I need your help!

I want your feedback to make the book better for you and other readers. If you find typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

Opening an issue or submitting a pull request on GitHub: https://github.com/isaactpetersen/Principles-Psychological-Assessment

Hypothesis Adding an annotation using hypothes.is. To add an annotation, select some text and then click the symbol on the pop-up menu. To see the annotations of others, click the symbol in the upper right-hand corner of the page.

Chapter 8 Item Response Theory

In the chapter on reliability, we introduced classical test theory. Classical test theory is a measurement theory of how test scores relate to a construct. Classical test theory provides a way to estimate the relation between the measure (or item) and the construct. For instance, with a classical test theory approach, to estimate the relation between an item and the construct, you would compute an item–total correlation. An item–total correlation is the correlation of an item with the total score on the measure (e.g., sum score). The item–total correlation approximates the relation between an item and the construct. However, the item–total correlation is a crude estimate of the relation between an item and the construct. And there are many other ways to characterize the relation between an item and a construct. One such way is with item response theory (IRT).

8.1 Overview of IRT

Unlike classical test theory, which is a measurement theory of how test scores relate to a construct, IRT is a measurement theory that describes how an item is related to a construct. For instance, given a particular person’s level on the construct, what is their chance of answering “TRUE” on a particular item?

IRT is an approach to latent variable modeling. In IRT, we estimate a person’s construct score (i.e., level on the construct) based on their item responses. The construct is estimated as a latent factor that represents the common variance among all items as in structural equation modeling or confirmatory factor analysis. The person’s level on the construct is called theta (\(\theta\)). When dealing with performance-based tests, theta is sometimes called “ability.”

8.1.1 Item Characteristic Curve

In IRT, we can plot an item characteristic curve (ICC). The ICC is a plot of the model-derived probability of a symptom being present (or a correct response) as a function of a person’s standing on a latent continuum. For instance, we can create empirical ICCs that can take any shape (see Figure 8.1).

Empirical Item Characteristic Curves of the Probability of Endorsement of a Given Item as a Function of the Person's Sum Score.

Figure 8.1: Empirical Item Characteristic Curves of the Probability of Endorsement of a Given Item as a Function of the Person’s Sum Score.

In a model-implied ICC, we fit a logistic (sigmoid) curve to each item’s probability of a symptom being present as a function of a person’s level on the latent construct. The model-implied ICCs for the same 10 items from Figure 8.1 are depicted in Figure 8.2.

Item Characteristic Curves of the Probability of Endorsement of a Given Item as a Function of the Person's Level on the Latent Construct.

Figure 8.2: Item Characteristic Curves of the Probability of Endorsement of a Given Item as a Function of the Person’s Level on the Latent Construct.

ICCs can be summed across items to get the test characteristic curve (TCC):

Test Characteristic Curve of the Expected Total Score on the Test as a Function of the Person's Level on the Latent Construct.

Figure 8.3: Test Characteristic Curve of the Expected Total Score on the Test as a Function of the Person’s Level on the Latent Construct.

An ICC provides more information than an item–total correlation. Visually, we can see the utility of various items by looking at the items’ ICC plots. For instance, consider what might be a useless item for diagnostic purposes. For a particular item, among those with a low total score (level on the construct), 90% respond with “TRUE” to the item, whereas among everyone else, 100% respond with “TRUE” (see Figure 8.4). This item has a ceiling effect and provides only a little information about who would be considered above clinical threshold for a disorder. So, the item is not very clinically useful.

Item Characteristic Curve of an Item with a Ceiling Effect That is not Diagnostically Useful.

Figure 8.4: Item Characteristic Curve of an Item with a Ceiling Effect That is not Diagnostically Useful.

Now, consider a different item. For those with a low level on the construct, 0% respond with “TRUE”, so it has a floor effect and tells us nothing about the lower end of the construct. But for those with a higher level on the construct, 70% respond with true (see Figure 8.5). So, the item tells us something about the higher end of the distribution, and could be diagnostically useful. Thus, an ICC allows us to immediately tell the utility of items.

Item Characteristic Curve of an Item With a Floor Effect That is Diagnostically Useful.

Figure 8.5: Item Characteristic Curve of an Item With a Floor Effect That is Diagnostically Useful.

8.1.2 Parameters

We can estimate up to four parameters in an IRT model and can glean up to four key pieces of information from an item’s ICC:

  1. Difficulty (severity)
  2. Discrimination
  3. Guessing
  4. Inattention/careless errors

8.1.2.1 Difficulty (Severity)

The item’s difficulty parameter is the item’s location on the latent construct. It is quantified by the intercept, i.e., the location on the x-axis of the inflection point of the ICC. In a 1- or 2-parameter model, the inflection point is where 50% of the sample endorses the item (or gets the item correct), that is, the point on the x-axis where the ICC crosses .5 probability on the y-axis (i.e., the level on the construct at which the probability of endorsing the item is equal to the probability of not endorsing the item). Item difficulty is similar to item means or intercepts in structural equation modeling or factor analysis. Some items are more useful at the higher levels of the construct, whereas other items are more useful at the lower levels of the construct. See Figure 8.6 for an example of an item with a low difficulty and an item with a high difficulty.

Item Characteristic Curves of an Item With Low Difficulty Versus High Difficulty. The dashed horizontal line indicates a probability of item endorsement of .50. The dashed vertical line is the item difficulty, i.e., the person’s level on the construct (the location on the x-axis) at the inflection point of the item characteristic curve. In a two-parameter logistic model, the inflection point corresponds to the probability of item endorsement is 50%. Thus, in a two-parameter logistic model, the difficulty of an item is the person’s level on the construct where the probability of endorsing the item is 50%.

Figure 8.6: Item Characteristic Curves of an Item With Low Difficulty Versus High Difficulty. The dashed horizontal line indicates a probability of item endorsement of .50. The dashed vertical line is the item difficulty, i.e., the person’s level on the construct (the location on the x-axis) at the inflection point of the item characteristic curve. In a two-parameter logistic model, the inflection point corresponds to the probability of item endorsement is 50%. Thus, in a two-parameter logistic model, the difficulty of an item is the person’s level on the construct where the probability of endorsing the item is 50%.

When dealing with a measure of clinical symptoms (e.g., depression), the difficulty parameter is sometimes called severity, because symptoms that are endorsed less frequently tend to be more severe [e.g., suicidal behavior; Krueger et al. (2004)]. One way of thinking about the severity parameter of an item is: “How severe does your psychopathology have to be for half of people to endorse the symptom?”

When dealing with a measure of performance, aptitude, or intelligence, the parameter would be more likely to be called difficulty: “How high does your ability have to be for half of people to pass the item?” An item with a low difficulty would be considered easy, because even people with a low ability tend to pass the item. An item with a high difficulty would be considered difficult, because only people with a high ability tend to pass the item.

8.1.2.2 Discrimination

The item’s discrimination parameter is how well the item can distinguish between those who were higher versus lower on the construct, that is, how strongly the item is correlated with the construct (i.e., the latent factor). It is similar to the factor loading in structural equation modeling or factor analysis. It is quantified by the slope of the ICC, i.e., the steepness of the line at its steepest point. The slope reflects the inverse of how much range of construct levels it would take to flip 50/50 whether a person is likely to pass or fail an item.

Some items have ICCs that go up fast (have a steep slope). These items provide a fine distinction between people with lower versus higher levels on the construct and therefore have high discrimination. Some items go up gradually (less steep slope), so it provides less precision and information, and has a low discrimination. See Figure 8.7 for an example of an item with a low discrimination and an item with a high discrimination.

Item Characteristic Curves of an Item With Low Discrimination Versus High Discrimination. The discrimination of an item is the slope of the line at its inflection point.

Figure 8.7: Item Characteristic Curves of an Item With Low Discrimination Versus High Discrimination. The discrimination of an item is the slope of the line at its inflection point.

8.1.2.3 Guessing

The item’s guessing parameter is reflected by the lower asymptote of the ICC. If the item has a lower asymptote above zero, it suggests that the probability of getting the item correct (or endorsing the item) never reaches zero, for any level of the construct. On an educational test, this could correspond to the person’s likelihood of being able to answer the item correctly by chance just by guessing. For example, for a 4-option multiple choice test, a respondent would be expected to get a given item correct 25% of the time just by guessing. See Figure 8.8 for an example of an item from a true/false exam and Figure 8.9 for an example of an item from a 4-option multiple choice exam.

Item Characteristic Curve of an Item from a True/False Exam, There Test Takers Get the Item Correct at Least 50% of the Time.

Figure 8.8: Item Characteristic Curve of an Item from a True/False Exam, There Test Takers Get the Item Correct at Least 50% of the Time.

Item Characteristic Curve of an Item From a 4-Option Multiple Choice Exam, Where Test Takers Get the Item Correct at Least 25% of the Time.

Figure 8.9: Item Characteristic Curve of an Item From a 4-Option Multiple Choice Exam, Where Test Takers Get the Item Correct at Least 25% of the Time.

8.1.2.4 Inattention/Careless Errors

The item’s inattention (or careless error) parameter is the reflected by the upper asymptote of the ICC. If the item has an upper asymptote below one, it suggests that the probability of getting the item correct (or endorsing the item) never reaches one, for any level on the construct. See Figure 8.10 for an example of an item whose probability of endorsement (or getting it correct) exceeds .85.

Item Characteristic Curve of an Item Where the Probability of Getting an Item Correct Never Exceeds .85.

Figure 8.10: Item Characteristic Curve of an Item Where the Probability of Getting an Item Correct Never Exceeds .85.

8.1.3 Models

IRT models can be fit that estimate one or more of these four item parameters.

8.1.3.1 1-Parameter and Rasch models

A Rasch model estimates the item difficulty parameter and holds everything else fixed across items. It fixes the item discrimination to be one for each item. In the Rasch model, the probability that a person \(j\) with a level on the construct of \(\theta\) gets a score of one (instead of zero) on item \(i\), based on the difficulty (\(b\)) of the item, is estimated using Equation (8.1):

\[\begin{equation} P(X = 1|\theta_j, b_i) = \frac{e^{\theta_j - b_i}}{1 + e^{\theta_j - b_i}} \tag{8.1} \end{equation}\]

The petersenlab package (Petersen, 2024b) contains the fourPL() function that estimates the probability of item endorsement as function of the item characteristics from the Rasch model and the person’s level on the construct (theta). To estimate the probability of endorsement from the Rasch model, specify \(b\) and \(\theta\), while keeping the defaults for the other parameters.

Code
library("petersenlab") #to install: install.packages("remotes"); remotes::install_github("DevPsyLab/petersenlab")
Code
fourPL <- function(a = 1, b, c = 0, d = 1, theta){
  c + (d - c) * (exp(a * (theta - b))) / (1 + exp(a * (theta - b)))
}
Code
fourPL(b, theta)
Code
fourPL(b = 1, theta = 0)
[1] 0.2689414

A one-parameter logistic (1-PL) IRT model, similar to a Rasch model, estimates the item difficulty parameter, and holds everything else fixed across items (see Figure 8.11). The one-parameter logistic model holds the item discrimination fixed across items, but does not fix it to one, unlike the Rasch model.

In the one-parameter logistic model, the probability that a person \(j\) with a level on the construct of \(\theta\) gets a score of one (instead of zero) on item \(i\), based on the difficulty (\(b\)) of the item and the items’ (fixed) discrimination (\(a\)), is estimated using Equation (8.2):

\[\begin{equation} P(X = 1|\theta_j, b_i, a) = \frac{e^{a(\theta_j - b_i)}}{1 + e^{a(\theta_j - b_i)}} \tag{8.2} \end{equation}\]

The petersenlab package (Petersen, 2024b) contains the fourPL() function that estimates the probability of item endorsement as function of the item characteristics from the one-parameter logistic model and the person’s level on the construct (theta). To estimate the probability of endorsement from the one-parameter logistic model, specify \(a\), \(b\), and \(\theta\), while keeping the defaults for the other parameters.

Code
fourPL(a, b, theta)

Rasch and one-parameter logistic models are common and are the easiest to fit. However, they make fairly strict assumptions. They assume that items have the same discrimination.

One-Parameter Logistic Model in Item Response Theory.

Figure 8.11: One-Parameter Logistic Model in Item Response Theory.

A one-parameter logistic model is only valid if there is not crossing of lines in empirical ICCs (see Figure 8.12).

Empirical Item Characteristic Curves of the Probability of Endorsement of a Given Item as a Function of the Person's Sum Score. The empirical item characteristic curves of these items do not cross each other.

Figure 8.12: Empirical Item Characteristic Curves of the Probability of Endorsement of a Given Item as a Function of the Person’s Sum Score. The empirical item characteristic curves of these items do not cross each other.

8.1.3.2 2-Parameter

A two-parameter logistic (2-PL) IRT model estimates item difficulty and discrimination, and it holds the asymptotes fixed across items (see Figure 8.13). Two-parameter logistic models are also common.

In the two-parameter logistic model, the probability that a person \(j\) with a level on the construct of \(\theta\) gets a score of one (instead of zero) on item \(i\), based on the difficulty (\(b\)) and discrimination (\(a\)) of the item, is estimated using Equation (8.3):

\[\begin{equation} P(X = 1|\theta_j, b_i, a_i) = \frac{e^{a_i(\theta_j - b_i)}}{1 + e^{a_i(\theta_j - b_i)}} \tag{8.3} \end{equation}\]

The petersenlab package (Petersen, 2024b) contains the fourPL() function that estimates the probability of item endorsement as function of the item characteristics from the two-parameter logistic model and the person’s level on the construct (theta). To estimate the probability of endorsement from the two-parameter logistic model, specify \(a\), \(b\), and \(\theta\), while keeping the defaults for the other parameters.

Code
fourPL(a, b, theta)
Code
fourPL(a = 0.6, b = 0, theta = -1)
[1] 0.3543437
Two-Parameter Logistic Model in Item Response Theory.

Figure 8.13: Two-Parameter Logistic Model in Item Response Theory.

8.1.3.3 3-Parameter

A three-parameter logistic (3-PL) IRT model estimates item difficulty, discrimination, and guessing (lower asymptote), and it holds the upper asymptote fixed across items (see Figure 8.14). This model would provide information about where an item drops out. Three-parameter logistic models are less common to estimate because it adds considerable computational complexity and requires a large sample size, and the guessing parameter is often not as important as difficulty and discrimination. Nevertheless, 3-parameter logistic models are sometimes estimated in the education literature to account for getting items correct by random guessing.

In the three-parameter logistic model, the probability that a person \(j\) with a level on the construct of \(\theta\) gets a score of one (instead of zero) on item \(i\), based on the difficulty (\(b\)), discrimination (\(a\)), and guessing parameter (\(c\)) of the item, is estimated using Equation (8.4):

\[\begin{equation} P(X = 1|\theta_j, b_i, a_i, c_i) = c_i + (1 - c_i) \frac{e^{a_i(\theta_j - b_i)}}{1 + e^{a_i(\theta_j - b_i)}} \tag{8.4} \end{equation}\]

The petersenlab package (Petersen, 2024b) contains the fourPL() function that estimates the probability of item endorsement as function of the item characteristics from the three-parameter logistic model and the person’s level on the construct (theta). To estimate the probability of endorsement from the three-parameter logistic model, specify \(a\), \(b\), \(c\), and \(\theta\), while keeping the defaults for the other parameters.

Code
fourPL(a, b, c, theta)
Code
fourPL(a = 0.8, b = -1, c = .25, theta = -1)
[1] 0.625
Three-Parameter Logistic Model in Item Response Theory.

Figure 8.14: Three-Parameter Logistic Model in Item Response Theory.

8.1.3.4 4-Parameter

A four-parameter logistic (4-PL) IRT model estimates item difficulty, discrimination, guessing, and careless errors (see Figure 8.15). The fourth parameter adds considerable computational complexity and is rare to estimate.

In the four-parameter logistic model, the probability that a person \(j\) with a level on the construct of \(\theta\) gets a score of one (instead of zero) on item \(i\), based on the difficulty (\(b\)), discrimination (\(a\)), guessing parameter (\(c\)), and careless error parameter (\(d\)) of the item, is estimated using Equation (8.5) (Magis, 2013):

\[\begin{equation} P(X = 1|\theta_j, b_i, a_i, c_i, d_i) = c_i + (d_i - c_i) \frac{e^{a_i(\theta_j - b_i)}}{1 + e^{a_i(\theta_j - b_i)}} \tag{8.5} \end{equation}\]

The petersenlab package (Petersen, 2024b) contains the fourPL() function that estimates the probability of item endorsement as function of the item characteristics from the four-parameter logistic model and the person’s level on the construct (theta). To estimate the probability of endorsement from the four-parameter logistic model, specify \(a\), \(b\), \(c\), \(d\), and \(\theta\).

Code
fourPL(a, b, c, d, theta)
Code
fourPL(a = 1.5, b = 1, c = .15, d = 0.85, theta = 3)
[1] 0.8168019
Four-Parameter Logistic Model in Item Response Theory.

Figure 8.15: Four-Parameter Logistic Model in Item Response Theory.

8.1.3.5 Graded Response Model

Graded response models and generalized partial credit models can be estimated with one, two, three, or four parameters. However, they use polytomous data (not dichotomous data), as described in the section below.

The two-parameter graded response model takes the general form of Equation (8.6):

\[\begin{equation} P(X_{ji} = x_{ji}|\theta_j) = P^*_{x_{ji}}(\theta_j) - P^*_{x_{ji} + 1}(\theta_j) \tag{8.6} \end{equation}\]

where:

\[\begin{equation} P^*_{x_{ji}}(\theta_j) = P(X_{ji} \geq x_{ji}|\theta_j, b_{ic}, a_i) = \frac{1}{1 + e^{a_i(\theta_j - b_{ic})}} \tag{8.7} \end{equation}\]

In the model, \(a_i\) an item-specific discrimination parameter, \(b_{ic}\) is an item- and category-specific difficulty parameter, and \(θ_n\) is an estimate of a person’s standing on the latent variable. In the model, \(i\) represents unique items, \(c\) represents different categories that are rated, and \(j\) represents participants.

8.1.4 Type of Data

IRT models are most commonly estimated with binary or dichotomous data. For example, the measures have questions or items that can be considered collapsed into two groups (e.g., true/false, correct/incorrect, endorsed/not endorsed). IRT models can also be estimated with polytomous data (e.g., likert scale), which adds computational complexity. IRT models with polytomous data can be fit with a graded response model or generalized partial credit model.

For example, see Figure 8.16 for an example of an item boundary characteristic curve for an item from a 5-level likert scale (based on a cumulative distribution). If an item has \(k\) response categories, it has \(k - 1\) thresholds. For example, an item with 5-level likert scale (1 = strongly disagree; 2 = disagree; 3 = neither agree nor disagree; 4 = agree; 5 = strongly agree) has 4 thresholds: one from 1–2, one from 2–3, one from 3–4, and one from 4–5. The item boundary characteristic curve is the probability that a person selects a response category higher than \(k\) of a polytomous item. As depicted, one likert scale item does equivalent work as 4 binary items. See Figure 8.17 for the same 5-level likert scale item plotted with an item response category characteristic curve (based on a static, non-cumulative distribution).

Item Boundary Characteristic Curves From Two-Parameter Graded Response Model in Item Response Theory.

Figure 8.16: Item Boundary Characteristic Curves From Two-Parameter Graded Response Model in Item Response Theory.

Item Response Category Characteristic Curves From Two-Parameter Graded Response Model in Item Response Theory.

Figure 8.17: Item Response Category Characteristic Curves From Two-Parameter Graded Response Model in Item Response Theory.

IRT does not handle continuous data well, with some exceptions (Y. Chen et al., 2019) such as in a Bayesian framework (Bürkner, 2021). If you want to use continuous data, you might consider moving to a factor analysis framework.

8.1.5 Sample Size

Sample size requirements depend on the complexity of the model. A 1-parameter model often requires ~100 participants. A 2-parameter model often requires ~1,000 participants. A 3-parameter model often requires ~10,000 participants.

8.1.6 Reliability (Information)

IRT conceptualizes reliability in a different way than classical test theory does. Both IRT and classical test theory conceptualize reliability as involving the precision of a measure’s scores. In classical test theory, (im)precision—as operationalized by the standard error of measurement—is estimated with a single index across the whole range of the construct. That is, in classical test theory, the same standard error of measurement applies to all scores in the population (Embretson, 1996). However, IRT estimates how much measurement precision (information) or imprecision (standard error of measurement) each item, and the test as a whole, has at different construct levels. This allows IRT to conceptualize reliability in such a way that precision/reliability can differ at different construct levels, unlike in classical test theory (Embretson, 1996). Thus, IRT does not have one index of reliability; rather, its estimate of reliability differs at different levels on the construct.

Based on an item’s difficulty and discrimination, we can calculate how much information each item provides. In IRT, information is how much measurement precision or consistency an item (or the measure) provides. In other words, information is the degree to which an item (or measure) reduces the standard error of measurement, that is, how much it reduces uncertainty of a person’s level on the construct. As a reminder (from Equation (4.11)), the standard error of measurement is calculated as:

\[ \text{standard error of measurement (SEM)} = \sigma_x \sqrt{1 - r_{xx}} \]

where \(\sigma_x = \text{standard deviation of observed scores on the item } x\), and \(r_{xx} = \text{reliability of the item } x\). The standard error of measurement is used to generate confidence intervals for people’s scores. In IRT, the standard error of measurement (at a given construct level) can be calculated as the inverse of the square root of the amount of test information at that construct level, as in Equation (8.8):

\[\begin{equation} \text{SEM}(\theta) = \frac{1}{\sqrt{\text{information}(\theta)}} \tag{8.8} \end{equation}\]

The petersenlab package (Petersen, 2024b) contains the standardErrorIRT() function that estimates the standard error of measurement at a person’s level on the construct (theta) from the amount of information that the item (or test) provides.

Code
standardErrorIRT <- function(information){
  1/sqrt(information)
}
Code
standardErrorIRT(information)
Code
standardErrorIRT(0.6)
[1] 1.290994

The standard error of measurement tends to be higher (i.e., reliability/information tends to be lower) at the extreme levels of the construct where there are fewer items.

The formula for information for item \(i\) at construct level \(\theta\) in a Rasch model is in Equation (8.9) (Baker & Kim, 2017):

\[\begin{equation} \text{information}_i(\theta) = P_i(\theta)Q_i(\theta) \tag{8.9} \end{equation}\]

where \(P_i(\theta)\) is the probability of getting a one instead of a zero on item \(i\) at a given level on the latent construct, and \(Q_i(\theta) = 1 - P_i(\theta)\).

The petersenlab package (Petersen, 2024b) contains the itemInformation() function that estimates the amount of information an item provides as function of the item characteristics from the Rasch model and the person’s level on the construct (theta). To estimate the amount of information an item provides in a Rasch model, specify \(b\) and \(\theta\), while keeping the defaults for the other parameters.

Code
itemInformation <- function(a = 1, b, c = 0, d = 1, theta){
  P <- NULL
  information <- NULL

  for(i in 1:length(theta)){
    P[i] <- fourPL(b = b, a = a, c = c, d = d, theta = theta[i])
    information[i] <- ((a^2) * (P[i] - c)^2 * (d - P[i])^2) / ((d - c)^2 * P[i] * (1 - P[i]))
  }

  return(information)
}
Code
itemInformation(b, theta)
Code
itemInformation(b = 1, theta = 0)
[1] 0.1966119

The formula for information for item \(i\) at construct level \(\theta\) in a two-parameter logistic model is in Equation (8.10) (Baker & Kim, 2017):

\[\begin{equation} \text{information}_i(\theta) = a^2_iP_i(\theta)Q_i(\theta) \tag{8.10} \end{equation}\]

The petersenlab package (Petersen, 2024b) contains the itemInformation() function that estimates the amount of information an item provides as function of the item characteristics from the two-parameter logistic model and the person’s level on the construct (theta). To estimate the amount of information an item provides in a two-parameter logistic model, specify \(a\), \(b\), and \(\theta\), while keeping the defaults for the other parameters.

Code
itemInformation(a, b, theta)
Code
itemInformation(a = 0.6, b = 0, theta = -1)
[1] 0.08236233

The formula for information for item \(i\) at construct level \(\theta\) in a three-parameter logistic model is in Equation (8.11) (Baker & Kim, 2017):

\[\begin{equation} \text{information}_i(\theta) = a^2_i\bigg[\frac{Q_i(\theta)}{P_i(\theta)}\bigg]\bigg[\frac{(P_i(\theta) - c_i)^2}{(1 - c_i)^2}\bigg] \tag{8.11} \end{equation}\]

The petersenlab package (Petersen, 2024b) contains the itemInformation() function that estimates the amount of information an item provides as function of the item characteristics from the three-parameter logistic model and the person’s level on the construct (theta). To estimate the amount of information an item provides in a three-parameter logistic model, specify \(a\), \(b\), \(c\), and \(\theta\), while keeping the defaults for the other parameters.

Code
itemInformation(a, b, c, theta)
Code
itemInformation(a = 0.8, b = -1, c = .25, theta = -1)
[1] 0.096

The formula for information for item \(i\) at construct level \(\theta\) in a four-parameter logistic model is in Equation (8.12) (Magis, 2013):

\[\begin{equation} \text{information}_i(\theta) = \frac{a^2_i[P_i(\theta) - c_i]^2[d_i - P_i(\theta)^2]}{(d_i - c_i)^2 P_i(\theta)[1 - P_i(\theta)]} \tag{8.12} \end{equation}\]

The petersenlab package (Petersen, 2024b) contains the itemInformation() function that estimates the amount of information an item provides as function of the item characteristics from the four-parameter logistic model and the person’s level on the construct (theta). To estimate the amount of information an item provides in a four-parameter logistic model, specify \(a\), \(b\), \(c\), \(d\), and \(\theta\).

Code
itemInformation(a, b, c, d, theta)
Code
itemInformation(a = 1.5, b = 1, c = .15, d = 0.85, theta = 3)
[1] 0.01503727

Reliability at a given level of the construct (\(\theta\)) can be estimated as in Equation (8.13):

\[ \begin{aligned} \text{reliability}(\theta) &= \frac{\text{information}(\theta)}{\text{information}(\theta) + \sigma^2(\theta)} \\ &= \frac{\text{information}(\theta)}{\text{information}(\theta) + 1} \end{aligned} \tag{8.13} \]

where \(\sigma^2(\theta)\) is the variance of theta, which is fixed to one in most IRT models.

The petersenlab package (Petersen, 2024b) contains the reliabilityiRT() function that estimates the amount of reliability an item or a measure provides as function of its information and the variance of people’s construct levels (\(\theta\)).

Code
reliabilityIRT <- function(information, varTheta = 1){
  information / (information + varTheta)
}
Code
reliabilityIRT(information, varTheta = 1)
Code
reliabilityIRT(10)
[1] 0.9090909

Consider some hypothetical items depicted with ICCs in Figure 8.18.

Item Characteristic Curves From Two-Parameter Logistic Model in Item Response Theory. The dashed horizontal line indicates a probability of item endorsement of .50. The dashed vertical line is the item difficulty, i.e., the person’s level on the construct (the location on the x-axis) at the inflection point of the item characteristic curve. In a two-parameter logistic model, the inflection point corresponds to the probability of item endorsement is 50%. Thus, in a two-parameter logistic model, the difficulty of an item is the person’s level on the construct where the probability of endorsing the item is 50%.

Figure 8.18: Item Characteristic Curves From Two-Parameter Logistic Model in Item Response Theory. The dashed horizontal line indicates a probability of item endorsement of .50. The dashed vertical line is the item difficulty, i.e., the person’s level on the construct (the location on the x-axis) at the inflection point of the item characteristic curve. In a two-parameter logistic model, the inflection point corresponds to the probability of item endorsement is 50%. Thus, in a two-parameter logistic model, the difficulty of an item is the person’s level on the construct where the probability of endorsing the item is 50%.

We can present the ICC in terms of an item information curve (see Figure 8.19). On the x-axis, the information peak is located at the difficulty/severity of the item. The higher the discrimination, the higher the information peak on the y-axis.

Item Information From Two-Parameter Logistic Model in Item Response Theory. The dashed vertical line is the item difficulty, which is located at the peak of the item information curve.

Figure 8.19: Item Information From Two-Parameter Logistic Model in Item Response Theory. The dashed vertical line is the item difficulty, which is located at the peak of the item information curve.

We can aggregate (sum) information across items to determine how much information the measure as a whole provides. This is called the test information curve (see Figure 8.20). Note that we get more information from likert/multiple response items compared to binary/dichotomous items. Having 10 items with a 5-level response scale yields as much information as 40 dichotomous items.

Test Information Curve From Two-Parameter Logistic Model in Item Response Theory.

Figure 8.20: Test Information Curve From Two-Parameter Logistic Model in Item Response Theory.

Based on test information, we can calculate the standard error of measurement (see Figure 8.21). Notice how the degree of (un)reliability differs at different construct levels.

Test Standard Error of Measurement From Two-Parameter Logistic Model in Item Response Theory.

Figure 8.21: Test Standard Error of Measurement From Two-Parameter Logistic Model in Item Response Theory.

Based on test information, we can estimate the reliability (see Figure 8.22). Notice how the degree of (un)reliability differs at different construct levels.

Test Reliability From Two-Parameter Logistic Model in Item Response Theory.

Figure 8.22: Test Reliability From Two-Parameter Logistic Model in Item Response Theory.

8.1.7 Efficient Assessment

One of the benefits of IRT is for item selection to develop brief assessments. For instance, you could use two items to estimate where the person is on the construct: low, middle, or high (see Figure 8.23). If the responses to the two items do not meet expectations, for instance, the person passes the difficult item but fails the easy item, we would keep assessing additional items to determine their level on the construct. If two items perform similarly, that is, they have the same difficulty and discrimination, they are redundant, and we can sacrifice one of them. This leads to greater efficiency and better measurement in terms of reliability and validity. For more information on designing and evaluating short forms compared to their full-scale counterparts, see Smith et al. (2000).

Visual Representation of an Efficient Assessment Based on Item Characteristic Curves from Two-Parameter Logistic Model in Item Response Theory.

Figure 8.23: Visual Representation of an Efficient Assessment Based on Item Characteristic Curves from Two-Parameter Logistic Model in Item Response Theory.

IRT forms the basis of computerized adaptive testing, which is discussed in Chapter 21. As discussed earlier, briefer measures can increase reliability and validity of measurement if the items are tailored to the ability level of the participant. The idea of adaptive testing is that, instead of having a standard scale for all participants, the items adapt to each person. An example of a measure that has used computerized adaptive testing is the Graduate Record Examination (GRE).

With adaptive testing, it is important to develop a comprehensive item bank that spans the difficulty range of interest. The starting construct level is the 50th percentile. If the respondent gets the first item correct, it moves to the next item that would provide the most information for the person, based on a split of the remaining sample (e.g., 75th percentile). And so on… The goal of adaptive testing is to find the construct level where the respondent keeps getting items right and wrong 50% of the time. Adaptive testing is a promising approach that saves time because it tailors which items are administered to which person (based on their construct level) to get the most reliable estimate in the shortest time possible. However, it assumes that if you get a more difficult item correct, that you would have gotten easier items correct, which might not be true in all contexts (especially for constructs that are not unidimensional).

Although most uses of IRT have been in cognitive and educational testing, IRT may also benefit other domains of assessment including clinical assessment (Gibbons et al., 2016; Reise & Waller, 2009; Thomas, 2019).

8.1.7.1 A Good Measure

According to IRT a good measure should:

  1. fit your goals of the assessment, in terms of the range of interest regarding levels on the construct,
  2. have good items that yield lots of information, and
  3. have a good set of items that densely cover the construct within the range of interest, without redundancy.

First, a good measure should fit your goals of the assessment, in terms of the “range of interest” or the “target range” of levels on the construct. For instance, if your goal is to perform diagnosis, you would only care about the high end of the construct (e.g., 1–3 standard deviations above the mean)—there is no use discriminating between “nothing”, “almost nothing”, and “a little bit.” For secondary prevention, i.e., early identification of risk to prevent something from getting worse, you would be interested in finding people with elevated risk—e.g., you would need to know who is 1 or more standard deviations above the mean, but you would not need to discriminate beyond that. For assessing individual differences, you would want items that discriminate across the full range, including at the lower end. The items’ difficulty should span the range of interest.

Second, a good measure should have good items that yield lots of information. For example, the items should have strong discrimination, that is, the items are strongly related to the construct. The items should have sufficient variability in responses. This can be achieved by having items with more response options (e.g., likert/multiple choice items, as opposed to binary items), items that differ in difficulty, and (at least some) items that are not too difficult or too easy (to avoid ceiling/floor effects).

Third, a good measure should have a good set of items that densely cover the construct within the range of interest, without redundancy. The items should not have the same difficulty or they would be considered redundant, and one of the redundant items could be dropped. The items’ difficulty should densely cover the construct within the range of interest. For instance, if the construct range of interest is 1–2 standard deviations above the mean, the items should have difficulty that densely cover this range (e.g., 1.0, 1.05, 1.10, 1.15, 1.20, 1.25, 1.30, …, 2.0).

With items that (1) span the range of interest, (2) have high discrimination and information, and (3) densely cover the range of interest without redundancy, the measure should have a high information in the range of interest. This would allow it to efficiently and accurately assess the construct for the intended purpose.

An example of a bad measure for assessing the full range of individual differences is depicted in terms of ICCs in Figure 8.24 and in terms of test information in Figure 8.25. The measure performs poorly for the intended purpose, because its items do not (a) span the range of interest (−3 to 3 standard deviations from the mean of the latent construct), (b) have high discrimination and information, and (c) densely cover the range of interest without redundancy.

Visual Representation of a Bad Measure Based on Item Characteristic Curves of Items From a Bad Measure Estimated from Two-Parameter Logistic Model in Item Response Theory.

Figure 8.24: Visual Representation of a Bad Measure Based on Item Characteristic Curves of Items From a Bad Measure Estimated from Two-Parameter Logistic Model in Item Response Theory.

Visual Representation of a Bad Measure Based on the Test Information Curve.

Figure 8.25: Visual Representation of a Bad Measure Based on the Test Information Curve.

An example of a good measure for distinguishing clinical-range versus sub-clinical range is depicted in terms of ICCs in Figure 8.26 and in terms of test information in Figure 8.27. The measure is good for the intended purpose, in terms of having items that (a) span the range of interest (1–3 standard deviations above the mean of the latent construct), (b) have high discrimination and information, and (c) densely cover the range of interest without redundancy.

Visual Representation of a Good Measure (For Distinguishing Clinical-Range Versus Sub-clinical Range) Based on Item Characteristic Curves of Items From a Good Measure Estimated From Two-Parameter Logistic Model in Item Response Theory.

Figure 8.26: Visual Representation of a Good Measure (For Distinguishing Clinical-Range Versus Sub-clinical Range) Based on Item Characteristic Curves of Items From a Good Measure Estimated From Two-Parameter Logistic Model in Item Response Theory.

Visual Representation of a Good Measure (For Distinguishing Clinical-Range Versus Sub-clinical Range) Based on the Test Information Curve.

Figure 8.27: Visual Representation of a Good Measure (For Distinguishing Clinical-Range Versus Sub-clinical Range) Based on the Test Information Curve.

8.1.8 Assumptions of IRT

IRT has several assumptions:

  • monotonicity
  • unidimensionality
  • item invariance
  • local independence

8.1.8.1 Monotonicity

The monotonicity assumption holds that a person’s probability of endorsing a higher level on the item increases as a person’s level on the latent construct increases. For instance, for each item assessing externalizing problems, as a child increases in their level of externalizing problems, they are expected to be rated with a higher level on that item. Monotonicity can be evaluated in multiple ways. For instance, monotonicity can be evaluated using visual inspection of empirical item characteristic curves. Another way to evaluate monotonicity is with Mokken scale analysis, such as using the mokken package in R.

8.1.8.2 Unidimensionality

The unidimensionality assumption holds that the items have one predominant dimension, which reflects the underlying (latent) construct. The dimensionality of a set of items can be evaluated using factor analysis. Although items that are intended to assess a given latent latent construct are expected to be unidimensional, models have been developed that allow multiple latent dimensions, as shown in Section 8.6. These multidimensional IRT models allow borrowing information from a given latent factor in the estimation of other latent factor(s) to account for the covariation.

8.1.8.3 Item Invariance

The item invariance assumption holds that the items function similarly (i.e., have the same parameters) for all people and subgroups in the population. The extent to which items may violate the item invariance assumption can be evaluated empirically using tests of differential item functioning (DIF). Tests of measurement invariance are the equivalent of tests of differential item functioning for factor analysis/structural equation models. Test of differential item functioning and measurement invariance are described in the chapter on test bias.

8.1.8.4 Local Independence

The local independence assumptions holds that the items are uncorrelated when controlling for the latent dimension. That is, IRT models assume that the items’ errors (residuals) are uncorrelated with each other. Factor analysis and structural equation models can relax this assumption and allow items’ error terms to correlate with each other.

8.2 Getting Started

8.2.1 Load Libraries

Code
library("petersenlab") #to install: install.packages("remotes"); remotes::install_github("DevPsyLab/petersenlab")
library("mirt")
library("lavaan")
library("semTools")
library("semPlot")
library("lme4")
library("MOTE")
library("tidyverse")
library("here")
library("tinytex")

8.2.2 Load Data

LSAT7 is a data set from the mirt package (Chalmers, 2020) that contains five items from the Law School Admissions Test.

Code
mydataIRT <- expand.table(LSAT7)

8.2.3 Descriptive Statistics

Code
itemstats(mydataIRT, ts.tables = TRUE)
$overall
    N mean_total.score sd_total.score ave.r  sd.r alpha
 1000            3.707          1.199 0.143 0.052 0.453

$itemstats
          N  mean    sd total.r total.r_if_rm alpha_if_rm
Item.1 1000 0.828 0.378   0.530         0.246       0.396
Item.2 1000 0.658 0.475   0.600         0.247       0.394
Item.3 1000 0.772 0.420   0.611         0.313       0.345
Item.4 1000 0.606 0.489   0.592         0.223       0.415
Item.5 1000 0.843 0.364   0.461         0.175       0.438

$proportions
           0     1
Item.1 0.172 0.828
Item.2 0.342 0.658
Item.3 0.228 0.772
Item.4 0.394 0.606
Item.5 0.157 0.843

$total.score_frequency
      0  1   2   3   4   5
Freq 12 40 114 205 321 308

$total.score_means
              0        1
Item.1 2.313953 3.996377
Item.2 2.710526 4.224924
Item.3 2.359649 4.104922
Item.4 2.827411 4.278878
Item.5 2.426752 3.945433

$total.score_sds
              0         1
Item.1 1.162389 0.9841483
Item.2 1.058885 0.9038319
Item.3 1.087593 0.9043068
Item.4 1.103158 0.8661396
Item.5 1.177807 1.0415877

8.3 Comparison of Scoring Approaches

A measure that is a raw symptom count (i.e., a count of how many symptoms a person endorses) is low in precision and has a high standard error of measurement. Some diagnostic measures provide an ordinal response scale for each symptom. For example, the Structured Clinical Interview of Mental Disorders (SCID) provides a response scale from 0 to 2, where 0 = the symptom is absent, 1 = the symptom is sub-threshold, and 2 = the symptom is present. If your measure was a raw symptom sum, as opposed to a count of how many symptoms were present, the measure would be slightly more precise and have a somewhat smaller standard error of measurement.

A weighted symptom sum is the classical test theory analog of IRT. In classical test theory, proportion correct (or endorsed) would correspond to item difficulty and the item–total correlation (i.e., a point-biserial correlation) would correspond to item discrimination. If we were to compute a weighted sum of each item according to its strength of association with the construct (i.e., the item–total correlation), this measure would be somewhat more precise than the raw symptom sum, but it is not a latent variable method.

In IRT analysis, the weight for each item influences the estimate of a person’s level on the construct. IRT down-weights the poorly discriminating items and up-weights the strongly discriminating items. This leads to greater precision and a lower standard error of measurement than non-latent scoring approaches.

According to Embretson (1996), many perspectives have changed because of IRT. First, according to classical test theory, longer tests are more reliable than shorter tests, as described in Section 4.5.5.4 in the chapter on reliability. However, according to IRT, shorter tests (i.e., tests with fewer items) can be more reliable than longer tests. Item selection using IRT can lead to briefer assessments that have greater reliability than longer scales. For example, adaptive tests that tailor the difficulty of the items to the ability level of the participant.

Second, in classical test theory, a score’s meaning is tied to its location in a distribution (i.e., the norm-referenced standard). In IRT, however, the people and items are calibrated on a common scale. Based on a child’s IRT-estimated ability level (i.e., level on the construct), we can have a better sense of what the child knows and does not know, because it indicates the difficulty level at which they would tend to get items correct 50% of the time; the person would likely fail items with a higher difficulty compared to this level, whereas the person would likely pass items with a lower difficulty compared to this level. Consider Binet’s distribution of ability that arranges the items from easiest to most difficult. Based on the item difficulty and content of the items and the child’s performance, we can have a better indication that a child can perform items successfully in a particular range (e.g., count to 10) but might not be able to perform more difficult items (e.g., tie their shoes). From an intervention perspective, this would allow working in the “window of opportunity” or the zone of proximal development. Thus, IRT can provide more meaningful understanding of a person’s ability compared to traditional classical test theory interpretations such as the child being at the “63rd percentile” for a child of their age, which lacks conceptual meaning.

According to Cooper & Balsis (2009), our current diagnostic system relies heavily on how many symptoms a person endorses as an index of severity, but this assumes that all symptom endorsements have the same overall weight (severity). Using IRT, we can determine the relative severity of each item (symptom)—and it is clear that some symptoms indicate more severity than others. From this analysis, a respondent can endorse fewer, more severe items, and have overall more severe psychopathology than an individual who endorses more, less severe items. Basically, not all items are equally severe—know your items!

8.4 Rasch Model (1-Parameter Logistic)

A one-parameter logistic (1PL) item response theory (IRT) model is a model fit to dichotomous data, which estimates a different difficulty (\(b\)) parameter for each item. Discrimination (\(a\)) is not estimated (i.e., it is fixed at the same value—one—across items). Rasch models were fit using the mirt package (Chalmers, 2020).

8.4.1 Fit Model

Code
raschModel <- mirt(
  data = mydataIRT,
  model = 1,
  itemtype = "Rasch",
  SE = TRUE)

8.4.2 Model Summary

Code
summary(raschModel)
          F1    h2
Item.1 0.511 0.261
Item.2 0.511 0.261
Item.3 0.511 0.261
Item.4 0.511 0.261
Item.5 0.511 0.261

SS loadings:  1.304 
Proportion Var:  0.261 

Factor correlations: 

   F1
F1  1
Code
coef(raschModel, simplify = TRUE, IRTpars = TRUE)
$items
       a      b g u
Item.1 1 -1.868 0 1
Item.2 1 -0.791 0 1
Item.3 1 -1.461 0 1
Item.4 1 -0.521 0 1
Item.5 1 -1.993 0 1

$means
F1 
 0 

$cov
      F1
F1 1.022

8.4.3 Factor Scores

One can obtain factor scores (i.e., theta; and their associated standard errors) for each participant using the fscores() function.

Code
raschModel_factorScores <- fscores(raschModel, full.scores.SE = TRUE)

8.4.4 Plots

8.4.4.1 Test Curves

The test curves suggest that the measure is most reliable (i.e., provides the most information has the smallest standard error of measurement) at lower levels of the construct.

8.4.4.1.1 Test Characteristic Curve

A test characteristic curve (TCC) plot of the expected total score as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.28.

Code
plot(raschModel, type = "score")
Test Characteristic Curve From Rasch Item Response Theory Model.

Figure 8.28: Test Characteristic Curve From Rasch Item Response Theory Model.

8.4.4.1.2 Test Information Curve

A plot of test information as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.29.

Code
plot(raschModel, type = "info")
Test Information Curve From Rasch Item Response Theory Model.

Figure 8.29: Test Information Curve From Rasch Item Response Theory Model.

8.4.4.1.3 Test Reliability

The estimate of marginal reliability is below:

Code
marginal_rxx(raschModel)
[1] 0.4205639

A plot of test reliability as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.30.

Code
plot(raschModel, type = "rxx")
Test Reliability From Rasch Item Response Theory Model.

Figure 8.30: Test Reliability From Rasch Item Response Theory Model.

8.4.4.1.4 Test Standard Error of Measurement

A plot of test standard error of measurement (SEM) as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.31.

Code
plot(raschModel, type = "SE")
Test Standard Error of Measurement From Rasch Item Response Theory Model.

Figure 8.31: Test Standard Error of Measurement From Rasch Item Response Theory Model.

8.4.4.1.5 Test Information Curve and Test Standard Error of Measurement

A plot of test information and standard error of measurement (SEM) as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.32.

Code
plot(raschModel, type = "infoSE")
Test Information Curve and Standard Error of Measurement From Rasch Item Response Theory Model.

Figure 8.32: Test Information Curve and Standard Error of Measurement From Rasch Item Response Theory Model.

8.4.4.2 Item Curves

8.4.4.2.1 Item Characteristic Curves

Item characteristic curve (ICC) plots of the probability of item endorsement (or getting the item correct) as a function of a person’s level on the latent construct (theta; \(\theta\)) are in Figures 8.33 and 8.34.

Code
plot(raschModel, type = "itemscore", facet_items = FALSE)
Item Characteristic Curves From Rasch Item Response Theory Model.

Figure 8.33: Item Characteristic Curves From Rasch Item Response Theory Model.

Code
plot(raschModel, type = "itemscore", facet_items = TRUE)
Item Characteristic Curves From Rasch Item Response Theory Model.

Figure 8.34: Item Characteristic Curves From Rasch Item Response Theory Model.

8.4.4.2.2 Item Information Curves

Plots of item information as a function of a person’s level on the latent construct (theta; \(\theta\)) are in Figures 8.35 and 8.36.

Code
plot(raschModel, type = "infotrace", facet_items = FALSE)
Item Information Curves from Rasch Item Response Theory Model.

Figure 8.35: Item Information Curves from Rasch Item Response Theory Model.

Code
plot(raschModel, type = "infotrace", facet_items = TRUE)
Item Information Curves from Rasch Item Response Theory Model.

Figure 8.36: Item Information Curves from Rasch Item Response Theory Model.

8.4.5 CFA

A one-parameter logistic model can also be fit in a CFA framework, sometimes called item factor analysis. The item factor analysis models were fit in the lavaan package (Rosseel et al., 2022).

Code
onePLModel_cfa <- '
# Factor Loadings (i.e., discrimination parameters)
latent =~ loading*Item.1 + loading*Item.2 + loading*Item.3 + 
  loading*Item.4 + loading*Item.5

# Item Thresholds (i.e., difficulty parameters)
Item.1 | threshold1*t1
Item.2 | threshold2*t1
Item.3 | threshold3*t1
Item.4 | threshold4*t1
Item.5 | threshold5*t1
'

onePLModel_cfa_fit = sem(
  model = onePLModel_cfa,
  data = mydataIRT,
  ordered = c("Item.1", "Item.2", "Item.3", "Item.4","Item.5"),
  mimic = "Mplus",
  estimator = "WLSMV",
  std.lv = TRUE,
  parameterization = "theta")

summary(
  onePLModel_cfa_fit,
  fit.measures = TRUE,
  rsquare = TRUE,
  standardized = TRUE)
lavaan 0.6.17 ended normally after 13 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                        10
  Number of equality constraints                     4

  Number of observations                          1000

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                22.305      24.361
  Degrees of freedom                                 9           9
  P-value (Chi-square)                           0.008       0.004
  Scaling correction factor                                  0.926
  Shift parameter                                            0.283
    simple second-order correction (WLSMV)                        

Model Test Baseline Model:

  Test statistic                               244.385     228.667
  Degrees of freedom                                10          10
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.072

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.943       0.930
  Tucker-Lewis Index (TLI)                       0.937       0.922
                                                                  
  Robust Comparative Fit Index (CFI)                         0.895
  Robust Tucker-Lewis Index (TLI)                            0.884

Root Mean Square Error of Approximation:

  RMSEA                                          0.038       0.041
  90 Percent confidence interval - lower         0.019       0.022
  90 Percent confidence interval - upper         0.059       0.061
  P-value H_0: RMSEA <= 0.050                    0.808       0.738
  P-value H_0: RMSEA >= 0.080                    0.000       0.000
                                                                  
  Robust RMSEA                                               0.080
  90 Percent confidence interval - lower                     0.038
  90 Percent confidence interval - upper                     0.122
  P-value H_0: Robust RMSEA <= 0.050                         0.106
  P-value H_0: Robust RMSEA >= 0.080                         0.539

Standardized Root Mean Square Residual:

  SRMR                                           0.065       0.065

Parameter Estimates:

  Parameterization                               Theta
  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model        Unstructured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  latent =~                                                             
    Item.1  (ldng)    0.599    0.038   15.831    0.000    0.599    0.514
    Item.2  (ldng)    0.599    0.038   15.831    0.000    0.599    0.514
    Item.3  (ldng)    0.599    0.038   15.831    0.000    0.599    0.514
    Item.4  (ldng)    0.599    0.038   15.831    0.000    0.599    0.514
    Item.5  (ldng)    0.599    0.038   15.831    0.000    0.599    0.514

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    Itm.1|1 (thr1)   -1.103    0.056  -19.666    0.000   -1.103   -0.946
    Itm.2|1 (thr2)   -0.474    0.048   -9.826    0.000   -0.474   -0.407
    Itm.3|1 (thr3)   -0.869    0.052  -16.706    0.000   -0.869   -0.745
    Itm.4|1 (thr4)   -0.313    0.047   -6.678    0.000   -0.313   -0.269
    Itm.5|1 (thr5)   -1.174    0.058  -20.092    0.000   -1.174   -1.007

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .Item.1            1.000                               1.000    0.736
   .Item.2            1.000                               1.000    0.736
   .Item.3            1.000                               1.000    0.736
   .Item.4            1.000                               1.000    0.736
   .Item.5            1.000                               1.000    0.736
    latent            1.000                               1.000    1.000

Scales y*:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    Item.1            0.858                               0.858    1.000
    Item.2            0.858                               0.858    1.000
    Item.3            0.858                               0.858    1.000
    Item.4            0.858                               0.858    1.000
    Item.5            0.858                               0.858    1.000

R-Square:
                   Estimate
    Item.1            0.264
    Item.2            0.264
    Item.3            0.264
    Item.4            0.264
    Item.5            0.264
Code
fitMeasures(
  onePLModel_cfa_fit,
  fit.measures = c(
    "chisq", "df", "pvalue",
    "baseline.chisq","baseline.df","baseline.pvalue",
    "rmsea", "cfi", "tli", "srmr"))
          chisq              df          pvalue  baseline.chisq     baseline.df 
         22.305           9.000           0.008         244.385          10.000 
baseline.pvalue           rmsea             cfi             tli            srmr 
          0.000           0.038           0.943           0.937           0.065 
Code
residuals(onePLModel_cfa_fit, type = "cor")
$type
[1] "cor.bollen"

$cov
       Item.1 Item.2 Item.3 Item.4 Item.5
Item.1  0.000                            
Item.2 -0.038  0.000                     
Item.3  0.026  0.168  0.000              
Item.4  0.032 -0.061  0.012  0.000       
Item.5  0.022 -0.129  0.001 -0.104  0.000

$mean
Item.1 Item.2 Item.3 Item.4 Item.5 
     0      0      0      0      0 

$th
Item.1|t1 Item.2|t1 Item.3|t1 Item.4|t1 Item.5|t1 
        0         0         0         0         0 
Code
modificationindices(onePLModel_cfa_fit, sort. = TRUE)
Code
compRelSEM(onePLModel_cfa_fit)
latent 
 0.467 
Code
AVE(onePLModel_cfa_fit)
latent 
 0.264 
Code
onePLModel_cfa_factorScores <- lavPredict(onePLModel_cfa_fit)

A path diagram of the one-parameter item factor analysis is in Figure 8.37.

Code
semPaths(
  onePLModel_cfa_fit,
  what = "std",
  layout = "tree2",
  edge.label.cex = 1.5)
Item Factor Analysis Diagram of One-Parameter Logistic Model.

Figure 8.37: Item Factor Analysis Diagram of One-Parameter Logistic Model.

8.4.6 Mixed Model

A Rasch model can also be fit in a mixed model framework. The Rasch model below was fit using the lme4 package (Bates et al., 2022).

First, we convert the data from wide form to long form for the mixed model:

Code
mydataIRT_long <- mydataIRT %>% 
  mutate(ID = 1:nrow(mydataIRT)) %>% 
  pivot_longer(cols = Item.1:Item.5) %>% 
  rename(
    item = name,
    response = value)

Then, we can estimate the Rasch model using a logit or probit link:

Code
raschModel_mixed_logit <- glmer(
  response ~ -1 + item + (1|ID),
  mydataIRT_long, 
  family = binomial(link = "logit"))

summary(raschModel_mixed_logit)
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: response ~ -1 + item + (1 | ID)
   Data: mydataIRT_long

     AIC      BIC   logLik deviance df.resid 
  5354.5   5393.6  -2671.3   5342.5     4994 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.7785 -0.6631  0.3599  0.5639  1.5080 

Random effects:
 Groups Name        Variance Std.Dev.
 ID     (Intercept) 0.9057   0.9517  
Number of obs: 5000, groups:  ID, 1000

Fixed effects:
           Estimate Std. Error z value Pr(>|z|)    
itemItem.1  1.85118    0.09903  18.693  < 2e-16 ***
itemItem.2  0.78581    0.08025   9.792  < 2e-16 ***
itemItem.3  1.45004    0.09015  16.085  < 2e-16 ***
itemItem.4  0.51687    0.07787   6.637  3.2e-11 ***
itemItem.5  1.97373    0.10221  19.310  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
           itmI.1 itmI.2 itmI.3 itmI.4
itemItem.2 0.173                      
itemItem.3 0.192  0.178               
itemItem.4 0.158  0.169  0.166        
itemItem.5 0.192  0.170  0.191  0.155 
Code
raschModel_mixed_probit <- glmer(
  response ~ -1 + item + (1|ID),
  mydataIRT_long, 
  family = binomial(link = "probit"))

summary(raschModel_mixed_probit)
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( probit )
Formula: response ~ -1 + item + (1 | ID)
   Data: mydataIRT_long

     AIC      BIC   logLik deviance df.resid 
  5362.0   5401.1  -2675.0   5350.0     4994 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.7185 -0.6990  0.3660  0.5725  1.4276 

Random effects:
 Groups Name        Variance Std.Dev.
 ID     (Intercept) 0.2992   0.547   
Number of obs: 5000, groups:  ID, 1000

Fixed effects:
           Estimate Std. Error z value Pr(>|z|)    
itemItem.1  1.10493    0.05634  19.612  < 2e-16 ***
itemItem.2  0.47513    0.04801   9.896  < 2e-16 ***
itemItem.3  0.87437    0.05262  16.615  < 2e-16 ***
itemItem.4  0.31178    0.04689   6.649 2.95e-11 ***
itemItem.5  1.16956    0.05739  20.379  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
           itmI.1 itmI.2 itmI.3 itmI.4
itemItem.2 0.155                      
itemItem.3 0.176  0.158               
itemItem.4 0.142  0.149  0.147        
itemItem.5 0.180  0.154  0.175  0.141 

One can extract item difficulty and people’s factor scores (i.e., theta), as adapted from James Uanhoro (https://www.jamesuanhoro.com/post/2018/01/02/using-glmer-to-perform-rasch-analysis/; archived at: https://perma.cc/84WP-TQBG):

Code
# Item difficulty
item.diff <- -1 * coef(summary(raschModel_mixed_logit))[,"Estimate"] # Regression coefficients * -1
item.diff <- data.frame(
  item = paste("Item", 1:5, sep = "."),
  item.diff = as.numeric(item.diff))

item.diff
Code
# Factor Scores (Theta)
raschModel_mixed_logit_theta <- ranef(raschModel_mixed_logit)$ID[,"(Intercept)"]

8.5 Two-Parameter Logistic Model

A two-parameter logistic (2PL) IRT model is a model fit to dichotomous data, which estimates a different difficulty (\(b\)) and discrimination (\(a\)) parameter for each item. 2PL models were fit using the mirt package (Chalmers, 2020).

8.5.1 Fit Model

Code
twoPLModel <- mirt(
  data = mydataIRT,
  model = 1,
  itemtype = "2PL",
  SE = TRUE)

8.5.2 Model Summary

Code
summary(twoPLModel)
          F1    h2
Item.1 0.502 0.252
Item.2 0.536 0.287
Item.3 0.708 0.501
Item.4 0.410 0.168
Item.5 0.397 0.157

SS loadings:  1.366 
Proportion Var:  0.273 

Factor correlations: 

   F1
F1  1
Code
coef(twoPLModel, simplify = TRUE, IRTpars = TRUE)
$items
           a      b g u
Item.1 0.988 -1.879 0 1
Item.2 1.081 -0.748 0 1
Item.3 1.706 -1.058 0 1
Item.4 0.765 -0.635 0 1
Item.5 0.736 -2.520 0 1

$means
F1 
 0 

$cov
   F1
F1  1

8.5.3 Factor Scores

Code
twoPLModel_factorScores <- fscores(twoPLModel, full.scores.SE = TRUE)

8.5.4 Plots

8.5.4.1 Test Curves

The test curves suggest that the measure is most reliable (i.e., provides the most information has the smallest standard error of measurement) at lower levels of the construct.

8.5.4.1.1 Test Characteristic Curve

A test characteristic curve (TCC) plot of the expected total score as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.38.

Code
plot(twoPLModel, type = "score")
Test Characteristic Curve From Two-Parameter Logistic Item Response Theory Model.

Figure 8.38: Test Characteristic Curve From Two-Parameter Logistic Item Response Theory Model.

8.5.4.1.2 Test Information Curve

A plot of test information as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.39.

Code
plot(twoPLModel, type = "info")
Test Information Curve From Two-Parameter Logistic Item Response Theory Model.

Figure 8.39: Test Information Curve From Two-Parameter Logistic Item Response Theory Model.

8.5.4.1.3 Test Reliability

The estimate of marginal reliability is below:

Code
marginal_rxx(twoPLModel)
[1] 0.4417618

A plot of test reliability as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.40.

Code
plot(twoPLModel, type = "rxx")
Test Reliability From Two-Parameter Logistic Item Response Theory Model.

Figure 8.40: Test Reliability From Two-Parameter Logistic Item Response Theory Model.

8.5.4.1.4 Test Standard Error of Measurement

A plot of test standard error of measurement (SEM) as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.41.

Code
plot(twoPLModel, type = "SE")
Test Standard Error of Measurement From Two-Parameter Logistic Item Response Theory Model.

Figure 8.41: Test Standard Error of Measurement From Two-Parameter Logistic Item Response Theory Model.

8.5.4.1.5 Test Information Curve and Standard Errors

A plot of test information and standard error of measurement (SEM) as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.42.

Code
plot(twoPLModel, type = "infoSE")
Test Information Curve and Standard Error of Measurement From Two-Parameter Logistic Item Response Theory Model.

Figure 8.42: Test Information Curve and Standard Error of Measurement From Two-Parameter Logistic Item Response Theory Model.

8.5.4.2 Item Curves

8.5.4.2.1 Item Characteristic Curves

Item characteristic curve (ICC) plots of the probability of item endorsement (or getting the item correct) as a function of a person’s level on the latent construct (theta; \(\theta\)) are in Figures 8.43 and 8.44.

Code
plot(twoPLModel, type = "itemscore", facet_items = FALSE)
Item Characteristic Curves From Two-Parameter Logistic Item Response Theory Model.

Figure 8.43: Item Characteristic Curves From Two-Parameter Logistic Item Response Theory Model.

Code
plot(twoPLModel, type = "itemscore", facet_items = TRUE)
Item Characteristic Curves From Two-Parameter Logistic Item Response Theory Model.

Figure 8.44: Item Characteristic Curves From Two-Parameter Logistic Item Response Theory Model.

8.5.4.2.2 Item Information Curves

Plots of item information as a function of a person’s level on the latent construct (theta; \(\theta\)) are in Figures 8.45 and 8.46.

Code
plot(twoPLModel, type = "infotrace", facet_items = FALSE)
Item Information Curves From Two-Parameter Logistic Item Response Theory Model.

Figure 8.45: Item Information Curves From Two-Parameter Logistic Item Response Theory Model.

Code
plot(twoPLModel, type = "infotrace", facet_items = TRUE)
Item Information Curves From Two-Parameter Logistic Item Response Theory Model.

Figure 8.46: Item Information Curves From Two-Parameter Logistic Item Response Theory Model.

8.5.4.3 Convert Discrimination To Factor Loading

As described by Aiden Loe (archived at https://perma.cc/H3QN-JAWW), one can convert a discrimination parameter to a standardized factor loading using Equation (8.14):

\[\begin{equation} f = \frac{a}{\sqrt{1 + a^2}} \tag{8.14} \end{equation}\]

where \(a\) is equal to: \(\text{discrimination}/1.702\).

The petersenlab package (Petersen, 2024b) contains the discriminationToFactorLoading() function that converts discrimination parameters to standardized factor loadings.

Code
discriminationParameters <- coef(
  twoPLModel,
  simplify = TRUE)$items[,1]

discriminationParameters
   Item.1    Item.2    Item.3    Item.4    Item.5 
0.9879254 1.0808847 1.7058006 0.7651853 0.7357980 
Code
discriminationToFactorLoading(discriminationParameters)
   Item.1    Item.2    Item.3    Item.4    Item.5 
0.5020091 0.5360964 0.7078950 0.4100462 0.3968194 

8.5.5 CFA

A two-parameter logistic model can also be fit in a CFA framework, sometimes called item factor analysis. The item factor analysis models were fit in the lavaan package (Rosseel et al., 2022).

Code
twoPLModel_cfa <- '
# Factor Loadings (i.e., discrimination parameters)
latent =~ loading1*Item.1 + loading2*Item.2 + loading3*Item.3 + 
  loading4*Item.4 + loading5*Item.5

# Item Thresholds (i.e., difficulty parameters)
Item.1 | threshold1*t1
Item.2 | threshold2*t1
Item.3 | threshold3*t1
Item.4 | threshold4*t1
Item.5 | threshold5*t1
'

twoPLModel_cfa_fit = sem(
  model = twoPLModel_cfa,
  data = mydataIRT,
  ordered = c("Item.1", "Item.2", "Item.3", "Item.4","Item.5"),
  mimic = "Mplus",
  estimator = "WLSMV",
  std.lv = TRUE,
  parameterization = "theta")

summary(
  twoPLModel_cfa_fit,
  fit.measures = TRUE,
  rsquare = TRUE,
  standardized = TRUE)
lavaan 0.6.17 ended normally after 28 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                        10

  Number of observations                          1000

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 9.131      11.688
  Degrees of freedom                                 5           5
  P-value (Chi-square)                           0.104       0.039
  Scaling correction factor                                  0.784
  Shift parameter                                            0.041
    simple second-order correction (WLSMV)                        

Model Test Baseline Model:

  Test statistic                               244.385     228.667
  Degrees of freedom                                10          10
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.072

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.982       0.969
  Tucker-Lewis Index (TLI)                       0.965       0.939
                                                                  
  Robust Comparative Fit Index (CFI)                         0.943
  Robust Tucker-Lewis Index (TLI)                            0.886

Root Mean Square Error of Approximation:

  RMSEA                                          0.029       0.037
  90 Percent confidence interval - lower         0.000       0.008
  90 Percent confidence interval - upper         0.058       0.064
  P-value H_0: RMSEA <= 0.050                    0.871       0.757
  P-value H_0: RMSEA >= 0.080                    0.001       0.004
                                                                  
  Robust RMSEA                                               0.079
  90 Percent confidence interval - lower                     0.015
  90 Percent confidence interval - upper                     0.139
  P-value H_0: Robust RMSEA <= 0.050                         0.176
  P-value H_0: Robust RMSEA >= 0.080                         0.547

Standardized Root Mean Square Residual:

  SRMR                                           0.045       0.045

Parameter Estimates:

  Parameterization                               Theta
  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model        Unstructured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  latent =~                                                             
    Item.1  (ldn1)    0.587    0.099    5.916    0.000    0.587    0.506
    Item.2  (ldn2)    0.627    0.099    6.338    0.000    0.627    0.531
    Item.3  (ldn3)    0.979    0.175    5.594    0.000    0.979    0.699
    Item.4  (ldn4)    0.479    0.076    6.325    0.000    0.479    0.432
    Item.5  (ldn5)    0.417    0.084    4.961    0.000    0.417    0.384

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    Itm.1|1 (thr1)   -1.097    0.070  -15.602    0.000   -1.097   -0.946
    Itm.2|1 (thr2)   -0.480    0.052   -9.200    0.000   -0.480   -0.407
    Itm.3|1 (thr3)   -1.043    0.108   -9.623    0.000   -1.043   -0.745
    Itm.4|1 (thr4)   -0.298    0.045   -6.556    0.000   -0.298   -0.269
    Itm.5|1 (thr5)   -1.091    0.060  -18.265    0.000   -1.091   -1.007

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .Item.1            1.000                               1.000    0.743
   .Item.2            1.000                               1.000    0.718
   .Item.3            1.000                               1.000    0.511
   .Item.4            1.000                               1.000    0.813
   .Item.5            1.000                               1.000    0.852
    latent            1.000                               1.000    1.000

Scales y*:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    Item.1            0.862                               0.862    1.000
    Item.2            0.847                               0.847    1.000
    Item.3            0.715                               0.715    1.000
    Item.4            0.902                               0.902    1.000
    Item.5            0.923                               0.923    1.000

R-Square:
                   Estimate
    Item.1            0.257
    Item.2            0.282
    Item.3            0.489
    Item.4            0.187
    Item.5            0.148
Code
fitMeasures(
  twoPLModel_cfa_fit,
  fit.measures = c(
    "chisq", "df", "pvalue",
    "baseline.chisq","baseline.df","baseline.pvalue",
    "rmsea", "cfi", "tli", "srmr"))
          chisq              df          pvalue  baseline.chisq     baseline.df 
          9.131           5.000           0.104         244.385          10.000 
baseline.pvalue           rmsea             cfi             tli            srmr 
          0.000           0.029           0.982           0.965           0.045 
Code
residuals(twoPLModel_cfa_fit, type = "cor")
$type
[1] "cor.bollen"

$cov
       Item.1 Item.2 Item.3 Item.4 Item.5
Item.1  0.000                            
Item.2 -0.043  0.000                     
Item.3 -0.064  0.060  0.000              
Item.4  0.077 -0.026 -0.026  0.000       
Item.5  0.091 -0.069 -0.004 -0.006  0.000

$mean
Item.1 Item.2 Item.3 Item.4 Item.5 
     0      0      0      0      0 

$th
Item.1|t1 Item.2|t1 Item.3|t1 Item.4|t1 Item.5|t1 
        0         0         0         0         0 
Code
modificationindices(twoPLModel_cfa_fit, sort. = TRUE)
Code
compRelSEM(twoPLModel_cfa_fit)
latent 
 0.468 
Code
AVE(twoPLModel_cfa_fit)
latent 
 0.296 
Code
twoPLModel_cfa_factorScores <- lavPredict(twoPLModel_cfa_fit)
Code
semPaths(
  twoPLModel_cfa_fit,
  what = "std",
  layout = "tree2",
  edge.label.cex = 1.5)
Item Factor Analysis Diagram of Two-Parameter Logistic Model.

Figure 8.47: Item Factor Analysis Diagram of Two-Parameter Logistic Model.

8.6 Two-Parameter Multidimensional Logistic Model

A 2PL multidimensional IRT model is a model that allows multiple dimensions (latent factors) and is fit to dichotomous data, which estimates a different difficulty (\(b\)) and discrimination (\(a\)) parameter for each item. Multidimensional IRT models were fit using the mirt package (Chalmers, 2020). In this example, I estimate a 2PL multidimensional IRT model by estimating two factors.

8.6.1 Fit Model

Code
twoPL2FactorModel <- mirt(
  data = mydataIRT,
  model = 2,
  itemtype = "2PL",
  SE = TRUE)

8.6.2 Model Summary

Code
summary(twoPL2FactorModel)

Rotation:  oblimin 

Rotated factor loadings: 

            F1      F2    h2
Item.1  0.7943 -0.0111 0.623
Item.2  0.0804  0.4630 0.255
Item.3 -0.0129  0.8628 0.734
Item.4  0.2794  0.1925 0.165
Item.5  0.2930  0.1772 0.165

Rotated SS loadings:  0.801 1.027 

Factor correlations: 

      F1 F2
F1 1.000   
F2 0.463  1
Code
coef(twoPL2FactorModel, simplify = TRUE)
$items
           a1     a2     d g u
Item.1 -2.007  0.870 2.648 0 1
Item.2 -0.849 -0.522 0.788 0 1
Item.3 -2.153 -1.836 2.483 0 1
Item.4 -0.756 -0.028 0.485 0 1
Item.5 -0.757  0.000 1.864 0 1

$means
F1 F2 
 0  0 

$cov
   F1 F2
F1  1  0
F2  0  1

8.6.3 Factor Scores

Code
twoPL2FactorModel_factorScores <- fscores(twoPL2FactorModel, full.scores.SE = TRUE)

8.6.4 Compare model fit

The modified model with two factors and the original one-factor model are considered “nested” models. The original model is nested within the modified model because the modified model includes all of the terms of the original model along with additional terms. Model fit of nested models can be compared with a chi-square difference test.

Code
anova(twoPLModel, twoPL2FactorModel)

Using a chi-square difference test to compare two nested models, the two-factor model fits significantly better than the one-factor model.

8.6.5 Plots

8.6.5.1 Test Curves

8.6.5.1.1 Test Characteristic Curve

A test characteristic curve (TCC) plot of the expected total score as a function of a person’s level on each latent construct (theta; \(\theta\)) is in Figure 8.48.

Code
plot(twoPL2FactorModel, type = "score")
Test Characteristic Curve From Two-Parameter Multidimensional Item Response Theory Model.

Figure 8.48: Test Characteristic Curve From Two-Parameter Multidimensional Item Response Theory Model.

8.6.5.1.2 Test Information Curve

A plot of test information as a function of a person’s level on each latent construct (theta; \(\theta\)) is in Figure 8.49.

Code
plot(twoPL2FactorModel, type = "info")
Test Information Curve From Two-Parameter Multidimensional Item Response Theory Model.

Figure 8.49: Test Information Curve From Two-Parameter Multidimensional Item Response Theory Model.

8.6.5.1.3 Test Standard Error of Measurement

A plot of test standard error of measurement (SEM) as a function of a person’s level on each latent construct (theta; \(\theta\)) is in Figure 8.50.

Code
plot(twoPL2FactorModel, type = "SE")
Test Standard Error of Measurement From Two-Parameter Multidimensional Item Response Theory Model.

Figure 8.50: Test Standard Error of Measurement From Two-Parameter Multidimensional Item Response Theory Model.

8.6.5.2 Item Curves

8.6.5.2.1 Item Characteristic Curves

Item characteristic curve (ICC) plots of the probability of item endorsement (or getting the item correct) as a function of a person’s level on each latent construct (theta; \(\theta\)) are in Figures 8.51 and 8.52.

Code
plot(twoPL2FactorModel, type = "itemscore", facet_items = FALSE)
Item Characteristic Curves From Two-Parameter Multidimensional Item Response Theory Model.

Figure 8.51: Item Characteristic Curves From Two-Parameter Multidimensional Item Response Theory Model.

Code
plot(twoPL2FactorModel, type = "itemscore", facet_items = TRUE)
Item Characteristic Curves From Two-Parameter Multidimensional Item Response Theory Model.

Figure 8.52: Item Characteristic Curves From Two-Parameter Multidimensional Item Response Theory Model.

8.6.5.2.2 Item Information Curves

Plots of item information as a function of a person’s level on each latent construct (theta; \(\theta\)) are in Figures 8.53 and 8.54.

Code
plot(twoPL2FactorModel, type = "infotrace", facet_items = FALSE)
Item Information Curves From Two-Parameter Multidimensional Item Response Theory Model.

Figure 8.53: Item Information Curves From Two-Parameter Multidimensional Item Response Theory Model.

Code
plot(twoPL2FactorModel, type = "infotrace", facet_items = TRUE)
Item Information Curves From Two-Parameter Multidimensional Item Response Theory Model.

Figure 8.54: Item Information Curves From Two-Parameter Multidimensional Item Response Theory Model.

8.6.6 CFA

A two-parameter multidimensional model can also be fit in a CFA framework, sometimes called item factor analysis. The item factor analysis models were fit in the lavaan package (Rosseel et al., 2022).

Code
twoPLModelMultidimensional_cfa <- '
# Factor Loadings (i.e., discrimination parameters)
latent1 =~ loading1*Item.1 + loading4*Item.4 + loading5*Item.5
latent2 =~ loading2*Item.2 + loading3*Item.3

# Item Thresholds (i.e., difficulty parameters)
Item.1 | threshold1*t1
Item.2 | threshold2*t1
Item.3 | threshold3*t1
Item.4 | threshold4*t1
Item.5 | threshold5*t1
'

twoPLModelMultidimensional_cfa_fit = sem(
  model = twoPLModelMultidimensional_cfa,
  data = mydataIRT,
  ordered = c("Item.1", "Item.2", "Item.3", "Item.4","Item.5"),
  mimic = "Mplus",
  estimator = "WLSMV",
  std.lv = TRUE,
  parameterization = "theta")

summary(
  twoPLModelMultidimensional_cfa_fit,
  fit.measures = TRUE,
  rsquare = TRUE,
  standardized = TRUE)
lavaan 0.6.17 ended normally after 41 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                        11

  Number of observations                          1000

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 1.882       2.469
  Degrees of freedom                                 4           4
  P-value (Chi-square)                           0.757       0.650
  Scaling correction factor                                  0.775
  Shift parameter                                            0.039
    simple second-order correction (WLSMV)                        

Model Test Baseline Model:

  Test statistic                               244.385     228.667
  Degrees of freedom                                10          10
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.072

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.023       1.018
                                                                  
  Robust Comparative Fit Index (CFI)                         1.000
  Robust Tucker-Lewis Index (TLI)                            1.032

Root Mean Square Error of Approximation:

  RMSEA                                          0.000       0.000
  90 Percent confidence interval - lower         0.000       0.000
  90 Percent confidence interval - upper         0.033       0.038
  P-value H_0: RMSEA <= 0.050                    0.994       0.989
  P-value H_0: RMSEA >= 0.080                    0.000       0.000
                                                                  
  Robust RMSEA                                               0.000
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.088
  P-value H_0: Robust RMSEA <= 0.050                         0.798
  P-value H_0: Robust RMSEA >= 0.080                         0.072

Standardized Root Mean Square Residual:

  SRMR                                           0.021       0.021

Parameter Estimates:

  Parameterization                               Theta
  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model        Unstructured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  latent1 =~                                                            
    Item.1  (ldn1)    0.731    0.139    5.268    0.000    0.731    0.590
    Item.4  (ldn4)    0.560    0.094    5.962    0.000    0.560    0.488
    Item.5  (ldn5)    0.472    0.098    4.832    0.000    0.472    0.427
  latent2 =~                                                            
    Item.2  (ldn2)    0.660    0.114    5.774    0.000    0.660    0.551
    Item.3  (ldn3)    1.265    0.358    3.529    0.000    1.265    0.784

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  latent1 ~~                                                            
    latent2           0.696    0.090    7.718    0.000    0.696    0.696

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    Itm.1|1 (thr1)   -1.172    0.095  -12.280    0.000   -1.172   -0.946
    Itm.2|1 (thr2)   -0.488    0.055   -8.923    0.000   -0.488   -0.407
    Itm.3|1 (thr3)   -1.202    0.218   -5.502    0.000   -1.202   -0.745
    Itm.4|1 (thr4)   -0.308    0.048   -6.442    0.000   -0.308   -0.269
    Itm.5|1 (thr5)   -1.114    0.066  -16.862    0.000   -1.114   -1.007

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .Item.1            1.000                               1.000    0.652
   .Item.4            1.000                               1.000    0.761
   .Item.5            1.000                               1.000    0.818
   .Item.2            1.000                               1.000    0.697
   .Item.3            1.000                               1.000    0.385
    latent1           1.000                               1.000    1.000
    latent2           1.000                               1.000    1.000

Scales y*:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    Item.1            0.807                               0.807    1.000
    Item.4            0.873                               0.873    1.000
    Item.5            0.904                               0.904    1.000
    Item.2            0.835                               0.835    1.000
    Item.3            0.620                               0.620    1.000

R-Square:
                   Estimate
    Item.1            0.348
    Item.4            0.239
    Item.5            0.182
    Item.2            0.303
    Item.3            0.615
Code
fitMeasures(
  twoPLModelMultidimensional_cfa_fit,
  fit.measures = c(
    "chisq", "df", "pvalue",
    "baseline.chisq","baseline.df","baseline.pvalue",
    "rmsea", "cfi", "tli", "srmr"))
          chisq              df          pvalue  baseline.chisq     baseline.df 
          1.882           4.000           0.757         244.385          10.000 
baseline.pvalue           rmsea             cfi             tli            srmr 
          0.000           0.000           1.000           1.023           0.021 
Code
residuals(twoPLModelMultidimensional_cfa_fit, type = "cor")
$type
[1] "cor.bollen"

$cov
       Item.1 Item.4 Item.5 Item.2 Item.3
Item.1  0.000                            
Item.4  0.008  0.000                     
Item.5  0.034 -0.048  0.000              
Item.2  0.000  0.016 -0.028  0.000       
Item.3 -0.031  0.009  0.032  0.000  0.000

$mean
Item.1 Item.4 Item.5 Item.2 Item.3 
     0      0      0      0      0 

$th
Item.1|t1 Item.4|t1 Item.5|t1 Item.2|t1 Item.3|t1 
        0         0         0         0         0 
Code
modificationindices(twoPLModelMultidimensional_cfa_fit, sort. = TRUE)
Code
compRelSEM(twoPLModelMultidimensional_cfa_fit)
latent1 latent2 
  0.321   0.426 
Code
AVE(twoPLModelMultidimensional_cfa_fit)
latent1 latent2 
  0.263   0.504 
Code
twoPLModelMultidimensional_cfa_factorScores <- lavPredict(twoPLModelMultidimensional_cfa_fit)
Code
semPaths(
  twoPLModelMultidimensional_cfa_fit,
  what = "std",
  layout = "tree2",
  edge.label.cex = 1.5)
Item Factor Analysis Diagram of Two-Parameter Multidimensional Logistic Model.

Figure 8.55: Item Factor Analysis Diagram of Two-Parameter Multidimensional Logistic Model.

8.7 Three-Parameter Logistic Model

A three-parameter logistic (3PL) IRT model is a model fit to dichotomous data, which estimates a different difficulty (\(b\)), discrimination (\(a\)), and guessing parameter for each item. 3PL models were fit using the mirt package (Chalmers, 2020).

8.7.1 Fit Model

Code
threePLModel <- mirt(
  data = mydataIRT,
  model = 1,
  itemtype = "3PL",
  SE = TRUE)

8.7.2 Model Summary

Code
summary(threePLModel)
          F1    h2
Item.1 0.509 0.259
Item.2 0.750 0.562
Item.3 0.700 0.489
Item.4 0.397 0.158
Item.5 0.411 0.169

SS loadings:  1.637 
Proportion Var:  0.327 

Factor correlations: 

   F1
F1  1
Code
coef(threePLModel, simplify = TRUE, IRTpars = TRUE)
$items
           a      b     g u
Item.1 1.007 -1.853 0.000 1
Item.2 1.928 -0.049 0.295 1
Item.3 1.667 -1.068 0.000 1
Item.4 0.736 -0.655 0.000 1
Item.5 0.767 -2.436 0.000 1

$means
F1 
 0 

$cov
   F1
F1  1

8.7.3 Factor Scores

Code
threePLModel_factorScores <- fscores(threePLModel, full.scores.SE = TRUE)

8.7.4 Plots

8.7.4.1 Test Curves

The test curves suggest that the measure is most reliable (i.e., provides the most information has the smallest standard error of measurement) at lower levels of the construct.

8.7.4.1.1 Test Characteristic Curve

A test characteristic curve (TCC) plot of the expected total score as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.56.

Code
plot(threePLModel, type = "score")
Test Characteristic Curve From Three-Parameter Logistic Item Response Theory Model.

Figure 8.56: Test Characteristic Curve From Three-Parameter Logistic Item Response Theory Model.

8.7.4.1.2 Test Information Curve

A plot of test information as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.57.

Code
plot(threePLModel, type = "info")
Test Information Curve From Three-Parameter Logistic Item Response Theory Model.

Figure 8.57: Test Information Curve From Three-Parameter Logistic Item Response Theory Model.

8.7.4.1.3 Test Reliability

The estimate of marginal reliability is below:

Code
marginal_rxx(threePLModel)
[1] 0.4681812

A plot of test reliability as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.58.

Code
plot(threePLModel, type = "rxx")
Test Reliability From Three-Parameter Logistic Item Response Theory Model.

Figure 8.58: Test Reliability From Three-Parameter Logistic Item Response Theory Model.

8.7.4.1.4 Test Standard Error of Measurement

A plot of test standard error of measurement (SEM) as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.59.

Code
plot(threePLModel, type = "SE")
Test Standard Error of Measurement From Three-Parameter Logistic Item Response Theory Model.

Figure 8.59: Test Standard Error of Measurement From Three-Parameter Logistic Item Response Theory Model.

8.7.4.1.5 Test Information Curve and Standard Errors

A plot of test information and standard error of measurement (SEM) as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.60.

Code
plot(threePLModel, type = "infoSE")
Test Information Curve and Standard Error of Measurement From Three-Parameter Logistic Item Response Theory Model.

Figure 8.60: Test Information Curve and Standard Error of Measurement From Three-Parameter Logistic Item Response Theory Model.

8.7.4.2 Item Curves

8.7.4.2.1 Item Characteristic Curves

Item characteristic curve (ICC) plots of the probability of item endorsement (or getting the item correct) as a function of a person’s level on the latent construct (theta; \(\theta\)) are in Figures 8.61 and 8.62.

Code
plot(threePLModel, type = "itemscore", facet_items = FALSE)
Item Characteristic Curves From Three-Parameter Logistic Item Response Theory Model.

Figure 8.61: Item Characteristic Curves From Three-Parameter Logistic Item Response Theory Model.

Code
plot(threePLModel, type = "itemscore", facet_items = TRUE)
Item Characteristic Curves From Three-Parameter Logistic Item Response Theory Model.

Figure 8.62: Item Characteristic Curves From Three-Parameter Logistic Item Response Theory Model.

8.7.4.2.2 Item Information Curves

Plots of item information as a function of a person’s level on the latent construct (theta; \(\theta\)) are in Figures 8.63 and 8.64.

Code
plot(threePLModel, type = "infotrace", facet_items = FALSE)
Item Information Curves From Three-Parameter Logistic Item Response Theory Model.

Figure 8.63: Item Information Curves From Three-Parameter Logistic Item Response Theory Model.

Code
plot(threePLModel, type = "infotrace", facet_items = TRUE)
Item Information Curves From Three-Parameter Logistic Item Response Theory Model.

Figure 8.64: Item Information Curves From Three-Parameter Logistic Item Response Theory Model.

8.8 Four-Parameter Logistic Model

A four-parameter logistic (4PL) IRT model is a model fit to dichotomous data, which estimates a different difficulty (\(b\)), discrimination (\(a\)), guessing, and careless errors parameter for each item. 4PL models were fit using the mirt package (Chalmers, 2020).

8.8.1 Fit Model

Code
fourPLModel <- mirt(
  data = mydataIRT,
  model = 1,
  itemtype = "4PL",
  SE = TRUE,
  technical = list(NCYCLES = 2000))

8.8.2 Model Summary

Code
summary(fourPLModel)
          F1    h2
Item.1 0.834 0.695
Item.2 0.980 0.961
Item.3 0.762 0.580
Item.4 0.876 0.768
Item.5 0.648 0.420

SS loadings:  3.425 
Proportion Var:  0.685 

Factor correlations: 

   F1
F1  1
Code
coef(fourPLModel, simplify = TRUE, IRTpars = TRUE)
$items
           a      b     g     u
Item.1 2.570 -1.619 0.000 0.911
Item.2 8.490  0.093 0.370 0.992
Item.3 2.002 -0.646 0.271 0.999
Item.4 3.094 -1.224 0.000 0.708
Item.5 1.450 -2.166 0.002 0.920

$means
F1 
 0 

$cov
   F1
F1  1

8.8.3 Factor Scores

Code
fourPLModel_factorScores <- fscores(fourPLModel, full.scores.SE = TRUE)

8.8.4 Plots

8.8.4.1 Test Curves

The test curves suggest that the measure is most reliable (i.e., provides the most information has the smallest standard error of measurement) at middle to lower levels of the construct.

8.8.4.1.1 Test Characteristic Curve

A test characteristic curve (TCC) plot of the expected total score as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.65.

Code
plot(fourPLModel, type = "score")
Test Characteristic Curve From Four-Parameter Logistic Item Response Theory Model.

Figure 8.65: Test Characteristic Curve From Four-Parameter Logistic Item Response Theory Model.

8.8.4.1.2 Test Information Curve

A plot of test information as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.66.

Code
plot(fourPLModel, type = "info")
Test Information Curve From Four-Parameter Logistic Item Response Theory Model.

Figure 8.66: Test Information Curve From Four-Parameter Logistic Item Response Theory Model.

8.8.4.1.3 Test Reliability

The estimate of marginal reliability is below:

Code
marginal_rxx(fourPLModel)
[1] 0.5060376

A plot of test reliability as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.67.

Code
plot(fourPLModel, type = "rxx")
Test Reliability From Four-Parameter Logistic Item Response Theory Model.

Figure 8.67: Test Reliability From Four-Parameter Logistic Item Response Theory Model.

8.8.4.1.4 Test Standard Error of Measurement

A plot of test standard error of measurement (SEM) as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.68.

Code
plot(fourPLModel, type = "SE")
Test Standard Error of Measurement From Four-Parameter Logistic Item Response Theory Model.

Figure 8.68: Test Standard Error of Measurement From Four-Parameter Logistic Item Response Theory Model.

8.8.4.1.5 Test Information Curve and Standard Errors

A plot of test information and standard error of measurement (SEM) as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.69.

Code
plot(fourPLModel, type = "infoSE")
Test Information Curve and Standard Error of Measurement From Four-Parameter Logistic Item Response Theory Model.

Figure 8.69: Test Information Curve and Standard Error of Measurement From Four-Parameter Logistic Item Response Theory Model.

8.8.4.2 Item Curves

8.8.4.2.1 Item Characteristic Curves

Item characteristic curve (ICC) plots of the probability of item endorsement (or getting the item correct) as a function of a person’s level on the latent construct (theta; \(\theta\)) are in Figures 8.70 and 8.71.

Code
plot(fourPLModel, type = "itemscore", facet_items = FALSE)
Item Characteristic Curves From Four-Parameter Logistic Item Response Theory Model.

Figure 8.70: Item Characteristic Curves From Four-Parameter Logistic Item Response Theory Model.

Code
plot(fourPLModel, type = "itemscore", facet_items = TRUE)
Item Characteristic Curves From Four-Parameter Logistic Item Response Theory Model.

Figure 8.71: Item Characteristic Curves From Four-Parameter Logistic Item Response Theory Model.

8.8.4.2.2 Item Information Curves

Plots of item information as a function of a person’s level on the latent construct (theta; \(\theta\)) are in Figures 8.72 and 8.73.

Code
plot(fourPLModel, type = "infotrace", facet_items = FALSE)
Item Information Curves From Four-Parameter Logistic Item Response Theory Model.

Figure 8.72: Item Information Curves From Four-Parameter Logistic Item Response Theory Model.

Code
plot(fourPLModel, type = "infotrace", facet_items = TRUE)
Item Information Curves From Four-Parameter Logistic Item Response Theory Model.

Figure 8.73: Item Information Curves From Four-Parameter Logistic Item Response Theory Model.

8.9 Graded Response Model

A two-parameter graded response model (GRM) is an IRT model fit to polytomous data (in this case, a 1–4 likert scale), which estimates a different difficulty (\(b\)) and discrimination (\(a\)) parameter for each item. It estimates four parameters for each item: difficulty [for each of three threshold transitions: 1–2 (\(b_1\)), 2–3 (\(b_2\)), and 3–4 (\(b_3\))] and discrimination (\(a\)). GRM models were fit using the mirt package (Chalmers, 2020).

8.9.1 Fit Model

Science is a data set from the mirt package (Chalmers, 2020) that contains four items evaluating people’s attitudes to science and technology on a 1–4 Likert scale. The data are from the Consumer Protection and Perceptions of Science and Technology section of the 1992 Euro-Barometer Survey of people in Great Britain.

Code
gradedResponseModel <- mirt(
  data = Science,
   model = 1,
   itemtype = "graded",
   SE = TRUE)

8.9.2 Model Summary

Code
summary(gradedResponseModel)
           F1    h2
Comfort 0.522 0.273
Work    0.584 0.342
Future  0.803 0.645
Benefit 0.541 0.293

SS loadings:  1.552 
Proportion Var:  0.388 

Factor correlations: 

   F1
F1  1
Code
coef(gradedResponseModel, simplify = TRUE, IRTpars = TRUE)
$items
            a     b1     b2    b3
Comfort 1.042 -4.669 -2.534 1.407
Work    1.226 -2.385 -0.735 1.849
Future  2.293 -2.282 -0.965 0.856
Benefit 1.095 -3.058 -0.906 1.542

$means
F1 
 0 

$cov
   F1
F1  1

8.9.3 Factor Scores

Code
gradedResponseModel_factorScores <- fscores(gradedResponseModel, full.scores.SE = TRUE)

8.9.4 Plots

8.9.4.1 Test Curves

The test curves suggest that the measure is most reliable (i.e., provides the most information and has the smallest standard error of measurement) across a wide range of construct. In general, this measure with polytomous (Likert-scale) items provides more information than the measure with binary items that were examined above. This is consistent with the idea that polytomous items tend to provide more information than binary/dichotomous items.

8.9.4.1.1 Test Characteristic Curve

A test characteristic curve (TCC) plot of the expected total score as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.74.

Code
plot(gradedResponseModel, type = "score")
Test Characteristic Curve From Graded Response Model.

Figure 8.74: Test Characteristic Curve From Graded Response Model.

8.9.4.1.2 Test Information Curve

A plot of test information as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.75.

Code
plot(gradedResponseModel, type = "info")
Test Information Curve From Graded Response Model.

Figure 8.75: Test Information Curve From Graded Response Model.

8.9.4.1.3 Test Reliability

The estimate of marginal reliability is below:

Code
marginal_rxx(gradedResponseModel)
[1] 0.6687901

A plot of test reliability as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.76.

Code
plot(gradedResponseModel, type = "rxx")
Test Reliability From Graded Response Model.

Figure 8.76: Test Reliability From Graded Response Model.

8.9.4.1.4 Test Standard Error of Measurement

A plot of test standard error of measurement (SEM) as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.77.

Code
plot(gradedResponseModel, type = "SE")
Test Standard Error of Measurement From Graded Response Model.

Figure 8.77: Test Standard Error of Measurement From Graded Response Model.

8.9.4.1.5 Test Information Curve and Standard Errors

A plot of test information and standard error of measurement (SEM) as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.78.

Code
plot(gradedResponseModel, type = "infoSE")
Test Information Curve and Standard Error of Measurement From Graded Response Model.

Figure 8.78: Test Information Curve and Standard Error of Measurement From Graded Response Model.

8.9.4.2 Item Curves

8.9.4.2.1 Item Characteristic Curves

Item characteristic curve (ICC) plots of the expected score on the item as a function of a person’s level on the latent construct (theta; \(\theta\)) are in Figures 8.79 and 8.80.

Code
plot(gradedResponseModel, type = "itemscore", facet_items = FALSE)
Item Characteristic Curves From Graded Response Model.

Figure 8.79: Item Characteristic Curves From Graded Response Model.

Code
plot(gradedResponseModel, type = "itemscore", facet_items = TRUE)
Item Characteristic Curves From Graded Response Model.

Figure 8.80: Item Characteristic Curves From Graded Response Model.

8.9.4.2.2 Item Information Curves

Plots of item information as a function of a person’s level on the latent construct (theta; \(\theta\)) are in Figures 8.81 and 8.82.

Code
plot(gradedResponseModel, type = "infotrace", facet_items = FALSE)
Item Information Curves From Graded Response Model.

Figure 8.81: Item Information Curves From Graded Response Model.

Code
plot(gradedResponseModel, type = "infotrace", facet_items = TRUE)
Item Information Curves From Graded Response Model.

Figure 8.82: Item Information Curves From Graded Response Model.

8.9.4.2.3 Item Response Category Characteristic Curves

A plot of the probability of item threshold endorsement as a function of a person’s level on the latent construct (theta; \(\theta\)) is in Figure 8.83.

Code
plot(gradedResponseModel, type = "trace")
Item Response Category Characteristic Curves From Graded Response Model.

Figure 8.83: Item Response Category Characteristic Curves From Graded Response Model.

8.9.4.2.4 Item Boundary Characteristic Curves (aka Item Operation Characteristic Curves)

A plot of item boundary characteristic curves is in Figure 8.84. The plot of item boundary characteristic curves was adapted from an example by Aiden Loe: https://aidenloe.github.io/irtplots.html (archived at https://perma.cc/D4YH-RV6N)

Code
modelCoefficients <- coef(
  gradedResponseModel,
  IRTpars = TRUE,
  simplify = TRUE)$items

theta <- seq(from = -6, to = 6, by = .1)

difficultyThresholds <- grep(
  "b",
  dimnames(modelCoefficients)[[2]],
  value = TRUE)
numberDifficultyThresholds <- length(difficultyThresholds)
items <- dimnames(modelCoefficients)[[1]]
numberOfItems <- length(items)

lst <- lapply(
  1:numberOfItems,
  function(x) data.frame(
    matrix(ncol = numberDifficultyThresholds + 1,
           nrow = length(theta),
           dimnames = list(NULL, c("theta", difficultyThresholds)))))

for(i in 1:numberOfItems){
  for(j in 1:numberDifficultyThresholds){
    lst[[i]][,1] <- theta
    lst[[i]][,j + 1] <- fourPL(
      a = modelCoefficients[i,1],
      b = modelCoefficients[i,j + 1],
      theta = theta)
  }
}

names(lst) <- items
dat <- bind_rows(lst, .id = "item")
longer_data <- pivot_longer(
  dat,
  cols = all_of(difficultyThresholds))

ggplot(
  longer_data,
  aes(theta, value, group = interaction(item, name), color = item)) +
  geom_line() +
  ylab("Probability of Endorsing an Item Response Category that is Higher than the Boundary") +
  theme_bw() +
  theme(axis.title.y = element_text(size = 10))
Item Boundary Category Characteristic Curves From Graded Response Model.

Figure 8.84: Item Boundary Category Characteristic Curves From Graded Response Model.

8.10 Conclusion

Item response theory is a measurement theory and advanced modeling approach that allows estimating latent variables as the common variance from multiple items, and it allows estimating how the items relate to the construct (latent variable). IRT holds promise to enable the development of briefer assessments, including short forms and adaptive assessments, that have strong reliability and validity. However, there are situations where IRT models may not be preferable, such as when assessing a formative construct, when using small sample sizes, or when assumptions of IRT are violated.

8.11 Suggested Readings

If you are interested in learning more about IRT, I highly recommend the book by Embretson & Reise (2000).

8.12 Exercises

8.12.1 Questions

Note: Several of the following questions use data from the Children of the National Longitudinal Survey of Youth Survey (CNLSY). The CNLSY is a publicly available longitudinal data set provided by the Bureau of Labor Statistics (https://www.bls.gov/nls/nlsy79-children.htm#topical-guide; archived at https://perma.cc/EH38-HDRN). The CNLSY data file for these exercises is located on the book’s page of the Open Science Framework (https://osf.io/3pwza). Children’s behavior problems were rated in 1988 (time 1: T1) and then again in 1990 (time 2: T2) on the Behavior Problems Index (BPI). Below are the items corresponding to the Antisocial subscale of the BPI:

  1. cheats or tells lies
  2. bullies or is cruel/mean to others
  3. does not seem to feel sorry after misbehaving
  4. breaks things deliberately
  5. is disobedient at school
  6. has trouble getting along with teachers
  7. has sudden changes in mood or feeling
  1. Fit a one-parameter (Rasch) model to the seven items of the Antisocial subscale of the BPI at T1. This will estimate the difficulty for each item threshold (one threshold from 0 to 1, and one threshold from 1 to 2), while constraining the discrimination for each item to be the same.
    1. Which item has the lowest difficulty (i.e., severity) in terms of endorsing a score of one (i.e., “sometimes true”) as opposed to zero (i.e., “not true”)? Which item has the highest difficulty in terms of endorsing a score of 2 (i.e., “often true”)? What do these estimates of item difficulty indicate?
  2. Fit a graded response model to the seven items of the Antisocial subscale of the BPI at T1. This will estimate the difficulty for each item threshold (one threshold from 0 to 1, and one threshold from 1 to 2), while allowing each item to have a different discrimination.
    1. Provide a figure of the item characteristic curves.
    2. Provide a figure of the item boundary characteristic curves.
    3. Which item has the lowest discrimination? Which item has the highest discrimination? What do these estimates of item discrimination indicate?
    4. Provide a figure of the item information curves.
    5. Examining the item information curves, which item provides the most information at upper construct levels (2–4 standard deviations above the mean)? Which item provides the most information at lower construct levels (2–4 standard deviations below the mean)?
    6. Provide a figure of the test information curve.
    7. Examining the test information curve, where (at what construct levels) does the measure do the best job of assessing? Based on its information curve, describe what purposes the test would be better- or worse-suited for.
  3. Fit a multidimensional graded response model to the seven items of the Antisocial subscale of the BPI at T1, by estimating two latent factors.
    1. Which items loaded onto Factor 1? Which items loaded onto Factor 2? Provide a possible explanation as two why some of the items “broke off” (from Factor 1) and loaded onto a separate factor (Factor 2).
    2. The one-factor graded response model (in #2) and the two-factor graded response model are considered “nested” models. The one-factor model is nested within the two-factor model because the two-factor model includes all of the terms of the one-factor model along with additional terms. Model fit of nested models can be directly compared with a chi-square difference test. Did the two-factor model fit better than the one-factor model?

8.12.2 Answers

    1. Item 7 (“sudden changes in mood or feeling”) has the lowest difficulty in terms of endorsing a score of one \((b_1 = -0.95)\). Item 5 (“disobedient at school”) has the highest difficulty in terms of endorsing a score of two \((b_2 = 3.55)\). The difficulty parameter indicates the construct-level at the inflection point of the item characteristic curve. In a one- or two-parameter model, the inflection point occurs where 50% of respondents endorse the item. Thus, in this model, the difficulty parameter indicates the construct-level at which 50% of respondents endorse the item. It takes a very high level of antisocial behavior for a child to be endorsed as being often disobedient at school, whereas it does not take a high construct-level for a child to be endorsed as sometimes showing sudden changes in mood.
    1. Below is a figure of item characteristic curves:
Exercise 1a: Item Characteristic Curves.

Figure 8.85: Exercise 1a: Item Characteristic Curves.

    1. Below is a figure of item boundary characteristic curves:
Exercise 2b: Item Boundary Characteristic Curves.

Figure 8.86: Exercise 2b: Item Boundary Characteristic Curves.

    1. Item 7 (“sudden changes in mood or feeling”) has the lowest discrimination \((a = 0.89)\). Item 6 (“has trouble getting along with teachers”) has the highest discrimination \((a = 2.06)\). The discrimination parameter represents the steepness of the slope of the item characteristic curve. It indicates how strongly endorsing an item discriminates (differentiates) between lower versus higher construct levels. In other words, it indicates how strongly the item is associated with the construct. Item 7 shows the weakest association with the construct, whereas item 6 shows the strongest association with the construct. That suggests that “trouble getting along with teachers” is more core to the construct of antisocial behavior than “sudden changes in mood.”
    2. Below is a figure of item information curves:
Exercise 2c: Item Information Curves.

Figure 8.87: Exercise 2c: Item Information Curves.

    1. Item 6 (“has trouble getting along with teachers”) provides the most information at upper construct levels (2–4 standard deviations above the mean). Item 7 (“has trouble getting along with teachers”) provides the most information at lower construct levels (2–4 standard deviations below the mean). Item 1 (“cheats or tells lies”) provides the most information at somewhat low construct levels (0–2 standard deviations below the mean).
Exercise 2e: Test Information Curve.

Figure 8.88: Exercise 2e: Test Information Curve.

    1. The measure does the best job of assessing (i.e., provides the most information) at construct levels from 1–3 standard deviations above the mean. Because the measure provides the most information at upper construct levels and provides little information at lower construct levels, the measure would be best used for assessing clinical versus sub-clinical levels of antisocial behavior rather than assessing individual differences in antisocial behavior across a community sample.
    1. Items 1, 2, 3, 4, and 7 loaded onto Factor 1. Items 5 and 6 loaded onto Factor 2. Items 5 (“disobedient at school”) and 6 (“trouble getting along with teachers”) both deal with school-related antisocial behavior. Thus, the items assessing school-related antisocial behavior may share variance owing to the shared context of the behavior (school).
    2. Yes, the two-factor model fit significantly better than the one-factor model according to a chi-square difference test \((\Delta\chi^2[df = 6.00] = 273.84, p < .001)\). Thus, antisocial behavior may not be a monolithic construct, but may depend on the context in which the behavior occurs.

References

Baker, F. B., & Kim, S.-H. (2017). The basics of item response theory using R. Springer.
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2022). lme4: Linear mixed-effects models using Eigen and S4. https://github.com/lme4/lme4/
Bürkner, P.-C. (2021). Bayesian item response modeling in R with brms and Stan. Journal of Statistical Software, 100(5), 1–54. https://doi.org/10.18637/jss.v100.i05
Chalmers, P. (2020). mirt: Multidimensional item response theory. https://CRAN.R-project.org/package=mirt
Chen, Y., Prudêncio, R. B. C., Diethe, T., & Flach, P. (2019). \(\beta\)3-IRT: A new item response model and its applications. arXiv:1903.04016. https://arxiv.org/abs/1903.04016
Cooper, L. D., & Balsis, S. (2009). When less is more: How fewer diagnostic criteria can indicate greater severity. Psychological Assessment, 21(3), 285–293. https://doi.org/10.1037/a0016698
Embretson, S. E. (1996). The new rules of measurement. Psychological Assessment, 8, 341–349. https://doi.org/10.1037/1040-3590.8.4.341
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists (Vol. 4). Lawrence Erlbaum Associates.
Gibbons, R. D., Weiss, D. J., Frank, E., & Kupfer, D. (2016). Computerized adaptive diagnosis and testing of mental health disorders. Annual Review of Clinical Psychology, 12(1), 83–104. https://doi.org/10.1146/annurev-clinpsy-021815-093634
Krueger, R. F., Nichol, P. E., Hicks, B. M., Markon, K. E., Patrick, C. J., lacono, W. G., & McGue, M. (2004). Using latent trait modeling to conceptualize an alcohol problems continuum. Psychological Assessment, 16(2), 107–119. https://doi.org/10.1037/1040-3590.16.2.107
Magis, D. (2013). A note on the item information function of the four-parameter logistic model. Applied Psychological Measurement, 37(4), 304–315. https://doi.org/10.1177/0146621613475471
Petersen, I. T. (2024b). petersenlab: A collection of R functions by the Petersen Lab. https://doi.org/10.5281/zenodo.7602890
Reise, S. P., & Waller, N. G. (2009). Item response theory and clinical measurement. Annual Review of Clinical Psychology, 5(1), 27–48. https://doi.org/10.1146/annurev.clinpsy.032408.153553
Rosseel, Y., Jorgensen, T. D., & Rockwood, N. (2022). lavaan: Latent variable analysis. https://lavaan.ugent.be
Smith, G. T., McCarthy, D. M., & Anderson, K. G. (2000). On the sins of short-form development. Psychological Assessment, 12(1), 102–111. https://doi.org/10.1037/1040-3590.12.1.102
Thomas, M. L. (2019). Advances in applications of item response theory to clinical assessment. Psychological Assessment, 31(12), 1442–1455. https://doi.org/10.1037/pas0000597

Feedback

Please consider providing feedback about this textbook, so that I can make it as helpful as possible. You can provide feedback at the following link: https://forms.gle/95iW4p47cuaphTek6