JSP-00827; No of Pages 16 Journal of School Psychology xxx (2016) xxx–xxx
Contents lists available at ScienceDirect
Journal of School Psychology journal homepage: www.elsevier.com/locate/jschpsyc
An introduction to mixture item response theory models☆ R.J. De Ayala a,⁎, S.Y. Santiago b a b
University of Nebraska, Lincoln, United States Consolidated School District of New Britain, Connecticut, United States
a r t i c l e
i n f o
Article history: Received 15 July 2015 Received in revised form 23 December 2015 Accepted 21 January 2016 Available online xxxx
a b s t r a c t Mixture item response theory (IRT) allows one to address situations that involve a mixture of latent subpopulations that are qualitatively different but within which a measurement model based on a continuous latent variable holds. In this modeling framework, one can characterize students by both their location on a continuous latent variable as well as by their latent class membership. For example, in a study of risky youth behavior this approach would make it possible to estimate an individual's propensity to engage in risky youth behavior (i.e., on a continuous scale) and to use these estimates to identify youth who might be at the greatest risk given their class membership. Mixture IRT can be used with binary response data (e.g., true/ false, agree/disagree, endorsement/not endorsement, correct/incorrect, presence/absence of a behavior), Likert response scales, partial correct scoring, nominal scales, or rating scales. In the following, we present mixture IRT modeling and two examples of its use. Data needed to reproduce analyses in this article are available as supplemental online materials at http://dx. doi.org/10.1016/j.jsp.2016.01.002. © 2016 Society for the Study of School Psychology. Published by Elsevier Ltd. All rights reserved.
1. Introduction Consider the situation in which we are interested in measuring the mathematics ability in young children. We may conceptualize that young children differ from one another along a continuum of mathematics ability. Stated another way, we consider our construct to be continuous with some children exhibiting more (or less) mathematics ability than other children. Depending on their mathematics ability, these children would be located at different points along the (latent) continuum that represents our construct. Alternatively, our conceptualization of why young children would differ from one another might involve the cognitive skills used by the children in solving the math items. In this case, we are considering our construct to be discrete and to consist of (latent) classes of children that differ from one another in their respective distinctive patterns of mathematical problem-solving behaviors. Thus, we have a construct that can be conceptualized in two ways. In the following, we provide a didactic introduction to a modeling framework that integrates continuous and categorical latent variables. Our categorical latent variable reflects a qualitatively different subgroup but within which there is a continuous latent variable. We begin by first presenting a latent variable modeling framework for continuous latent variables. Following this, we introduce an alternative latent variable approach for when we conceptualize a latent variable as categorical in nature. Subsequently, we integrate these two approaches to form a mixture item response theory model and then present examples of the application of mixture item response theory.
☆ Data needed to reproduce analyses in this article are available as supplemental online materials at http://dx.doi.org/10.1016/j.jsp.2016.01.002. ⁎ Corresponding author.
http://dx.doi.org/10.1016/j.jsp.2016.01.002 0022-4405/© 2016 Society for the Study of School Psychology. Published by Elsevier Ltd. All rights reserved.
Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002
2
R.J. De Ayala, S.Y. Santiago / Journal of School Psychology xxx (2016) xxx–xxx
2. Item response theory Item response theory (IRT) has emerged as a popular approach for solving various measurement problems with continuous latent variables. IRT has been used in the development of measures for mathematics achievement (e.g., Clements, Sarama, & Liu, 2008), numeracy (e.g., Purpura, Reid, Eiland, & Baroody, 2015) and temperament (e.g., Primi, Wechsler, de Cassia Nakano, Oakland, & Souza Lobo Gusso, 2014), and is used with the WRAT-3 (Henington, 2004). Moreover, IRT is used nationally (e.g., California Achievement Test, the California Test of Basic Skills [CTB/McGraw-Hill, 1987; CTB/MacMillan/McGraw-Hill, 1991]), in state testing programs (e.g., Maryland State Department of Education's High School Functional Assessment program), as well as in municipal programs (e.g., Portland School district). IRT consists of a family of models that specify the probability of a response given person and item characteristics such as a person's mathematics ability and the item's difficulty. Different models exist for dichotomous (i.e., binary) and polytomous response data. Dichotomous responses arise from questions that require a true/false, agree/disagree, or endorsement/not endorsement response or from a response that is scored as correct or incorrect. In contrast, polytomous response data come from Likert response scales, partial correct scoring, nominal scales, or rating scales. When we have model-data fit IRT provides several advantages over classical test theory. These advantages of IRT include item statistics that are independent of the sample of individuals to whom an instrument is administered, ability statistics that are independent of the instrument used, and an accuracy index of the individual ability estimation. Sample independent item statistics (e.g., item difficulty) means that how difficult the items are is meaningful for all samples of respondents. This is in stark contrast to classical item difficulty (i.e., the item's p-value or proportion correct) whose interpretation is tied to the sample that was administered the instrument. For example, an item administered to a high ability group will have a large p-value (i.e., it is an easy item), whereas the same item administered to a low ability group will have a small p-value (i.e., it is a hard item). Similarly, in classical test theory an individual administered an easy, say algebra, test will have a higher ability estimate (i.e., number correct) than if he or she had been administered a difficult algebra test. This is the case even though the individual's true ability is the same in both administrations. However, in IRT our ability statistics transcend the administered instrument. That is, with IRT the ability estimate will be invariant across administrations of the two algebra examinations regardless of their respective level of difficulty. The simplest of the IRT models contains a single parameter representing an item's location on the latent construct. This model is the one-parameter model; some individuals also refer to this model as the Rasch model (Rasch, 1980). In addition, this model contains a single parameter representing the person's location on the latent construct. In the context of proficiency assessment the person's location is typically referred to as ability (e.g., algebra ability) and item location is referred to as item difficulty. However, because the latent construct can reflect either the cognitive (e.g., algebra ability) or affective (e.g., degree of anxiousness) we use the general term ‘location’ to transcend the nature of the latent construct domain. That is, referring to a person's location on an affective domain as ability is a non sequitur. Unlike classical test theory, in IRT these item and person statistics occur on the same latent continuum. This allows us to talk about people and items directly and in a comparative fashion. For instance, in proficiency assessment an item may be too easy for a person located at one point on the continuum, but too hard for another individual located at a different point on the scale. The one-parameter model can be represented using a logistic framework as p X ij ¼ 1jθi ; δ j ¼
eαðθi −δ j Þ 1 þ eα ðθi −δ j Þ
ð1Þ
where p is the probability of a response of 1 (Xij = 1) given person i's location (θi) and item j's location (δj) on the latent continuum; α is related to item discrimination, is constant across items, and is sometimes set to 1.0 (e.g., the Rasch model). (If we want the probability of a response of 0 (Xij = 0) we simply take the complement of Eq. (1) (i.e., p(Xij = 0)=1 − p(Xij = 1|θi,δj).) In words, Eq. (1) states that the probability of a response of 1 on item j (e.g., a correct response) is a function of the distance between person i's location (e.g., ability) and the item j's location (e.g., difficulty). For convenience, the latent construct continuum represented by theta is typically shown using a standard score scale. Although the range of item and person locations is influenced by the specific application, typically the range s is from −3 to 3 with greater positive values indicating more of the construct than do lower values. In applications of IRT, it is sometimes convenient to transform this scale to eliminate negative values or to a scale that has intrinsic meaning such as a proportion or a number correct scale. These transformations are easily accomplished through linear or nonlinear transformations. Details may be found in de Ayala (2009). As is true for most statistical models, IRT models are predicated on assumptions. One of these assumptions is the dimensionality assumption. In short, this assumption states the number of latent variables underlying the data is consistent with the model. For example, in the one-parameter model we have a single latent construct (represented by θ) that accounts for the manifest (observed) responses. As such, this model is appropriate for modeling data that are (essentially) unidimensional. A second assumption is that the responses are conditionally independent of one another. Stated another way, the probability of responding to an item set depends only on the latent construct and nothing else. As a result, the responses to an item are independent of the responses to another item conditional on θ. It is because of this assumption that we are able to use a product of item probabilities in Eq. (2). Our third assumption is that the pattern of responses to an item is consistent with the functional form specified by the model. For dichotomous data this functional form appears as an S-shape curve (i.e., an ogive) when graphed. Fig. 1 shows Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002
R.J. De Ayala, S.Y. Santiago / Journal of School Psychology xxx (2016) xxx–xxx
3
Fig. 1. Item response function for an E–I item with a location of −1.
an example of this graph and is referred to as an item response function. This function shows the probability of a response of 1 as a function of the latent construct. Typically, when we administer an instrument to a child to measure, for example reading proficiency, we obtain responses to multiple items. However, Eq. (1) only provides the probability of a response to a single item. Therefore, we need to be able to utilize Eq. (1) with multiple responses simultaneously. We now show how this is done. Let all of child i's item responses on an instrument (Xi1,Xi2,…,XiL) be collectively symbolized by x i (i.e., child i's responses to L items is represented by the response vector x i = (Xi1,Xi2,…,XiL)). For example, given a three-item instrument (i.e., L = 3), child i provides a response of 1 (e.g., responds ‘True’) to the first and third items and a response of 0 (e.g., responds ‘False’) to the second item, so his or her response vector would be x i = (1,0,1). To obtain the probability for child i's responses to L items we assume conditional independence, apply Eq. (1) to each item, and multiply the resulting probabilities. For convenience we let pij = p(Xij = 1|θi,δj), so, symbolically, we have ð1−X Þ L X ij pðx i Þ ¼ ∏ pij ij 1−pij
ð2Þ
j¼1
3
where pij is obtained by Eq. (1); Π is the product symbol: ∏ y j ¼ y1 y2 y3. We can use Eq. (2) to determine which child location j¼1
was most likely to have produced the observed response vector. For example, assume we observe a child's responses to be (1,0,1) and according to Eq. (2) it has a corresponding probability of .8 (i.e., pðx i Þ = .8) if the child is located at 1.2. Alternatively, if the child is located at −1.5 the probability falls to .3 (i.e., pðx i Þ . pðx i Þ = .3). Thus, we would say that a child located at 1.2 is more likely to have produced the response vector of (1,0,1) than a child located at −1.5. Extending this idea beyond two child locations to all possible child locations, we would say that the child is located at 1.2 if the probability of .8 is the largest of all the possible probabilities. The person and item parameter locations (θi, δj) are the parameters estimated for the one-parameter IRT model. Typically, this estimation utilizes maximum likelihood to determine the optimal estimates of θi and δj. A related and more general model is Birnbaum's (1968) two-parameter logistic (2PL) model. With the 2PL model items are allowed to differ from one another in terms of their discrimination as well as location (δj). This model is essentially the same as Eq. (1) except that α in Eq. (1) is replaced by αj to reflect that item discrimination can vary the across the L items. To provide some context assume that the item response function displayed comes from a temperament scale (cf. Primi et al., 2014) that uses a force-choice format with two alternatives that represent opposite poles (e.g., extraversion or introversion). For instance, let the item's stem be “When I feel drained from a long work week”: (a) “I like to spend time by myself to ‘recharge’.” or (b) “I like to get together with friends and go out to ‘recharge’.” A response of ‘a’ is coded as a response of 1, 0 otherwise. Thus, if all the items on the scale measured extraversion–introversion (E–I) and were coded as above (i.e., the introversion choice is coded as 1), then the right side of the E–I (the theta scale) continuum would reflect introversion and the left side would be associated with extraversion. In our example, our item has a location of − 1 (i.e., δj = − 1) and individuals located to the right of this Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002
4
R.J. De Ayala, S.Y. Santiago / Journal of School Psychology xxx (2016) xxx–xxx
location have an increasing probability of endorsing the introversion choice, whereas persons to the left of −1 have a decreasing probability of selecting the introversion choice (i.e., an increasing probability of selecting the extraversion choice). For example, a person located at 2 (θi = 2) has a probability of selecting option (a) of .95; p = e(θi − δj)/(1 + e(θi − δj)) = e(2 − (−1))/(1 + e(2 − (−1))) = .95. In contrast, a person located at −2 (θi = −2) would have a probability of .27 of selecting option (a) or a probability of .73 of selecting option (b) (i.e., 1.0–.27). The application of IRT requires assessing the tenability of each of the assumptions and determining the degree of model-data fit at the item level as well as overall. Assessment of the dimensionality assumption can be accomplished in multiple ways, such as, using principal component or (linear or nonlinear) factor analysis. After determining the data's latent structure, we can examine the tenability of the conditional independence and functional form assumptions. The veracity of the conditional independence assumption can be determined through one of the several statistics that are available (e.g., Q3; Yen, 1984). We can assess the functional form assumption graphically by comparing an item's observed response function with the one predicted by the model. If the observed and predicted item response functions closely match, then we have support of fit for the item; we would perform this comparison for each item on the instrument. In addition, item level and overall model-level statistical indices are available to facilitate model-data fit examination. Upon obtaining sufficient evidence supporting model-data fit the IRT advantages presented above can be realized in the item and person parameter estimates. More information on IRT can be found in de Ayala (2009); Embretson and Reise (2000), or Hambleton, Swaminathan, and Rogers (1991). 3. Latent class analysis Demographic variables such as gender and ethnicity reflect pre-defined (manifest) groupings. In latent class analysis (LCA) our groups are unobserved. That is, in contrast to IRT's assumption of a continuous latent variable in LCA the latent construct is assumed to be categorical and to consist of a set of mutually exclusive and exhaustive latent classes. These latent classes account for the manifest relationship between any two or more items on an instrument (Stouffer, 1950). In short, we believe that our sample consists of a mixture of qualitatively different types of persons. As an example, assume we wish to measure anxiety in children by administering a scale, such as, the Taylor Manifest Anxiety Scale (Taylor, 1953). We conceptualize our latent variable anxiety as categorical in nature. The latent class analysis of our response data might lead us to classify the children into qualitatively different latent groups so that, for example, one class may be interpreted as representing individuals with incapacitating anxiety with a second class reflecting children with transient anxiety. Because LCA involves comparing individuals in terms of their latent class memberships, rather than their locations on a continuous latent variable, we can talk about the children in terms of their similarities within a class or their dissimilarities across classes, but not in terms of degree of their anxiety. LCA has been applied in the study of bullying behavior (e.g., Bradshaw, Waasdorp, & O'Brennan, 2013; Shao, Liang, Yuan, & Bian, 2014; Wang, Ianotti, & Luk, 2012), mathematical ability (e.g., Yang, Shaftel, Glasnapp, & Poggio, 2005), and in cross-cultural comparisons (e.g., Eid, Langeheine, & Diener, 2003). As was true with IRT, LCA can be used with dichotomous response or polytomous response data. These data are assumed to be a manifestation of two or more latent classes of respondents. Analogous to IRT, LCA has a conditional independence assumption. Specifically, the item responses are assumed to be independent of one another within a latent class (i.e., conditional on latent class membership). To facilitate presenting the basic latent class model assume that we administer an instrument designed to measure general anxiety to a sample of children. For simplicity let their responses be dichotomous. Each item j on the instrument is characterized by a conditional item probability (πjc) in each of the C latent classes (c = 1 … C). This conditional item probability specifies the probability that a child in a given latent class gives a response of 1 on the item. For instance, in a proficiency context the conditional item probabilities would reflect the item's difficulty for a latent class. Additionally, each latent class c is characterized by its latent class proportion (πc). That is, the proportion of children in the sample that belong to each latent class. Because the latent class set is exhaustive the sum of all the latent class proportions is 1 (i.e., π1 + π2 + … + πC = 1). (Note that we use double subscripts for conditional item probabilities and a single subscript for latent class proportions.) The latent class proportions and conditional item probabilities are the person and item parameters estimated in LCA. Recall that each of child i's item responses (Xij) can be collectively symbolized by x i (i.e., all of child i's responses to L items is the response vector x i = (Xi1, Xi2,…, XiL)). To obtain our basic latent class model we need to define the probability of a response vector for a latent class c. Therefore, this conditional probability of the response vector x i is ð1−X Þ L X ij pðx i jcÞ ¼ ∏ πjcij 1−πjc ;
ð3Þ
j¼1
where pðx i jcÞ is the conditional probability for child i's response vector x i and πjc is the conditional item probability for item j in latent class c. Stated in words, Eq. (3) tells us the chance of observing the response vector x i (e.g., (1,0,1)) given that child i is in a particular latent class. These probabilities are conditional on the child's latent class membership and are a function of the each item's conditional item probability for the class (πjc) and child i's response to the item (i.e., Xij). This equation reflects that a child's responses are assumed to be independent of one another within a latent class (i.e., our conditional independence assumption). Eq. (3) shows that LCA is concerned not with a child's individual item response per se, but with all of the child's responses taken collectively. Stated another way, LCA is concerned with a child's pattern of responses. This is in contrast to IRT in which Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002
R.J. De Ayala, S.Y. Santiago / Journal of School Psychology xxx (2016) xxx–xxx
5
our model (e.g., Eq. (1)) was at the level of a child's individual item response, but could also be used to accommodate all of a child's responses (Eq. (2)). Eq. (3) forms the basis for predicting to which latent class a child is most likely to belong. In order to predict which class a child is most likely to belong to (i.e., given his or her response vector) we need to know what are all the possibilities. That is, what is the probability of having observed child i's response vector if that child is in latent class 1, what is the probability of having observed child i's response vector if he or she is in latent class 2, and so on. To obtain this overall probability for child i's response vector (i.e., irrespective of latent class) we need to take into consideration the latent class sizes (i.e., πc). These latent class proportions serve to weight the conditional probabilities of the child i's response vector (i.e., Eq. (3)). Therefore, we have pðx i Þ ¼
C X
πc pðx i jcÞ ¼
c¼1
C X c¼1
" # ð1−X Þ L X ij π c ∏ πjcij 1−πjc ;
ð4Þ
j¼1
where pðx i Þ is the unconditional probability for child i's response vector x i , πc is the latent class proportion for latent class c, and πjc is the conditional item probability for item j in latent class c; pðx i jcÞ is the conditional probability for child i's response vector x i and is given by Eq. (3). (Note the use of double subscripts on conditional item probabilities and a single subscript on latent class proportions.) Eq. (4) tells us the probability of observing child i's responses regardless of his or her class membership, whereas Eq. (3) tells us the probability of observing child i's responses given a latent class. By way of analogy, Eq. (3) tells us the probability of randomly selecting a girl who is left-handed from a classroom, whereas Eq. (4) tells us the probability of randomly selecting a girl from the same classroom regardless of her handedness. As mentioned above, LCA involves comparing individuals in terms of their latent class memberships. Thus, we need to be able to predict latent class membership for each member of our sample. To make this prediction we need to determine for which latent class each of the children has the highest probability of belonging.1 This probability is determined by using Eqs. (3) and (4) as well as Bayes' Theorem to obtain Eq. (5) (cf. Eqs. (3) and (4)). Eq. (5) gives us the (posterior) probability of membership in latent class c given child i's responses: pðcjx i Þ ¼
C X c¼1
" πc
pðx i jcÞ π c #: ð1−X Þ X ij ij ∏ πjc 1−πjc L
ð5Þ
j¼1
Eq. (5) is calculated for each latent class given a child's responses. The child is assigned to whichever latent class has the largest membership probability given the child's responses. Therefore, all children with a common response pattern are assigned to the same latent class (i.e., all children in a latent class share a common pattern of responses). As was the case with IRT, in applying LCA to empirical data we need to first determine the latent structure of our data. To accomplish this several models involving an increasing number of latent classes is fitted to the observed data. For instance, the simplest mixture IRT model we would start with has two latent classes. We would then proceed to a three latent class model and so on. Assessing the model-data fit across this series model can involve fit indices such as the Akaike information criterion (AIC, Akaike, 1974), consistent AIC (CAIC, Bozdogan, 1987), the Bayesian information criterion (BIC; Schwarz, 1978), the sample-size adjusted Bayesian information criterion (SABIC; Sclove, 1987), or a statistical significance test such as the chi-squared difference test (i.e., the likelihood-ratio chi-squared test); additional information concerning fit statistics may be found in Nylund, Asparouhov, and Muthén (2007). The chi-squared difference test is applied between successive hierarchical models (e.g., a two- versus a three-latent class model) to determine if the additional latent class is necessary. A non-significant chi-squared difference test indicates that the model with the larger number of latent classes would not be preferred over the model with the fewer number of classes. In contrast, AIC, CAIC, BIC, and SABIC are not statistical significance tests, but rather information criteria based on the ‘log likelihood’ (i.e., −2lnL) function used in parameter estimation.2 For each of these information indices we select the model with the lowest index value among a set of competing models. (If more than one model shares the same index value, then we select the most parsimonious model.) These indices reflect relative fit and are meaningful when used for making comparisons across a set of models applied to the same data. Generally speaking, as model complexity increases fit improves and the maximum of the log likelihood function decreases reflecting better fit. For example, a two-class model will have a maximum log likelihood that is larger than that of a three-class model. Although these statistics all utilize the − 2lnL of the fitted model, they differ in the penalty imposed for model complexity (i.e., the number of parameters estimated) and, in some cases, sample size. The addition of the penalty seeks to compensate for the decreasing log likelihood associated with increasing model complexity. As such, these indices try to balance model-data fit as captured in the log likelihood function with the selection of a parsimonious model by imposing a penalty for model complexity. Because some or all of these indices may be found in the output of LCA estimation programs, we will briefly summarize their differences. For AIC the penalty is based solely on the number of model parameters, whereas with CAIC, BIC and SABIC the penalty 1 We are actually calculating the probability of a given response pattern for each latent class. All children that provide the same response pattern are classified in the same latent class. 2 Technically, when we say ‘log likelihood function’ we are referring to −2 times the log likelihood. Symbolically, this is represented as −2lnL where lnL refers to the log likelihood; −2lnL is sometimes referred to as the deviance statistic.
Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002
6
R.J. De Ayala, S.Y. Santiago / Journal of School Psychology xxx (2016) xxx–xxx
also involves the sample size. BIC, CAIC, and SABIC differ in the implementation of the sample size penalty.3 With respect to BIC, as the number of parameters estimated and/or the sample size increases so does the penalty. In other words, BIC's penalty leads BIC to favor models that have fewer parameters over models that are more complex, this is particularly true with large samples. Unlike AIC, as sample size increases BIC tends to select the correct model (Haughton, 1988). As its name implies SABIC's penalty seeks to moderate the sample size effect used in BIC (or CAIC); CAIC's penalty is more severe than BIC for a given number of parameters and sample size. Because of the use of sample size and/or the number of parameters estimated, the AIC, CAIC, BIC, and SABIC indices will not always agree with one another in model selection. Thus, they should be considered as providing guidance in model selection rather than determining which model should be selected. Although the above indices are useful, additional considerations in model selection are parsimony, latent class interpretability, and theoretical implications. In short, once a model has been identified using one or more of the approaches above, the latent classes are examined to determine the characteristics shared by the children in a latent class, but not with the children in the other latent classes. In some contexts, the theoretical implications of these latent class interpretations are a consideration. If the latent classes are not interpretable, then the model is suspect. In these cases a model with interpretable latent classes is the preferred model despite not having, for example, a BIC value that is smaller than the model with the non-interpretable latent classes. It is also important to note that the latent classes should not be confused with manifest groupings that arise from the use of one or more thresholds (or cut points) on a continuum. 4. Mixture IRT model Some empirical situations involve a mixture of latent subpopulations such that there are qualitative differences between the subgroups but within each subpopulation a measurement model holds. For instance, recall our temperament E–I scale example above. With this scale we use a force-choice format with two alternatives representing opposite poles (e.g., extraversion or introversion). If our sample of children reflected a mixture of latent subpopulations that varied in self-disclosure such that one group is interpreted as reflecting ‘non-discriminatory self-disclosure,’ whereas the children in the other group reflect ‘selective selfdisclosure,’ then these two classes could influence how the children respond on our temperament E–I scale (cf. Maij-de Meij, Kelderman, & van der Flier, 2005). In this case we are still interested in comparing the children in terms of their degree of E–I, but recognize that a child's propensity to endorse one choice over the other may be affected by whether he or she is a member of the ‘selective self-disclosure’ class or the ‘non-discriminatory self-disclosure’ class. We can model this situation by using mixture IRT — an integration of IRT and LCA. In this modeling framework the children are characterized by both a location parameter (θi) and a latent class membership parameter (πc). Our IRT model holds within each latent class, but an item's parameter estimate(s) can differ across latent classes. Mixture IRT models have been used to study risky youth behavior (e.g., Finch & Pierson, 2011), self-disclosure (e.g., Maij-de Meij et al., 2005), personality characteristics (e.g., Smit, Kelderman, van der Flier, 2003), mathematics assessment (e.g., Li, Cohen, Kim, & Cho, 2009), and items that function differently across groups (e.g., Cohen & Bolt, 2005; De Ayala, Kim, Stapleton, & Dayton, 2003). Although IRT models for ordinal or nominal polytomous data or dichotomous data arising in the presence of guessing may be utilized with mixture IRT (e.g., Cohen & Bolt, 2005; Finch & Pierson, 2011; Maij-de Meij et al., 2005; Smit, Kelderman, van der Flier, 2003), we utilize the one-parameter model (Eq. (1)) for simplicity and without loss of generality. Following Rost (1990) (also see Mislevy & Verhelst, 1990; Rost, 1991) the probability of a response of 1 (e.g., a response of “true” or “agree”) on item j by child i in latent class c according to our mixed one-parameter logistic model is
pijc ¼
eα ðθic −δjc Þ 1 þ eαðθic −δjc Þ
ð6Þ
where δjc is item j's location in class c, θic is the location for child i who is a member of latent class c, and α reflects a common item discrimination across items and across classes. The subscript c on the child and item location parameters in Eq. (6) reflects that we have this parameter set for each latent class. As was true with IRT, these parameters are continuous within each class. The 3
The formulae for AIC, CAIC, BIC, and SABIC are AIC ¼ −2 lnL þ 2p
ð9Þ
BIC ¼ −2 lnL þ p ln ðN Þ
ð10Þ
CAIC ¼ −2 lnL þ pð ln ðN Þ þ 1Þ
ð11Þ
0 SABIC ¼ −2 lnL þ p ln N
ð12Þ
where p is the number of parameters in the model, N is the sample size, and N 0 ¼ ðNþ2Þ
24
.
Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002
R.J. De Ayala, S.Y. Santiago / Journal of School Psychology xxx (2016) xxx–xxx
7
child's parameters characterize how much of the construct (i.e., θic) the child possesses (e.g., his/her degree of E–I) given his/her membership in a latent class (e.g., the ‘non-discriminatory self-disclosure’ class). If the c latent classes are assumed to be mutually exclusive and jointly exhaustive (as they were in LCA), then we can obtain the probability of a response of 1 on item j by a child i irrespective of class membership (i.e., unconditional probability) by pij ¼
C X
π c pijc ¼
c¼1
C X
" πc
c¼1
# eαðθic −δjc Þ ; 1 þ eαðθic −δjc Þ
ð7Þ
where pijc is the conditional probability given by Eq. (6) (i.e., the probability is conditional on membership in class c) and πc is C
latent class c's proportion. As was true with LCA, the sum of the latent class proportions is equal to 1; ∑ π c ¼ 1. In the case of c¼1
a one-class solution our mixed one-parameter model simplifies to the one-parameter model shown in Eq. (1). To estimate a child's parameters we need to generalize from the probability of a response to a single item j given by Eq. (7) to all of child i's observed responses to L items (i.e., child i's response vector, x i). In a fashion analogous to how Eqs. (2) and (4) were obtained, we have the probability for child i's response vector as pðx i Þ ¼
C X c¼1
πc pðx i jcÞ ¼
C X c¼1
"
L
πc ∏
j¼1
X pijcij
# ð1−X Þ ij 1−pijc ;
ð8Þ
where pðx i jcÞ is the conditional probability for the response vector by child i who is a member of latent class c, pijc is given by Eq. (6), and all other terms are defined as above. When we apply our model we can determine not only a location for each of the children on the construct continuum within each latent class, but also to which latent class he or she is most likely to belong. As an example, using our temperament E–I example we would have an estimate of a child's E–I capacity given that he or she belong to, say, the ‘selective self-disclosure’ class. We would also have estimates of each item's location on the E–I continuum for each class. As was true with IRT, our first step in applying our model to empirical data is to determine the latent structure of our data. Because we need to determine if the addition of a multi-class structure is necessary we start with a one-class solution. We then progress to a two-class, three-class, and more classes if necessary. Determination of the appropriate number of classes proceeds as was done with LCA. 5. Example 14 Assume that as part of a project we need to assess students' cognitive ability. Our sample is comprised of 2400 students in grades 9–12 at nine high schools in an urban public school district. The schools varied in terms of student-teacher ratio, gender, student socio-economic status, ethnicity, free-subsidized lunch, and percentage of ELL students. Our instrument is a proprietary scale developed to measure general intellectual ability and influenced, in part, by the Woodcock-Johnson IV Tests of Cognitive Abilities (Shrank, McGrew, & Mather, 2014) and the Wechsler Nonverbal Scale of Ability (Wechsler & Naglieri, 2006). Our test consists of ten items each of which is scored as correct or incorrect. As part of our project we collect demographic information on each of the students. Given the heterogeneous nature of our sample we believe that it is reasonable to entertain the possibility that our sample consists of latent subpopulations. As such, we will model our data with a mixture IRT model. To estimate our model we use MPlus 7.11 (Muthén & Muthén, 2013). We begin by specifying a single latent class solution and proceed to two- and three-latent class models. An example of the syntax used for the two-class model is shown in Table 1. The general layout of the input file is specification of the variables and their nature (names = i1–i10 and categorical = i1–i10), the two class model (classes = c(2)), the use of categorical and continuous latent variables (type = mixture), and then the model specification. We start with an overall model specification of what is in common across our latent classes. In this example, this section indicates that our latent variable (f) is being measured by each of our ten items (i.e., i1, …, i10). Each of these items has a starting value of 1 and we fix the latent variable's mean to 0 in all classes ([f@0]); this also addresses model identification. The class specific parts follow. Class 1 (i.e., %c#1%) is presented first and followed by the second class (i.e., %c#2%). Because we want to estimate a model with a constant discrimination across items within each class we set each item measuring the continuous factor to 1 (f by i1– i10@1) within a class, but allow the item locations to be free to vary. (Technically, we are allowing the item thresholds to vary, but these thresholds are transformed to the item locations. Similarly, we are setting the item loadings to be 1, but the loadings are also transformed to be our item discriminations. See Muthén and Muthén (2013) for more information.) Table 2 shows abridged output for our two-class mixture IRT model. Examining our output we first verified that the data were correctly read and that none of the stages exceeded the maximum number of iterations for that stage. Inspection of our output showed that the data input was correct and that we obtained a converged solution. That is, we see that the Number of observations is 2400 and that each item in the UNIVARIATE PROPORTIONS AND COUNTS… has a count of 2400 (e.g., for the first item 564 students incorrectly answered it, whereas 1836 correctly answered the item). We verified the number correctly responding to be correct by comparing this output with that used in our data cleaning phase. 4
Simulated data are used in the example. Latent class interpretations exemplify the types of interpretations that are possible.
Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002
8
R.J. De Ayala, S.Y. Santiago / Journal of School Psychology xxx (2016) xxx–xxx Table 1 Mixture IRT model syntax for a two latent class solution.
The AIC, BIC, and SABIC indices and the latent class proportions for our three models are presented in Table 3. As can be seen, our fit indices suggest that the additional latent classes over a one-class solution (i.e., a one-parameter model) are necessary. For example, the AIC value for a one-class model (AIC = 26,860.93) is greater than that for the two- and three-class models. Moreover, although the three-class model has an ever so slightly smaller AIC than does the two-class model, when we examine the BIC and SABIC fit indices we see that the two-class model exhibits better fit to the data. Inspecting our latent class proportions shows that our three-class model has a small class (π 3 = 0.02) indicating that we have possibly “over-extracted” one of the classes from our two-class model. Applying this latent class proportion to our 2400 students, we expect to have only 56 students with which to estimate the item locations for this class. Because of the BIC and SABIC values as well as the belief that we have over-extracted a latent class, we believe that the data can be modeled with a two-latent class mixture one-parameter model. As mentioned above, determining the number of latent classes also involves being able to interpret the latent classes. This process requires that we assign the students to the latent class for which he or she has the largest membership probability and determining which characteristics the individuals within a class share. For our example, interpretation of our latent classes showed that our larger class tends to have characteristics that reflect that English is their first language, do not use a freesubsidized lunch program, tend to exhibit less mobility than students in the other latent class, and have a mean intellectual ability number correct score of 6.24 (see Table 4). In contrast, our other latent class contains students for whom English is not their first language, tend to be in the free-subsidized lunch program, and exhibit mobility; mean intellectual ability number correct score of 4.19. Fig. 2 shows the item response functions for three of our items for each latent class. Our continuous latent variable, cognitive ability, is measured such that students with higher ability are located towards the upper end of the continuum, whereas lower ability students are found at the lower end. As can be seen, item 1 (dash line/circle v. solid line/circle) is estimated to be easier for students in latent class 1 than it is for students that belong to latent class 2 (i.e., δˆ11 = −1.313, δˆ12 = −0.660). In contrast, the reverse is true for item 4 (dash line/triangle v. solid line/triangle) with estimated locations of δˆ41 = −0.734 and δˆ42 = −1.840 in latent classes 1 and 2, respectively. Further, item 7 (dash line/square v. solid line/square) is estimated to be located at approximately the same point on in each latent class (i.e., δˆ71 = 0.533, δˆ72 = 0.545). In short, some items reflect different relationships across classes, whereas others maintain similar relationships across classes. Fig. 3 shows the differences in our latent classes with respect to the students and their performance on our instrument. This figure contains two double Y graphs with a histogram depicting the number correct score (i.e., Rawscore) frequency and the students' corresponding proficiency location estimates (dash-dot line) superimposed. (This figure was created using a different estimation program.) We see that for a given number correct score the estimated students' latent proficiency (θˆic) varies depending on his or her latent class membership. For example, we see that students who obtained only one question correct (i.e., Rawscore = 1) have a proficiency estimate of about −5.6 if they belong to latent class 1 (top panel), but approximately −2.5 if they are a member of latent class 2 (bottom panel); values are obtained from the left ordinate and the dash-dot line. Stated another way, if we ignore the students' latent class membership we have approximately 100 students that answered only one Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002
R.J. De Ayala, S.Y. Santiago / Journal of School Psychology xxx (2016) xxx–xxx
9
Table 2 Abridged MPlus Output for a two-latent class solution.
question correct. As such, our observed (manifest) number correct score would lead us to believe that the cognitive abilities of these students are similar. However, by taking into account the data's latent structure (i.e., latent classes and latent cognitive ability) we have a different interpretation. Specifically, those students who are bilingual and/or for whom English is not their first language, etc. (i.e., class 2) are actually estimated to have higher cognitive ability than those students in class 1 (i.e., English was their first language, etc.) who obtained the same number correct score. Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002
10
R.J. De Ayala, S.Y. Santiago / Journal of School Psychology xxx (2016) xxx–xxx Table 3 Model-data fit indices and latent class proportions. Index
1 LC
2 LC
3 LC
AIC BIC SABIC
26,860.93 26,924.55 26,889.60
26,787.98 26,921.02 26,847.92
26,777.47 26,975.88 26,864.68
πc
1 LC
2 LC
3 LC
1 2 3
1
0.87 0.13
0.85 0.13 0.02
Table 4 LC characteristics. LC
English first language
1
No Yes No Yes
2 a
94% 6% 83.5% 16.5%
Luncha
Mobility
10.1%
No Yes No Yes
24.0%
Intellectual ability 84.5% 15.5% 23.7% 76.3%
6.24 4.19
Free and reduced lunch.
6. Example 24 In this example we use an estimation program, WINMIRA (von Davier, 2001), designed for the one-parameter (Rasch) model for dichotomous and polytomous data. In our study we wish to understand the nature of writing problems in order to identify groups that might have misconceptions and/or are utilizing potentially erroneous strategies that lead to writing problems. Along with a measure of writing knowledge (i.e., planning, organization, multiple drafts, revision strategies), students are given either a verbal or pictorial writing prompt to assess their writing ability. Two raters holistically judge the students' writing samples for verbal complexity (age-appropriate vocabulary, sentence structure) and syntactic correctness (age-appropriate spelling, handwriting, capitalization, and punctuation). Responses on writing knowledge are scored as correct or incorrect, whereas verbal complexity and syntactic correctness are each judged on a four-point scale with larger values indicating greater competence than lower values. Our data come from a sample of middle schools. We collect data on 1174 students, age 10–16, seventy-nine percent of whom are white with females accounting for 52% of the sample. Other demographic information is also collected. WINMIRA utilizes a graphical user interface. Thus, there is no syntax to present, however, Appendix A contains screen shots. Through the completion of a series of menus and dialogs we import our data (data may be in text format or a SPSS data file), select our items to be analyzed, specify the number of latent classes and the mixed Rasch model to use, as well as output options. As with Example 1 we examine multiple class models ranging from two to four classes in addition to the “one class” model. The information criteria for the one- through four-class solutions are presented in Table 5 with the abridged output from the two-class solution presented in Table 7.
Fig. 2. Item response functions for items 1, 4, and 7 for each latent class.
Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002
R.J. De Ayala, S.Y. Santiago / Journal of School Psychology xxx (2016) xxx–xxx
11
Fig. 3. Students' latent proficiency locations and frequency distribution of observed score.
We check for a converged solution by verifying that the “Number of iterations needed” (491 iterations) is less than the “max. number of iterations” (1000 iterations is the maximum; see Table 7). Table 5 shows that all three information criteria indicate that the two-class solution exhibits the best relative model-data fit. Therefore, we proceed to interpret the two-class solution to ensure that these classes are meaningful. Our interpretation reveals that latent class one consists of 826 students who are knowledgeable about the writing process with an average number correct score on the writing knowledge measure of 5.00. Moreover, these students could apply this knowledge in writing performance as exhibited in the verbal complexity and syntactic correctness of their writing samples (verbal M = 3.05, syntactic M = 3.05) regardless of prompt type. In contrast, the second class is composed primarily of students (n = 348) who Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002
12
R.J. De Ayala, S.Y. Santiago / Journal of School Psychology xxx (2016) xxx–xxx
Table 5 Model-data fit indices and latent class proportions. Index
1 LC
2 LC
3 LC
4 LC
AIC BIC SABIC
9172.74 9213.28 9221.28
9066.71 9152.87 9169.87
9075.41 9207.18 9233.18
9078.74 9256.13 9291.13
πc
1 LC
2 LC
3 LC
4 LC
1 2 3 4
1
0.69 0.31
0.35 0.33 0.32
0.52 0.28 0.17 0.03
Table 6 LC characteristics. LC
Verbal complexity
Syntactic correctness
Prompt type
1
3.05
3.05
2
1.98
1.86
Verbal Pictorial Verbal Pictorial
WK 84.5% 15.5% 23.7% 76.3%
5.00 1.65
Prompt type LC 1
Verbal Pictorial Verbal Pictorial
2
Verbal complexity
Syntactic correctness
3.08 3.02 1.00 2.31
3.08 3.01 1.89 1.85
were less knowledgeable about the writing process (writing knowledge M = 1.65) and whose writing tends to not be seen as verbally complex and syntactically correct (verbal M = 1.98, syntactic M = 1.86) as members of the first latent class. Additionally, it appears that latent class 2 members tend to do better with pictorial prompts than with verbal prompts; latent class 1 members' performance did not seem to be affected by the type of prompt (see Table 6). With latent class proportions of π1 = .69 and π2 = .39 we see that latent class 1 is more than twice the size of latent class 2 indicating that the majority of the students are expected to be proficient in writing. From Table 7 we see that the output format consists of general information about data input, descriptive statistics, followed by sections on each latent class, and then goodness of fit information (e.g., AIC, BIC) at the model level. Within each class section we have student location estimation (e.g., Expected Score Frequencies and Personparameters bsicN) followed by item parameter estimation information, such as, estimated item locations and standard errors, and item-level fit information. The Expected Score Frequencies and Personparameters section shows that a student that obtained a Rawscore of 1 in CLASS 1 of 2 is unlikely to be observed because we expect that .41% of latent class 1 students will have a number correct score of 1. Nevertheless, all students that obtained this score will be estimated to be located at − 2.053 on the writing ability continuum (θˆ11 ¼− 2.053); we use the WLE estimate because these are available for all possible number correct scores. Continuing with latent class 1 we see from the expected category frequencies and item scores section that items 1 and 7 were correctly answered by 97% and 36% of the students, respectively. Our estimates of these items' locations are found in the threshold parameters: ordinal (partial credit) model section. For instance, item 1 is relatively easy in this class with an estimated location of − 2.08406 (δˆ11 ¼− 2. 08406), whereas item 7 is comparatively harder with an estimated location of 1.79190 (δˆ71 ¼1.79190). Because items and students are located on the same continuum we can predict that a student that correctly answered one item (i.e., θˆ11 ¼− 2.08406) has a low probability of correctly answering item 7 (i.e., an item located farther up the ability continuum from him or her), but about 50:50 chance of correctly answering item 1 (i.e., an item located at about the same point as his or her ability).5 As mentioned above, our estimated student and item locations can be transformed to 5
This interpretation is obtained by substituting −2.053 for θic and 1.79190 for δjcin Eq. (6); α = 1.0:
p171 ¼
eαðθic −δjc Þ e1:0ð−2:053−1:79190Þ ¼ 0:0209; ¼ α ðθic −δjc Þ 1 þ e1:0ð−2:053−1:79190Þ 1þe
and again with −2.08406 for δjc):
p111 ¼
α θ −δ 1:0ð−2:053−ð−2:08406ÞÞ e ð ic jc Þ e ¼ 0:5078: ¼ α ðθic −δjc Þ 1 þ e1:0ð−2:053−ð−2:08406ÞÞ 1þe
Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002
R.J. De Ayala, S.Y. Santiago / Journal of School Psychology xxx (2016) xxx–xxx
13
Table 7 Mixture IRT model output for a two-latent class solution.
eliminate negative values or to a scale that has intrinsic meaning such as a proportion or a number correct scale (see de Ayala, 2009). Similar information for latent class 2 is found in the Final estimates in CLASS 2 of 2 with… section. In addition to model-level data fit information, WINMIRA also provides item-level fit information in the item fit assessed by the Q-index section. The Q-index (Rost & von Davier, 1994) has a range of 0 to 1 where small values are good: a Q-index = 0 Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002
14
R.J. De Ayala, S.Y. Santiago / Journal of School Psychology xxx (2016) xxx–xxx
reflects perfect fit, a Q-index = .5 indicates random response behavior, a Q-index = 1 indicates perfect misfit for the model. In contrast to this descriptive use of the Q-index, we can use a transformed Q-index for significance testing. This transformed Qindex, Zq, is asymptotically normal, standardized, and centered at 0. In the “item fit assessed by the Q-index” section we find both the Q-index and Zq for each item in each latent class. For instance, in latent class 1 we see that the Q-index values range from .1687 to .2476, thus reflecting itemdata fit for each of our items in this latent class. Similarly, for latent class 2 the Q-index values show that we have item-data fit. From a significance testing perspective we want non significant Zqs. As can be seen, we have non significant Zqs for each of our items in each of our latent classes. Therefore, in terms of fit our two-class model exhibits the best relative fit of our four models and our one-parameter (Rasch) model shows item-level fit in each of our latent classes. Fig. 4 presents a graphical depiction of our proficiency estimates for each of our latent classes. As can be seen latent class 1 students tend towards being distributed at and above a number correct score of 4, whereas latent class 2 students tend to obtain fewer items correct. However, for a given number correct score the students are estimated to have roughly the same writing proficiency. For example, for students in latent class 1 that correctly answered four items we would estimate his or her writing proficiency to be .416 (see dash`-dot line, top panel). However, for latent class 2 students who had a number correct score of 4 our estimated location for them is .349 (see dash-dot line, bottom panel). Because of the class structure we know that these latter students would tend to respond correctly to the pictorial prompts, but not verbal prompts, as well as the types of problems we would see in their writing (i.e., few related ideas, stories that lack cohesiveness, etc.).
Fig. 4. Students' latent proficiency locations and frequency distribution of number correct score.
Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002
R.J. De Ayala, S.Y. Santiago / Journal of School Psychology xxx (2016) xxx–xxx
15
7. Conclusion Mixture IRT models assume that our sample comes from a population that is comprised of latent subpopulations. These subpopulations are qualitatively different from one another and represent different latent classes of individuals. Within each of these classes there is a continuous latent variable. This continuous latent variable is reflected in an IRT model. The combination of latent classes and a continuous latent variable compose the data's latent structure. Our modeling process requires an examination of fit to determine the necessity of using both a continuous latent variable and latent classes. If our analysis reveals that a one-class solution is best, then our mixture IRT model simplifies to an IRT model and a continuous latent variable. Regardless of whether we have one or more latent classes, if we have model-data fit, then our person location estimates will transcend the administered instrument and our item location estimates will be sample independent. Although our presentation was limited to dichotomous data, mixture IRT models may also be applied to polytomous data. As opposed to the traditional total score approach, an IRT model provides item level information. However, with IRT modeling an item's one or more parameters are required to be constant across subpopulations. Our mixture IRT model relaxes this requirement and allows the item's parameters to vary across our classes. In the case of polytomous data this means that the item's parameters for the response scale do not have be constant across subgroups. For instance, if we find that the ‘don't know’ response category parameter varies across classes, then we know that the different subgroups are using (and possibly interpreting) this response category differently from one another. With a mixture IRT model we can combine latent classes with manifest groupings (e.g., gender, race/ethnicity) to explore whether our manifest groups are homogeneous or heterogeneous with respect to our latent variable. Therefore, we can explore whether, for example, all African Americans are disadvantaged/advantaged by an instrument (whole or in part) or if it is only certain latent class(es) of African Americans that are disadvantaged/advantaged. As a second example, we can determine whether special education students share common cognitive skills in mathematical problem solving (cf. Yang et al., 2005) as well as the students' mathematical problem ability. Mixture IRT models offer the possibility of modeling response or rating data without having to consider the data's latent structure to be solely continuous or solely discrete. As such, a richer interpretation of the student data is possible with a mixture IRT modeling approach. Appendix A. WINMIRA screen shots. Supplementary data to this article can be found online at http://dx.doi.org/10.1016/j.jsp.2016.01.002. References Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 713–723. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord, & M. R. Novick (Eds.), Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Bozdogan, H. (1987). Model selection and Akaike's information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52, 345–370. Bradshaw, C. P., Waasdorp, T. E., & O'Brennan, L. M. (2013). A latent class approach to examining forms of peer victimization. Journal of Educational Psychology, 105, 839–849. Clements, D. H., Sarama, J. H., & Liu, X. H. (2008). Development of a measure of early mathematics achievement using the Rasch model: The research-based early maths assessment. Educational Psychology: An International Journal of Experimental Educational Psychology, 28, 457–482. http://dx.doi.org/10.1080/01443410701777272. Cohen, A. S., & Bolt, D. M. (2005). A mixture model analysis of differential item functioning. Journal of Educational Measurement, 42, 133–148. CTB/MacMillan/McGraw-Hill. (1991). Comparative tests of basic skills, fourth edition. (technical report, June, 1991). Monterey, CA: Author. CTB/McGraw-Hill (1987). California achievement tests, forms E and F, levels 10–20 (technical report). Monterey, CA: Author. de Ayala, R. J. (2009). Theory and practice of item response theory. New York: Guilford Publishing. De Ayala, R. J., Kim, S. H., Stapleton, L. M., & Dayton, C. M. (2003). Differential item functioning: A mixture distribution conceptualization. International Journal of Testing, 2, 243–276. Eid, M., Langeheine, R., & Diener, E. (2003). Comparing typological structures across cultures by multigroup latent class analysis. Journal of Cross-Cultural Psychology, 34, 195–210. Embretson, S. E., & Reise, S. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum Publishers. Finch, W. H., & Pierson, E. E. (2011). A mixture IRT analysis of risky youth behavior. Frontiers in Psychology, 2, 1–10. http://dx.doi.org/10.3389/fpsyg.2011.00098. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage. Haughton, D. M. A. (1988). On the choice of a model to fit data from an exponential family. The Annals of Statistics, 16, 342–355. Henington, C. (2004). Wide Range Achievement Test — 3 (WRAT-3). In T. S. Watson, & C. H. Skinner (Eds.), Encyclopedia of school psychology (pp. 377–378). New York, NY: Kluwer Academic/Plenum Publishers. Li, F., Cohen, A. S., Kim, S. -H., & Cho, S. -J. (2009). Model selection methods for mixture dichotomous IRT models. Applied Psychological Measurement, 33, 353–373. Maij-de Meij, A. M., Kelderman, H., & van der Flier, H. (2005). Latent-trait latent-class analysis of self-disclosure in the work environment. Multivariate Behavioral Research, 40, 435–460. Mislevy, R. J., & Verhelst, N. (1990). Modeling item responses when different subjects employ different solution strategies. Psychometrika, 55, 195–215. Muthén, L. K., & Muthén, B. O. (2013). MPlus Software, Version 7.11. Los Angeles, CA: MPlus. Nylund, K. L., Asparouhov, T., & Muthén, B. O. (2007). Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study. Structural Equation Modeling, 14, 535–569. Primi, R., Wechsler, S. M., de Cassia Nakano, T., Oakland, T., & Souza Lobo Gusso, R. (2014). Using item response theory methods with the Brazilian temperament scale for students. Journal of Psychoeducational Assessment, 32, 651–662. http://dx.doi.org/10.1177/0734282914528613. Purpura, D. J., Reid, E. E., Eiland, M. D., & Baroody, A. J. (2015). Using a brief preschool early numeracy skills screener to identify young children with mathematics difficulties. School Psychology Review, 44, 41–59. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press (Original work published 1960). Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14, 271–282. Rost, J. (1991). A logistic mixture distribution model for polychotomous item responses. British Journal of Mathematical and Statistical Psychology, 44, 75–92.
Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002
16
R.J. De Ayala, S.Y. Santiago / Journal of School Psychology xxx (2016) xxx–xxx
Rost, J., & von Davier, M. (1994). A conditional item fit index for Rasch models. Applied Psychological Measurement, 18, 171–182. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464. Sclove, S. L. (1987). Application of model-selection criteria to some problems in multivariate analysis. Psychometrika, 52, 333–343. Shao, A., Liang, L., Yuan, C., & Bian, Y. (2014). A latent class analysis of bullies, victims and aggressive victims in Chinese adolescence: Relations with social and school adjustments. PloS One, 9(4), 1–8. Shrank, F. A., McGrew, K. S., & Mather, N. (2014). Woodcock-Johnson IV tests of cognitive abilities. Rolling Meadows, IL: Riverside Publishing. Smit, A., Kelderman, H., & van der Flier (2003). Latent trait latent class analysis of an Eysenck Personality Questionnaire. Methods of Psychological Research Online, 8, 23–50. Stouffer, S. A. (1950). The logical and mathematical foundation of latent structure analysis. In S. A. Stouffer, L. Guttman, E. A. Suchman, P. F. Lazarsfeld, S. A. Star, & H. A. Clausen (Eds.), Measurement and prediction (pp. 3–45). Princeton, NJ: Princeton University Press. Taylor, J. A. (1953). A personality scale of manifest anxiety. Journal of Abnormal Psychology, 48, 285–290. von Davier, M. (2001). WINMIRA 2001 [Computer Software]. Retrieved July 1, 2015 from http://www.von-davier.com Wang, J., Ianotti, R. J., & Luk, J. W. (2012). Patterns of adolescent bullying behaviors: Physical, verbal, exclusion, rumor, and cyber. Journal of School Psychology, 50, 521–534. Wechsler, D., & Naglieri, J. A. (2006). Wechsler Nonverbal Scale of Ability (WNV). San Antonio, TX: Harcourt Assessment. Yang, X., Shaftel, J., Glasnapp, D., & Poggio, J. (2005). Latent class analysis of mathematical ability for special education students. Journal of Special Education, 38, 194–207. Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125–145.
Please cite this article as: De Ayala, R.J., & Santiago, S.Y., An introduction to mixture item response theory models, Journal of School Psychology (2016), http://dx.doi.org/10.1016/j.jsp.2016.01.002