Measurement equivalence: A non-technical primer on categorical multi-group confirmatory factor analysis in school psychology

Measurement equivalence: A non-technical primer on categorical multi-group confirmatory factor analysis in school psychology

Journal of School Psychology 60 (2017) 65–82 Contents lists available at ScienceDirect Journal of School Psychology journal homepage: www.elsevier.c...

537KB Sizes 222 Downloads 356 Views

Journal of School Psychology 60 (2017) 65–82

Contents lists available at ScienceDirect

Journal of School Psychology journal homepage: www.elsevier.com/locate/jschpsyc

Measurement equivalence: A non-technical primer on categorical multi-group confirmatory factor analysis in school psychology☆ Laura L. Pendergast a,⁎, Nathaniel von der Embse a, Stephen P. Kilgus b, Katie R. Eklund b a b

Temple University, United States University of Missouri, United States

a r t i c l e

i n f o

Article history: Received 16 July 2015 Received in revised form 5 June 2016 Accepted 22 November 2016 Available online 3 January 2017 Keywords: Factor analysis Invariance Equivalence Measurement Bias Diversity SABRS Categorical School psychology

a b s t r a c t Evidence-based interventions (EBIs) have become a central component of school psychology research and practice, but EBIs are dependent upon the availability and use of evidencebased assessments (EBAs) with diverse student populations. Multi-group confirmatory factor analysis (MG-CFA) is an analytical tool that can be used to examine the validity and measurement equivalence/invariance of scores across diverse groups. The objective of this article is to provide a conceptual and procedural overview of categorical MG-CFA, as well as an illustrated example based on data from the Social and Academic Behavior Risk Screener (SABRS) – a tool designed for use in school-based interventions. This article serves as a non-technical primer on the topic of MG-CFA with ordinal (rating scale) data and does so through the framework of examining equivalence of measures used for EBIs within multi-tiered models – an understudied topic. To go along with the illustrated example, we have provided supplementary files that include sample data, Mplus input code, and an annotated guide for understanding the input code (http://dx.doi.org/10.1016/j.jsp.2016.11.002). Data needed to reproduce analyses in this article are available as supplemental materials (online only) in the Appendix of this article. © 2016 Society for the Study of School Psychology. Published by Elsevier Ltd. All rights reserved.

Over the past decade, school psychologists have been serving a more diverse student population than ever before due to demographic shifts in the United States, as well as the expansion of school psychology internationally (Jimerson, Stewart, Skokut, Cardenas, & Malone, 2009), and this trend is likely to continue (Ortiz, Flanagan, & Dynda, 2008). At the same time, evidence based interventions (EBIs), often implemented within multi-tiered models, have become a central focus of school psychology practice (NASP, 2010a), training (NASP, 2010b), and research (e.g., Glover & DiPerna, 2007). In a recent publication, the APA Division 16 Working Group on Translating Science to Practice concluded that, although there are many EBIs in school psychology, little is known about the effectiveness of most of EBIs among minority populations in diverse contexts and called for increased research on the use of EBIs across diverse groups (Forman et al., 2013). Although it is critical for researchers to examine the extent to which EBIs, and adaptations thereof, are effective across racial, ethnic, and other groups (Forman et al., 2013), there is a critical intermediary step related to assessment.

☆ Data needed to reproduce analyses in this article are available as supplemental materials (online only) in the Appendix of this article. ⁎ Corresponding author. E-mail address: [email protected] (L.L. Pendergast). Action Editor: Michael Toland

http://dx.doi.org/10.1016/j.jsp.2016.11.002 0022-4405/© 2016 Society for the Study of School Psychology. Published by Elsevier Ltd. All rights reserved.

66

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

The success of EBI programs is dependent upon the availability and use of evidence-based assessments (EBAs; e.g., Mash & Hunsley, 2005; Youngstrom, Findling, Kogos-Youngstrom, & Calabrese, 2005). Assessment procedures used within EBIs must produce reliable scores that allow for valid inferences (Kratochwill & Stoiber, 2002). Moreover, according to the Standards for Educational and Psychological Testing, test users should select measures that are fair for all test takers. “A test that is fair within the meaning of the Standards reflects the same construct(s) for all test takers, and scores from it have the same meaning for all individuals in the intended population.” (AERA, APA, NCME, 2014; p. 50). Given that the outcomes of EBIs must be evaluated using scores from EBAs that are culturally sensitive, fair, and valid, it logically follows that the examination of measurement equivalence (also known as measurement invariance; ME/I) of EBA tools across diverse groups is one necessary pre-requisite for examining the effectiveness of EBIs in multicultural contexts. Although the presence of ME/I by no means guarantees that test scores are culturally sensitive or fair, ME/I does provide one essential piece of evidence in support of comparable (structural) validity across groups. Multi-group confirmatory factor analysis (MG-CFA) is a useful tool for examining various levels of invariance (Chen, 2008; Gregorich, 2006). MG-CFA can be examined in many popular statistical packages (e.g., Mplus, R, EQS) and provides researchers and practitioners with important information about the comparability and usefulness of scores across cultural groups. The purpose of this article is to (a) provide a primer on best practices in conducting MG-CFA1 with ordinal, Likert-type, item response data (as most behavioral and social-emotional rating scales used in school psychology EBIs yield ordinal data), (b) illustrate the use of MG-CFA using ordinal data from the Social and Academic Behavior Risk Screener (SABRS; Kilgus, Chafouleas, & Riley-Tillman, 2013), and (c) emphasize the importance of ensuring that assessment tools produce valid scores across cultural groups. This article is a non-technical, conceptual and procedural overview for novices. Those interested in more technical and mathematically-oriented reviews are referred to other excellent resources (e.g., Kim & Yoon, 2011; Millsap, 2012; Millsap & Yun-Tein, 2004; Vandenberg & Lance, 2000). In this article, the explanations are non-mathematical with the intent of facilitating comprehension at an introductory level for non-statisticians. To our knowledge, this is the first manuscript to focus on ME/I or MG-CFA in the context of measures used for school-based, multi-tiered intervention models. Given that these intervention models are a burgeoning area of research in school psychology, discussion about evaluating the equivalence of measures used within these interventions may be overdue. Moreover, very few authors have examined the issue of categorical MG-CFA with ordinal (item response) data in a non-technical fashion that is accessible to non-statisticians and provided annotated syntax. This manuscript is divided into five sections (a) a review of ME/I and the historical context of the technique, (b) an overview that introduces key terms and concepts, (c) discussion of technical considerations when conducting MG-CFA analyses (e.g., necessary sample size, estimation techniques, etc.), (d) an illustration of the technique using a school-based measure that produces ordinal, Likert-type, item response data, and (e) a discussion of implications, complications, and advanced applications. 1. Section one: what is measurement equivalence/invariance? 1.1. Role of measurement In psychology and education, researchers and practitioners regularly attempt to assess constructs that are believed to exist but cannot be observed directly (e.g., intelligence, mathematics achievement, depression, happiness). Because these constructs cannot be directly observed, the degree to which respondents possess an unobservable trait is typically inferred based on their responses to items (often responses to items on scales with Likert-type formats) that are believed to reflect the underlying construct. Measurement equivalence/invariance (ME/I) has been said to be achieved when the relationships between responses to items (indicators) and latent constructs are the same across groups (Drasgow & Kanfer, 1985). ME/I has often been discussed in the context of broader discussions regarding test bias and/or item bias. 1.2. Measurement equivalence/invariance and bias 1.2.1. Historical context for testing concerns Historically, “test bias” has been among the most controversial and commonly misunderstood concepts in assessment and psychometrics. Over the past century, researchers, practitioners, and laypeople have raised concerns about “bias” in psychological tests across cultural groups. Many of these concerns are understandable given historical misuse of psychological test data. As one example, historically, some published studies have compared racial groups based on “intelligence” using scores from tests given in different nations without reasonably sufficient evidence of validity among members of the groups with which the tests were used (e.g., Porteus, 1937). At best, such research reflects poor science (inadequate design and insufficiently supported conclusions; see Linstrum, 2016); some have argued that it embodies scientific racism (e.g., Fairchild, 1991; Richards, 1997). 1.2.2. First generation of bias research Given the troubled history of race and research in many fields (Dennis, 1995; Poortinga, 1995), it is not surprising that racial differences in mean scores on psychological and educational assessments (e.g., IQ tests, college entrance exams, such as the SATs, etc.) are often greeted with skepticism. In fact, the study of measurement invariance and bias in testing originated in response to concerns and skepticism regarding race and gender differences in mean scores on IQ tests and college entrance exams. Zumbo 1 This primer focuses on the use of categorical MG-CFA with ordinal data. When the term MG-CFA is used, the reader can assume that the authors are referring to categorical MG-CFA with ordinal data except where otherwise noted.

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

67

(2007) identified three “generations” of the study of measurement bias (later referred to as differential item functioning, DIF). The first generation of bias/DIF research focused on the study of mean differences in test scores by race and gender. Studies from the first generation of bias/DIF research were seminal and paved the way for later studies, however, it is now widely accepted that although mean difference may be useful “red flags” which require further exploration, “bias” cannot be determined solely on the basis of mean differences. If mean differences alone revealed bias, then a typical bathroom scale could be deemed a biased instrument on the basis of gender – a functional bathroom scale would, on average, yield higher scores (i.e., estimates of weight) for men than for women. Of course, these mean differences would be the result of true gender differences in weight rather than “bias.” At times, bias is evident, and mean differences in test scores are the result of statistical differences in measurement properties that are unrelated to the intended construct which has implications for whether and how a particular measure should be used. In other instances, mean differences in scores are reflective of true differences on a particular construct which has important implications for policy and practice (e.g., differences in academic achievement scores might reflect underlying inequities in school resources). Thus, while it is important to evaluate mean differences in test scores, decisions regarding bias require further investigation. 1.2.3. Second and third generations of bias research In the second generation of bias/DIF study, researchers began to move beyond the examination of mean differences alone (Zumbo, 2007). As a result, bias is now defined as a statistical phenomenon, which is separate from test fairness, and the examination of bias shifted from a sole focus on the test level to the item level (i.e., DIF; Zumbo, 2007). It is a nuisance factor whereby test scores are systematically influenced by a factor that is unrelated to the intended construct. Bias in construct validity has been defined as “the extent to which a test measures different constructs for different groups,” (Reynolds & Ramsay, 2003, p. 81) and refers to systematic inaccuracy in measurement (Millsap & Everson, 1993). Bias has been said to be present if a measure systematically overestimates or underestimates scores on a variable for members of a particular group – accounting for overall levels on a particular trait (Reynolds & Ramsay, 2003). Bias can manifest itself in many ways. Scores can be biased as a result of differences in the testing situation, content validity, predictive validity, or construct validity, and bias can manifest at the item level (DIF) or at the scale level (differential test functioning, DTF; AERA, APA, & NCME, 2014; Zumbo, 2007). ME/I is one framework that can be used, alongside others, to evaluate and understand whether and how scores from a psychological measure, or items therein may be “biased” or contain DIF/DTF across groups. Notably, the presence of ME/I alone is not an assurance that scores are bias-free. Other forms of bias, such as differences in predictive validity of scores, may be present even when ME/I is supported. Likewise, non-invariance across groups does not necessarily suggest that a scale cannot be used with members of a particular group. Whether a scale should be used with members of a particular group in the presence of non-invariance depends on the research question or purpose of testing. Borsboom (2006) discussed the impact of non-invariance in three contexts: (a) understanding within-group relationships between variables, (b) comparing means across groups, and (c) using test scores for selection. As long as there is evidence to support the validity of scores among members of each group, non-invariance is unlikely to matter if the sole objective of the research is to understand within group relationships between variables. However, Borsboom (2006) and many others (e.g., Vandenberg & Lance, 2000) have clearly demonstrated that invariance is essential for accurate comparison of means across groups. Likewise, when test scores are used for selection of individuals, such as rating scales used to select students for placement within multi-tiered prevention/intervention models, measurement invariance is an important prerequisite. Borsboom (2006) contends that “unless measurement invariance holds, fairness and equity cannot exist in principle. Thus, when the purpose of test use is selection of individuals, measurement invariance is a necessary condition for fair selection procedures.” (p. 179). Zumbo (2007) argued that mere detection of DIF is insufficient. According to Zumbo, the “third generation” of DIF research should involve intensive study of why DIF occurs. Systematic evaluation of ME/I may be an important prerequisite step in this process as it can help researchers to better understand potential differences in how a scale measures a construct across groups and can illuminate the need for adaptations or special considerations in cross-cultural measurement. Beyond the issue of “bias,” examination of measurement invariance is an important aspect of the scale development process. Vandenberg and Lance (2000) noted that violations of ME/I threaten the reliability of observed scores and the validity of interpretations. Establishing ME/I involves a series of steps to ensure that both constructs and measures operate similarly among individuals of different races, genders, cultures, and other diversity variables (language, age, disability, etc.) that are relevant in a particular context. Although many tests used for evaluations for special education services in schools (e.g., intelligence tests) have evidence of ME/I, such evidence is not reported as frequently for measures used in the context of EBIs (e.g., universal screening measures). Measures used in the context of EBIs yield important information and are used for tier placement and allocation of intervention services. Thus, it is important to ascertain that such measures function equitably across groups, and examination of ME/I constitutes one step in this process. Notably, universal screening measures and other tools used in EBIs tend to yield ordinal data at the item response level. Special considerations are needed when examining ME/I with ordinal data, such as data from rating scales using a Likert-type format. 2. Section two: a conceptual overview of ME/I 2.1. What techniques are available to examine ME/I with ordinal data? Rating scales with Likert-type formats are commonly used to assess the outcomes of social, emotional, and behavioral interventions in school settings, and these rating scales usually produce ordinal data. Ordinal variables consist of three or more

68

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

mutually exclusive levels, which are presumed to be rank-ordered (Kline, 2010). Though commonly treated as interval variables within psychological and educational research, scales with Likert-type item formats typically lack the necessary characteristics of interval variables, including (a) equal spacing between variable levels, and (b) the potential to be normally distributed. Research suggests that treating Likert-type items as interval is reasonable when a variable possesses many (e.g., 5 to 15) levels, as the resulting data may then have the potential to approximate a normal distribution (Kline, 2010). Yet, most psychological rating scales do not possess such a wide range of item levels, suggesting that they ought to be treated as ordinal (Knapp, 1990). ME/I is typically examined using one of five techniques (see Teresi, 2006): logistic regression (see Abedlaziz, Ismail, & Hussin, 2011), Mantel-Haenszel (see Dorans & Holland, 1992), item response theory (IRT; see Edelen & Reeve, 2007 for a review), multiple indicator multiple cause (MIMIC) modeling (see Woods, 2009, for a review, and Pendergast et al., 2014, for an example), and MG-CFA of mean and covariance structures (Byrne, Shavelson, & Muthén, 1989). Each approach has strengths and weaknesses (for comparisons, see Kim & Yoon, 2011, and Woods, 2009). MG-CFA is a widely used approach, and some have argued that it has become the most commonly used technique for examining ME/I (Chen, 2008; Koh & Zumbo, 2008). MG-CFA is a particularly appealing technique for school psychology researchers because (a) it is an extension of factor analysis (a statistical technique that many school psychology researchers are already familiar with), (b) it can be conducted in many popular and widely available statistical programs (e.g., Mplus, R, EQS, etc.), and (c) it can be conducted with ordinal data. Moreover, the MG-CFA approach has been shown to overcome limitations of other approaches. First, the logistic regression approach to DIF is useful and flexible but relies on the total score as an estimate of the underlying level of the latent trait, whereas other approaches, such as MG-CFA and IRT, provide more sophisticated estimates of the underlying latent trait (Millsap & Everson, 1993). Second, MG-CFA allows for detection of two types of DIF, uniform (constant across ability levels) DIF and non-uniform (varying across ability levels) DIF (see Woods, 2009, for a more detailed explanation of uniform and non-uniform DIF), whereas the MH approach and MIMIC modeling are only sensitive to uniform DIF (Rogers & Swaminathan, 1993; Woods, 2009). Finally, in the context of ordinal/categorical item data, MG-CFA has been shown to be more sensitive than IRT in identifying items with DIF with a lower rate of false positives (Kim & Yoon, 2011). An overview of MG-CFA as a method for evaluating ME/I is provided hereinafter. 2.2. Key concepts in single-group confirmatory factor analysis MG-CFA is an extension of confirmatory factor analysis (CFA). CFA is a technique that is commonly used to examine the structural validity of test scores. To grasp MG-CFA, it is first necessary to understand CFA as applied to single groups. There are many resources that review the concepts and practices of CFA (e.g., Byrne, 2011; Kline, 2010). This article provides a brief review of the concepts of CFA that are particularly relevant to multi-group extensions with ordinal data. However, there are many important topics related to CFA that are discussed only briefly here (e.g., necessary sample size, assumptions), and readers are encouraged to review the cited sources to learn more about those topics. When psychologists and educators attempt to measure constructs that cannot be directly observed (e.g., hyperactivity, school climate), they commonly do so using rating scales with Likert-type item formats which produce ordinal data. These unobserved constructs (hyperactivity, school climate) are referred to as latent variables or factors. Because latent variables cannot be observed directly, inferences are made about a person's level or standing on a latent variable based on various indicators, such as responses to items on a questionnaire. Suppose that a researcher wanted to measure depression. Depression cannot be seen, counted, or measured with a ruler. Depression is a latent construct that is often measured using self-report rating scales. Inferences are made about a person's level of depression based on their responses to items that are believed, based on theory, to reflect depression. Overall, students who are more depressed have a higher probability of agreeing (or strongly agreeing) with items that reflect depression (e.g., I feel sad) than those who are less depressed. In the context of CFA, it is believed that differences in item responses are attributable only to (a) differences on the factor, or (b) unique variance (specific reliable variance and error; Kline, 2010). Ideally, differences between responses to items should be explained or accounted for primarily by differences on the latent depression factor. Essentially, factor analysis is used to evaluate the extent to which patterns of inter-relationships between indicators and latent constructs are consistent with expectations based on theory (Fabrigar, Wegener, MacCallum, & Strahan, 1999; Ferguson & Cox, 1993; Keith, Caemmerer, & Reynolds, 2016). If item responses from a particular measure are related to latent constructs in a way that conforms to theory, then validity evidence in support of the internal structure is provided by the data (AERA, APA, NCME, 2014). In other words, if findings indicate that a factor structure in a given data set corresponds to expectations based on a corresponding theory, then those findings serve as evidence (ideally one of many pieces of evidence) suggesting that the scale is measuring the intended construct. There are two forms of factor analysis: exploratory factor analysis (EFA), which is generally used in the earlier phases of scale development and score validation (see Henson & Roberts, 2006, for a review) and CFA, which is typically used in the later phases of scale development (Brown & Moore, 2012). ME/I can be examined using EFA, albeit to a very limited extent. Readers who are interested in examining ME/I via EFA are referred to reviews on that topic (Dolan, Oort, Stoel, & Wicherts, 2009; Lorenzo-Seva & ten Berge, 2006). CFA allows for a more comprehensive examination of ME/I relative to EFA, as it allows for systematic examination of several components of the factor structure (factor loadings, intercepts/thresholds, and residuals) which are described hereinafter. In CFA, it is especially important to have a strong theoretical framework because the researcher must specify the model a priori (Keith et al., 2016; Kline, 2010). When using CFA, the researcher must explicitly indicate the number of factors there are and

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

69

which items should load on what factor. In this context, factor loadings (which can be standardized or unstandardized) reflect the relationship between items (indicators) and factors. Other types of parameters in CFA include residuals/unique variance (variance not explained by the factor) and factor variance (an estimation of variability in the factor). In CFA, researchers specify ahead of time whether each parameter should be fixed (e.g., set to 0 or 1) or freely estimated (Brown & Moore, 2012). 2.3. Key concepts in MG-CFA MG-CFA is an extension of CFA that can facilitate the examination of ME/I. Simply put, MG-CFA allows researchers to examine the extent to which one important source of validity evidence, internal structure, is equivalent across groups at different levels. The objective of MG-CFA is not to answer the question, Is this measure invariant/equivalent across groups? Instead, MG-CFA answers a more nuanced question: To what degree is this measure invariant/equivalent across groups? (Vandenberg & Lance, 2000). Invariance testing using MG-CFA involves examining the fit of a series of increasingly restrictive models (Dimitrov, 2010; Gregorich, 2006; Vandenberg & Lance, 2000; van de Schoot, Lugtig, & Hox, 2012). The researcher begins by examining model fit among members of each group separately. A baseline model is established, and model fit is evaluated for the baseline model. If the baseline model fits well, then the researcher can proceed with ME/I testing by using increasingly restrictive equality constraints and evaluating model fit with the constraints. If model fit does not meaningfully change when equality constraints are included, then the measure is said to be equivalent/invariant. Notably, MG-CFA is typically used to evaluate measurement invariance across two groups. However, it can be used with more than two groups and the process is similar but may also involve the researcher selecting one group to be a “referent” group for other groups to be compared against. In analyses that involve many groups (perhaps more than five), MG-CFA may become overly complex and other approaches, such as MIMIC modeling and the newly developed alignment method (Asparouhov & Muthén, 2014) may be preferable. Because this article is intended to be a primer, the focus is on MG-CFA with two groups. MG-CFA with two groups involves a multi-step process whereby researchers can test changes in model fit to examine multiple levels of ME/I: configural, metric, scalar/threshold, and residual. These steps are described hereinafter. 2.3.1. Preliminary steps Important preliminary steps include ensuring that each group has a sufficient sample size (see section on necessary sample size), that the data meet the assumptions of factor analysis (see sections on key concepts in CFA and MG-CFA), and that the model has at least minimally adequate fit (see section on model fit in single-group CFA) for each group. If the aforementioned conditions are met, then invariance testing via MG-CFA may be a viable option. 2.3.2. Step 1: configural invariance The first step of ME/I testing is examining configural invariance – that the same indicators are measuring the same factors across groups (Davidov, Datler, Schmidt, & Schwartz, 2011). In other words, the configural invariance model specifies that the same items load on the same factor(s) in both groups (Bowen & Masa, 2015; Dimitrov, 2010). No other constraints are placed upon the model for ME/I testing purposes (aside from those needed in CFA or MG-CFA for identification purposes, see section on identification for more information). (Notably, if the baseline models were supported, as described in the preliminary steps, it is unlikely that testing the configural model would result in a change in interpretation.) If the configural model is a “good fit” to the data, in other words, if the items are associated with the same latent factor(s) as expected based on theory (see Section 3 for review of criteria for good fit), then configural invariance is said to be established (Dimitrov, 2010). Lack of configural invariance occurs when the overall nature of the construct differs across cultural groups (Chen, 2008). For example, Chen (2007) noted that the construct of individuation (defined as “people's willingness to engage in behaviors publically that differentiate themselves from others”; Maslach, Stapp & Santee, 1985, p. 732) differs in the United States and China whereby individuation is an unidimensional construct in the United States, yet, in China, it is best represented by two factors (Chen, 2007; Kwan, Bond, Boucher, Maslach, & Gan, 2002). Thus, a measure of individuation would probably not have configural invariance in the United States and China. However, if configural invariance did hold, then invariance testing could proceed. If configural invariance is not supported, separate, group-specific measures may be needed. However, if configural non-invariance can be attributed to a small number of parameters (e.g., one item), then partial non-invariance (see Section five: extensions and advanced applications of MG-CFA section) can also be considered. 2.3.3. Step 2: metric invariance The next level of invariance is metric invariance (also referred to as weak invariance; Dimitrov, 2010; Meredith, 1993; Meredith & Teresi, 2006). If a scale has metric invariance across groups, then the magnitudes of the relationships between items and the latent factor(s) are equivalent across groups. If the relationships between items and the factor(s) differ significantly across groups, then the measure is considered to be metrically non-invariant. Metric non-invariance can occur when a construct is similar across cultural groups, but some items reflect the latent construct better for members of one culture than another. As an example, Chen (2008) described the use of scales intended to measure depression that include items such as “I feel blue.” Most individuals in the United States understand “blue” to be synonymous with “sad,” and this item would probably have a high loading on a depression factor in the United States. However, the term “blue, as used to describe mood, does not have a direct translation in Chinese languages. Thus, the item “I feel blue” would probably be frequently misunderstood which would influence the factor loading and may lead to metric non-invariance (Chen, 2008).

70

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

To examine metric invariance, a researcher should examine the fit of a model that is similar to the configural model but that also constrains corresponding factor loadings to be equal across groups. If the factor loadings are comparable (equivalent) across groups, then the fit of the metric invariance model (with factor loadings constrained to be equal across groups) will not be meaningfully different from that of the configural model. If model fit does not change meaningfully with factor loadings constrained to be equal across groups, then metric invariance is supported and the researcher should proceed to test scalar/threshold invariance. However, if model fit worsens as a result of the constraint for equal factor loadings across groups, then the researcher should (a) cease invariance testing and treat the measure as non-invariant, or (b) consider examining partial invariance (Byrne et al., 1989; See the advanced applications section of this manuscript for more information on partial invariance.) 2.3.4. Step 3: scalar/threshold invariance Scalar invariance (also referred to as strong invariance; Dimitrov, 2010) is tested after metric invariance is established, and the fit of the scalar invariance model is compared to that of the metric invariance model (Vandenberg & Lance, 2000). Below, we begin by discussing scalar invariance with continuous data to facilitate conceptual understanding and then proceed to provide an overview of scalar/threshold invariance which is examined with categorical/ordinal data. 2.3.4.1. Scalar invariance with continuous data. For continuous measures, scalar invariance across groups occurs when metric invariance is supported and intercepts are equivalent across groups. In the context of factor analysis, the intercept refers to the point of origin and, a continuous measure (such as an IQ test), is said to meet criteria for scalar invariance if the intercepts are equivalent across groups (for more information regarding the role of intercepts in factor analysis, see Meredith, 1993; Steinmetz, 2013). Scalar invariance is most commonly discussed in the context of continuous measures. For example, suppose that children from one particular ethnic group frequently misunderstand a word used in the instructions of some subtests on an IQ test because that word is not commonly used among individuals from their cultural background. Then, suppose that misunderstanding of this word led the test to significantly underestimate the IQ of these children. A MG-CFA at the subscale level would reveal scalar non-invariance (see Wicherts & Dolan, 2010, for a review and similar example). With continuous data, a researcher can test scalar invariance by constraining both the factor loadings and intercepts of a model to be equivalent across groups. If the fit of the model with both loadings and intercepts constrained to be equal does not meaningfully differ from the fit of the metric model, then scalar invariance is supported (Vandenberg & Lance, 2000). 2.3.4.2. Scalar/threshold invariance with ordinal data. A depiction of a threshold invariance model is provided in Fig. 1. Examining scalar invariance with ordinal data is more complex than doing so with continuous data, partly because invariance of item thresholds must also be examined (Millsap & Yun-Tein, 2004). As described previously, an item intercept refers to the point of origin. In contrast, from a mathematical perspective, item thresholds are similar to difficulty or location parameters in the IRT context. (For a detailed more technical review of item thresholds, see Kim & Yoon, 2011; Millsap & Yun-Tein, 2004; Xing & Hall, 2015). Conceptually, if an item has threshold invariance across groups, then members of both groups have an equal probability of shifting between response categories (e.g., Agree to Strongly Agree) given comparable scores on the latent factor (Kim & Yoon, 2011). As noted by Chen (2008), many factors can influence scalar/threshold invariance – leading individuals with comparable levels of the underlying trait to respond differently to specific items. These factors include (a) social desirability (e.g., certain items may contain content that leads members of one group, but not another, to systematically respond differently – despite comparable levels on the underlying construct – to give the appearance of adherence to social norms), (b) members of a particular group having a desire to overcome perceived group-related deficits (e.g., stereotypes) which influences responses, and (c) groups using

Fig. 1. This figure depicts factor loadings and thresholds across two groups as they might be modeled for a scale that had one factor with three items each of which had four response categories (Strongly Disagree, Disagree, Agree, Strongly Agree). For each respective item, λ represents the factor loadings. τ represents the thresholds with τ1 representing the likelihood of a response shifting between Strongly Disagree and Disagree, τ2 representing likelihood of a response shifting between Disagree and Agree and τ3 representing likelihood of a response shifting between Agree and Strongly Agree. Thresholds in bold are constrained to be equal across groups for identification purposes.

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

71

different reference points by which to evaluate their own thoughts and behaviors (e.g., comparing themselves to a meaningfully different reference group). For example, Pendergast et al. (2014) examined the invariance of a measure of post-partum depression among women from research sites in eight different countries. Women from the Indian research site were six times more likely to endorse an item intended to reflect suicidal ideation than women with equivalent levels of overall depressive symptoms (i.e., scores on the latent factor) from the other seven sites. Thus, the scale had scalar non-invariance – women from India systematically over-endorsed the suicidal ideation item relative to their overall level of depressive symptoms. Subsequently, the researchers followed up with women from the sample and consulted experts familiar with this cultural context to better understand how the participants were interpreting the item. Findings revealed that references to death or suicide (e.g., “I just want to die.”) were a part of the common vernacular among women at this particular site and were not expressions of true suicidal ideation. This is an example of scalar/threshold non-invariance. Notably, software packages offer different approaches for the examination of scalar/threshold invariance with ordinal data (see Davidov et al., 2011, for a review of different approaches available in different programs). One commonly used approach, available in Mplus, involves testing a model that is similar to the metric invariance model (factor loadings set to be equal across groups) but also sets intercepts at zero and constrains thresholds to be equal across groups. Notably, it is not possible to test for invariance of intercepts and thresholds simultaneously. However, if thresholds are invariant, invariance of intercepts can be assumed and scalar/threshold invariance is supported (Davidov et al., 2011; Millsap & Yun-Tein, 2004). If the fit of the scalar invariance model (described above) is not meaningfully different from the fit of the metric model (see the section entitled Evaluating Change in Model Fit in MG-CFA with Ordinal Data for a detailed overview of how to evaluate change in model fit), based on changes in fit indices, then scalar invariance is supported (Vandenberg & Lance, 2000). If scalar invariance is not supported, the scale should be treated as non-invariant, and/or partial invariance testing can be considered (see Byrne et al., 1989, for a review). Alternatively, if scalar/threshold invariance is supported, then some have argued that latent mean comparisons can be conducted with confidence (e.g., Vandenberg & Lance, 2000), and the researcher can proceed to test residual invariance. Notably, in the context of ordinal data, residual invariance can only be examined using a theta parameterization (see our section on Model Parameterization). 2.3.5. Step 4: residual invariance (may be optional) In the context of single-group CFA, residuals reflect unique variance (both error and other forms of unique variance). When testing for residual (strict) ME/I, the researcher constrains the residuals to be equal across groups. If model fit does not meaningfully change when the residuals are constrained to be equal, then residual ME/I is supported. The topic of residual invariance in the context of MG-CFA is complex, and there is disagreement in the literature about whether or not residual invariance is a necessary prerequisite condition for comparing latent means across groups. Some have argued that residual invariance is an essential prerequisite for latent mean comparisons (Meredith, 1993). On the other hand, others have claimed that requiring residual invariance prior to latent mean comparisons is unrealistic and that lack of residual invariance may not meaningfully influence latent mean comparisons (Brown, 2006; Chen, 2008; Xing & Hall, 2015). Residual invariance is not discussed in-depth or illustrated here because research on the examination of residual invariance in the context of ordinal data is limited and the question of whether residual invariance is a necessary precondition for comparing latent means remains unresolved. However, interested readers are encouraged to review other resources (e.g., Chen, 2007; Meredith, 1993) for more nuanced discussions of the topic. 3. Section three: technical considerations for CFA and MG-CFA with ordinal data 3.1. Preliminary analyses in CFA with ordinal data 3.1.1. Necessary sample size Prior to beginning analyses, researchers should ensure that they have enough participants to accurately evaluate ME/I. MG-CFA is based on CFA which is a large sample technique that generally requires a minimum of 200 participants (200 per group for MGCFA). That said, a variety of factors should be considered in decisions about sample size including reliability of indicators, scaling, and strength of factor loadings. For more information about minimum sample sizes needed for CFA, see MacCallum, Widaman, Zhang, and Hong (1999), and Marsh, Balla, and McDonald (1988). In the context of MG-CFA, samples of fewer than 300 participants (150 per group) have been shown to be more prone to error and yield change in model fit indices that are less accurate (Chen, 2007). Notably, MG-CFA techniques have been shown to be fairly robust and to retain power in the context of differing sample sizes across groups as long as each group has a minimum of 200 participants (González-Romá, Hernández, & GόmezBenito, 2006). However, because a variety of characteristics influence the minimum sample size needed, researchers may consider conducting a simulation study that mimics the sample size, test length, number of response categories, etc. within their sample to determine the specific sample size needed to show invariance for a particular model. 3.1.2. Polychoric correlation and covariance In factor analyses of continuous data, relationships between variables are generally derived based on a matrix of product-moment correlations or covariances. However, product-moment correlations and covariances (and factor analyses that are based on them) do not accurately reflect variance in ordinal data, and relationships between variables (e.g., correlation coefficients) are likely to be artificially low when product-moment correlations or covariances are used. In turn, factor loadings may be attenuated

72

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

(Flora, LaBrish, & Chalmers, 2012). Polychoric correlation or covariance matrices (or tetrachoric matrices with binary data) are ones in which the correlations or covariances within the matrix have been calculated and standardized in a way that better reflects relationships between ordinal variables. Factor analyses with ordinal data that are based on polychoric matrices (via estimators such as WLSMV) are likely to yield more accurate results than those based on raw scores (see Forero, Maydeu-Olivares, & Gallardo-Pujol, 2009, for a more detailed review). Notably, there are other methods (aside from polychoric correlations) which can account for ordinal data, and these methods are described in the model estimation section of this article. Because factor analyses are based on correlation and covariance matrices, the primary objective in data screening and assumption testing is ensuring the adequacy of those matrices (Flora et al., 2012). In factor analyses on continuous data, linearity (i.e., linear relationships between variables) is a critical assumption. However, the assumption of linearity is violated when factor analysis is applied to ordinal indicators. To account for this violated assumption, it is important for researchers performing factor analysis on ordinal data to use sufficiently large samples and ensure that outlying cases are identified and accounted for (see Flora et al., 2012 for a review). 3.1.3. Number of responses per category Another preliminary consideration for MG-CFA with ordinal/categorical data involves the number of responses per category or cell (i.e., the number of individuals who endorsed each response option, such as Strongly Agree or Agree, for each item. Researchers conducting MG-CFA with ordinal data (especially in studies with small samples) will commonly encounter situations where zero respondents endorse a particular response category (e.g., Strongly Disagree) for one or more items. Categories with zero respondents are referred to in the literature as “empty cells” or “zero frequency cells.” There are many studies focused on addressing sparseness or empty cells in IRT, but fewer studies have examined this issue within a CFA or structural equation modeling (SEM) framework. There are two primary schools of thought regarding how to address zero frequency cells when calculating polychoric (or tetrachoric) correlations (Savalei, 2011). One option is to do nothing, as any adjustment could potentially distort the data. A second option (which is used in many statistical programs, including Mplus) is to replace zero values with low values, such as 0.5. This approach is based on the view that empty cells could influence or impede model estimation and that uniformly low values will not meaningfully impact findings. Savalei (2011) compared the two approaches in a series of Monte Carlo analyses and determined that, under most conditions, the preferable option is to replace zero values with low values, such as 0.5. Notably, some estimation techniques have been shown to be more effective than others in the context of empty cells or sparseness in response categories (See the section on Model Estimation with Ordinal Data for more on this point.) Also, sparseness has been shown to be most problematic (i.e., most likely to lead to inaccurate findings) in conditions where (a) sample sizes are small, (b) there are a small number of categories (two or three), (c) data are strongly skewed, and/or (d) there are many items with sparseness or zero frequency cells (Flora et al., 2012). 3.1.4. Identification After verifying that the data are appropriate for factor analysis, the researcher must ensure that the model is identified. A model is said to be identified if there are sufficient numbers of indicators for each latent variable such that it is possible to test the model. A general guideline is there must be at least two indicators (with uncorrelated errors) per latent variable (with a notable exception if the model includes only one latent variable and no other variables; Kenny, Kashy, & Bolger, 1998; Kline, 2010). Model identification, including “over identification” and “under identification” is a complex topic that is outside the scope of the present article—readers are directed to Kline (2010) for a review of identification. Notably, in scale development, an absolute minimum of three items per latent factor is strongly recommended so as to allow for triangulation of the latent construct. There are various ways to examine identification including algebraic equations, heuristics, and information matrix techniques. In CFA, one step that a researcher must take to ensure that a model is identified is (a) fixing one item on each factor to one, or (b) setting the factor variance to one. Generally, the former method (fixing one item per factor to one) is considered to be preferable because setting the variance to one can result in the pattern coefficients being altered differentially across groups if the factor variances differ across groups (Comşa, 2010; Jöreskog & Sörbom, 1996; Yoon & Millsap, 2007). As such, the former method (fixing one item per factor to one) was recommended by Millsap and Yun-Tein (2004) in their seminal article, is more commonly used, and is the default in many statistical programs (including Mplus). One potential drawback of this approach is that it assumes that the loading fixed to one is invariant across groups. This may be particularly problematic when examining metric invariance (see the metric invariance section for a description) in cases where there are a small number of items per factor or in the examination of partial invariance (see Section five: extensions and advanced applications of MG-CFA section; Cheung & Rensvold, 1999). Researchers who are concerned about this have two options: (a) set the factor variance to one instead, or (b) run the metric invariance analyses twice – fixing a different item on each factor to one each time. When working with ordinal/categorical data, additional steps are necessary to ensure that the model is identified when examining scalar/threshold invariance. Specifically, when a researcher examines scalar/threshold invariance with ordinal/categorical data, a very large number of freely estimated parameters are typically included in the analysis (i.e., parameters are estimated for each category − 1 for each item). Because so many parameters are estimated in the model, the researcher must fix additional sets of parameters in order for the model to be identified. Conceptually, there are a wide variety of parameters that one potentially could fix. However, it is typical for researchers to fix one threshold on each item and two thresholds for one item on each factor (Millsap & Yun-Tein, 2004). Taking these steps (in addition to the aforementioned steps, including fixing one item on each factor to one) satisfies the identification requirements.

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

73

3.1.5. Model parameterization Related to model identification, model parameterization is another consideration in MG-CFA with ordinal data. In this context, model parameterization involves estimating (or not estimating) different sets of parameters within a model. There are two primary methods for model parameterization in categorical MG-CFA: delta and theta. In the delta parameterization, latent factors can be estimated as model parameters, but residual variances for ordinal/categorical items cannot. In contrast, in the theta parameterization, residual variances for ordinal/categorical dependent variables (items) are estimated, but factor residual variances are not included (Muthén & Muthén, 2010). The two approaches generally produce similar results (Muthén & Asparouhov, 2002). It has been argued that, in many instances, the delta parameterization is preferable (Muthén & Asparouhov, 2002). However, when a model has correlated residuals and/or a researcher intends to examine residual invariance with ordinal/categorical indicators, it is not possible to use the delta parameterization. In these instances, the theta parameterization should be used (Muthén & Asparouhov, 2002). Notably, the delta and theta parameterizations are mathematically equivalent and, although they yield different types of information, the fit statistics should be the same with both methods. 3.2. Model estimation with ordinal data Selection of an estimation technique is another important step in conducting CFA. Maximum likelihood (ML) is a commonly used estimation technique. However, ML (with the exception of full information ML, see below) estimation often produces inaccurate results with non-normal and/or ordinal data (Bollen, 1989; Jöreskog, 2005). For ordinal data, alternative estimation techniques, such as weighted least squares mean and variance adjusted (WLSMV; Beauducel & Herzberg, 2006; Bowen & Masa, 2015; Flora & Curran, 2004) are recommended. With ordinal data, WLSMV bases estimation on a polychoric correlation matrix with adjusted means and variance estimates and, relative to ML, has been shown to produce more accurate estimates of factor loadings; yield more accurate fit indices; perform better with small sample sizes; and perform better in conditions of non-normality (see Beauducel & Herzberg, 2006 for a review and detailed analysis). That said, WLSMV is only one estimation technique available in categorical CFA. Researchers are encouraged to consider both limited-information (e.g., WLSMV) and full-information (e.g., full-information maximum likelihood, FIML) estimation techniques. Both WLSMV and FIML derive estimates based on a matrix of associations between item-level variables. Broadly speaking, limitedinformation techniques, such as WLSMV (which are based on standardized, polychoric matrices) incorporate less information from matrices (i.e., only the diagonal elements of the weight matrix) into analyses relative to full-information techniques (see Wirth & Edwards, 2007, for a detailed and more technical review of full- and partial-information techniques). As such, partial-information techniques, such as WLSMV, yield biased standard errors and test statistics which must be corrected. Notably, these errors and test statistics can be easily corrected in Mplus, and research suggests that the corrections are accurate and function well in practice (Flora & Curran, 2004; Forero et al., 2009). Full-information techniques do not require corrections but are more computationally intensive. Additionally, limited information techniques (e.g., WLSMV) are likely to outperform FIML technique in conditions of sparseness of responses in a particular category (Maydeu-Olivares & Joe, 2006; Rhemtulla, Brosseau-Liard, & Savalei, 2012). 3.3. Model fit in single-group CFA Evaluation of model fit is another important technical aspect of CFA. In essence, CFA examines the extent to which data correspond to (or fit) a model that the researcher specifies based on theory. Fit indices are used to evaluate the degree of model “fit,” or the extent to which the patterns of relationships (or covariances) within the data correspond to the theoretically specified ones. Experts in CFA and structural equation modeling (SEM) largely agree that decisions about model fit should be made based on consideration of multiple criteria (e.g., several fit indices, model parsimony, comparison of competing models; Kline, 2010; Tanaka, 1993). Fit indices are one important tool in evaluating model fit – particularly in a multi-group context. Detailed reviews of many different fit indices can be found elsewhere (e.g., Chen, Sousa, & West, 2005; Kline, 2010). Here three statistics are reviewed: Sattora-Bentler χ2, comparative fit index (CFI), and the root mean square error approximation (RMSEA). The Satorra Bentler χ2, CFI, and RMSEA were selected because they (a) are commonly used for model comparison in MG-CFA, (b) are produced in CFA analyses with WLSMV estimation, and (c) are available in Mplus – the software program that is illustrated in this paper. Other fit indices (not discussed here), such as the Tucker Lewis Index (TLI) can also be considered. 3.3.1. Chi-square According to Hu and Bentler (1999), chi-square (χ2) “assesses the magnitude of discrepancy between the sample and fitted covariance matrices” (Hu & Bentler, 1999, p. 2). The χ2 examines whether there is a statistically significant difference between the observed and expected covariance matrices (Hu & Bentler, 1999; Millsap & Yun-Tein, 2004). Therefore, unlike in other analyses (such as ANOVA or regression), here, failure to reject the null hypothesis serves as evidence that the researchers' hypothesis regarding the factor structure of a scale is tenable. Unfortunately, there are many problems associated with the χ2 statistic (Kline, 2010). The most salient issue with the χ2 is that it is sensitive to sample size, and CFA is a technique that requires large samples. In other words, ideally, a non-significant χ2 is evidence of good model fit. However, the χ2 is likely to be statistically significant, even in well-fitting models, solely as a result of large samples that CFA requires (Bentler & Bonnet, 1980). Thus, over-reliance on the χ2 value can lead to inappropriate rejection of well-fitting models (see Kline, 2010, for a review). In addition, the χ2 assumes multivariate normality, a property which

74

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

data from social-emotional and behavioral measures, such as those commonly used in school psychology interventions, often lack. Alternative metrics, such as the Satorra-Bentler χ2, are available in many statistical packages. The Satorra-Bentler χ2 is a mean-adjusted test that better approximates non-normal data (Bryant & Satorra, 2012; Satorra & Bentler, 2001). The use of the SatorraBentler χ2 with ordinal data has been supported in the literature (Lei, 2009), but it is also sensitive to sample size (Cheung & Rensvold, 2002). As such, other metrics should be used in conjunction with the Satorra-Bentler χ2 (Tanaka, 1993). 3.3.2. Comparative fit index The comparative fit index (CFI) is a type of incremental fit index that is commonly used in CFA. Incremental fit indices provide indications of model fit by comparing the observed model to a null model (i.e., a model with no covariance among observed variables). If the hypothesized model has substantially better fit than the null model, then the CFI will be high, which supports the hypothesized factor structure (Kline, 2010). 3.3.3. Root mean square error of approximation The Root mean square error of approximation (RMSEA) is a fit index that reflects the difference between the hypothesized model and the population covariance matrix (Hooper, Coughlan, & Mullen, 2008). Low RMSEA values (close to zero) indicate that the data fit the model well (Kline, 2010). 3.3.3.1. Cut-offs and fit indices. The topic of strict cut-offs in regard to fit indices in single-group CFA has been a point of debate in the structural equation modeling literature (see Barrett, 2007; Bentler, 2007; Markland, 2007). Some research findings have supported cut-offs, such as CFI ≥ 0.90 (Hu & Bentler, 1995) and RMSEA ≤ 0.08 (Browne & Cudeck, 1993), and these cut-offs are commonly cited in the literature. Hu and Bentler (1999) proposed more rigorous cut-offs (CFI N 0.95; RMSEA b 0.06) which some warn may be overly stringent (e.g., Markland, 2007) and emphasize the importance of comparing competing models over rigid adherence to proposed cut-offs (Marsh, Hau, & Wen, 2004). Although there is disagreement in the literature regarding the desired values for cut-off scores in SEM, it is relatively well accepted that researchers should make decisions about model fit criteria a priori (Bentler, 2007; Markland, 2007). Accordingly, readers are encouraged to review the literature on fit indices, consider the nature of their data, and make a priori decisions regarding model fit (Bentler, 2007; Hu & Bentler, 1999; Kline, 2010; Markland, 2007; Schmitt, 2011). It is important to consider fit indices in tandem rather than relying on a single index (Tanaka, 1993). This includes finding a parsimonious model that fits the data well, taking into consideration that additional fit indices may be needed when comparing competing models (Rodgers, 2010). However, judgments of model acceptability should also be supported by empirical data and careful, substantive model conceptualization (McDonald, 2010). 3.4. Evaluating change in model fit in MG-CFA with ordinal data 3.4.1. Change in chi-square Traditionally, models were said to meet criteria for ME/I (e.g., metric invariance, scalar invariance, etc.) based on the value of the change in χ2 (Δ χ2) between models (e.g., comparing the metric invariance model to the configural invariance model). Remember that, in the context of CFA, the χ2 value “assesses the magnitude of discrepancy between the sample and fitted covariance matrices” (Hu & Bentler, 1999, p. 2) and that a minimal (non-significant) discrepancy between the sample and the fitted covariance matrices is desirable and indicative of good fit. In the context of MG-CFA, if the model fit worsens (i.e., the χ2 value increases significantly) when equality constraints across groups are applied, that serves as evidence that the scale is non-invariant. Alternatively, if no meaningful change in χ2 is evident when increasingly restrictive equality constraints are applied, then ME/I across groups is supported. In analyses on data that are non-normal (as is common in measures used to evaluate interventions in school psychology), change in in Satorra-Bentler chi-square (ΔSB-χ2) can be used in place of the standard Δχ2 value (Asparouhov, Muthén, & Muthén, 2006; Dimitrov, 2010; Satorra & Bentler, 2001). However, additional computational procedures are needed to calculate ΔSB-χ2. For more information, readers are directed to Bryant and Satorra (2012) for a helpful review and to Watkins (2012) for freely available software. 3.4.2. Change in CFI and RMSEA Although the Δχ2 and ΔSB-χ2 can be useful tools, recall from pp. 11–12 that many problems (e.g., sensitivity to sample size) are associated with the χ2 value in general. These issues extend to invariance testing (Cheung & Rensvold, 2002). As such, many scholars have recommended the use of alternative metrics. For example, Cheung and Rensvold (2002) and others (e.g., Keith, 2014) recommended that change in CFI (ΔCFI) be used to in conjunction with Δ χ2 to evaluate model fit. As a general rule, a measure can be considered invariant at a particular level if the ΔCFI ≥ −0.01 (i.e., the CFI does not decrease by N00.01). Chen (2007) also presented evidence supporting a cut-off of ΔCFI ≥ −0.01, but only for studies with sample sizes N300 (150 per groups) – noting that studies with smaller samples should use stricter cut-offs. She recommended that the ΔCFI be supplemented by a criterion requiring ΔRMSEA b 0.015. Notably, examining change in model fit with ordinal data presents unique challenges. 3.4.3. Change in model fit with ordinal data – overview Findings from recent studies indicate that there are unique challenges inherent in evaluating change in model fit with ordinal data and different cut-offs may be warranted in some circumstances. Notably, with ordinal data, the WLSMV estimation technique minimizes bias in factor loadings that is present with other estimation techniques (i.e., maximum likelihood and maximum

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

75

likelihood robust, ML and MLR, respectively). Thus, WLSMV is considered to be a preferable estimation technique for single-group CFAs because it has been shown to produce relatively unbiased parameter estimates (Beauducel & Herzberg, 2006; Bowen & Masa, 2015; Flora & Curran, 2004). However, change in model fit indices calculated using WLSMV may not perform well in the context of model misspecification. As such, researchers are encouraged to run and compare competing models and consider examining their findings using multiple estimators (Sass, Schmitt, & Marsh, 2014). Additionally, a variety of cut-offs (discussed further on pp. 21–22) for establishing non-invariance with the WLSMV estimator have been proposed, and these cut-offs may vary as a function of sample size (e.g., Sass et al., 2014). 3.4.4. A cautionary note Importantly, identifying cut-offs or other criteria for determining ME/I is complex, and there is not a clear consensus in the field about the best approach to use with either continuous or ordinal data. Presently, all available approaches have notable limitations (see Chen et al., 2005, for a review). One commonly used approach (i.e., evaluating changes in Satorra-Bentler χ2, CFI, and RMSEA) is illustrated here, but readers are encouraged to review a variety of sources to determine the most appropriate approach for their study (e.g., Chen, 2007; Meade, Johnson, & Braddy, 2008; Sass et al., 2014). The ΔCFI and ΔRMSEA have been shown to be effective indicators of invariance and non-invariance in the context of MG-CFA with categorical indicators with correctly specified models. However, these and other fit indices may not function as well with incorrectly specified models and ordinal/categorical data (Sass et al., 2014). 4. Section four: an illustrated example of categorical MG-CFA 4.1. Introduction 4.1.1. Overview of SABRS The Social and Academic Behavior Risk Screener (SABRS; Kilgus et al., 2013) is a brief, 12-item screener of behavioral risk. The SABRS was selected for this illustrative example for two reasons. First, it yields ordinal data at the item-level; thus, the analytical techniques used for examination of measurement equivalence on the SABRS will be applicable for most social, emotional, and behavioral rating scales used in school psychology (including those used within intervention frameworks). Second, the SABRS was designed for use in universal screening and tiered-intervention models. If ME/I across groups is established, then that ME/I would serve as one piece of evidence supporting the use of the SABRS. Thus, practitioners using the SABRS for universal screening or evaluation of placements for evidence-based interventions within tiered intervention models could do so with more confidence. 4.1.2. SABRS development The SABRS was founded upon models of social-behavioral (Walker, Irvin, Noell, & Singer, 1992) and academic (DiPerna, 2006) competence. It provides Social Behavior (SB) and Academic Behavior (AB) subscales, as well as a total scale. The SB subscale consists of six items pertaining to maladaptive (e.g., temper outbursts) and adaptive (e.g., cooperation) behaviors that influence a student's ability to form and maintain age appropriate relationships. The AB subscale includes six items measuring adaptive (e.g., academic engagement) and maladaptive (e.g., difficulty working independently) behaviors related to a student's ability to participate in, be prepared for, and benefit from academic instruction. The overall factor structure of SABRS scores (see Kilgus et al., 2013; von der Embse, Pendergast, Kilgus, & Eklund, 2015) has been supported across multiple samples using both EFA and CFA. Studies have supported the internal consistency reliability of SABRS scores across all three scales (α = 0.90–0.94), as well as the criterion-related validity of scores as predictors of various academic and behavioral outcomes (e.g., alternative behavior rating scales, office referrals, suspensions, curriculum-based measurement; Kilgus et al., 2013; Kilgus, Eklund, Nathaniel, Taylor, & Sims, 2016a; Kilgus, Sims, Nathaniel, & Taylor, 2016b). Receiver operating characteristic (ROC) curve analyses have been indicative of the diagnostic accuracy of SABRS cut scores, as measured via acceptable levels of sensitivity and specificity relative to the Social Skills Improvement System (SSIS; Gresham & Elliott, 2008), Behavioral and Emotional Screening System (BESS; Kamphaus & Reynolds, 2007), Student Risk Screening Scale (SRSS; Drummond, 1994), and Student Internalizing Behavior Screener (SIBS; Cook et al., 2011). Ongoing research has evaluated the SABRS use in multiple gating assessment models (Kilgus et al., 2016a, 2016b). 4.2. Method 4.2.1. Participants Findings presented here were based on teacher ratings of 864 students from two schools (one elementary and one middle school) in the southeastern United States. Female students comprised 51.7% of the sample. In regard to race/ethnicity, 47.7% of the participants were White, 37.4% were Black, 10.9% were Latino, and 3.7% were Multiracial. Grade level ranged from 1st to 9th with 23% of students enrolled in 1st or 2nd grade, 23% in 3rd or 4th grade, 20% in 5th or 6th grade, and 34% in 7th–9th grade. 4.2.2. Procedure All study procedures were approved by the university institutional review board (IRB). Teachers who agreed to participate were given explicit instructions on how to complete the SABRS through an online assessment platform and completed all student ratings via the Qualtrics system. A passive consent procedure was used whereby parents had the option to complete a form and

76

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

“opt-out” if they preferred that their child not participate. Teachers were provided with opt-out forms to be distributed to all students (and their parents). Students for whom no signed opt-out form was received were enrolled in the study. Of the 871 students approached to participate, no opt-out form was received for 864 students (99.20% response rate). 4.2.3. Data analysis CFAs were conducted in Mplus 6.2 using WLSMV estimation (Beauducel & Herzberg, 2006). The command (syntax) files needed for reproducing the results using Mplus (Muthén & Muthén, 2015), are provided as supplementary materials at 10.1016/ j.jsp.2016.11.002. Overall model fit was evaluated based on the RMSEA and the CFI (Kline, 2010; Tanaka, 1993). Criteria for evaluating minimally acceptable model fit were established a priori: RMSEA values ≤ 0.08 and CFA values ≥ 0.90 (Browne & Cudeck, 1993; Hu & Bentler, 1995; Markland, 2007). The analyses in this study focused on examining ME/I across race. The ME/I of the two-factor SABRS structure was assessed by applying increasingly restrictive equality constraints across groups to examine (a) configural invariance, (b) metric invariance, and (c) scalar/threshold invariance. Nested models (i.e., models with increasingly restrictive invariance tests) were compared using the change in SB χ2 (ΔSB χ2), change in (ΔCFI), and change in RMSEA (ΔRMSEA) values. To clarify, Model A is considered “nested” within Model B if Model A is a less complex version of Model B, where one can simply subtract parameters from Model B to arrive at Model A. The models are not considered nested if one has to both add and subtract parameters from one model to arrive at the other. Within the current study, each nested model was compared to its parent model, the latter of which possessed increasingly restrictive invariance specifications. As the models grew more restrictive, non-significant Δχ2 (p N 0.05), ΔCFI b 0.01 (Cheung & Rensvold, 2002), and ΔRMSEA b 0.015 indicated that the more restrictive model had a comparable fit to the data as less restrictive one (Byrne, 2011; Meade et al., 2008; Satorra & Bentler, 2001). 4.2.4. Overview of model In this study, a two-factor model (presented in Fig. 2) was tested, with Academic Behavior and Social Behavior factors using 11 of the 12 SABRS items. Item 11, Distracted, was identified early in the analyses as a Heywood case (i.e., a negative variance estimate; see Kolenikov & Bollen, 2012, for a review of Heywood cases in factor analysis) and had to be removed from the model. All of the remaining 11 items were included with items 1–6 loading on the Social Behaviors subscale and items 7–10 and 12 loading on the Academic Behaviors subscale. Additionally, based on findings from prior research and theory, residuals from items (on both factors) that are closely associated with symptoms of externalizing disorders, attention deficit hyperactivity disorder in particular, were allowed to correlate (e.g., Impulsiveness, Disruptive Behavior; Cole, Ciesla, & Steiger, 2007). Intercorrelation between the two factors was specified. Additionally, one item on each factor was fixed to one for identification purposes. The items that were fixed to one (Arguing, and Production of Acceptable Work) were selected because they had the highest loadings on their respective factors in prior research. 4.3. Results 4.3.1. Preliminary analyses Participants from two racial groups (White, n = 412, and Black n = 323) were included in analyses of SABRS ME/I across race. Participants from other racial groups were excluded because the sample sizes were too small (b100). Item-level descriptive statistics were examined and are available in Table 1. The SABRS is scored such that higher scores indicate that the child is engaging

Fig. 2. Factor loadings from confirmatory factor analyses of SABRS data with Black and White students.

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

77

in more positive (and fewer negative) social and academic behaviors. There were small but statistically significant mean differences in the subscale scores with teachers rating White students higher than Black students on the social behavior (t = 8.24, p b 0.001) and academic behavior (t = 8.25, p b 0.001) subscales.

4.3.2. Primary analyses All CFA findings are reported in Table 2 (including SB χ2 values and p values). In the first step, configural invariance was established. Fit indices for the configural model fell within specified ranges (CFI = 0.993; RMSEA = 0.077). However, the RMSEA was a bit high by some standards (e.g., N0.06; Hu & Bentler, 1999). Nonetheless, because empirical recommendations state that multiple, as opposed to single, fit indices should be considered when evaluating fit (e.g., Tanaka, 1993), the two-factor SABRS model was considered to have adequate, albeit borderline, fit across groups. (Note that the sum of the df values for each group separately, as well as the sum of the χ2 values for each group separately, are equal to the df and χ2 values in the configural model, respectively.) Next, a metric invariance model was tested wherein factor loadings were constrained to be equal across racial groups. The model had adequate fit based on the aforementioned fit criteria (CFI = 0.994; RMSEA = 0.069). The change-in-model fit indices suggested that the fit of the metric invariance model was not significantly worse, and, in fact, was slightly better relative to the configural invariance model (Δχ2 was non-significant, ΔCFI was b0.01, and was ΔRMSEA b 0.015). Therefore, metric invariance was supported. Hypothetically (for illustrative purposes), if the ΔCFI had been N0.01 (e.g., if the CFI for this metric model had been 0.980, but the CFI for the configural model was still 0.993), or if the ΔRMESA was N0.015, then the scale would have metric non-invariance. If the scale had been non-invariant, then there would be no need to proceed to testing scalar/threshold and/or latent mean invariance. However, because the change-in-model fit indices were consistent with a priori and generally accepted criteria, metric invariance for the SABRS across race was supported by these findings. Subsequently, a scalar/threshold invariance model was tested whereby factor loadings and thresholds were constrained to be equal across groups. The model had adequate fit (CFI = 0.994; RMSEA = 0.062), and the change-in-model fit indices indicated that the fit of the scalar/threshold model was not significantly different from that of the metric model (Δχ2 was non-significant, ΔCFI was b0.01, ΔRMSEA was b 0.015). Therefore, scalar/threshold invariance was supported.

4.4. SABRS findings: discussion and conclusion The purpose of these analyses was to examine ME/I of SABRS scores across race (White and Black participants). These findings strongly support the invariance of SABRS scores across race. Specifically, configural, metric, and scalar invariance were supported for Black and White participants. That said, aspects of the fit of these models were borderline across all groups (e.g., RMSEA = 0.077). Some fit indices (CFI) were consistently strong, and the RMSEA consistently fell within the guidelines for adequate fit specified by Browne and Cudeck (1993). Further investigation of the SABRS, among all participants, perhaps using an EFA or IRT approach, would be a useful avenue for scholarly inquiry. Regardless, it is important to note that SABRS scores did meet criteria for adequate fit among all groups examined in this study. Because configural, metric, and scalar/threshold ME/I were evident in this study, use of these scores for within group analyses, mean group comparisons, and selection purposes is supported by the findings. However, additional analyses, such as examination of potential predictive bias, are needed, particularly given that the scale is intended to be used in the context of multi-tiered models. Additionally, examination of the underlying reason for mean differences may be warranted. Thus, in regard to the SABRS, support for ME/I is an important starting point, but, as is true with most universal screening measures for social and behavioral functioning, there is much more work to be done. Table 1 Item-level descriptive statistics for SABRS scores. Items

ss1r ss2 ss3r ss4r ss5 ss6r sa7 sa8 sa9 sa10r sa12 Social total Academic total

All participants

European American participants

African American participants

Mean

SD

Skew

Kurtosis

Mean

SD

Mean

SD

2.53 2.10 2.74 2.42 2.27 2.40 2.09 2.08 2.07 2.25 2.03 14.80 12.85

0.74 0.87 0.60 0.82 0.82 0.86 0.84 0.89 0.89 0.93 0.84 3.66 4.52

−1.60 −0.49 −2.64 −1.33 −0.71 −1.35 −0.37 −0.51 −0.43 −1.09 −0.27 −1.37 −0.57

2.07 −0.83 7.03 0.96 −0.66 0.92 −0.99 −0.78 −0.95 0.22 −1.05 1.44 −0.62

2.65 2.25 2.84 2.53 2.43 2.48 2.23 2.22 2.25 2.40 2.18 15.17 13.41

0.65 0.82 0.48 0.76 0.77 0.83 0.83 0.86 0.83 0.86 0.84 3.46 4.36

2.37 1.91 2.64 2.19 2.09 2.25 1.90 1.89 1.88 2.12 1.85 13.44 11.53

0.81 0.87 0.68 0.91 0.87 0.90 0.84 0.89 0.89 0.94 0.85 4.24 4.54

78

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

Table 2 Fit statistics for measurement equivalence/invariance of SABRS scores across racial groups. Model

χ2

df

p

Δχ2 2

Black participants (n = 323) White participants (n = 412) Configural model Metric model Scalar/threshold model

120.970 108.534 228.181 222.201 242.248

36 36 72 81 101

b0.001 b0.001 b0.001 b0.001 b0.001

Δχ (df)

p

– – – 15.19NS (9) 23.66NS (20)

– – – 0.086 0.257

CFI

ΔCFI

RMSEA

RMSEA CI

ΔRMSEA

0.993 0.993 0.993 0.994 0.994

– – – 0.001 0

0.085 0.070 0.077 0.069 0.062

0.069–0.103 0.055–0.085 0.066–0.088 0.058–0.080 0.052–0.072

– – – 0.008 0.007

Note. CFI = comparative fit index; RMSEA = root mean square error of approximation; change in chi-square was calculated using the Satorra-Bentler Chi-Square difference test formula via the DIFFTEST function in MPlus 6.12. NSdenotes a non-significant Δχ2.

Future research among members of other racial, ethnic, cultural, and disability groups not included in the sample would be very valuable, and replication with other demographic groups and samples is recommended as ME/I can vary widely depending on the groups examined. In regard to statistical methods, some, but not all, prior studies of the SABRS have supported a bifactor model for the SABRS (Kilgus, Sims, von der Embse, & Riley-Tillman, 2014). Moreover, like most large, school-based studies, this study used nested data. The ideal model to test within the context of this study would be a multilevel, bifactor one. In the context of this paper, we felt that a bifactor, multilevel CFA model would be too complex to be useful for the purposes of demonstrating the MGCFA technique; thus, a simpler, correlated factors model that has been supported in prior research with single-group EFAs and CFAs (Kilgus et al., 2013) was examined. That said, future research examining ME/I with a bifactor, multilevel SABRS model would be useful. Very few researchers have attempted to examine ME/I in a multilevel model for any scale. However, recent methodological advances (Kim, Yoon, Wen, Luo, & Kwok, 2015) may help to facilitate multilevel invariance analyses and allow researchers to answer many related theoretical and methodological questions. 5. Section five: extensions and advanced applications of MG-CFA 5.1. Meaningful non-invariance: what next? In this study, ME/I across race was supported for SABRS scores. However, in practice, it is common to encounter situations where invariance does not hold. For example, the SABRS (described previously) has been shown to meet criteria for configural, metric, and scalar/threshold invariance across gender (Kilgus & Pendergast, 2016). However, suppose that the scale developers added the following item, Gossips about classmates, to the social behavior factor. Gossiping is a form of relational aggression that girls have been shown to engage in more commonly than boys (Levin & Arluke, 1985); thus, it is reasonable to expect that gossiping (or responses to items inquiring about gossiping behavior) may be more strongly related to a latent social behavior factor in girls than in boys. This may result in much higher factor loadings for girls than boys (metric non-invariance), and teachers of girls may be more likely to Strongly Agree with that item than teachers of boys who have the same underlying level of social behavior difficulties (threshold non-invariance). When non-invariance occurs, such as the in the situation described above, a researcher has four options (Millsap & Kwok, 2004): (a) eliminate the problematic items and use shortened versions of the scale, (b) use the scale despite non-invariance, (c) refrain from using the scale for any comparisons across groups, or (d) examine partial invariance. Each approach has strengths and weaknesses. Eliminating problematic items is a relatively simple solution but could alter the construct and lead to different forms of a scale being used in different studies with different groups. Ignoring non-invariance can lead to inaccurate latent mean comparisons (e.g., Vandenberg & Lance, 2000), and refraining from conducting any latent mean comparisons may prevent researchers from examining research questions that are of theoretical and practical importance. The remaining option is to examine partial invariance. Byrne et al. (1989) coined the term partial invariance and were the first to recommend consideration of it. Simply put, a model that is partially invariant is one where some, but not all, model parameters meet criteria for invariance at a particular level. Examining partial invariance involves re-running invariance analyses (metric, scalar/threshold, etc.) while allowing certain parameters to vary across groups (i.e., not restricting parameters, such as the factor loading and intercept/threshold for the Gossips about classmates item, to be equivalent across groups). The methods for conducting the analyses and interpreting the findings are similar to those used when examining full invariance except that the researchers can only claim partial invariance when certain item loadings or intercept/thresholds are excluded from equivalence constraints. 5.2. ME/I with longitudinal data This article focuses primarily on the use of categorical MG-CFA in establishing ME/I across independent groups (e.g., across race, gender, etc.). However, categorical MG-CFA can also be a useful tool for examining ME/I with members of the same group over time and for longitudinal structural equation modeling. Examining longitudinal ME/I is necessary because, at times, the psychometric properties of a measure can change over time which may influence comparability of latent mean scores. For example, Fokkema, Smits, Kelderman, and Cuijpers (2013) examined longitudinal ME/I of the Beck Depression Inventory (BDI; Beck & Beamesderfer, 1974) before and after patients received treatment for depression and found that the scores were non-invariant

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

79

and tended to under-estimate depressive symptoms at baseline and over-estimate them at the second time-point. After treatment, participants appeared to become better reporters of their symptoms as evidenced by higher factor loadings and lower residuals of BDI items. Notably, if BDI scores at baseline had been compared to those at follow-up to evaluate intervention effectiveness, it is likely that the effectiveness of the intervention would be underestimated. Similar rating scales are commonly used to measure the effectiveness of interventions used within multi-tiered models in school settings; thus, MG-CFA may have yet another important role in school psychology research. Notably, there are slight differences that apply when examining longitudinal ME/I models relative to those with independent groups (e.g., in longitudinal models, it is necessary to allow correlated residuals for corresponding items across time points). For useful references on longitudinal ME/I, see Coertjens (2014) and Oort (2005). 5.3. Why is ME/I necessary? Broader conclusions and future directions As noted by Vandenberg and Lance (2000), “The establishment of measurement invariance across groups is a logical prerequisite to conducting substantive cross-group comparisons (e.g., tests of group mean differences, invariance of structural parameter estimates)” (p. 1). Before a researcher or practitioner can compare scores on a measure (e.g., a rating scale for depression) across groups (e.g., boys and girls) and have confidence that any identified mean differences reflect true differences in the underlying construct, they must ensure that, at a minimum, the measures meet criteria for configural, metric, and scalar/threshold invariance across groups. This is particularly important in school psychology intervention research, as a crucial objective in the field of school psychology is to evaluate the extent to which interventions are effective across diverse groups of students (Forman et al., 2013), and making valid mean comparisons across groups is a necessary step in that process. Ignoring non-invariance of measures can have significant consequences and can erode the accuracy of research findings. Findings from many studies have demonstrated that metric and scalar non-invariance can lead to inaccuracies in latent means and “bogus interaction effects” based on group differences (Chen, 2008, p. 1002). Further, and more importantly in the context of school psychology intervention research, measurement non-invariance can lead to inaccurate predictions (e.g., Millsap & Kwok, 2004). For example, Millsap and Kwok (2004) examined the impact of various degrees of measurement non-invariance on selection bias (i.e., as in selection of candidates by school admissions committees or employers) and found that measurement non-invariance substantially impacted predictive accuracy (i.e., selection of the correct students or employees relative to a criterion measure). That said, it is important to note that measurement invariance and selection invariance are separate constructs, and measurement invariance does not guarantee selection invariance across groups. Just as multiple forms of validity evidence (e.g., structural, predictive, content) are necessary to use a scale with one group, multiple forms of evidence are needed to use and compare scores across groups (Borsboom, Romeijn, & Wicherts, 2008; Millsap, 1997). In schools, measures are used to assist school professionals with determining need for intervention, selecting the appropriate intervention tier (e.g., level of service or intervention), and evaluating responses to intervention (Dowdy, Ritchey, & Kamphaus, 2010). However, if decisions about tier placement are made based on non-invariant measures that may yield more erroneous predictions for members of certain racial/ethnic groups. Thereby, universal screenings and tiered intervention models may, potentially, introduce additional inequality into the schooling process if non-invariant measures are used. Some measures that are widely used in the context of universal screening and school psychology interventions do have evidence of invariance (though usually only configural or metric), but many do not. In the interest of equity, fairness, and ensuring that all students have the opportunity to derive benefit from the fruits of school psychology intervention research, school psychologists have a responsibility to follow the guidelines outlined in the Standards (AERA, APA, & NCME, 2014) and ensure that measures used in the context of schoolbased interventions operate similarly for members of the racial, ethnic, cultural, gender, and other groups with whom they are used. School psychology researchers and scale developers can support practitioners by using techniques, such as MG-CFA, to examine the ME/I of scales that are commonly used in the context of school-based interventions. Appendix A. Supplementary data Supplementary data to this article can be found online at http://dx.doi.org/10.1016/j.jsp.2016.11.002. References Abedlaziz, N., Ismail, W., & Hussin, Z. (2011). Detecting a gender-related DIF using logistic regression and transformed item difficulty. U.S.-China Education Review, 5, 734–744. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Asparouhov, T., & Muthén, B. (2014). Multiple-group factor analysis alignment. Structural Equation Modeling: A Multidisciplinary Journal, 21, 495–508. http://dx.doi.org/ 10.1080/10705511.2014.919210. Asparouhov, T., Muthén, B., & Muthén, B. O. (2006). Robust chi square difference testing with mean and variance adjusted test statistics. Retrieved from http:// statmodel2.com/download/webnotes/webnote10.pdf Barrett, P. (2007). Structural equation modelling: Adjudging model fit. Personality and Individual Differences, 42, 815–824. http://dx.doi.org/10.1016/j.paid.2006.09.018. Beauducel, A., & Herzberg, P. Y. (2006). On the performance of maximum likelihood versus means and variance adjusted weighted least squares estimation in CFA. Structural Equation Modeling, 13, 186–203. http://dx.doi.org/10.1207/s15328007sem1302_2. Beck, A., & Beamesderfer, A. (1974). Assessment of depression. In P. Pichot (Ed.), Psychological measurements in psychopharmacology: Modern problems in pharmacopsychiatry. Vol. 7. (pp. 151–169). Oxford, UK: Karger.

80

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

Bentler, P. M. (2007). On tests and indices for evaluating structural models. Personality and Individual Differences, 42(5), 825–829. http://dx.doi.org/10.1016/j.paid. 2006.09.024. Bentler, P. M., & Bonnet, D. C. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88, 588–606. http://dx.doi.org/ 10.1037/0033-2909.88.3.588. Bollen, K. A. (1989). Structural equations with latent variables. New York, NY: John Wiley & Sons. Borsboom, D. (2006). When does measurement invariance matter? Medical Care, 44(11), S176–S181. http://dx.doi.org/10.1097/01.mlr.0000245143.08679.cc. Borsboom, D., Romeijn, J. W., & Wicherts, J. M. (2008). Measurement invariance versus selection invariance: Is fair selection possible? Psychological Methods, 13, 75–98. http://dx.doi.org/10.1037/1082-989x.13.2.75. Bowen, N. K., & Masa, R. D. (2015). Conducting measurement invariance tests with ordinal data: A guide for social work researchers. Journal of the Society for Social Work and Research, 6, 229–249. http://dx.doi.org/10.1086/681607. Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York, NY: Guilford. Brown, T. A., & Moore, M. T. (2012). Confirmatory factor analysis. In R. H. Hoyle (Ed.), Handbook of structural equation modeling (pp. 361–379). New York, NY: Guilford Press. Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen, & J. S. Long (Eds.), Testing structural equation models (pp. 136–162). Newbury Park, CA: Sage. Bryant, F. B., & Satorra, A. (2012). Principles and practice of scaled difference chi-square testing. Structural Equation Modeling: A Multidisciplinary Journal, 19, 372–398. http://dx.doi.org/10.1080/10705511.2012.687671. Byrne, B. M. (2011). Structural equation modeling with Mplus: Basic concepts, applications and programming. New York, NY: Routledge, Taylor & Francis Group. Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105(3), 456–466. http://dx.doi.org/10.1037/0033-2909.105.3.456. Chen, F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance. Structural Equation Modeling, 14, 464–504. http://dx.doi.org/10.1080/ 10705510701301834. Chen, F. F. (2008). What happens if we compare chopsticks with forks? The impact of making inappropriate comparisons in cross-cultural research. Journal of Personality and Social Psychology, 95, 1005–1018. http://dx.doi.org/10.1037/e514412014-064. Chen, F. F., Sousa, K. H., & West, S. G. (2005). Teacher's corner: Testing measurement invariance of second-order factor models. Structural Equation Modeling, 12(3), 471–492. http://dx.doi.org/10.1207/s15328007sem1203_7. Cheung, G. W., & Rensvold, R. B. (1999). Testing factorial invariance across groups: A reconceptualization and proposed new method. Journal of Management, 25(1), 1–27. http://dx.doi.org/10.1177/014920639902500101. Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233–255. http://dx.doi. org/10.1207/s15328007sem0902_5. Coertjens, L. (2014). Change in learning strategies during and after the transition to higher education: The impact of statistical choices on growth trend estimates. Antwerpen: Universiteit Antwerpen. Cole, D. A., Ciesla, J. A., & Steiger, J. H. (2007). The insidious effects of failing to include design-driven correlated residuals in latent-variable covariance structure analysis. Psychological Methods, 12, 381–398. http://dx.doi.org/10.1037/1082-989x.12.4.381. Comşa, M. (2010). How to compare means of latent variables across countries and waves: Testing for invariance measurement. An application using eastern European societies. Sociológia, 42(6), 639–669. Cook, C. R., Rasetshwane, K. B., Truelson, E., Grant, S., Dart, E. H., Collins, T. A., & Sprague, J. (2011). Development and validation of the Student Internalizing Behavior Screener: Examination of reliability, validity, and classification accuracy. Assessment for Effective Intervention, 36(2), 71–79. http://dx.doi.org/10.1177/ 1534508410390486. Davidov, E., Datler, G., Schmidt, P., & Schwartz, S. H. (2011). Testing the invariance of values in the Benelux countries with the European Social Survey: Accounting for ordinality. In Davidov, Schmidt, & Billiet (Eds.), Cross-cultural analysis: Methods and applications (pp. 149–168). New York, NY: Routledge. Dennis, R. M. (1995). Social Darwinism, scientific racism, and the metaphysics of race. The Journal of Negro Education, 243–252. http://dx.doi.org/10.2307/2967206. Dimitrov, D. M. (2010). Testing for factorial invariance in the context of construct validation. Measurement and Evaluation in Counseling and Development, 43, 121–149. http://dx.doi.org/10.1177/0748175610373459. DiPerna, J. C. (2006). Academic enablers and student achievement: Implications for assessment and intervention services in the schools. Psychology in the Schools, 43, 7–17. http://dx.doi.org/10.1002/pits.20125. Dolan, C. V., Oort, F. J., Stoel, R. D., & Wicherts, J. M. (2009). Testing measurement invariance in the target rotated multigroup exploratory factor model. Structural Equation Modeling, 16, 295–314. http://dx.doi.org/10.1080/10705510902751416. Dorans, N. J., & Holland, P. W. (1992). DIF detection and description: Mantel-Haenszel and standardization. ETS research report series, 1. (pp. 35–66). http://dx.doi.org/ 10.1002/j.2333–8504.1992.tb01440.x. Dowdy, E., Ritchey, K., & Kamphaus, R. W. (2010). School-based screening: A population-based approach to inform and monitor children's mental health needs. School Mental Health, 2, 166–176. http://dx.doi.org/10.1007/s12310-010-9036-3. Drasgow, F., & Kanfer, R. (1985). Equivalence of psychological measurement in heterogeneous populations. Journal of Applied Psychology, 70, 662–680. http://dx.doi. org/10.1037//0021-9010.70.4.662. Drummond, T. (1994). The student risk screening scale (SRSS). Grants Pass, OR: Josephine County Mental Health Program. Edelen, M. O., & Reeve, B. B. (2007). Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement. Quality of Life Research, 16, 5–18. http://dx.doi.org/10.1007/s11136-007-9198-0. Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psychological Methods, 4, 272–299. http://dx.doi.org/10.1037/1082-989x.4.3.272. Fairchild, H. H. (1991). Scientific racism: The cloak of objectivity. Journal of Social Issues, 47(3), 101–115. http://dx.doi.org/10.1111/j.1540-4560.1991. tb01825.x. Ferguson, E., & Cox, T. (1993). Exploratory factor analysis: A users' guide. International Journal of Selection and Assessment, 1, 84–94. http://dx.doi.org/10.1111/j.14682389.1993.tb00092.x. Flora, D. B., & Curran, P. J. (2004). An empirical evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data. Psychological Methods, 9, 466–491. http://dx.doi.org/10.1037/1082-989x.9.4.466. Flora, D. B., LaBrish, C., & Chalmers, R. P. (2012). Old and new ideas for data screening and assumption testing for exploratory and confirmatory factor analysis. Frontiers in Psychology, 3, 1–21. http://dx.doi.org/10.3389/fpsyg.2012.00055. Fokkema, M., Smits, N., Kelderman, H., & Cuijpers, P. (2013). Response shifts in mental health interventions: An illustration of longitudinal measurement invariance. Psychological Assessment, 25, 520–531. http://dx.doi.org/10.1037/a0031669. Forero, C. G., Maydeu-Olivares, A., & Gallardo-Pujol, D. (2009). Factor analysis with ordinal indicators: A Monte Carlo study comparing DWLS and ULS estimation. Structural Equation Modeling, 16, 625–641. http://dx.doi.org/10.1080/10705510903203573. Forman, S. G., Shapiro, E. S., Codding, R. S., Gonzales, J. E., Reddy, L. A., Rosenfield, S. A., ... Stoiber, K. C. (2013). Implementation science and school psychology. School Psychology Quarterly, 28, 77–100. http://dx.doi.org/10.1037/spq0000019. Glover, T. A., & DiPerna, J. C. (2007). Service delivery for response to intervention: Core components and directions for future research. School Psychology Review, 36, 526–540. http://dx.doi.org/10.1016/b978-0-7506-6997-9.50020-1. González-Romá, V., Hernández, A., & Gόmez-Benito, J. (2006). Power and type I error of the mean and covariance structure analysis model for detecting differential item functioning in graded response items. Multivariate Behavioral Research, 41, 29–53. http://dx.doi.org/10.1207/s15327906mbr4101_3. Gregorich, S. E. (2006). Do self-report instruments allow meaningful comparisons across diverse population groups? Testing measurement invariance using the confirmatory factor analysis framework. Medical Care, 44, S78–S94. http://dx.doi.org/10.1097/01.mlr.0000245454.12228.8f. Gresham, F. M., & Elliott, S. N. (2008). Ssis: Social skills improvement system: Rating scales manual. PsychCorp.

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

81

Henson, R. K., & Roberts, J. K. (2006). Use of exploratory analysis in published research: Common errors and some comments on improved practice. Educational and Psychological Measurement, 66, 393–416. http://dx.doi.org/10.1177/0013164405282485. Hooper, D., Coughlan, J., & Mullen, M. R. (2008). Structural equation modelling: Guidelines for determining model fit. Journal of Business Research Methods, 6, 53–60. Hu, L. T., & Bentler, P. M. (1995). Evaluating model fit. In R. Hoyle (Ed.), Structural equation modeling: Concepts, issues, and applications (pp. 76–99). Newbury Park, CA: Sage Publishing. Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6, 1–55. http://dx.doi.org/10.1080/10705519909540118. Jimerson, S. R., Stewart, K., Skokut, M., Cardenas, S., & Malone, H. (2009). How many school psychologists are there in each country of the world? International estimates of school psychologists and school psychologist-to-student ratios. School Psychology International, 30, 555–567. http://dx.doi.org/10.1177/ 0143034309107077. Jöreskog, K. G. (2005). Structural equation modeling with ordinal variables using LISREL. Lincolnwood, IL: Scientific Software International. Jöreskog, K. G., & Sörbom, D. (1996). LISREL 8: User's reference guide. Scientific Software International. Kamphaus, R. W., & Reynolds, C. R. (2007). BASC-2 behavioral and emotional screening system manual. Circle Pines, MN: Pearson. Keith, T. Z. (2014). Multiple regression and beyond: An introduction to multiple regression and structural equation modeling. New York, NY: Routledge. Keith, T. Z., Caemmerer, J. M., & Reynolds, M. R. (2016). Comparison of methods for factor extraction for cognitive test-like data: Which overfactor, which underfactor? Intelligence, 54, 37–54. http://dx.doi.org/10.1016/j.intell.2015.11.003. Kenny, D. A., Kashy, D. A., & Bolger, N. (1998). Data analysis in social psychology. In D. Gilbert, S. Fiske, & G. Lindzey (Eds.), The handbook of social psychology, Vol. 1. (pp. 233–265). Kilgus, S., & Pendergast, L. L. (2016, February). Confirmatory factor analysis and measurement invariance in MPlus. Invited workshop presented at the annual Trainers of School Psychologists annual conference (New Orleans, LA). Kilgus, S. P., Chafouleas, S. M., & Riley-Tillman, T. C. (2013). Development and initial validation of the Social and Academic Behavior Risk Screener for elementary grades. School Psychology Quarterly, 28, 210–226. http://dx.doi.org/10.1521/scpq.17.4.390.20863. Kilgus, S. P., Sims, W. A., von der Embse, N. P., & Riley-Tillman, T. C. (2014). Confirmation of models for interpretation and use of the Social and Academic Behavior Risk Screener (SABRS). School Psychology Quarterly, 30, 335–352. http://dx.doi.org/10.1037/spq0000087. Kilgus, S. P., Eklund, K., Nathaniel, P., Taylor, C. N., & Sims, W. A. (2016a). Psychometric defensibility of the Social, Academic, and Emotional Behavior Risk Screener (SAEBRS) Teacher Rating Scale and multiple gating procedure within elementary and middle school samples. Journal of School Psychology, 58, 21–39. http://dx. doi.org/10.1016/j.jsp.2016.07.001. Kilgus, S. P., Sims, W. A., Nathaniel, P., & Taylor, C. N. (2016b). Technical adequacy of the Social, Academic, and Emotional Behavior Risk Screener in an elementary sample. Assessment for Effective Intervention. http://dx.doi.org/10.1177/1534508415623269 (1534508415623269). Kim, E. S., & Yoon, M. (2011). Testing measurement invariance: A comparison of multiple-group categorical CFA and IRT. Structural Equation Modeling, 18, 212–228. http://dx.doi.org/10.1080/10705511.2011.557337. Kim, E. S., Yoon, M., Wen, Y., Luo, W., & Kwok, O. M. (2015). Within-level group factorial invariance with multilevel data: Multilevel factor mixture and multilevel MIMIC models. Structural Equation Modeling: A Multidisciplinary Journal, 1–14. http://dx.doi.org/10.1080/10705511.2014.938217 (ahead-of-print). Kline, R. B. (2010). Principles and practice of structural equation modeling (3rd ed.). New York, NY: Guilford Press. Knapp, T. R. (1990). Treating ordinal scales as interval scales: An attempt to resolve the controversy. Nursing Research, 39, 121–123. Koh, K. H., & Zumbo, B. D. (2008). Multi-group confirmatory factor analysis for testing measurement invariance in mixed item format data. Journal of Modern Applied Statistical Methods, 7, 471–477. Kolenikov, S., & Bollen, K. A. (2012). Testing negative error variances: Is a Heywood case a symptom of misspecification? Sociological Methods & Research, 41, 124–167. http://dx.doi.org/10.1177/0049124112442138. Kratochwill, T. R., & Stoiber, K. C. (2002). Evidence-based interventions in school psychology: Conceptual foundations of the procedural and coding manual of division 16 and the society for the study of school psychology task force. School Psychology Quarterly, 17, 341–389. http://dx.doi.org/10.1521/scpq.17.4.390.20863. Kwan, V. S., Bond, M. H., Boucher, H. C., Maslach, C., & Gan, Y. (2002). The construct of individuation: More complex in collectivist than in individualist cultures. Personality and Social Psychology Bulletin, 28, 300–310. http://dx.doi.org/10.1177/0146167202286002. Lei, P. W. (2009). Evaluating estimation methods for ordinal data in structural equation modeling. Quality and Quantity, 43, 495–507. http://dx.doi.org/10.1007/ s11135-007-9133-z. Levin, J., & Arluke, A. (1985). An exploratory analysis of sex differences in gossip. Sex Roles, 12, 281–286. http://dx.doi.org/10.1007/BF00287594. Linstrum, E. (2016). Ruling minds: Psychology in the British empire. Cambridge, MA: Harvard University Press. Lorenzo-Seva, U., & ten Berge, J. M. F. (2006). Tucker's congruence coefficient as a meaningful index of factor similarity. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 2, 57–64. http://dx.doi.org/10.1027/1614-2241.2.2.57. MacCallum, R. C., Widaman, K. F., Zhang, S., & Hong, S. (1999). Sample size in factor analysis. Psychological Methods, 4, 84–99. http://dx.doi.org/10.1037//1082-989x.4.1.84. Markland, D. (2007). The golden rule is that there are no golden rules: A commentary on Paul Barrett's recommendations for reporting model fit in structural equation modelling. Personality and Individual Differences, 42, 851–858. http://dx.doi.org/10.1016/j.paid.2006.09.023. Marsh, H. W., Balla, J. R., & McDonald, R. P. (1988). Goodness-of-fit indexes in confirmatory factor analysis: The effect of sample size. Psychological Bulletin, 103, 391–410. http://dx.doi.org/10.1037/0033-2909.103.3.391. Marsh, H. W., Hau, K. T., & Wen, Z. (2004). In search of golden rules: Comment on hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and Bentler's (1999) findings. Structural Equation Modeling, 11(3), 320–341. http://dx.doi.org/10.1207/s15328007sem1103_2. Mash, E. J., & Hunsley, J. (2005). Evidence-based assessment of child and adolescent disorders: Issues and challenges. Journal of Clinical Child and Adolescent Psychology, 34, 362–379. http://dx.doi.org/10.1207/s15374424jccp3403_1. Maslach, C., Stapp, J., & Santee, R. T. (1985). Individuation: Conceptual analysis and assessment. Journal of Personality and Social Psychology, 49, 729–738. http://dx.doi. org/10.1037/0022-3514.49.3.729. Maydeu-Olivares, A., & Joe, H. (2006). Limited information goodness-of-fit testing in multidimensional contingency tables. Psychometrika, 71, 713–732. http://dx.doi. org/10.1007/s11336-005-1295-9. McDonald, R. P. (2010). Structural models and the art of approximation. Perspectives on Psychological Science, 5(6), 675–686. http://dx.doi.org/10.1177/ 1745691610388766. Meade, A. W., Johnson, E. C., & Braddy, P. W. (2008). Power and sensitivity of alternative fit indices in tests of measurement invariance. Journal of Applied Psychology, 93, 568. http://dx.doi.org/10.1037/0021-9010.93.3.568. Meredith, W. (1993). Measurement invariance, factor analysis, and factorial invariance. Psychometrika, 58, 525–543. http://dx.doi.org/10.1007/bf02294825. Meredith, W., & Teresi, J. A. (2006). An essay on measurement and factorial invariance. Medical Care, 44, S69–S77. http://dx.doi.org/10.1097/01.mlr.0000245438.73837. 89. Millsap, R. E. (1997). Invariance in measurement and prediction: Their relationship in the single-factor case. Psychological Methods, 2, 248–260. http://dx.doi.org/10. 1037/1082-989x.2.3.248. Millsap, R. E. (2012). Statistical approaches to measurement invariance. New York, NY: Routledge. Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334. http://dx.doi.org/10.1177/014662169301700401. Millsap, R. E., & Kwok, O. M. (2004). Evaluating the impact of partial factorial invariance on selection in two populations. Psychological Methods, 9, 93–115. http://dx. doi.org/10.1037/1082-989x.9.1.93. Millsap, R. E., & Yun-Tein, J. (2004). Assessing factorial invariance in ordered-categorical measures. Multivariate Behavioral Research, 39, 479–515. http://dx.doi.org/10. 1207/s15327906mbr3903_4. Muthén, B., & Asparouhov, T. (2002). Latent variable analysis with categorical outcomes: Multiple-group and growth modeling in Mplus. Mplus web notes, no. 4, version 5 (Retrieved from https://www.statmodel.com/download/webnotes/CatMGLong.pdf).

82

L.L. Pendergast et al. / Journal of School Psychology 60 (2017) 65–82

Muthén, L. K., & Muthén, B. O. (2010). Mplus user's guide (6th ed.). Los Angeles, CA: Muthén & Muthén. Muthén, L. K., & Muthén, B. O. (1998-2015). Mplus User’s guide (7th ed.). Los Angeles, CA: Muthén & Muthén. National Association of School Psychologists (2010a). Model for comprehensive and integrated school psychological services. Retrieved from http://www.nasponline. org/standards/2010standards/2_PracticeModel.pdf National Association of School Psychologists (2010b). Standards for graduate preparation of school psychologists. Retrieved from http://www.nasponline.org/ standards/2010standards/1_Graduate_Preparation.pdf Oort, F. (2005). Using structural equation modeling to detect response shifts and true change. Quality of Life Research, 14, 587–598. http://dx.doi.org/10.1007/s11136004-0830-y. Ortiz, S. O., Flanagan, D. P., & Dynda, A. M. (2008). Best practices in working with culturally diverse children and families. In A. Thomas, & J. Grimes (Eds.), Best practices in school psychology V (pp. 1721–1738). Bethesda, MD: The National Association of School Psychologists. Pendergast, L. L., Scharf, R. J., Rasmussen, Z. A., Seidman, J. C., Schaefer, B. A., Svensen, E., ... Network Investigators, M. A. L. -E. D. (2014). Postpartum depressive symptoms across time and place: Structural invariance of the Self-Reporting Questionnaire among women from the international, multi-site MAL-ED study. Journal of Affective Disorders, 167, 178–186. http://dx.doi.org/10.1016/j.jad.2014.05.039. Poortinga, Y. H. (1995). Cultural bias in assessment: Historical and thematic issues. European Journal of Psychological Assessment, 11, 140–146. http://dx.doi.org/10. 1027/1015-5759.11.3.140. Porteus, S. D. (1937). Primitive intelligence and environment. New York, NY: McMillan. Reynolds, C. R., & Ramsay, M. C. (2003). Bias in psychological assessment: An empirical review and recommendations. In J. R. Graham, & J. A. Naglieri (Eds.), Handbook of psychology (vol. 10): Assessment psychology (pp. 67–93). Hoboken, NJ: John Wiley & Sons. Rhemtulla, M., Brosseau-Liard, P.É., & Savalei, V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological Methods, 17, 354–373. http://dx.doi.org/10.1037/a0029315. Richards, G. (1997). Race, racism, and psychology: Towards a reflective history. New York, NY: Routledge. Rodgers, J. L. (2010). The epistemology of mathematical and statistical modeling: A quiet methodological revolution. American Psychologist, 65(1), 1. http://dx.doi.org/ 10.1037/a0018326. Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17(2), 105–116. http://dx.doi.org/10.1177/014662169301700201. Sass, D. A., Schmitt, T. A., & Marsh, H. W. (2014). Evaluating model fit with ordered categorical data within a measurement invariance framework: A comparison of estimators. Structural Equation Modeling: A Multidisciplinary Journal, 21, 167–180. http://dx.doi.org/10.1080/10705511.2014.882658. Satorra, A., & Bentler, P. M. (2001). A scaled difference chi-square test statistic for moment structure analysis. Psychometrika, 66, 507–514. http://dx.doi.org/10.1007/ bf02296192. Savalei, V. (2011). What to do about zero frequency cells when estimating polychoric correlations. Structural Equation Modeling, 18, 253–273. http://dx.doi.org/10. 1080/10705511.2011.557339. Schmitt, T. A. (2011). Current methodological considerations in exploratory and confirmatory factor analysis. Journal of Psychoeducational Assessment, 29(4), 304–321. http://dx.doi.org/10.1177/0734282911406653. Steinmetz, H. (2013). Analyzing observed composite differences across groups. Methodology, 9, 1–12. http://dx.doi.org/10.1027/1614-2241/a000049. Tanaka, J. S. (1993). Multifaceted conceptions of fit in structural equation models. In K. A. Bollen, & J. S. Long (Eds.), Testing structural equation models. Newbury Park, CA: Sage. Teresi, J. (2006). Different approaches to differential item functioning in health applications: Advantages, disadvantages and some neglected topics. Medical Care, 44, S152–S170. http://dx.doi.org/10.1097/01.mlr.0000245142.74628.ab. van de Schoot, R., Lugtig, P., & Hox, J. (2012). A checklist for testing measurement invariance. The European Journal of Developmental Psychology, 9, 486–492. http://dx. doi.org/10.1080/17405629.2012.686740. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4–70. http://dx.doi.org/10.1177/109442810031002. von der Embse, N., Pendergast, L. L., Kilgus, S. P., & Eklund, K. R. (2015). Evaluating the applied use of a mental health screener: Structural validity of the Social, Academic, and Emotional Behavior Risk Screener. Psychological Assessment. http://dx.doi.org/10.1037/pas0000253. Walker, H. M., Irvin, L. K., Noell, J., & Singer, G. H. (1992). A construct score approach to the assessment of social competence rationale, technological considerations, and anticipated outcomes. Behavior Modification, 16, 448–474. http://dx.doi.org/10.1177/01454455920164003. Watkins, M. W. (2012). ChiSquareDiff (computer software). Phoenix, AZ: Ed & Psych Associates. Wicherts, J. M., & Dolan, C. V. (2010). Measurement invariance in confirmatory factor analysis: An illustration using IQ test performance of minorities. Educational Measurement: Issues and Practice, 29, 39–47. http://dx.doi.org/10.1111/j.1745-3992.2010.00182.x. Wirth, R. J., & Edwards, M. C. (2007). Item factor analysis: Current approaches and future directions. Psychological Methods, 12(1), 58. http://dx.doi.org/10.1037/1082989X.12.1.58. Woods, C. M. (2009). Evaluation of MIMIC-model methods for DIF testing with comparison to two-group analysis. Multivariate Behavioral Research, 44, 1–27. http://dx. doi.org/10.1080/00273170802620121. Xing, C., & Hall, J. A. (2015). Confirmatory factor analysis and measurement invariance testing with ordinal data: An application in revising the flirting styles inventory. Communication Methods and Measures, 9, 123–151. http://dx.doi.org/10.1080/19312458.2015.1061651. Yoon, M., & Millsap, R. E. (2007). Detecting violations of factorial invariance using data-based specification searches: A Monte Carlo study. Structural Equation Modeling, 14(3), 435–463. http://dx.doi.org/10.1080/10705510701301677. Youngstrom, E. A., Findling, R. L., Kogos-Youngstrom, J., & Calabrese, J. R. (2005). Toward an evidence-based assessment of pediatric bipolar disorder. Journal of Clinical Child and Adolescent Psychology, 34, 433–448. http://dx.doi.org/10.1207/s15374424jccp3403_4. Zumbo, B. D. (2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4, 223–233. http://dx.doi.org/10.1080/15434300701375832.