Methodological artifacts in dimensionality assessment of the Hospital Anxiety and Depression Scale (HADS)

Methodological artifacts in dimensionality assessment of the Hospital Anxiety and Depression Scale (HADS)

Journal of Psychosomatic Research 74 (2013) 116–121 Contents lists available at SciVerse ScienceDirect Journal of Psychosomatic Research Methodolog...

199KB Sizes 0 Downloads 61 Views

Journal of Psychosomatic Research 74 (2013) 116–121

Contents lists available at SciVerse ScienceDirect

Journal of Psychosomatic Research

Methodological artifacts in dimensionality assessment of the Hospital Anxiety and Depression Scale (HADS) J. Hendrik Straat ⁎, L. Andries van der Ark, Klaas Sijtsma Department of Methodology and Statistics TSB, Tilburg University, Tilburg, The Netherlands

a r t i c l e

i n f o

Article history: Received 29 June 2012 Received in revised form 29 November 2012 Accepted 30 November 2012 Keywords: Anxiety Depression Dimensionality HADS Mokken scale analysis

a b s t r a c t Objective: The Hospital Anxiety and Depression Scale (HADS) is a brief, self-administered questionnaire for the assessment of anxiety and depression in hospital patients. A recent review [7] discussed the disagreement among different studies with respect to the dimensionality of the HADS, leading Coyne and Van Sonderen [8] to conclude from this disagreement that the HADS must be abandoned. Our study argues that the disagreement is mainly due to a methodological artifact, and that the HADS needs revision rather than abandonment. Method: We used Mokken scale analysis (MSA) to investigate the dimensionality of the HADS items in a representative sample from the Dutch non-clinical population (N = 3643) and compared the dimensionality structure with the results that Emons, Sijtsma, and Pedersen [11] obtained in a Dutch cardiac-patient sample. Results: We demonstrated how MSA can retrieve either one scale, two subscales, or three subscales, and that the result not only depends on the data structure but also on choices that the researcher makes. Two 5-item HADS scales for anxiety and depression seemed adequate. Four HADS items constituted a weak scale and contributed little to reliable measurement. Conclusions: We argued that several psychometric methods show only one level of a hierarchical dimensionality structure and that users of psychometric methods are often unaware of this phenomenon and miss information about other levels. In addition, we argued that a theory about the attribute may guide the researcher but that well-tested theories are often absent. © 2012 Elsevier Inc. All rights reserved.

Introduction The Hospital Anxiety and Depression Scale (HADS [1,2]) is a brief, self-administered questionnaire for the assessment of the presence and the severity of anxiety and depression in physically ill patients. The HADS consists of two 7-item scales, one measuring anxiety and the other depression. Somatic indicators of anxiety and depression are not part of the HADS because physical illness may interfere with somatic symptoms [3]. For the classification of individuals as anxious or depressed, researchers use the total scores on the 7-item Anxiety and Depression scales [4]. Two literature reviews [5,6] concluded that the HADS is a psychometrically sound, 2-dimensional questionnaire for measuring anxiety and depression. More recently, Cosco, Doyle, Ward, and McGee [7] reported several studies that failed to replicate the HADS' expected 2-dimensional structure and, moreover, disagreed on the dimensionality of the HADS. Coyne and Van Sonderen [8] concluded from this result that the HADS should be abandoned. This article is a response to the literature reviews and has two goals. ⁎ Corresponding author at: Cito Netherlands, P.O. Box 1034, 6801 MG Arnhem, The Netherlands. E-mail address: [email protected] (J.H. Straat). 0022-3999/$ – see front matter © 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jpsychores.2012.11.012

Different researchers use different psychometric methods to analyze their HADS data sets. The first goal was to argue that the use of different psychometric methods rather than different psychological traits driving responses to the items is responsible for the different dimensionality results. Hence, the search for the “true” HADS structure confounds the effect that different traits exercise on patients' responses with a method effect, and it is the method effect rather than the trait effect that likely produces the divergent dimensionality results found in different studies. Thus, the disagreement about the dimensionality of the HADS mainly refers to a methodological artifact. We believe a discussion of the failure to replicate a 2-dimensional structure should concern the method effect, and should analyze the ways that different methods produce different results that are typical of the method more than the data set. The second goal of this article was to study the HADS dimensionality using Mokken scale analysis (MSA) [9,10] as Cosco et al. [7] recommended. Mokken scale analysis is a scaling method that can be used for the assessment of Likert-items [11–13], that includes a methodology for finding the dimensionality of a data set, and has the effect of reducing the method artifact. Mokken scale analysis is particularly suited for revealing the dimensionality of questionnaires that are based on a hierarchical trait structure [12,19], and we argue that the HADS data reflect this hierarchical structure.

J.H. Straat et al. / Journal of Psychosomatic Research 74 (2013) 116–121

Previous studies (e.g., [14–17]) found different dimensionality structures in samples from a non-clinical population and a cardiacpatient population. Hence, we also compared MSA results we found in a Dutch non-clinical sample with MSA results that Emons et al. [11] obtained from a Dutch cardiac-patient sample so as to explain why studies investigating different populations produce different dimensionality-structure results. The Dutch non-clinical sample data were reused with permission by Denollet (e.g., [18]). The secondary data analysis using MSA was original and not done previously by other researchers using these data. The outline of this article is as follows. First, we discuss the HADS' hierarchical dimensionality structure, and discuss how different psychometric methods used to study the HADS' dimensionality deal with hierarchical trait structures and produce dimensionality results typical of the method. Second, we use MSA to study the HADS' hierarchical structure in a sample from a non-clinical population and compare the results with the MSA results that Emons et al. [11] obtained from a sample of cardiac patients. Third, we discuss the relation of the MSA results to previous findings from Rasch-model analysis, exploratory factor analysis, and confirmatory factor analysis. Finally, we discuss the consequences of the MSA results for the use of the HADS. Hierarchical attribute structure and method effects Psychological attributes often have a hierarchical structure [19]. In response to Coyne and Van Sonderen [8], Norton, Sacker, and Done [20] also made this point, based on the argument that researchers using the same dimensionality-assessment method usually found the same dimensionality structure for the HADS but researchers using different methods found different dimensionality structures. Hence, their point is that different methods find different levels of the hierarchy but that the hierarchy does not become apparent when one does not use different methods or when one uses one method but fails to implement different modes of using the same method. An example is the following. We assume that a hierarchical attribute structure is reflected in the structure of the item scores that constitute the data. A plausible structure would be that all items correlate positively but that several clusters contain items that correlate higher with one another than with the items from other clusters. This structure could suggest two levels, one at which the common denominator of all items reflected by the positive correlations is described and another at which the different item clusters are identified. If clusters contain sub-clusters of items that share variance that other items do not share, one might even discern a third level. Researchers have used Rasch-model analysis, exploratory factor analysis, confirmatory factor analysis, and MSA to study HADS dimensionality. How does each of these different methods deal with the hypothesized 2-level data structure? Rasch-model analysis has the Rasch model [21], which is a one-dimensional scaling model, as the criterion for assessing the structure of the items. Given the formal prevalence for one-dimensionality, a Rasch-model analysis tends to provide information on which items to retain in the scale and which items to remove so as to produce a subset that is one-dimensional; hence, the end result tends to be one scale and one or more items that are not in the scale. The method thus tends to select items from different clusters in one scale and hence identifies the first level. Meijer, Sijtsma, and Smid [22] and De Koning, Sijtsma, and Hamers [23] provided rather involved methodologies for identifying data multidimensionality at the second level using Rasch-model analysis. These authors demonstrated that it is possible to use the one-dimensional Rasch model for identifying multidimensionality but researchers unaware of how the model's perspective tends to lead them to a one-dimensional scale may all too easily overlook possible multidimensionality when the trait structure is 2-level hierarchical or has an even more fine-grained hierarchy. Exploratory factor analysis is a typical dimensionality-reduction method that focuses on identifying a number of dimensions, each

117

attracting a number of distinct items and explaining a reasonable amount of variance in the item scores. The method does not a priori assume a particular dimensionality structure in the data, and tends to pick up heterogeneity in the inter-item correlation matrix, readily suggesting multidimensionality even when the factor structure is rather weak. Unlike the Rasch-model analysis, exploratory factor analysis thus tends to identify the second level of the hierarchy. Like the Rasch-model analysis, confirmatory factor analysis requires the researcher to define the dimensionality structure that serves as the hypothesis to be tested. This is done by a priori fixing the number of hypothesized factors and specifying which items load on which of the fixed factors. The researcher may assume that the trait hierarchy contains a third level and that the item clusters at that level contain few items and that the items have high factor loadings. As a result, a hypothesized 2-factor model misses the third level and shows misfit to the data. Modification indices may be used to adapt the model and test this model for goodness of fit. The identification of the dimensionality structure that best reproduces the inter-item correlations yields the best model-data fit and this may entail that the method finds the third level in our example. Mokken scale analysis is particularly useful to evaluate the different levels of the hierarchical structure. The researcher has to specify a numerical value for a scaling criterion that controls the level at which the hierarchical structure is assessed. Usually, researchers rely on a default option that computer programs provide, and this leads them to accepting the particular outcome corresponding to the default all too easily thus ignoring the trait's hierarchical structure, if present. Sijtsma and Molenaar [10] recommend trying a range of criterion values so as to obtain a picture of the complete hierarchy of the trait and how the hierarchy is reflected in different item clusterings. In the example, the lowest criterion values would produce one scale that includes most or all items, higher criterion values would produce the second-level subscales, and still higher criterion values would produce small sub-subscales whereas many items would not be included in scales anymore. Even higher criterion values would lead to the conclusion that the whole item set is unscalable. Thus, dimensionality assessment methods such as the Rasch-model method and exploratory and confirmatory factor analysis may find different dimensionality structures, but if used correctly MSA may well find the hierarchy that the other methods miss. The hierarchical structure implies that there is not one “true” dimensionality but that the dimensionality depends on the level at which the hierarchy is assessed. This was just one example, but different sets of outcomes are possible, depending on the structure of the attribute. For example, three clusters of items may exist such that within clusters inter-item correlations have approximately the same magnitude and between clusters inter-item correlations are zero. Hence, there is only one level of three zero-correlating dimensions. Rasch-model analysis looks for one-dimensionality in the whole set and thus produces gross misfit, suggesting the rejection of two-thirds of the items and retaining one short scale representing one cluster. Exploratory factor analysis correctly finds the 3-cluster structure. Confirmatory factor analysis likely produces striking misfit for solutions other than the three-factor solution with the items loading on the appropriate factors. For increasing criterion values, MSA continuously finds the correct solution until suddenly all items appear unscalable. In this study, we compare MSA results obtained in a non-clinical sample with MSA results that Emons et al. [11] obtained in a cardiacpatient sample and discuss the relation of the MSA results to previous findings from Rasch-model analysis, exploratory factor analysis, and confirmatory factor analysis. We demonstrate the methodological artifacts that emerge from relying on one psychometric method while ignoring the method's peculiarities. By doing this, we do not intend to provide new HADS results but instead signal a methodological message.

118

J.H. Straat et al. / Journal of Psychosomatic Research 74 (2013) 116–121

Method Participants For the non-clinical sample, Kupper and Denollet [18] approached 3708 Dutch participants of which 3643 (98.2%) participants filled out the HADS. Two gender levels and six age levels (20–29, …, 60–69, 70–80 years) served as stratification criteria, and quota sampling produced twelve equally sized groups. A local ethics committee at Tilburg University (protocol number: 2006/ 1101) approved this study. Research assistants approached participants personally or by phone. After the study's purpose had been explained, participants received an informed consent form and a questionnaire, and participants returned both documents in closed envelopes to the research assistants (between October 1, 2006 and December 15, 2008). Returned questionnaires were coded by number for the purpose of data collection tracking but were otherwise anonymous. The sample consisted of 50% men. The mean age was equal to 50.12 and the standard deviation was equal to 16.31. For 68 respondents, one to three item scores were missing. Two-way imputation [24,25] was used to replace the missing item scores by estimated scores. Statistical analyses Mokken scale analysis Mokken scale analysis assesses the fit of a measurement model known as the monotone homogeneity model [9,10,26]. First, the monotone homogeneity model assumes a single attribute, such as anxiety or depression, to capture the associations between the item scores, implying that the items do not measure any other attribute in common. Second, the monotone homogeneity model assumes a monotone nondecreasing relation between the scores on an item and the attribute. The first assumption ensures that the items measure only one attribute rather than a conglomerate of attributes that hinders a straightforward interpretation of test performance. The second assumption reflects the principle that the higher one scores on the attribute scale, the higher one is expected to score on each of the items in the test that are indicators of the attribute. The function that provides the relationship between the item score and the scale is the item response function. Mokken [9] and Sijtsma and Molenaar [10] provide technical details about the monotone homogeneity model. Let the total score be defined as the sum of the scores on the J items in the test. Then, if the data are consistent with the monotone homogeneity model, individuals with a higher total score are expected to also score higher on the attribute [27,28]. Hence, a monotone homogeneity model that fits the anxiety-item data provides a justification for the use of the total score as a measure of anxiety; and likewise for depression. Mokken scale analysis uses item scalability coefficient Hj, which expresses the strength of the relation between the scores on item j and the attribute the total score measures [29]. A high Hj value implies that the item distinguishes well between low scores on the attribute scale and high scores on the attribute scale. It can be shown that the monotone homogeneity model implies 0 ≤ Hj ≤ 1. Mokken scale analysis aims at obtaining scales consisting of items with Hj values exceeding a lower bound c (default c = .3). The researcher can specify lower bound c to reflect the minimum strength that he requires the relation of an item with the attribute to have for the item to be admitted to the scale. Hence, items for which Hj ≥ c are admitted to the scale and items for which Hj b c are not admitted. The total-scale coefficient H reflects the discrimination power of the whole scale. Mokken, Lewis, and Sijtsma [30] suggested that H expresses the accuracy of a person ordering by means of the total score. Mokken [9] suggested that .30 ≤ H b .40 defines a weak scale, .40 ≤ H b.50 a medium scale, and H ≥ 0.50 a strong scale; H b.3 means that items are unscalable. An automated item selection procedure [31] that is part of MSA partitions a set of items, such as the 14 HADS items, into one or

more scales if the data permit. The two requirements for a scale are that (1) all inter-item correlations are positive and (2) each Hj value exceeds lower bound c; that is, Hj ≥ c [9,10]. The automated item selection procedure [9] is a bottom-up algorithm that starts with the two items i and j that have the highest, significantly positive Hij value that exceeds lower bound c. In each consecutive step, the procedure adds one item that correlates positively with the already selected items, has an Hj value that exceeds c, and produces the highest H value with the items already selected in the previous steps, given all possible items that are candidates for selection in the present step. The item selection proceeds until there are no items left that satisfy the requirements for inclusion in the scale. If items remain unselected, from these items the procedure may select a second scale, a third scale, and so on, until there are no items left or the items left are unscalable. Sometimes the procedure selects an item that after completion of the procedure does not satisfy the scale requirements anymore due to the items selected later in the procedure. Another problem is that the procedure does not always find the best possible partitioning. The first problem is circumvented and the second problem is almost always circumvented by the use of a genetic algorithm [31] that obtains only partitionings that satisfy the scale requirements. Lower bounds c serve as the criterion values that can be varied to study different levels of the hierarchical structure of a psychological attribute. We used the methodology that Hemker, Sijtsma and Molenaar [32], and Sijtsma and Molenaar [10] recommended to find different dimensionality structures, and that entails running the automated item selection procedure several times, starting with minimum c = 0, in each next run using a lower bound c that has increased by .05, and terminating at c = .60 or higher. We used R package mokken [33] to run the automated item selection procedure and the geneticalgorithm version. To investigate the dimensionality of the HADS, both item selection procedures were run for c = .00,.05, …,.60. We also ran confirmatory MSA on a priori defined scales; that is, the item scalability coefficients and the total-scale coefficient were computed for fixed sets of items without item selection. The monotonicity assumption was investigated as follows. The item response function of item j is estimated by considering for each total score on the J-1 items without item j, also known as the rest score, the mean score of item j. Monotonicity implies that for the next rest score the mean score is at least as large as it was for the previous rest score. Deviations from this expectation signal violations of monotonicity and possible distortions from an ordinal scale for measuring persons. We investigated for each item j whether the mean item score is a nondecreasing function of the rest score [34]. We used the R package mokken [33] to investigate the monotonicity assumption for a single 14-item HADS scale, the 7-item Anxiety and Depression scales, and the three scales Anxiety (A1, A2, A3, A5, and A7), Depression (D1, D2, D3, D4, and D6) and Restlessness (A4, A6, and D7) of the Caci et al. [2] model. Reliability We used four methods to estimate the total-score reliability [35]. The methods were coefficient α, coefficient λ2 (both computed using R package mokken [33]), the greatest lower bound to the reliability (glb; computed using R package psych [36]), and the Molenaar–Sijtsma (MS) method [10,37] (computed using R package mokken [33]). Methods α, λ2, and glb are lower bounds to the total-score reliability. Their mutual relationship is: α≤ λ2 ≤ glb≤Reliability. Coefficient α is the most frequently used estimate, but λ2 and glb provide estimates closer to the population total-score reliability and may be preferred over α [35]. Note that for small samples, glb may overestimate the reliability [38]. The MS method was developed in the context of MSA and is a reliability estimator that has smaller bias relative to the reliability than α and λ2 [10,37]. We computed the reliability estimates for a scale containing all 14 HADS items, the 7-item Anxiety and Depression scales, and the three scales of the Caci et al. [2] model.

J.H. Straat et al. / Journal of Psychosomatic Research 74 (2013) 116–121 Table 1 Results from exploratory and confirmatory Mokken Scale Analysis in the non-clinical sample Item label

Exploratory MSA c=0

c = .3

Confirmatory MSA

Scale 1 A1 A2 A3 A4 A5 A6 A7 D1 D2 D3 D4 D5 D6 D7 H

.37 .35 .37 .33 .34 .26 .34 .34 .38 .36 .39 .17 .30 .23 .32

.42 .40 .44 .38 .34

Scale 2

Anx

.51 .52 .56

.51 .52 .56

.48

.48

Depr

.49 .50 .52 .46 .49

.49 .52 .44 .47

.32 .39

.45 .51

.49

.51

.47

.29 .32

Anx = Anxiety scale, Depr = Depression scale, and Restl = Restlessness scale.

A1 A2 A3 A4 A5 A6 A7 D1 D2 D3 D4 D5 D6 D7 H

.47 .41 .45 .38 .40 .28 .43 .41 .46 .44 .43 .32 .39 .31 .39

Confirmatory MSA

c = .4 Scale 1

.38

.49

Exploratory MSA c=0

Restl

.29 .38 .38 .43 .40 .43

Table 3 Exploratory and confirmatory Mokken scale analysis results for a cardiac-patient sample Adapted from Emons et al. [11], Tables 1 and 5. Item label

c = .45

119

c = .5 Scale 2

.51 .45 .50

Scale 1

Scale 2

Anx

.58 .62 .61

.58 .62 .61

.45

.53

.53

.48 .46 .51 .50 .48

.62

.62

Depr

.47

Restl

.45 .38 .59 .61 .57 .51

.43 .48

.60 .56 .53 .48 .39 .54

.58 .47 .47

.59

.57

.59

.51

.39 .40

Anx = Anxiety scale, Depr = Depression scale, and Restl = Restlessness scale. Results Mokken scale analysis For each lower bound c, the automated item selection procedure and its genetic-algorithm version yielded the same item partitionings. Table 1 shows the item partitionings for lower bounds c equal to 0, .3, and .45; other c values did not provide additional information. For c = 0, all items were selected in one scale. For lower bound c = .3, the automated item selection procedure produced a single scale containing 11 items. Items A6, D5, and D7 were unscalable. For higher c values, the automated item selection procedure found two distinct scales that resembled the shortened Anxiety and Depression scales of the Caci et al. [2] 3-factor model. Based on this result, we used confirmatory MSA and computed the Hj and the H coefficients for the a priori identified Anxiety, Depression, and Restlessness scales. Given that two out of three items had Hj values just below .3, the Restlessness items were unscalable; hence, we found convincing 5-item scales for anxiety and depression. In none of the investigated scales – the 14-item scale, the 7-item Anxiety and Depression scales, and the three Caci et al. (2003) scales – did we find violations of monotonicity. Reliability Table 2 shows the four reliability estimates for the models with one 14-item scale, two 7-item scales measuring anxiety and depression, and the three scales based on the Caci et al. [2] model. The 14-item scale had the highest reliability, and the 5-item Anxiety and Depression scales had higher α, λ2, and MS than the 7-item Anxiety and Depression scales. The glb values of the 5-item Anxiety scale and the 7-item Anxiety scale were approximately equal, and the glb for the 5-item Depression scale was higher than the glb for the 7-item Depression scale. Hence, the reliability estimates suggest that items A4 and A6 do not contribute to the reliable measurement of anxiety and items D5 and D7 do not contribute to the reliable measurement of depression. Comparing the non-clinical and cardiac-patient populations Table 3 shows the exploratory and confirmatory MSA results that Emons et al. [11] obtained. For exploratory MSA, the scales in the non-clinical sample (Table 1) and the cardiac-patient sample (Table 3) were comparable but for the cardiac-patient sample, the dimensionality structure remained intact for higher values of lower bound c. For confirmatory MSA, the Hj and H coefficients were also higher in the cardiac-patient

Table 2 Coefficients α, λ2, the glb, and the Molenaar–Sijtsma (MS) method for one scale, two scales, and three scales (non-clinical sample)

One scale Two scales Anxiety (7-item) Depression (7-item) Three scales Anxiety (5-item) Depression (5-item) Restlessness (3-item)

α

λ2

glb

MS

.832

.836

.873

.840

.773 .735

.776 .739

.823 .752

.784 .734

.780 .762 .560

.783 .766 .565

.801 .800 .634

.796 .771 .559

sample than in the non-clinical sample. As a result, the Anxiety scale and the Depression scale are stronger scales in the cardiac-patient sample than in the non-clinical sample, and the Restlessness scale satisfied the Mokken scale criteria at the default lower bound of .3 in the cardiac-patient sample but not in the non-clinical sample.

Discussion Fig. 1 summarizes the MSA results with respect to the HADS' hierarchical structure. At the top level of Fig. 1, all 14 items are selected in a single scale. This one-dimensional solution corresponds to results obtained from Rasch analysis and MSA for c = 0. At the second level of Fig. 1, 10 items constitute a single psychological distress scale whereas items A4, A6, D5, and D7 are no longer in the scale. This predominantly one-dimensional solution, in which a few items do not fit, corresponds to MSA for c = .3 in the non-clinical sample. At the third level in Fig. 1, the single psychological distress scale breaks down into a 5-item Anxiety scale and a 5-item Depression scale. This corresponds to results obtained from explanatory factor analysis and MSA for c = .45 in the non-clinical sample. Finally, confirmatory Mokken scale analysis and confirmatory factor analysis suggest a weak third ‘restlessness’ dimension at the bottom level of Fig. 1, which consists of items A4, A6, D7, and possibly D5. The reliability of the 5-item Anxiety and Depression scales was higher than the reliability of their 7-item versions. The latter reliability values confirmed the lower measurement quality of the four excluded items. Reliability values of the 5-item scales were smaller than .80, which was due to the small number of items. Researchers using Rasch-model analysis [39,40] reported having obtained the 14-item general psychological-distress scale that MSA found for lower bound c-values close to 0. Moreover, fit assessment of the Rasch model showed some evidence of the Anxiety and Depression scales, but did not reveal the low measurement quality of the items A4, A6, D5, and D7 [39,40]. A problem of the Rasch model is that it assumes that all items relate to the attribute scale to the same degree, and not all fit-assessment methods pinpoint differences in degree as a source of misfit (e.g., [41,42]). Alternatively, a researcher may choose fitting the 2-parameter logistic model [43], which assumes that different items relate to the attribute scale to different degrees, and thus may distinguish items relating relatively weakly to the scale from items relating stronger to the scale. Exploratory factor analysis [14,17] yielded a 2-factor solution in which all items with index A loaded on the Anxiety factor and all items with index D loaded on the Depression factor. Hence,

120

J.H. Straat et al. / Journal of Psychosomatic Research 74 (2013) 116–121

Rasch analysis

Factor analysis

A1

A2

A3

A5

A7

D1

D2

D3

D4

D6

A4

A6

D7

D5

Fig. 1. Graphical representation of the hierarchical structure of the HADS identified by MSA.

exploratory factor analysis included the low-quality items in the factors, but MSA excluded these items from the two scales because Hj b c. Confirmatory factor analysis [15,16] yielded fit indices that were sensitive to the misfit of the low-quality items A4, A6, D5, and D7. As a result, a third factor had to be defined to obtain an acceptably fitting model. The main difference between the exploratory and the confirmatory factor analysis results was that confirmatory factor analysis identified the low-quality items by a misfitting two-factor model. Emons et al. [11] showed that the three item clusters from the Caci et al. [2] 3-factor model were consistent with the three scales found in a cardiac-patient sample. In non-clinical samples, researchers (e.g., [14,17]) found a 2dimensional structure and in cardiac-patients samples, other researchers (e.g., [15,16]) found a 3-dimensional structure. The clusters of items we found using MSA in a non-clinical sample were comparable with the clusters of items that Emons et al. [11] found using MSA in a cardiac-patient sample. However, the Hj and the H values were lower in the non-clinical sample. As a result, the Hj values of the items constituting the Restlessness scale were lower than .3 in the non-clinical sample. Hence, the items did not exceed the default lower bound of .3. In the cardiac-patient sample, confirmatory MSA identified the Restlessness scale of the Caci et al. [2] 3-factor model because the Hj and the H values exceeded .3. We concluded that the different dimensionality structures are due to the lower measurement quality of the HADS in non-clinical samples than in the cardiac-patient samples. Zigmond and Snaith [1] did not intend the HADS to measure restlessness in addition to anxiety and depression. Different dimensionality assessment methods used in different populations produce consistent dimensionality results for the ten HADS items without the four restlessness items. An important question is whether the four restlessness items cover important aspects of anxiety and depression. This question is difficult to answer. Like many questionnaires, the HADS is not the operationalization of a well-tested and well-founded theory of anxiety and depression from which substantive arguments for the inclusion or exclusion of particular items were derived. Sijtsma [44,45] argued that in the absence of a welltested theory about the attribute of interest, researchers can only rely on psychometric methods to decide about the dimensionality structure of their item sets. As a result, psychometric arguments rather than theoretical arguments are highly dominant, perhaps too dominant, in instrument construction. A well-established theory about anxiety and depression should guide the operationalization of the measurement instrument [44,45]. The HADS lacks a well-established theoretical foundation and, as a result, the HADS items predominantly define anxiety and depression. The heavy reliance of researchers on psychometric methods for

dimensionality assessment that each emphasize different levels of the hierarchical HADS structure, explains much of the disagreement among different studies about the dimensionality structure of the HADS. Mokken scale analysis better reveals the hierarchy in a dimensionality structure than any of the other methods, and also provides a higher level of awareness with respect to the possibility that different dimensionality structures can be part of the same hierarchy. The HADS can have a future but needs to be based on better established and tested anxiety and depression theories. The two 5-item subscales can be an excellent basis for a novel HADS whereas the four restlessness items may be discarded.

Acknowledgment The authors thank Johan Denollet for making the Dutch non-clinical sample data available. References [1] Zigmond AS, Snaith RP. The Hospital Anxiety and Depression Scale. Acta Psychiatrica Scandinavica 1983;67:361–70. [2] Caci H, Bayle FJ, Dossios C, Robert P, Boyer P. How does the Hospital Anxiety and Depression Scale measure anxiety and depression in healthy subjects? Psychiatry Research 2003;118:89–99. [3] Moorey S, Greer S, Watson M, Gorman C, Rowden L, Tunmore R, et al. The factor structure and factor stability of the Hospital Anxiety and Depression Scale in patients with cancer. The British Journal of Psychiatry 1991;158:255–9. [4] Brennan C, Worrall-Davies A, McMillan D, Gilbody S, House A. The Hospital Anxiety and Depression Scale: a diagnostic meta-analysis of case-finding ability. Journal of Psychosomatic Research 2010;69:371–8. [5] Herrmann C. International experiences with the Hospital Anxiety and Depression Scale — a review of validation data and clinical results. Journal of Psychosomatic Research 1997;42:17–41. [6] Bjelland I, Dahl AA, Tangen Haug T, Neckelmann D. The validity of the Hospital Anxiety and Depression Scale: an updated literature review. Journal of Psychosomatic Research 2002;52:69–77. [7] Cosco TD, Doyle F, Ward M, McGee H. Latent structure of the Hospital Anxiety and Depression Scale: a 10-year systematic review. Journal of Psychosomatic Research 2012;72:180–4. [8] Coyne JC, Van Sonderen E. No further research needed: abandoning the Hospital and Anxiety Depression Scale (HADS). Journal of Psychosomatic Research 2012;72: 173–4. [9] Mokken RJ. A theory and procedure of scale analysis. The Hague, The Netherlands: Mouton; 1971. [10] Sijtsma K, Molenaar IW. Introduction to nonparametric item response theory. Thousand Oaks, CA: Sage; 2002. [11] Emons WHM, Sijtsma K, Pedersen SS. Dimensionality of the Hospital Anxiety and Depression Scale (HADS) in cardiac patients: comparison of Mokken scale analysis and factor analysis. Assessment 2012;19:337–53, http://dx.doi.org/10.1177/ 107319111038495. [12] Straat JH, Van der Ark LA, Sijtsma K. Multi-method analysis of the internal structure of the Type D Scale-14 (DS14). Journal of Psychosomatic Research 2012;72: 258–65.

J.H. Straat et al. / Journal of Psychosomatic Research 74 (2013) 116–121 [13] Wismeijer AAJ, Sijtsma K, Van Assen MALM, Vingerhoets AJJM. A comparative study of the dimensionality of the self-concealment scale using principal component analysis and Mokken scale analysis. Journal of Personality Assessment 2008;90:323–34. [14] Andrea H, Bültmann U, Beurskens AJ, Swaen GM, van Schayck CP, Kant IJ. Anxiety and depression in the working population using the HAD Scale—psychometrics, prevalence and relationships with psychosocial work characteristics. Social Psychiatry and Psychiatric Epidemiology 2004;39:637–46. [15] Hunt-Shanks T, Blanchard C, Reid R, Fortier M, Cappelli M. A psychometric evaluation of the Hospital Anxiety and Depression Scale in cardiac patients: addressing factor structure and gender invariance. British Journal of Health Psychology 2010;15: 97–114. [16] Martin CR, Thompson DR, Barth J. Factor structure of the Hospital Anxiety and Depression Scale in coronary heart disease patients in three countries. Journal of Evaluation in Clinical Practice 2008;14:281–7. [17] Mykletun A, Stordal E, Dahl AA. Hospital anxiety and depression (HAD) scale: factor structure, item analyses and internal consistency in a large population. The British Journal of Psychiatry 2001;179:540–4. [18] Kupper N, Denollet J. Social anxiety in the general population: introducing abbreviated versions of SIAS and SPS. Journal of Affective Disorders 2012;136:90–8. [19] Reise SP, Waller NG, Comrey AL. Factor analysis and scale revision. Psychological Assessment 2000;12:287–97. [20] Norton S, Sacker A, Done J. Further research needed: a comment on Coyne and van Sonderen's call to abandon the Hospital Anxiety and Depression Scale. Journal of Psychosomatic Research 2012;73:75–6. [21] Rasch G. Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press; 1960. [22] Meijer RR, Sijtsma K, Smid NG. Theoretical and empirical comparison of the Mokken and the Rasch approach to IRT. Applied Psychological Measurement 1990;14:283–98. [23] De Koning E, Sijtsma K, Hamer JHM. Comparing four IRT models when analyzing two tests for inductive reasoning. Applied Psychological Measurement 2002;26:302–20. [24] Bernaards CA, Sijtsma K. Influence of imputation and EM methods on factor analysis when item nonresponse in questionnaire data is nonignorable. Multivariate Behavioral Research 2000;35:321–64. [25] Van Ginkel JR, Van der Ark LA, Sijtsma K. Multiple imputation of item scores in test and questionnaire data, and influence on psychometric results. Multivariate Behavioral Research 2007;42:387–414. [26] Sijtsma K, Meijer RR. Nonparametric item response theory and related topics. In: Rao CR, Sinharay S, editors. Handbook of statistics. Psychometrics, 1st ed. Amsterdam, Netherlands: Elsevier, North Holland; 2007. [27] Grayson DA. Two-group classification in latent trait theory: scores with monotone likelihood ratio. Psychometrika 1988;53:383–92. [28] Van der Ark LA. Stochastic ordering of the latent trait by the sum score under various polytomous IRT models. Psychometrika 2005;70:283–304.

121

[29] Van Abswoude AAH, Van der Ark LA, Sijtsma K. A comparative study of test dimensionality assessment procedures under nonparametric IRT models. Applied Psychological Measurement 2004;28:3–24. [30] Mokken RJ, Lewis C, Sijtsma K. Rejoinder to ‘The Mokken scale: A critical discussion’. Applied Psychological Measurement 1986;10:279–85. [31] Straat JH, Van der Ark LA, Sijtsma K. Comparing optimization algorithm for item selection in Mokken scale analysis. Journal of Classification in press. [32] Hemker BT, Sijtsma K, Molenaar IW. Selection of unidimensional scales from a multidimensional item bank in the polytomous Mokken IRT model. Applied Psychological Measurement 1995;19:337–52. [33] Van der Ark LA. Mokken scale analysis in R. Journal of Statistical Software 2007;20: 1–19. [34] Junker BW, Sijtsma K. Latent and manifest monotonicity in item response models. Applied Psychological Measurement 2000;24:65–81. [35] Sijtsma K. On the use, the misuse, and the very limited usefulness of Cronbach's alpha. Psychometrika 2009;74:107–20. [36] Revelle W. psych: Procedures for Psychological, Psychometric, and Personality Research (version 1.2.8), http://cran.r-project.org/web/packages/psych/index.html; 2012. [37] Van der Ark LA, Van der Palm DW, Sijtsma K. A latent class approach to estimating test-score reliability. Applied Psychological Measurement 2011;35:380–92. [38] Shapiro A, Ten Berge JMF. The asymptotic bias of minimum trace factor analysis, with applications to the greatest lower bound to reliability. Psychometrika 2000;65: 413–25. [39] Gibbons CJ, Mills RJ, Thornton EW, Ealing J, Mitchell JD, Shaw PJ, et al. Rasch analysis of the Hospital Anxiety and Depression Scale (HADS) for use in motor neurone disease. Health and Quality of Life Outcomes 2011;9:82–9. [40] Pallant JF, Tennant A. An introduction to the Rasch measurement model: an example using the Hospital Anxiety and Depression Scale (HADS). British Journal of Clinical Psychology 2007;46:1–18. [41] Molenaar IW. Some improved diagnostics for failure of the Rasch model. Psychometrika 1983;48:49–72. [42] Glas CAW, Verhelst ND. Testing the Rasch model. In: Fischer GH, Molenaar IW, editors. Rasch models: their foundations, recent developments and applications. New York: Springer; 1995. p. 69–96. [43] Birnbaum A. Some latent variable models and their use in inferring an examinee's ability. In: Lord FM, Novick MR. Statistical theories of mental test scores. Reading, MA: Addison-Wesley; 1968. p. 397–472. [44] Sijtsma K. Future of psychometrics: ask what psychometrics can do for psychology. Psychometrika 2012;77:4–20. [45] Sijtsma K. Psychological measurement between physics and statistics. Theory & Psychology 2012;22:786–809.