J. psych& Res., Vol. 16, PP. 29-39. 0 Pergamon Press Ltd. 1981. Printed in Great Britain
ASSESSMENT OF AGREEMENT AMONG SEVERAL FORMULATING MULTIPLE DIAGNOSES JUAN
RATERS
E. MEZZICH
Western Psychiatric Institute and Clinic, University of Pittsburgh School of Medicine HELENA C. KRAEMER
Stanford University School of Medicine DAVID R. L. WORTHINGTON SRI International
and GERALD A. COFFMAN Western Psychiatric Institute and Clinic, University of Pittsburgh School of Medicine (Received 15 April 1980; revised 25 July 1980)
THE FULFILLMENT of the various
purposes of diagnostic judgments (i.e. their validity) has as a prerequisite their attainment of satisfactory reliability. The general idea is that an instrument has to yield reproducible and accurate measurements before it can be clinically useful. The classical theory of reliability in behavioral science is based on SPEARMAN’s’ assumption that a score obtained for a given individual is additively composed of a true score and an independent error score. This theory defines reliability as the ratio of true (between subject) variance to total variance, and was presented by GULLIKSEN~ in terms of correlations between parallel tests. The generalizability theory proposed by CRONBACH et aL3 extended the concept of reliability as accuracy of generalization from a set of observed scores to a specified universe of scores, again using ratio of variances as a measure. It emphasized the importance of identifying hypothesized components of an observed score (e.g. subject variance, assessor variance, criterion variance, etc.) and then of evaluating them by sampling relevant multiple observations. The application of generalizability theory, initially formulated for interval scale data, can be extended to data on any scale. For this purpose, reliability can be more generally defined as the degree in which an observation or diagnostic judgment is free from error, with error not necessarily specified in a ratio of variances. A coefficient of reliability should approach 1 accordingly as multiple judgments of the same subject are similar to each other and approach 0 when there is only chance agreement.
Requests for reprints should be sent to Juan E. Mezzich, Western Psychiatric Institute and Clinic, University of Pittsburgh School of Medicine, 3811 O’Hara St., Pittsburgh, Pennsylvania 15261. 29
30
JUAN E. MEZZITH, HELENAC. KRAEMEH,
DAVIS K. L. WORTHINGTON and
GERALU A.
HOFFMAN
At this time, the most widely accepted criterion for psychiatric case identification and classification are judgments made by qualified clinicians. On this basis, diagnostic In fact, as clinical judgments reliability can be formulated as inter-rater agreement. usually are the final pathway of the diagnostic process, the relative impact on reliability of various sources or components of variance (e.g. different procedures for information gathering, different diagnostic systems, different training for clinicians) can be evaluated through the assessment of inter-rater agreement for judgments made as these circumstances are systematically varied. There are diagnostic judgments made on a nominal or typological scale (e.g. clinical syndrome) and others on a dimensional scale (e.g. level of social functioning). Most multi-axial psychiatric diagnostic systems currently being used or proposed in various parts of the world include both typological and dimensional axes. In this paper, only the issue of inter-rater agreement on a nominal scale, and more specifically agreement among several raters formulating multiple typological diagnoses, is dealt with. A number of approaches are reported in the literature for the assessment of inter-rater agreement on a nominal scale. These include indices such as percentage of agreement (based on the ratio of number of agreements on the considered categories to the total number of possible agreements). This measure ignores chance agreement (two raters each formulating a diagnosis A 80% of the time, would have just by chance a percentage of agreement on A of 64Vo), and is not therefore a clearly interpretable measure of reliability, Another index sometimes used is a chi-square value computed on a contingency table cross-classifying the diagnoses made by two raters, which measures association between the two raters (in the sense of deviation from random expectations) but not specifically agreement. Free from these shortcomings is the kappa coefficient (k) proposed by COHEN”, which represents a proportion of agreement corrected for chance. Its formula is:
PC,- p,
k--P’
where P,,stands for observed proportion of agreement and P, for chance proportion of agreement. Its value is 1 when observed agreement if perfect and 0 when it is not different from chance agreement. On the basis of these features, the kappa coefficient is becoming widely recognized as the preferred index for assessing inter-rater agreement on a nominal scale. In an attempt to provide partial credit for various degrees of inter-rater disagreement, COHEN' proposed a weighted kappa coefficient. Standard errors of kappa and weighted kappa, for the case of two raters, have been given by various authors.h 9 For the development of weighted kappa, levels of disagreement between all possible pairs of DSM-I and DSM-II broad diagnostic categories were judgmentally estimated by two experienced clinicians from 0 (identical diagnoses) to 9 (extreme disagreement).‘” For comparative purposes, two other clinicians were asked to formulate similar judgments. The correlation between the disagreement levels assigned by these two clincians to pairs of diagnostic categories was 0.67 while the correlations between each of these clinician’s levels and those used for developing weighted kappa were both 0.59, which may suggest
ASSESSMENT OFAGREEMENT AMONG RATERS
31
that the judgments made by the two extra clinicians were somewhat more reliable than those used for weighted kappa. The central problem with weighted kappa seems to be the arbitrary nature of the assigned weights (e.g. a weight of 9 for the disagreement between paranoid schizophrenia and homosexuality and of 7 for that between paranoid schizophrenia and alcohol addition). Weighting diagnostic disagreements requires assumptions that may not be appropriate to nominal data. Not only do investigators have to decide which disagreements are serious, but also give a numeric rating to the degree of seriousness, as noted by HELZER et al.” It is also not clear if the relationships among the weights should be geometrical or additive. I2 The lack of an obvious or universal rule for assigning weights opens the option for each group of investigators to estimate a different set of weights and therefore to compute, from the same diagnostic formulations, different weighted kappa values. This would reduce the comparability of various studies of the same diagnostic system. Furthermore, when new diagnostic systems appear (e.g. ICD-9, DSM-III) new sets of weights are required, which may bear no relation to weights created for other systems. The kappa coefficient was originally developed to assess inter-rater typological agreement in the basic situation in which two raters formulate one diagnosis each for each patient. FLEISS’~ extended the procedure to deal with the situation in which more than two raters (but a fixed number of them) are involved in formulating a single diagnosis for each of a sample of patients. He also derived large sample standard errors of kappa for testing the statistical significance of its difference from zero. Those formulae were later corrected by FLEISS et al. I4 Kappa was also extended by FLEISS and CUZICK” to deal with the case of dichotomous judgments made by unequal numbers of raters per subject. There has been increasing awareness that within our current nosological understanding a patient may simultaneously present or experience more than one clinical syndrome (e.g. schizophrenia, alcoholism and mental retardation). In line with this, recent standard diagnostic systems allow and encourage the formulation of several syndromes or entities for each patient. The consideration of multiple diagnosis is incorporated in the currently official and uniaxial ICD-9,1h in DSM-II” as well as in each of the three typological axes (I. Clinical psychiatric syndromes, II. Personality disorders for adults and specific developmental disorders for children and adolescents, and III. Physical disorders) of DSM-I11.18 Multiple diagnosis presents a new challenge to the assessment of inter-rater agreement. The procedures previously discussed, including the original computation of kappa, are not directly applicable here. FLEISS et aLI made a significant attempt to deal with the complexity of the new situation. Their procedure involves several steps and decisions which should be carefully considered. The first basic step consists in the judgmental determination of weights for the disagreements between all possible pairs of diagnostic categories. As discussed above, the introduction of this weighting of disagreements is problematic because it both involves questionable assumptions and opens the option to different groups of investigators to estimate their own sets of weights and consequently reduce the comparability of results across studies. Another step in their procedure involves summarizing the agreement between two multiple diagnoses by arbitrarily selecting for each diagnostic category the lowest of its various values of disagreement with other categories, which tends to inflate the assessed inter-rater agreement. The potential vari-
32
JUAN E. MEZZICH, HELENA C. KRAEMER, DAVID R. L. WORTHINGTONand GERALD
A. COFFMAN
ability involved in this step is illustrated by the fact that in the example they presented with two multiple diagnoses, their procedure yielded an overall disagreement value of 2.4, which would become 4.6 if full averaging throughout the summarizing process (arguably a more straightforward and fair procedure) is used. The computation of kappa in their proposal is based on the mean observed disagreement value (across patients) and the chance-expected disagreement mean. Their procedure was described for the case of only two multiple diagnosis ratings made for each patient. From the above discussion follows the need for new procedures to deal more appropriately with the computation of an unweighted kappa coefficient for the general case in which multiple raters (not necessarily fixed in number) formulate multiple diagnoses for each subject. The difficulty lies in determining what constitutes a single diagnosis and how to quantify agreement between two such diagnoses. One can define the diagnosis as the “primary” category chosen, with agreement determined ignoring all other categories mentioned. This implies that the interpretation and clinical use of, for example, a diagnosis of schizophrenia is the same whether or not alcoholism is also mentioned. Alternatively, one might regard each category chosen as a separate and totally independent diagnosis. Agreement is then assessed separately on each available diagnostic category. This procedure will tend to yield higher kappas for frequently mentioned categories and lower kappas for infrequent categories. One can therefore increase kappa by encouraging clinicians to formulate longer lists. Neither of these approaches reflects clinical reality; when multiple categories are selected, the set of categories as a whole constitutes the diagnosis. Whether the order of presentation is also important is an open question. Furthermore two diagnoses (lists of categories) are in total agreement if, clearly, the same categories appear on each list, not merely when the first mentioned category agrees or alternatively when the same category is either present or absent from both lists. If order is important, the categories must appear in the same order as well. Complete disagreement occurs only if there is no overlap between lists. Quantification of agreement between these extremes must systematically give credit for partial agreement. Three procedures are described here. The first two are based on the consideration of a multiple diagnosis formulation as a list of non-ordered categories applicable to a patient. These represent alternative conceptualizations of agreement between non-ordered diagnostic lists. The third procedure is based on the assumption that the categories in each list are specifically ranked as primary, secondary, tertiary, etc. The procedures proposed in this paper for assessing agreement among several raters formulating multiple diagnoses involve (1) the computation of a measure of agreement between two multiple diagnoses for the same subject according to a procedure operationally defined, (2) the computation of mean agreement among all raters of a single subject, (3) the computation of overall mean intrasubject agreement, (4) the determination of a measure of chance agreement, and (5) the computation of a kappa coefficient and assessment of its statistical significance. PROCEDURE
1: PROPORTIONAL
OVERLAP
This procedure defines proportion of agreement between two diagnostic formulations as the ratio of the number of agreements between specific categories over the number
33
ASSESSMENT OFAGREEMENT AMONGRATERS
of possible agreements, i.e. number of different specific categories mentioned in the two diagnostic lists. To illustrate this step, let’s consider the following two diagnostic formulations: Diagnostic
formulation
1 _____ _____ _____ __---
substance abuse schizophrenia mental retardation
Diagnostic
formulation
substance
abuse
affective
2
disorder
The proportion of agreement between these two diagnostic formulations according to the “proportional overlap” criterion, is the ratio of 1 (one agreement, on “substance abuse”) over 4 (four possible agreements, corresponding to the four different categories “substance abuse”, “schizophrenia”, “mental retardation”, and “affective mentioned: disorder”). The numerical value of the “proportional overlap” criterion ranges between 0 and 1. It will be 1 only when agreement is perfect (i.e. the lists match), will gradually decrease with progressively larger differences in content and size between the diagnostic formulations, and will be 0 only when there is no overlap between lists. Agreement among the several raters for a subject is measured by averaging the proportions of agreement obtained for all combinations of pairs of diagnosticians judging that subject. The overall observed proportion of agreement (P,) for the sample of subjects under consideration is the average of the mean proportions of agreement obtained for each of the N subjects in the sample. This procedure handles both the situation of a fixed number as well as that of a variable number of raters involved with each subject. In order to obtain a reliability measure corrected for chance, a proportion of chance agreement (PC) needs to be determined. This is accomplished by computing proportions of agreement between all diagnostic formulations made by all raters for all subjects, and then averaging across them. Using the above described overall observed and chance proportions of agreement, a kappa coefficient can be computed according to Equation 1. The standard error of kappa can be estimated as follows: Let F(P) be the variance of observed proportions of agreement across subjects and Nbe the number of subjects. Then, since PC is a constant, P(k)
= Var
p0 -PC I-p C1 C
Var (PO)
=(1
s2m
and finally,
= N(l Se(k) =
-PC)*
m+j I.
The significance of k can be approximately assessed values of a t distribution with N - 1 degrees of freedom.
by comparing
&)
with critical
34
GFRALDA.COFFMAN
Jt1.w E. MEZZICH,H~LENAC‘.I
Example
Let’s consider the following diagnostic exercise in which 30 child psychiatrists made independent diagnoses of 27 child psychiatric cases. Each psychiatrist rated 3 cases, and each case turned out to be rated by 3 or 4 psychiatrists upon completion of the study. Table 1 shows the resulting 90 multiple diagnostic formulations. Each diagnostic formulation presented was composed of up to three broad diagnostic categories taken from Axis I (clinical psychiatric syndromes) of the American Psychiatric Association’s new Diagnostic and Statistical Manual of Mental Disorders (DSM-III). TABLE I. MULTIPW DIAGNOSTIC FORMUIATIONS FOR 27 C‘HII.D PSYCHIATRIC CASES USING DSM-III AXtS t BROAD VATEGORIES*
Raters Cases
I 2 3 4 5 6 7 8 9 10 11 12 I3 14 I5 16 17 18 19 20 21 22 23 24 25 26 27
I 9, II 16 17 16, 13 7 10 7. I6 I, I4 5 12, 13.14 13 5. I8 14, I3 11, 16 10 14, 5 I2 20 I3 9, 14, IO 12, II 17 16. I3 I2 I3 13 10.9
2
3
11.9, I4 16, I4 I2 13, 16, I4 7, 12, I3 10 I3 I3 20 12, 14, I3 I8 I, 5, I8 14.7,7 14, II, 16 3, 18 5, 16 12, II I6 I4 9. II, I4 11, I4 12 12 12 20 13, 16 9, IO
16, 9 I2 7, 8 I6 13 IO I6 l6,13 13. I4 12, II, 14 I6 I 14, I6 II, I3 10, II 14 I2 16 14 IO. 9 I1 I? I4 16 13 13 9
4 II,9 14, 5 I3
12, 17, I5 13 I2 13 I6 9, IO
*I. Organic mental disorders, 2. Substance use disorders, 3. Schizophrenic and paranoid disorders, 4. Schizoaffective disorders, 5. Affective disorders, 6. Psychoses not elsewhere classified, 7. Anxiety factitious, somatoform, and dissociative disorders, 8. Psychosexual disorder, 9. Mental retardation, 10. Pervasive developmental disorder, 11. Attention deficit disorders, 12. Conduct disorders. 13. Anxiety disorders ot childhood or adolescence, 14. Other disorders of childhood or adolescence, speech and stereotyped movement disorders. disorders characteristic of late adolescence, 15. Eating disorders, 16. Reactive disorders not elsewhere classified, 17. Disorders of impulse control not elsewhere classified, IS. Sleep and other disorders, 19. Conditions not attributable to a mental disorder, 20. No diagnosis on Axis I.
35
ASSESSMENT OFAGREEMENT AMONGRATERS
According to the proportional overlap procedure the proportion of agreement between the diagnostic formulations made by clinicians 1 and 2 for case 1 is the ratio of 2 (one agreement on “9. Mental retardation” and one agreement on “11. Attention deficit disorder”) over 3 (3 possible agreements, corresponding to the 3 different categories mentioned: 9, 11 and 14). The proportions of agreement on case 1 between the diagnostic formulations of clinicians 1 and 3 is 1%)between 1 and 4 is 1, between 2 and 3 is r/4, between 2 and 4 is 2/3, and between 3 and 4 is I%. Consequently, the average proportion of agreement for case 1 is 0.54. Following the same computation procedure it can be found that the average proportion of agreement for case 2 is 0.14 and for case 3 is 0. Similarly, average proportions of agreement are computed for each of the remaining 24 cases. Over all 27 cases, the overall mean proportion of agreement (P,) is 0.36 with a standard deviation of 0.24. In order to estimate chance agreement, a proportion of agreement is computed for each pair of the 90 diagnostic formulations made by all raters for all subjects. The average proportion of chance agreement (P,) turns out to be 0.12. By applying the previously described definition of kappa, k = 0.36 -0.12 1 -0.12 The statistical significance of the obtained assuming that PCis constant. Then: Se(k) =
kappa
above
zero is assessed
by
5.4
with (N - 1) degrees
PROCEDURE
value
___ = 0.05 a7~;~o.12)
t ZZ0.27= 0.05 For one-tailed critical values significance of k isp < 0.001.
= 0 27 . .
2: INTRACLASS
of freedom,
where n = 27, the
CORRELATION
This procedure assesses agreement among multiple diagnostic lists composed of nonordered categories through the computation of intraclass correlation coefficients. Given the nominal structure of the scale underlying typological diagnosis, intraclass correlation coefficients are not used here as final reliability indicesZo but as intermediate steps leading to the determination of kappa coefficients. The computation begins by defining a Kelement vector (where K is the number of available diagnostic categories) for each diagnostic formulation. If a category is mentioned, one is placed in the appropriate element of the vector; if a category is not mentioned, a zero is placed in the corresponding position. A measure of agreement is obtained by computing an intraclass correlation coefficient for all raters evaluating a given subject. Perfect agreement is obtained if and only if the same diagnostic categories are mentioned by all the raters for a given subject-without regard to order.
36
JUAN
E. MEZZI~H, HELENA C. KRAEMER, DAVID
Let e, = intraclass correlation coefficient and variable number of raters).
R. L. WORTHINGTON and GERALD
among
A. COFFMAN
the raters of subject i (allows for multiple
el = average of el, e2, . . . ., eN((Nsubjects). er = intraclass correlation coefficient between
all raters, all subjects.
Then, k = ~. el-
eT
1 -eT
Under the null hypothesis the expected value of kappa (k) is zero. At the other extreme, k = 1 if and only if there is agreement among diagnostic formulations of all single subjects (e, = 1) and some heterogeneity between subjects (er # 1). The standard error of k can be estimated as follows, when a fixed marginal distributron is assumed and the subject sample is at least of moderate size: Se (k) = S,/fl (1 - er), where S,2 is the variance of the e, over subjects. The derivation of the Se (k) formula here is similar to the derivation of the corresponding formula in Procedure 1. The statistical significance of k above zero is tested by comparing critical values of a t distribution with N - 1 degrees of freedom.
k/Se(k)
with the
Example The set of multiple diagnostic formulations made for 27 child psychiatric cases presented in Table 1 will be used again to illustrate the intraclass correlation procedure. In the diagnostic formulation made by psychiatrist 1 for case 1, a value of 1 is assigned to both the “9. Mental retardation” and “11. Attention deficit disorder” positions in the 20 element vector (where 20 is the number of broad diagnostic categories), and zeros are assigned to the remaining elements in the vector. Intraclass correlations (using one-way ANOVA) are computed between all diagnostic formulations for each subject. For example, the intraclass correlations for cases 1, 2 and 3 are: e, = 0.64, e2 = 0.17, e3 = -0.06. This procedure is completed for the remaining cases and an average (e,) and standard deviation (S,) are computed. Next, the intraclass correlation between all psychiatrists and all subjects is computed (&. Then: e, = 0.41, er = 0.09, S, = 0.28, k = 0.41 -0.09 = 0.35, 1 -0.09 Se(k) =
fl(l
0.28 -0.09)
= 0.06
t zo.35 = 5.8. 0.06 For a one-tailed test with (N - 1) degrees of freedom, significance of kappa above zero is: p < 0.001. PROCEDURE
3: RANK
where N = 27, the statistical
CORRELATION
This procedure for assessing agreement among several raters formulating multiple diagnoses is based on the assumption that the categories included in a diagnostic formulation
ASSESSMENT OFAGREEMENT AMONG RATERS
31
are specifically ranked as primary, secondary, tertiary, etc. Thus, each diagnostic formulation is considered a ranking of the K available categories in a diagnostic system (or in a given axis in the case of the multiaxial model). As in Procedure 2, a K-element vector is defined. For a diagnostic formulation including 3 categories, a 1 is assigned to the element in the vector corresponding to the first category mentioned, a 2 to the element corresponding to the second category mentioned, and a 3 to the element corresponding to the third category. The average of the remaining ranks, that is (K + 4)/2, is assigned to the other K-3 elements of the vector. The agreement between two ranked diagnostic formulations is measured by the Spearman rank correlation coefficient. Perfect agreement is obtained if and only if two ranked diagnostic formulations are identical in every respect. Let r, = average Spearman correlation coefficient between all pairs of raters of subject i (allows for multiple and variable number of raters) r, = average of r,, r,, . . . . r, (IV subjects) r, = average Spearman correlation coefficient among all pairs of raters, all subjects. Then, -rl-rr . 1 -rT k
=
The assessment of the statistical significance for k above zero is exactly the same as for Procedure 2. For a full mathematical description of the technique, with procedures for improving computational efficiency, see KRAEMER.~’ Example The previously described diagnostic exercise (Table 1) will be used also to illustrate the third procedure for inter-rater reliability assessment. In the diagnostic formulation made by psychiatrist 1 for case 1, a rank of 1 is given a rank of 2 to “11. Attention deficit disorder”, and a rank to “9. Mental retardation”, of 11.5 (remaining rank average) to each of the other 18 available broad diagnostic categories. These ranks are assigned to their corresponding positions in the K-element vector. A rank assignment procedure is then completed for each of the other 89 diagnostic formulations. The next step is to compute an average Spearman correlation for all pairs of diagnostic formulations corresponding to a given case history. For the first three cases the results are: r, = 0.58, r2 = 0.17, r3 = -0.06. Similarly, average Spearman correlations are computed for each of the 24 remaining cases and an overall average (r,) and standard deviation (S,) are computed. The last step is to compute the average Spearman correlation between all pairs of psychiatrists and over all case histories (rT). The results are: r, = 0.40, rT = 0.09, S, = 0.28, k = 0.40 1 -0.09 -0.09 -___ Se(k) =dsf-~;‘_“o t=0.34=57 0.06
= 0.34, 09)= 0.06,
*
38
JUAN E. MEZZIW, HELENA C. KRAEMER, DAVID R. L.
WORTHINGTON
For a one-tailed test with (N - 1) degrees of freedom, kappa above zero is p < 0.001.
and GERALD A. COFFMAN
where N = 27, the significance
of
DISCUSSION
The three techniques described in this paper are extensions of unweighted kappa to handle the assessment of agreement among multiple raters (not necessarily the same number for each subject) formulating multiple typological diagnoses. The proportional overlap and the intraclass correlation procedures are addressed to the situation in which the multiple diagnosis for a subject is a list of non-ordered categories. The rank correlation procedure is appropriate when the order of the categories is important. In terms of computational efficiency, the intraclass correlation and the rank correlation methods have about the same requirements for computer resources. The proportional overlap method takes longer to run than either of the other methods but requires less computer memory. For the rank correlation method, KRAEMER*’ shows that the method can be applied to reliability studies with small sample sizes. Such results are also true for the intraclass correlation method. To compare further the three techniques, the kappas for Axes II and 111 of DSM-III were also computed from the same study of diagnostic agreement among child psychiatrists from which the reliability of Axis 1 was computed. In each of these axes, clinicians made multiple diagnoses for a subject. The kappa values obtained were as follows (Table 2): TABLL 2 Proportional overlap procedure AxiT I Axis II Axis III
0.28 0.42 0.61
lntraclasy correlation procedure 0.35 0.46 0.63
Ranked correlation procedure 0.34 0.45 0.63
Although it appears that the proportional overlap method tends to yield slightly lower reliability values, the differences in kappas between the techniques within each axis are not substantial and when the standard error is taken into account, the resulting t values (and consequent significances of the obtained kappas above zero) are highly similar to one another. SUMMARY
The development and implementation of improved diagnostic systems require careful assessment of their reliability. The kappa coefficient, an index of inter-rater reliability adjusted for chance agreement is generically very pertinent and useful. However, it is not directly applicable to the currently prevailing situation in which multiple diagnoses are formulated on each diagnostic axis by multiple raters (not necessarily fixed in number). This paper proposes three methodological extensions of kappa specifically addressed to
ASSESSMENTOF AGREEMENTAMONGRATERS
39
fill this need. Two procedures (based on the assessment of agreement between diagnostic formulations or lists using respectively a proportional overlap and an intraclass correlation criteria) are proposed for the situation in which a multiple diagnosis involves a list of non-ordered categories applicable to a subject. The third procedure, based on a Spearman rank correlation criterion, applies to the situation in which the categories in the list are specifically ranked as primary, secondary, tertiary, etc. Data from an actual diagnostic exercise are analyzed to illustrate these procedures and their concurring results. Acknowledgmenr-Thanks are given to Dr. Ada C. Merrich for making available the diagnostic exercise data used for illustration purposes in this paper, and to Dr. Rolf Jacob for his contributions during the initial phases of this research. REFERENCES calculated from faulty data. Br. J. Psychol. 3,271, 1910. I. SPEARMAN, C. Correlation 2. GULLIKSEN, H. Theory of Mental Tests, Wiley, New York, 1950. of Behavioral Measure3. CRONBACH, L. J., GLESER, G., NANDA, H. and RAIARATNAM, J. The Dependability ments. Wiley and Sons, New York, 1972. 20, 37, 1960. 4. COHEN, J. A coefficient of agreement for nominal scales. Educational Psychol. Measurement with provision for scale disagreement or partial 5. COHEN, J. Weighted kappa: Nominal scale agreement credit. Psychol. Bull. 70,213, 1968. 6. EVERITT, B. S. Moments of the statistics kappa and weighted kappa. Br. J. Math. Stat. Psychol. 21, 97, 1968. 7. FLEISS, J. L., COHEN, J. and EVERITT, B. S. Large sample standard errors of kappa and weighted kappa. Psychol. Bull. 72,323, 1969. of the null distributions of weighted kappa and the C 8. CICCHETTI, D. V. and FLEISS, J. L. Comparison ordinal statistic. Applied Psychol. Measurement 1, 195, 1977. 9. FLEISS, J. L. and CICCHETTI, D. V. Inference about weighted kappa in the non-null case. Appl. Psychol. Measurement 1, 113, 1978. of agreement in psychiatric 10. SPITZER, R. L., COHEN, J., FLEISS, J. L. and ENDICOTT, J. Quantification diagnosis. Arch. Gen. Psychiat. 17.83, 1967. 11. HELZER, J. E., ROBINS, L. N., TAIBLESON, M., WOODRUFF, R. A., REICH, T. and WISH, E. D. Reliability of psychiatric diagnosis. Arch. Gen. Psychiat. 34, 129, 1977. 12. BARTKO. J. J. and CARPENTER, W. T. On the methods and theory of reliability. J. Nervous Mental Disease 163, jO7, 1976. 13. FLEISS, J. L. Measuring nominal scale agreement among many raters. Psychol. Bull. 76,278, 1971. 14. FLEISS, J. L., NEE, J. C. M. and LANDIS, J. R. The large sample variance of kappa in the case of different sets of raters. Psychol. Bull. 86,974, 1979. judgments: Unequal numbers of judges per 15. FLEISS, J. L. and CUZICK, J. The reliability of dichotomous subject. Appl. Psychol. Measurement 3,537, 1979. The International Classificatron of Diseases, Ninth Revision (ICD-9). Geneva, 16. World Health Organization: World Health Organization, 1977. Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders. (2nd ed.) 17. American DSM-II, Washington, D.C., American Psychiatric Association, 1968. Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders. (3rd ed.) 18. American DSM-III, Washington, D.C., American Psychiatric Association, 1980. of agreement in multiple psychiatric 19. FLEISS, J. L., SPITZER, R. L., ENDICO~~, J. and COHEN, J. Quantification diagnosis. Archs. Gen. Psychiat. 26, 168, 1972. Psychol. Bull. 83,762, 1976. 20. BARTKO, J. J. On various intraclass correlation reliability coefficients. 21. KRAEMER, H. C. Extension of the kappa coefficient. Biometrics 36, 207, 1980.