PROPOV-K: A FORTRAN program for computing a kappa coefficient using a proportional overlap procedure

PROPOV-K: A FORTRAN program for computing a kappa coefficient using a proportional overlap procedure

COMPUTERS AND BIOMEDICAL RESEARCH 22,41.5-423 (1989) PROPOV-K: A FORTRAN Program for Computing a Kappa Coefficient Using a Proportional Overlap P...

469KB Sizes 12 Downloads 89 Views

COMPUTERS

AND

BIOMEDICAL

RESEARCH

22,41.5-423 (1989)

PROPOV-K: A FORTRAN Program for Computing a Kappa Coefficient Using a Proportional Overlap Procedure CHUL W. AHN ANDJUAN E. MEZZICH Western Psychiatric Institute and Clinic, University of Pittsburgh, 3811 O’Hara Street, Pittsburgh, Pennsylvania 15213-2593 Received June 8, 1988

The computer program PROPOV-K allows the computation of an unweighted kappa coefficient for expressing interrater agreement in the general case in which multiple raters (not necessarily fixed in number) formulate a variable number of multiple diagnoses for each subject. PROPOV-K assesses agreement among lists of multiple diagnoses composed of nonordered categories. PROPOV-K calculates a kappa coefficient on the basis of estimating proportion of agreement between two diagnostic formulations as the ratio of the number of agreements between specific categories over the number of different specific categories mentioned in the two diagnostic lists. When multiple raters formulate a variable number of multiple diagnoses for each subject, the use of a kappa coefficient has been limited to researchers since there are no generally available computer programs. The purpose of this paper is to present a FORTRAN computer program allowing the computation of a kappa coefficient for the case mentioned above and to illustrate its use with examples respectively involving multiple psychiatric and multiple physical diagnoses. o 1989 Academic press. IN.

1. INTRODUCTION

One of the important statistics used in medical and behavioral science is the kappa coefficient which measures interrater agreement on diagnostic judgments. A number of approaches are reported in the literature for the assessment of interrater agreement on a nominal scale. The simplest measure of agreement is the percentage of agreements on the considered categories to the total number of possible agreements. This measure is not clearly interpretable since it ignores chance agreement. The kappa coefficient has been proposed by Cohen (2) as a chance-corrected measure of nominal scale agreement among raters, particularly as applied to problems of the reliability or reproducibility of the diagnostic categorization of patients. The kappa coefficient was originally developed to assess inter-rater agreement in the situation in which two raters select a single response category for each subject. As more studies on diagnostic reliability are conducted and reported in medical, epidemiology, and behavioral science journals, the kappa statistic, first developed by Cohen for estimating a chance-corrected index of inter-rater agreement on individual diagnostic categories, is being increasingly 415 OOlO-4809/89 $3.00 Copyright B 1989 by Academic Press, Inc. All rights of reproduction in any form reserved.

416

AHN

AND

MEZZICH

used. The kappa statistic has been gradually extended to cover more complex research designs. Kappa was extended by Fleiss (3) to deal with the case of more than two raters (but a fixed number of them) selecting a single response category. Fleiss et al. (4) and Kraemer (5) extended the procedure to deal with the situation in which more than two raters select multiple response categories for each subject. Even though there has been an increasing need to compute a kappa coefficient in the situation in which multiple raters (not necessarily fixed in number) formulate multiple (a variable number of) diagnoses for each subject. the use of a kappa coefficient has been limited since a computer program has not yet been generally available. The computer program presented here is designed to allow the computation of an unweighted kappa coefficient for expressing interrater agreement in the general use in which a variable number of multiple raters formulate a variable number of multiple diagnoses for each subject. The algorithm PROPOV-K assesses agreement among multiple diagnostic lists composed of nonordered categories. PROPOV-K calculates a kappa coefficient using the proportion of agreement between two diagnostic formulations as the ratio of the number of agreements between specific categories over the number of different specific categories mentioned in the two diagnostic lists. The process involves the computation of observed proportion of agreement for each subject, the average and standard deviation of observed proportion of agreement across subjects, a proportion of chance agreement among all raters and all subjects and a kappa coefficient along with its t test statistic. 2. METHOD

The proportional overlap procedure defines the proportion of agreement between two diagnostic formulations as the ratio of the number of agreements between specific categories over the number of possible agreements. Suppose that one clinician formulates cocaine abuse, exhibitionism, and dysthymia and the other formulates cocaine abuse and dysthymia. The proportion of agreement between the above two clinicians is the ratio of 2 (two agreements on cocaine abuse and dysthymia) over 3 (three possible agreements corresponding to the three different categories mentioned: cocaine abuse, dysthymia, and exhibitionism) according to the proportional overlap criterion. The numerical value of proportional overlap will be 1 if agreement is perfect and will be 0 if there is no overlap between lists. It will take a value between 0 and 1 in a situation of partial agreement between these extremes. Agreement among the several raters for a subject is measured by averaging the proportions of agreement obtained for all combinations of pairs of raters judging that subject. The overall observed proportion of agreement (P,) for the sample of subjects under consideration is the average of the mean proportions of agreement obtained for each of the N subjects in the sample. The proportion of chance agreement (PC> is calculated by computing proportions of agreement

PROPOV-K

COMPUTES

KAPPA

COEFFICIENT

417

between all diagnostic formulations made by all raters for all subjects, and then averaging across them. A kappa coefficient K is computed as PO - PC K=m--

HI

Let S*(P) be the variance of observed proportions of agreement across subjects and N be the number of subjects. The standard error of kappa (Kraemer (5)) is given by

S(P)

SE(K) = (1 _ P,)Nl/2’

Then,

K/SE(K)

approximately

has t distribution

with N-l degrees of freedom.

3. EXAMPLES

The following two examples illustrate using a proportional overlap procedure.

the computation

of kappa coefficients

3.1. Example I

We describe data from a reliability study conducted by Mezzich et al. (6). In this study, 30 child psychiatrists were asked through the mail to make independent diagnoses on 27 child psychiatric case summaries. Each psychiatrist rated 3 cases, and each case turned out to be rated by 3 or 4 psychiatrists upon completion of the study. Table 1 shows the resulting 90 multiple diagnostic formulations. Each diagnostic formulation was composed of up to three broad diagnostic categories taken from Axis I (clinical psychiatric syndromes) of the American Psychiatric Association’s Diagnostic and Statistical Manual of Mental Disorders (DSM-III (I)). The proportion of agreement between the diagnostic formulations made by clinicians 1 and 2 for case 1 is the ratio of 2 (one agreement on “9, Mental retardation” and one agreement on “11, Attention deficit disorder”) over 3 (3 possible agreements, corresponding to the 3 different categories mentioned: 9, 11, and 14). Following the same computation procedure, the proportion of agreement on case 1 between the diagnostic formulations of clinicians 1 and 3 is l/3, between 1 and 4 is 1, between 2 and 3 is l/4, between 2 and 4 is 213, and between 3 and 4 is l/3. Thus, the average porportion of agreement for case 1 is 0.54. Using the same steps as above, we can show that the average proportion of agreement for case 2 is 0.14 and for case 3 is 0, etc. Over all 27 cases, the overall mean proportion of agreement (P,) is 0.36 with a standard deviation of 0.24. In order to compute a chance agreement, a proportion of agreement is computed for each pair of the 90 diagnostic formulations made by all raters for all subjects. The average proportion of chance agreement (PC) is 0.12. Table 2 presents computer output from the analysis of the above data set.

418

AHN AND MEZZICH TABLE

1

MULTIPLE DIAGNOSTIC FORMULATIONS CHILD PSYCHIATRIC CASES USING DSM-III BROAD CATEGORIES

FOR 27 AXIS I

Raters Cases

1

3

3

11, 9, 14 16. I4 12 13, 16, I4 7. 12, 13 10 I3 I3 20 12, 14, 13 18 I. 5, I8 14. 7, 7 14, 11, I6 3, 18 5, 16 12. 11 I6 14 9, 11, 14 11. 14 12 I? I2 20 13, 16 9, 10

16, 9 12 7. 8 16 I3 10 16 16. 13 13, 14 12, 11, 14 16 I 14. 16 11. 13 IO, II 14 12 16 I4 IO. 9 II 12 14 16 I3 I3 9

4

__-~-.-

1 2 3 4 5 6 7 8 9 10 11 12 13 14 I5 16 17 18 19 20 21 22 23 24 25 26 27

9. II 16 17 16. 13 7 10 7. 16 I. 14 5 12, 13, 14 13 5. 18 14, 13 Il. 16 IO 14, 5 12 20 13 9, 14, IO 12. 11 I7 16, I3 12 13 I3 10. 9

II. 9 14, 5 I3

12, 17. I5 I3 I2 13 I6 9, IO

Note. 1. Organic mental disorders: 2, substance use disorders; 3, schizophrenic and paranoid disorders; 4, schizoaffective disorders; 5, affective disorders; 6, psychoses not elsewhere classified; 7, anxiety factitious, somatoform. and dissociative disorders; 8, psychosexual disorder; 9. mental retardation; 10. pervasive developmental disorder; 1 I. attention deficit disorders; 12, conduct disorders: 13, anxiety disorders of childhood or adolescence; 14. other disorders of childhood or adolescence. speech and stereotyped movement disorders, disorders characteristic of late adolescence: 1.5, eating disorders: 16. reactive disorders not elsewhere classified: 17, disorders of impulse control not elsewhere classified; 18, sleep and other disorders: 19. conditions not attributable to a mental disorder: 20, no diagnosis on Axis I.

TABLE

2

KAPPA COEFFICIENTS FOR AXIS I BROAD DIAGNOS-X CATEGORIES Calculation of kappa coefficient using a proportional

overlap method

Overall observed proportion of agreement = 0.3596 Chance proportion of agreement = 0.1179 Kappa coefficient = 0.2740 Standard dev. of kappa = 0.0529 t Statistic for testing kappa = 5.1784 Degrees of freedom for test statistic = 26

TABLE

3

MULTIPLE DIAGNOSTIC FORMULATIONS FOR THE FIRST 10 PHYSICAL DISORDER CASES USING DSM-III AXIS III BROAD CATEGORIES Raters Cases

1

2

1 2 3 4 5 6 7 8 9 10

13, 7, 9 19 13 2 6 3, 6, 8 6 6, 13 7 9, 10

7, 9 19 13, 8 10, 3 6 13, 7, 3 13, 6 6 7 9, 10, 7

Note. 1, Infectious and parasitic disease; 2, neoplasms; 3, endocrine, nutritional, and metabolic disorders; 4, diseases of blood and blood-forming organs; 5, mental disorders; 6, diseases of the nervous system sense organs; 7, diseases of the circulatory system; 8, diseases of the respiratory system; 9, diseases of the digestive system; 10, diseases of the genito-urinary system; 11. complications of pregnancy, childbirth, and puerperium; 12, diseases of the skin and subcutaneous tissue; 13, diseases of the musculoskeletal system and connective tissue; 14, congenital abnormalities; 15, certain conditions originating in the prenatal period; 16, symptoms, signs, and ill-defined conditions; 17, injury and poisoning; 18. factors influencing health accidents; 19, deferred or no diagnosis. 419

420

AHN

AND

MEZZICH

TABLE KAPPA

COEFFICIENT

4

FOR AXIS III BROAD

DIAGNOSTIC

CATEGORIES

Axis-111 broad diagnostic categories Overall observed proportion of agreement = 0.6368 Chance proportion of agreement = 0.2401 Kappa coefficient = 0.5220 Standard dev. of kappa = 0.0571 f Statistic for testing kappa = 9.1389 Degrees of freedom for test statistic = 101

3.2. Example

2

Participants in the study were selected from those persons presenting for care at the Western Psychiatric Institute and Clinic (WPIC) of the University of Pittsburgh. The present study was confined to patients who were over 18 years of age at the time of evaluation. Patients were approached by a clinical psychologist or a nurse clinician in order to explain the protocol and obtain consent for participation. One hundred two patients consented and received care by WPIC. In order to obtain interrater reliability data the 102 study patients were jointly assessed by two independent evaluation teams, each comprised of a psychiatric nurse or master level psychologist and a faculty psychiatrist, using the interviewer-observer paradigm. The two teams jointly interviewed the patient but independently completed the study forms and multiaxial formulation. Table 3 shows the first 10 cases of physical disorders made on 102 patients undergoing a face-to-face evaluation. The proportion of agreement on case 1 is the ratio of 2 (one agreement on “7. Diseases of the circulatory system” and one agreement on “9, Diseases of the digestive system”) over 3 (3 possible agreements, corresponding to the 3 different categories mentioned: 7, 9, and 13). Using the same steps as above, the proportion of agreement for case 2 is 1, and for case 3 is l/2, etc. Over all 102 cases, the overall average proportion of agreement (P,) is 0.64. The average proportion of chance agreement (P,) is 0.24. Table 4 shows the computer output for kappa coefficients for Axis III (physical disorders) broad diagnostic categories. APPENDIX:

C C

‘I!&

C

KAPPA

PROGRAM

LISTING

C

is a driverprogram for FRX0V-Ktiich caopltesan-ighkd coefficient for t& general in whiti multiple raters (not necessarily fixed in rnmdzer) formulate titiple ratings for

C

eat%

subject.

PROPOV-K

COMPUTES

KAPPA

421

COEFFICIENT

C DIMENSICtd NRATER(5OO),Ml'E(3,500) DIMENSION TITIE(20) REALKAPPA DF CHARACTER*10 lNFILE,curFTLE C C C m(6.13) FORMAT(lX, ' INHJI MTA RFAD(S,'(AlO)') INFIU OpEN(~~~,~~INFIIE,sTA~~'OII)')

13 C C C

FILE

=I,$)

SFECIFYWI'H?l?DATAFIIE =(6,14) laumT(l.x, ' OIII'FW MTA FIIE FeAD(5,'(AlO)') ormFIIE OpEN(DNI'I=12,STATW='NEW',FIL+WI'FIIE)

14 C C C

!IYPE NAT-E OF 'ITE TITLE

=I,$)

FOR ANALYSIS

(MAXl?WM OF 80 CHARACIEFS)

READ(11,284) TITLE WRJ.TE(6,285) TITLE WUTE(12,285) TITIE FOR,lAT(2OA4) FOlWAT(lX,2OA4) WRITEw, ' w ) ')

284 285

NSOBJ:

NUMBER OF WBJECIS, m OF 250 FORECAMPIE, -OF PATIEKIS ORCASFS NAVAIIR : NDMSEROFAVAIIABLEPATIKS, -OF ZO(=NAVAIIR) -c cxrEmluF.5 F0REXAMPIE,NUMEtEROFAVAIIABLED N-RATE :N‘JMBEROFRATll%S WIcHCANBEM&DEFCREiACHSUBlECT BYEWHRA~ READ(ll,*) NSUBJ,NAVAILFt,NRATE WRITE(6,287) N.SUBJ,NAVAIIR,NRATE lTPMAT(lX,'NO. OF SUBJETS = ',13,/,1x, 'NUMBER OF AVAIIABIE RATING = ',13,/,1x 'NWBFROFRHTNGSMADE=',I3)

287

g-=D4*,

W=‘WJ)

,J=W=JW

m I=l,NSUB7 mmALftmTER(I) EINDIXI RFAD~RATINGSOFWsuBJEcrs FZAD(ll,*) 851 950 C C C 1000 852

MADEBYALLRMERS

((RATB(I,J),I=1,NRATE),J=1,KCTAL,)

CALL ~~(~,NAvAIIR,NRATE,NRATER,~~,~,OBsAG, EXPAG,KAPPA,SlD,T,DF) !OBSERVEDFKO~~OF~ biRITE(12,851) 069AG FOIMAT(ZOX, 'OVERALL OESRVED PROK)KI'ION iauTE(12,950) EmAG Ft3~T(2OX,'CHANCE FROFOFtTIoN OF m

OF m

WRITE STAW WRITE(12,1000) IQiPPA FWMAT(16X,'KAPPA aDEFFICIEWI! = 'F10.4) WRITE(12,852) S'ID lTsJaT(19X, 'STDEV. OF KAPPA = ',F10.4) WRITE(12,6111) T,DF

= ',F10.4/)

= ',F10.4)

422

AHN AND MEZZICH 6111

C C C C

FOF+lAT(17X,' /,17X,' STOP END

T STATIaC FOR TEYITNG KAPPA = ',F10.4, llEsm?s OF FORTE%! SI'ATIsI?C

?he subroutinepropov using a prqxrtionaluvei-lapmthal.

coqytes

a kappa

ccefficient

pRoHxr(NsuBJ,NAvAILR,NRATE,NRATER,~~,RATE,OBsA(3, - EXP&Z,KAPPA,STD,T,DF) ?NJXXXR DF REALKAPPA DIMENSIQ3 RATE(NRATE,NIDTAL,,M(ATER(NsuBs),A~(500), SuM?G(500),ImP(500,500) ,NmmER(500),aANCE(50000) C mm w K=l,NsuBJ IF(K.NE.l) 'IHEN J'lWl?ALdIUl'AL+NRATER(K-1) ENDIF MJ=JTtJTAL,tl .-~TEWK) NUMEGR(J)=O w I=l,NRATE IF(RATE(I,J).GT.O) NUMBER(J)-(J)+1 ENDIF ErNlnxl Elmw KNOW

2HEN !NUMBE?OFDIILL;NQsEsMADE

C W K=l,NSUEU NCZSXNFO IF(K.NE.l) 'IHEN J'IWTALd'IWTAL+~TER(K-1) ENDIF ~ALbNP.ATER(K)-1 W J=3'IWTAL+l,JUP w L=G+1,JuPt1 -0 W I=l,NUMEER(J) w M=l.NuMEExfL~ IF(RhE(I,Jj.&?ATE(M,L)) 'RIEN -+1 ENDIF mDw ENDIX) NmaUNI'+l NECS.S=NUMPER(J)+NUMEER(L)-NAGREE ~P(K,NWUNI')=FLOAT(M+GREE)/FKOAT(Nposs) ENMX) ENDW SUMAG(K)=O. W J=l,NUZUNl=(K)~(K)+PRQP(K,J) ENDCO AVGG(K)%UMAG ! OBsERvEDpRoP.OF~ (K)/NaXINT ENDW C LxJIm3=0. W K=l,NSURJ sUMoBsUMoetAVGAG(K) EMXX) OEGG+ZMDB/NSUBJ .Fzmm=o. W K=l,NSllRJ -(A=(K)*=)**2

! OVERAILOB%RVEDpIEDP.

OF-

= ',15)

PROPOV-K

COMPUTES

LYiELql (NSUEU-1) ! lFXWNCE C C C

KAPPA

423

COEFFICIENT

OF OESEXVJD PROP. OF m

~w-w ~'ME~C!E~~ONOFAL;REEMENT D3 J=l,NIWTAL NUMlER(J)=O la3 1=1,mTE IF(RATE(I,J).~.O) 'II-EN NUMBER(J)=NLWiER(J)+1 ENDIF ENWO ENWO

C Do J=l,NNI.ALrl Do I.=J+1,mAL KAQmH. Do I=l,NWdE!ER(J) Do M=l,NmBm(L) IF(RATFi(I,J).EQ.RATE(M,L)) 'IYEN IQGFSE=KAGREE+1 ENDIF Ema J?NDm KCXJNEKa%JNT+l V(J)+-(L)-CHANCE(K‘XiJtiT)=FLOAT(KXREE)/FLCW(KXX3i) ENDDO ENIxI3 C C C C

~'RIECHANCEPRQ~ONOF~, ST~J3R3ROFKAPPAEEFFIcIENFANDT-STATEXZ!

KAPPA UXFFTCIEKp,

suK=o. JXI J=l,KCWNl -c=(J) m IXPpGsuM/Ko3uNT! cwINcEPROF'.OF~ KAPF'A=(OESAG-EXP&Z)/(l.-EXPAG) ! KAPPA CIXFF. tXMX7XV/(1.-EXPAG)/~(~(NsuBT)) ! STAN. ERRX l=IaPFq/sID !T-STATISTICFQRAKAPPAC3XFFICIEtW DF=NSJBJ-1 ! DEGREES OF FREEDOMFORT-STATISTIC Ic!mml END

OF KAPPA CXXF'F.

REFERENCES 1. AMERICAN PSYCHIATRIC ASSOCIATION. “Diagnostic and Statistical Manual of Mental Disorders.” 3rd ed. DSM-III, American Psychiatric Association, Washington, DC, 1980. 2. COHEN, J. A coefficient of agreement for nominal scales. Educ. Psych. Measure. 20, 37 (1960). 3. FLEISS, J. L. Measuring nominal scale agreement among many raters. Psych. Bull. 76, 378 (1971). 4. FLEISS, J. L., SPITZER, R. L., ENDICOTT, J., AND COHEN, J. Quantification of agreement in multiple psychiatric diagnosis. Arch. Gen. Psych. 26, 168 (1972). 5. KRAEMER, H. C. Extension of kappa coefficients. Biometrics 36, 207 (1981). 6. MEZZICH, J. E., KRAEMER, H. C., WORTHINGTON, D. R., AND COFFMAN, G. A. Assessment of agreement among several raters formulating multiple diagnoses. J. Psych. Res. 16, 29 (1981).