Accepted Manuscript Specific agreement on dichotomous outcomes in the situation of more than two raters Henrica C.W. de Vet, PhD, Rieky E. Dikmans, MD, Iris Eekhout, PhD PII:
S0895-4356(16)30837-X
DOI:
10.1016/j.jclinepi.2016.12.007
Reference:
JCE 9294
To appear in:
Journal of Clinical Epidemiology
Received Date: 30 April 2016 Revised Date:
31 October 2016
Accepted Date: 1 December 2016
Please cite this article as: de Vet HCW, Dikmans RE, Eekhout I, Specific agreement on dichotomous outcomes in the situation of more than two raters, Journal of Clinical Epidemiology (2017), doi: 10.1016/ j.jclinepi.2016.12.007. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT Manuscript Number: JCE-16-330R2 Title: Specific agreement on dichotomous outcomes in the situation of more than two raters Article Type: Original Article
RI PT
Corresponding Author: Mrs. Henrica C.W. de Vet, PhD Corresponding Author's Institution: EMGO Institute, VU University Medical Center Phone: +31 20 4446014 Fax: +31 20 4446775 E-mail:
[email protected]
SC
First Author: Henrica C.W. de Vet, PhD
M AN U
Order of Authors: Henrica C.W. de Vet, PhD; Rieky E Dikmans, MD; Iris Eekhout, PhD Department of Epidemiology and Biostatistics, EMGO Institute for health and Care Research, VU medical center, De Boelelaan 1089A, Amsterdam 1081HV, Netherlands
Manuscript Region of Origin: NETHERLANDS
TE D
Abstract: Objective For assessing inter-rater agreement the concepts of observed agreement and specific agreement have been proposed. The situation of two raters and dichotomous outcomes has been described, whereas often times multiple raters are involved. We aim to extend it for more than two raters and examine how to calculate agreement estimates and 95% confidence intervals (CIs).
AC C
EP
Study design and setting As an illustration we used a reliability study that includes the scores of 4 plastic surgeons classifying photographs of breasts of 50 women after breast reconstruction into 'satisfied' or 'not satisfied'. In a simulation study we checked the hypothesized sample size for calculation of 95% CIs. Results For m raters, all pairwise tables (i.e. m(m-1)/2) were summed. Then the discordant cells were averaged before observed and specific agreements were calculated. The total number (N) in the summed table is m(m-1)/2 times larger than the number of subjects (n). In the example: N=300 compared to n = 50 subjects times m= 4 raters. A correction of n√(m-1) was appropriate to find 95% CIs comparable to bootstrapped CIs. Conclusion The concept of observed agreement and specific agreement can be extended to more than two raters with a valid estimation of the 95% CIs. Running title: Specific agreement for more than two raters. Key words: observed agreement; specific agreement, confidence intervals, continuity correction, Fleiss correction.
ACCEPTED MANUSCRIPT
Introduction
2
Clinicians interested in observer variation pose questions such as “Is my diagnosis in
3
agreement with that of my colleagues?” (inter-observer) and “Would I obtain the same
4
result if I repeated the assessment?” (intra-observer). To express the level of intra-rater
5
or inter-rater agreement, many textbooks on reliability and agreement studies [1-4]
6
recommend Cohen’s kappa as the most adequate measure. Cohen introduced kappa as a
7
coefficient of agreement for categorical outcomes [5]. However, unsatisfactory situations
8
occur when a high level of agreement is accompanied by a low kappa value. These
9
counterintuitive results have led to many adjustments and extensions of kappa [6], e.g.
10
to adjust for bias or prevalence [7]. Other authors recommend presenting a number of
11
accompanying measures to enable correct interpretation [8], or propose alternative
12
parameters such as A1C and Gwet’s coefficient [4].
M AN U
SC
RI PT
1
13
Cohen’s kappa has been frequently recommended and used to assess agreement within
15
and between raters classifying characteristics of patients. A 2013 paper advised against
16
the use of Cohen’s kappa [9] because kappa is NOT a measure of agreement, but a
17
measure of reliability [9,10]. When interested in inter-rater agreement in clinical
18
practice, the most relevant question concerns agreement between raters, and more
19
specifically ‘what is the probability that colleagues will provide the same answer’. It was
20
argued that, by adjusting the observed agreement for the expected agreement as Cohen’s
21
kappa does, an agreement measure (observed agreement) is turned into a reliability
22
measure. Reliability measures are less informative in clinical practice. [9]
23
It was therefore recommended to use the proportion of observed agreement or the
24
proportion of specific agreement instead of Cohen’s kappa. The proportion of specific
AC C
EP
TE D
14
1
ACCEPTED MANUSCRIPT 25
agreement distinguishes agreement on positive or negative scores, which might have
26
different implications in clinical practice.
27
The simple situation of two raters and two categories is presented in Table 1, resulting
29
in a 2 by 2 table with four cells (a,b,c and d). The observed agreement is defined as (a +
30
d)/ (a + b + c + d).
RI PT
28
32
-------------------- Table 1 here ------------------------------
33
SC
31
The specific agreement on a positive score, known as the positive agreement (PA), is
35
calculated with the following formula:
M AN U
34
36
PA =
2a 2a + b + c
38
which is the same as PA =
a . a + (b + c) / 2
TE D
37
Specific agreement on a negative score, the negative agreement (NA), is calculated using
40
the formula:
42
43
NA =
AC C
41
EP
39
2d d , which is the same as PA = . d + (b + c) / 2 2d + b + c
44
The inclusion of both discordant cells (b and c) in the formula accounts for the fact that
45
these numbers might be different, in which case their average value is used.
46
2
ACCEPTED MANUSCRIPT The concept of specific agreement was introduced by Dice in 1945 [11] and revitalized
48
by Cicchetti and Feinstein[12]. However, it has rarely been applied in the medical
49
literature. This may be due to its restricted application to the situation of two raters and
50
dichotomous response options, whereas often multiple raters are involved. We will
51
therefore extend the measure of agreement and specific agreement to include situations
52
of more than two raters. We will provide formulas to calculate estimates and 95%
53
confidence intervals (CIs) for the proportions of agreement and specific agreement.
54
Subsequently, we will address the interpretation of the results, for example with respect
55
to systematic differences between raters. In the discussion we will provide background
56
information on some of the decisions we have made.
M AN U
SC
RI PT
47
57
Methods
59
Description of example
60
The illustrative example data set is from a study by Dikmans and colleagues[13] [data in
61
Appendix A.1] and is based on photographs of breasts of 50 women after breast
62
reconstruction. The photographs are independently scored by 5 surgeons to rate the
63
quality of the reconstruction on a 5-point ordinal scale, varying from ‘very dissatisfied’
64
to ‘very satisfied’. In this paper we use the data of 4 surgeons because one surgeon had
65
some missing values. For this paper the satisfaction scores were dichotomized into d
66
satisfied (S) (scores 4 and 5) and not satisfied (not-S)scores (scores 1, 2 and 3).
EP
AC C
67
TE D
58
68
Specific agreement with more than two raters.
69
To assess agreement in a situation with two raters, the score of one rater has to be
70
compared only to the other rater. With three raters, three comparisons are possible:
3
ACCEPTED MANUSCRIPT rater 1 can be compared to rater 2 and to rater 3, and the agreement between raters 2
72
and 3 can be assessed. The formula to calculate the number of comparisons for m raters
73
is m(m-1)/2. The agreement question can be generalized to: given that one rater (of the
74
m raters) scores positive, what is the probability of a positive score by the other raters;
75
and the same question can obviously also be asked for a negative score.
RI PT
71
76
Statistics
78
To obtain an agreement parameter for m raters, all pairwise tables (i.e. m(m-1)/2) are
79
summed. After summing these tables, the b and c cells are averaged. So, placement of the
80
value in the b cell or the c cell is arbitrary. In other words, it should not matter whether
81
we put the scores of rater 1 horizontally or vertically in the 2 by 2 table. By averaging
82
the value in the b and c cells in the summed table we get the same result, even if all
83
higher values are in the b cell and all lower values in the c cell. Subsequently the
84
observed agreement and specific agreement are calculated.
M AN U
TE D
85
SC
77
The confidence interval for the proportion of agreement can be obtained using the
87
simple normal approximation for an interval for proportions (see Appendix A.2 for the
88
formulas and explanations). When more than two raters are involved we add up
89
multiple 2 by 2 tables. The summed table then contains a higher total number (N=6 * n =
90
300) than the number of subjects (n = 50) multiplied by the number of ratings (4), and
91
the sample size used in the formula has to be adjusted. We assumed that a correction of
92
n (m − 1) would be appropriate. In a simulation study, we checked the resulting interval
93
with the bootstrapped 95% CI. Our simulation contained situations where a sample of
94
100 subjects was rated by 4 raters, which resulted in a corrected sample size for the CI
AC C
EP
86
4
ACCEPTED MANUSCRIPT formulas of n (m − 1) = 100 * (4 − 1) . The prevalence of the categories was varied between
96
50%, 70%, 80% and 90%. The overall agreement was set to reach about 0.80. We
97
repeated the analyses for 8 and 12 raters. Each condition was simulated 500 times [14]
98
and simulations were performed in R statistical software [15].
99
RI PT
95
Results
101
Calculation of observed and specific agreement
102
With four surgeons as raters it is possible to present six 2 by 2 tables (m(m-1)/2 =
103
4x3/2 =6), representing the agreement between surgeons: 1 vs 2; 1 vs 3; 1 vs4; 2 vs3; 2
104
vs 4; 3 vs 4. The question of agreement with four surgeons can be answered by adding
105
up the cells of these six 2 by 2 tables.
106
108
TE D
107
M AN U
SC
100
---------------------------- Table 2 here ------------------------------------
109
Calculating proportions of observed and specific agreement for Table 2C results in an
111
observed agreement of 0.747, and the proportion of specific agreement is 0.756 for the
112
‘satisfied’ scores and 0.736 for the ‘non-satisfied’ scores (for formulas and calculations
113
see Appendix A.3).
AC C
114
EP
110
115
Calculation of confidence intervals
116
For the observed agreement we calculated the 95% CI departing from the total number
117
of subjects (n=50). As the specific agreement is calculated based on the positive scores
118
(‘satisfied’) or negative scores (‘not-satisfied’), the sample size used for calculation of
5
ACCEPTED MANUSCRIPT 95% CI in our example (see Table 2c) is 26 (=156/300 * 50) and 24 (=144/300 * 50)
120
respectively.
121
The confidence intervals calculated with sample size n (m − 1) corresponded to the
122
bootstrapped confidence interval for 500 iterations[16]. A Table with these results is
123
presented in Appendix A.4. The results for the situation of 8 and 12 raters were similar
124
(data not presented). The 95% CI were quite close to the bootstrapped 95% CI,
125
indicating that n (m − 1) is an adequate correction for the sample size. We also see that
126
Fleiss correction[17] is necessary for the upper limit. (The formulas to obtain the lower
127
and upper limit of the 95% CI with continuity correction, and both with and without
128
Fleiss correction are presented in Appendix A.5.)
129
M AN U
SC
RI PT
119
Dependency on prevalence
131
The proportions of specific agreement are dependent on the prevalence of the scores. As
132
the same discordant cells (b and c) are related to both positive scores and negative
133
scores, it is clear that when the prevalence of positive scores is above 50%, the positive
134
agreement will be higher than the negative agreement and vice versa.
EP
135
TE D
130
In the case of Cohen’s kappa, the effect of prevalence seems counterintuitive to
137
researchers. If the prevalence of positive scores is > 90%, the proportion of observed
138
agreement is often high: say 0.95. In this case one would expect a high kappa value.
139
However, the expected agreement will also be high, i.e. more than 0.80 [(0.9 x 0.9) + (0.1
140
x 0.1)], and the resulting kappa value will be low. Note that in clinical practice there is no
141
such thing as chance agreement. The probability that a colleague agrees with a positive
142
score of a first rater will be larger if the prevalence of the positive score is higher.
AC C
136
6
ACCEPTED MANUSCRIPT 143
We considered adjusting for the effect of prevalence by calculating proportion
145
agreement when the prevalence is 50%. In Table 3A we present the results for the
146
example that was used in Table 2B. When the prevalence is 50% (as in Table 3B), the a
147
and d cells will be equal. If we then relate the values in the b and c cells to the positive
148
and negative scores, these will be equal. Note that in that case the specific agreement has
149
the same value as the total observed agreement.
--------------------------------Table 3 here ------------------------------------------------
M AN U
151
SC
150
RI PT
144
152
Systematic differences between raters
154
Table 2 shows, based on the different numbers in the b and c cells, that some surgeons
155
have more positive scores than other surgeons. Given a sufficient number of raters,
156
these systematic errors average out when the results of all pairs of raters are combined.
157
In the case of large systematic errors, it is informative to point out these differences.
158
This can be done, for example, by presenting the values of the proportion of absolute
159
agreement between pairs and the values of prevalences of agreement on positive scores.
160
In Table 2 the values of observed agreements runs from 0.68 (surgeons 2 vs 3) to 0.80
161
(surgeons 2 and 3 vs 4) and the prevalence of positive scores varies from 44%
162
(surgeons 2 and 4) to 52% (surgeons 1 and 3).
EP
AC C
163
TE D
153
164 165
Discussion
7
ACCEPTED MANUSCRIPT In this paper we have presented a method to easily obtain the proportions agreement
167
and specific agreement between multiple raters. The provided confidence interval
168
formula enables calculation of the uncertainty around the agreement. The aim of the
169
measures presented is to help clinicians in practice. There are a number of remarks to
170
be made about the methods we have proposed. These concern the summation of the
171
pairwise 2 by 2 tables, the calculation of the 95% CI, and the averaging of the values in
172
the corresponding discordant cells.
RI PT
166
SC
173
From a conceptual point of view it seems sensible to add up all pairwise 2 by 2 tables to
175
obtain an overall table which forms the basis for calculating proportions total agreement
176
and specific agreement. We compared this strategy to methods for calculating a kappa
177
value for a dichotomous scoring if more than 2 raters are involved. The kappa value
178
calculated for the overall table corresponds to the value of kappa as calculated by Light
179
[18]. The Light kappa for more than two observers is obtained by calculating the kappa
180
values for all 2 by 2 tables for pairs of observers and then averaging the result.
TE D
M AN U
174
181
To calculate the 95% CI around observed agreement and specific agreement the sample
183
size to be used is n (m − 1) . Our simulations showed that the Fleiss correction is only
184
needed for the upper limit. Note that in agreement studies the lower limit (indicating
185
how poor the agreement can be) is often more important than the upper limit of the
186
confidence interval.
AC C
EP
182
187 188
The formula for specific agreement uses the average of the values in the discordant cells
189
(cells b and c). However, different values in these discordant cells in a 2 by 2 table are
190
known to lead to slightly higher kappa values, i.e. when there are systematic differences 8
ACCEPTED MANUSCRIPT 191
between the raters [19]. By averaging we ignore potential systematic differences
192
between raters. Presentation of values of the magnitude of systematic differences is
193
therefore important.
194
The clinical perspective
196
Researchers have become used to kappa values. They like the single value for inter-
197
observer agreement, together with the appraisal system as proposed by Landis and
198
Koch [20]or Fleiss [17]. They experience problems, however, when the prevalence of
199
abnormalities becomes low or high, as kappa may have a counter-intuitively low value in
200
these situations. It is often stated that kappa cannot be interpreted in these situations.
201
Vach [21] argued that one cannot blame kappa for what it is supposed to do, i.e.
202
adjusting for chance agreement, and chance agreement is high when the prevalence of
203
(ab)normalities becomes high, leading to low kappa values. Researchers sometimes
204
sample patients specifically for a reliability study in order to obtain approximately equal
205
numbers in each category. However, from a clinical perspective this is not realistic and it
206
hampers the generalizability of the results to clinical practice.
207
The largest criticism of using the proportion agreement is that it does not correct for
208
chance agreement, and that it is therefore overestimated [22]. Note that chance
209
agreement is a non-issue when we look at inter-observer agreement from a clinical
210
perspective. Knowledge of the probability that colleagues will provide the same answer
211
does not reveal (nor is it important) whether this is chance agreement or not. One is
212
merely interested in this probability. When the prevalence of abnormalities is high, the
213
probability that colleagues agree will be higher, when the prevalence is low this
214
probability will be lower, because the same values of the discordant cells (b and c cells)
AC C
EP
TE D
M AN U
SC
RI PT
195
9
ACCEPTED MANUSCRIPT are related to the concordant positive cells and the concordant negative cells (a and d
216
cells).
217
For example, in Appendix A.6 we present an example with a high prevalence of 86%
218
‘satisfied’ scores (258/300) and 14% ‘non-satisfied’ scores. (42/300). In that case the 38
219
discordant scores would have been related to both 220 ‘satisfied’ and 4 ‘non-satisfied’
220
scores, resulting in a proportion of positive agreement of 0.853 and a negative
221
agreement of 0.095. The clinical application would be that if one surgeon scores ‘not
222
satisfied’, it is worthwhile to ask the opinion of a second surgeon. In case of a ‘satisfied’
223
score the probability that the other surgeon agrees is 85.3%, so it is probably not worth
224
the effort of involving a second surgeon. It is this dual information that makes specific
225
agreement especially useful for clinical practice.
226
M AN U
SC
RI PT
215
Strength and limitations
228
In our example the photographs were originally scored on an ordinal level and
229
dichotomized afterwards. This may have affected the outcomes with regard to the level
230
of agreement. However, in this paper the focus is on the parameters used to express the
231
agreement and not on the level of agreement among the surgeons. The example was
232
merely used as an illustration for the proposed method. Moreover, the example does not
233
contain any missing values, which is unrealistic in many situations. In a future study we
234
will therefore focus on how to deal with missing values in agreement studies.
EP
AC C
235
TE D
227
236
Conclusion
237
Specific agreement as described previously for the situation of two raters can be also
238
used in the situation of more than two observers.
239
10
ACCEPTED MANUSCRIPT 240
References
241
244 245 246 247
Measurement Errors. Oxford University Press, New York NY, 1989. 2. Shoukri MM: Measures of Interobserver Agreement. Chapman & Hall/CRC, Boca Raton
RI PT
243
1. Dunn G: Design and Analysis of Reliability Studies. The Statistical Evaluation of
FL 2004.
3. Lin L, Hedayat AS, Wu W: Statistical Tools for Measuring Agreement. Springer, New York NY 2012.
SC
242
4. Gwet KL: Handbook of inter-rater reliability. A Definitive Guide to Measuring the Extent
249
of Agreement Among Raters. 3rd Edition. Advanced Analytics, LLC, Gaithersburg MD
250
2012.
254 255 256 257 258 259 260 261
6. http://www.john-uebersax.com/stat/kappa.htm#procon. Kappa coefficients: a critical appraisal (accessed October 18, 2016)
TE D
253
Measurement, 196037-46, 1960.
7. Byrt T. Bishop J. Carlin JB. Bias, prevalence and kappa. J Clin Epidemiol. 1993; 46(5): 423-9.
8. Lantz CA, Nebenzahl E. Behaviour and Interpretation of the k Statistic. Resolution of the
EP
252
5. Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological
two paradoxes. J Clin Epidemiol 1996; 49(4): 431-4.
9. De Vet HC, Mokkink LB, Terwee CB, Hoekstra OS, Knol DL. Clinicians are right not to
AC C
251
M AN U
248
like Cohen’s κ. BMJ 2013;346:f2125 doi:10.1136/bmj.f2125.
10. Kottner J, Audigé L, Brorson S, Donner A, Gajewski BJ, Hrobjartsson A, Roberts C,
262
Shoukri M, Streiner DL. Guidelines for Reporting Reliability and Agreement
263
Studies (GRRAS) were proposed. J Clin Epidemiol. 2011; 64 (1): 96-106.
264
11
ACCEPTED MANUSCRIPT
266 267 268 269
11. Dice LR. Measures of the amount of ecologic association between species. Ecology 1945; 26 (3):297-302. 12. Cicchetti DV, Feinstein AR. High agreement but low kappa: II Resolving the paradoxes. J Clin Epidemiol. 1990; 43(6): 551-8. 13. Dikmans REG, Nene L, de Vet HCW, Mureau MAM, Bouman MB, Ritt MJPF ,
RI PT
265
Mullender MG The aesthetic items scale: a tool for the evaluation of aesthetic
271
outcome after breast reconstruction. Plastic and Reconstructive Surgery Global
272
Open, in press.
274 275
14. Burton A, Altman DG, Royston P, Holder RL. The design of simulation studies in medical statistics. Stat Med. 2006; 25(24): 4279–92.
M AN U
273
SC
270
15. Team RC. R: A language and environment for statistical computing. (R. F. for S.
276
Computing, Ed.). 2014. Vienna, Austria. Retrieved from http://www.r-
277
project.org/.
16. Efron, B., & Tibshirani, R. (1986). Bootstrap Methods for Standard Errors,
TE D
278
Confidence Intervals, and Other Measures of Statistical Accuracy. Statistical
280
Science, 1(1), 54–75. Retrieved from
281
http://projecteuclid.org/euclid.ss/1177013815.
283 284 285 286 287 288 289
17. Fleiss JL, Kevin B, Paik MC. Statistical methods for rates and proportions. 3rd
AC C
282
EP
279
edition. 2003. John Wiley & Sons, Inc. Hoboken, NJ.
18. Light RJ. Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol Bull. 1971; 76(5):365–77.
19. Feinstein AR, Cicchetti DV. High agreement but low kappa: I The problem of two paradoxes. J Clin Epidemiol. 1990; 43(6): 543-9. 20. Landis, J. R. and Koch, G. C. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-74. 12
ACCEPTED MANUSCRIPT
291 292 293
21. Vach W. The dependence of Cohen’s kappa on the prevalence does not matter. J Clin Epidemiol. 2000; 58(7): 655–61. 22. Hallgren KA. Computing Inter-Rater Reliability for Observational Data: An overview and tutorial. Tutor Quant Methods Psychol. 2012; 8(1): 23–34.
RI PT
290
AC C
EP
TE D
M AN U
SC
294
13
ACCEPTED MANUSCRIPT Table 1: Two by two table with dichotomous scores of two raters Rater X
Positive Negative Total rater X
Positive
a
b
a+b
Negative
c
d
c+d
a+c
b+d
a+b+c+d
AC C
EP
TE D
M AN U
SC
Total rater Y
RI PT
Rater Y
ACCEPTED MANUSCRIPT
Table 2: Pairwise comparison of surgeons’ scores and a summation table. Table 2A: Pairwise comparison of raters’ scores Surg 1 vs 3
Surg 1 vs 4
Surg 2 vs 3
Surg 2 vs 4
Surg 3 vs 4
RI PT
Surg 1 vs 2
17 9
19 7
18 8
16 6
17 5
19 7
5
7
4
10 18
5
3
Table 2B: summed table Surg X
S
20
23
21
SC
17
Table 2C: summed table with averaged b and c cells.
Not S
Total X
Surg Y
M AN U
19
Surg X
S
Not S
Total X
S
118
38
156
Not S
38
106
144
156
144
300
Surg Y
118
42
160
Not S
34
106
140
Total Y
152
148
300
AC C
EP
TE D
S
Total Y
ACCEPTED MANUSCRIPT Table 3: Positive and negative agreement for the situation that the prevalence of positive scores is not equal to the prevalence of negative scores (Table 3A), and the situation that
Table 3A S
Not S
Total X
Rater X
Rater Y 118
34
152
S
Not S
42
106
148
Total Y
160
140
300
AC C
EP
Not S
Total X
112 38
150
Not S
38
112
150
Total Y
150 150
300
M AN U
TE D
S
Rater Y
S
SC
Rater X
Table 3B
RI PT
the prevalence of both positive and negative scores is 50% (Table 3B).