Specific agreement on dichotomous outcomes can be calculated for more than two raters

Specific agreement on dichotomous outcomes can be calculated for more than two raters

Accepted Manuscript Specific agreement on dichotomous outcomes in the situation of more than two raters Henrica C.W. de Vet, PhD, Rieky E. Dikmans, MD...

541KB Sizes 2 Downloads 33 Views

Accepted Manuscript Specific agreement on dichotomous outcomes in the situation of more than two raters Henrica C.W. de Vet, PhD, Rieky E. Dikmans, MD, Iris Eekhout, PhD PII:

S0895-4356(16)30837-X

DOI:

10.1016/j.jclinepi.2016.12.007

Reference:

JCE 9294

To appear in:

Journal of Clinical Epidemiology

Received Date: 30 April 2016 Revised Date:

31 October 2016

Accepted Date: 1 December 2016

Please cite this article as: de Vet HCW, Dikmans RE, Eekhout I, Specific agreement on dichotomous outcomes in the situation of more than two raters, Journal of Clinical Epidemiology (2017), doi: 10.1016/ j.jclinepi.2016.12.007. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT Manuscript Number: JCE-16-330R2 Title: Specific agreement on dichotomous outcomes in the situation of more than two raters Article Type: Original Article

RI PT

Corresponding Author: Mrs. Henrica C.W. de Vet, PhD Corresponding Author's Institution: EMGO Institute, VU University Medical Center Phone: +31 20 4446014 Fax: +31 20 4446775 E-mail: [email protected]

SC

First Author: Henrica C.W. de Vet, PhD

M AN U

Order of Authors: Henrica C.W. de Vet, PhD; Rieky E Dikmans, MD; Iris Eekhout, PhD Department of Epidemiology and Biostatistics, EMGO Institute for health and Care Research, VU medical center, De Boelelaan 1089A, Amsterdam 1081HV, Netherlands

Manuscript Region of Origin: NETHERLANDS

TE D

Abstract: Objective For assessing inter-rater agreement the concepts of observed agreement and specific agreement have been proposed. The situation of two raters and dichotomous outcomes has been described, whereas often times multiple raters are involved. We aim to extend it for more than two raters and examine how to calculate agreement estimates and 95% confidence intervals (CIs).

AC C

EP

Study design and setting As an illustration we used a reliability study that includes the scores of 4 plastic surgeons classifying photographs of breasts of 50 women after breast reconstruction into 'satisfied' or 'not satisfied'. In a simulation study we checked the hypothesized sample size for calculation of 95% CIs. Results For m raters, all pairwise tables (i.e. m(m-1)/2) were summed. Then the discordant cells were averaged before observed and specific agreements were calculated. The total number (N) in the summed table is m(m-1)/2 times larger than the number of subjects (n). In the example: N=300 compared to n = 50 subjects times m= 4 raters. A correction of n√(m-1) was appropriate to find 95% CIs comparable to bootstrapped CIs. Conclusion The concept of observed agreement and specific agreement can be extended to more than two raters with a valid estimation of the 95% CIs. Running title: Specific agreement for more than two raters. Key words: observed agreement; specific agreement, confidence intervals, continuity correction, Fleiss correction.

ACCEPTED MANUSCRIPT

Introduction

2

Clinicians interested in observer variation pose questions such as “Is my diagnosis in

3

agreement with that of my colleagues?” (inter-observer) and “Would I obtain the same

4

result if I repeated the assessment?” (intra-observer). To express the level of intra-rater

5

or inter-rater agreement, many textbooks on reliability and agreement studies [1-4]

6

recommend Cohen’s kappa as the most adequate measure. Cohen introduced kappa as a

7

coefficient of agreement for categorical outcomes [5]. However, unsatisfactory situations

8

occur when a high level of agreement is accompanied by a low kappa value. These

9

counterintuitive results have led to many adjustments and extensions of kappa [6], e.g.

10

to adjust for bias or prevalence [7]. Other authors recommend presenting a number of

11

accompanying measures to enable correct interpretation [8], or propose alternative

12

parameters such as A1C and Gwet’s coefficient [4].

M AN U

SC

RI PT

1

13

Cohen’s kappa has been frequently recommended and used to assess agreement within

15

and between raters classifying characteristics of patients. A 2013 paper advised against

16

the use of Cohen’s kappa [9] because kappa is NOT a measure of agreement, but a

17

measure of reliability [9,10]. When interested in inter-rater agreement in clinical

18

practice, the most relevant question concerns agreement between raters, and more

19

specifically ‘what is the probability that colleagues will provide the same answer’. It was

20

argued that, by adjusting the observed agreement for the expected agreement as Cohen’s

21

kappa does, an agreement measure (observed agreement) is turned into a reliability

22

measure. Reliability measures are less informative in clinical practice. [9]

23

It was therefore recommended to use the proportion of observed agreement or the

24

proportion of specific agreement instead of Cohen’s kappa. The proportion of specific

AC C

EP

TE D

14

1

ACCEPTED MANUSCRIPT 25

agreement distinguishes agreement on positive or negative scores, which might have

26

different implications in clinical practice.

27

The simple situation of two raters and two categories is presented in Table 1, resulting

29

in a 2 by 2 table with four cells (a,b,c and d). The observed agreement is defined as (a +

30

d)/ (a + b + c + d).

RI PT

28

32

-------------------- Table 1 here ------------------------------

33

SC

31

The specific agreement on a positive score, known as the positive agreement (PA), is

35

calculated with the following formula:

M AN U

34

36

PA =

2a 2a + b + c

38

which is the same as PA =

a . a + (b + c) / 2

TE D

37

Specific agreement on a negative score, the negative agreement (NA), is calculated using

40

the formula:

42

43

NA =

AC C

41

EP

39

2d d , which is the same as PA = . d + (b + c) / 2 2d + b + c

44

The inclusion of both discordant cells (b and c) in the formula accounts for the fact that

45

these numbers might be different, in which case their average value is used.

46

2

ACCEPTED MANUSCRIPT The concept of specific agreement was introduced by Dice in 1945 [11] and revitalized

48

by Cicchetti and Feinstein[12]. However, it has rarely been applied in the medical

49

literature. This may be due to its restricted application to the situation of two raters and

50

dichotomous response options, whereas often multiple raters are involved. We will

51

therefore extend the measure of agreement and specific agreement to include situations

52

of more than two raters. We will provide formulas to calculate estimates and 95%

53

confidence intervals (CIs) for the proportions of agreement and specific agreement.

54

Subsequently, we will address the interpretation of the results, for example with respect

55

to systematic differences between raters. In the discussion we will provide background

56

information on some of the decisions we have made.

M AN U

SC

RI PT

47

57

Methods

59

Description of example

60

The illustrative example data set is from a study by Dikmans and colleagues[13] [data in

61

Appendix A.1] and is based on photographs of breasts of 50 women after breast

62

reconstruction. The photographs are independently scored by 5 surgeons to rate the

63

quality of the reconstruction on a 5-point ordinal scale, varying from ‘very dissatisfied’

64

to ‘very satisfied’. In this paper we use the data of 4 surgeons because one surgeon had

65

some missing values. For this paper the satisfaction scores were dichotomized into d

66

satisfied (S) (scores 4 and 5) and not satisfied (not-S)scores (scores 1, 2 and 3).

EP

AC C

67

TE D

58

68

Specific agreement with more than two raters.

69

To assess agreement in a situation with two raters, the score of one rater has to be

70

compared only to the other rater. With three raters, three comparisons are possible:

3

ACCEPTED MANUSCRIPT rater 1 can be compared to rater 2 and to rater 3, and the agreement between raters 2

72

and 3 can be assessed. The formula to calculate the number of comparisons for m raters

73

is m(m-1)/2. The agreement question can be generalized to: given that one rater (of the

74

m raters) scores positive, what is the probability of a positive score by the other raters;

75

and the same question can obviously also be asked for a negative score.

RI PT

71

76

Statistics

78

To obtain an agreement parameter for m raters, all pairwise tables (i.e. m(m-1)/2) are

79

summed. After summing these tables, the b and c cells are averaged. So, placement of the

80

value in the b cell or the c cell is arbitrary. In other words, it should not matter whether

81

we put the scores of rater 1 horizontally or vertically in the 2 by 2 table. By averaging

82

the value in the b and c cells in the summed table we get the same result, even if all

83

higher values are in the b cell and all lower values in the c cell. Subsequently the

84

observed agreement and specific agreement are calculated.

M AN U

TE D

85

SC

77

The confidence interval for the proportion of agreement can be obtained using the

87

simple normal approximation for an interval for proportions (see Appendix A.2 for the

88

formulas and explanations). When more than two raters are involved we add up

89

multiple 2 by 2 tables. The summed table then contains a higher total number (N=6 * n =

90

300) than the number of subjects (n = 50) multiplied by the number of ratings (4), and

91

the sample size used in the formula has to be adjusted. We assumed that a correction of

92

n (m − 1) would be appropriate. In a simulation study, we checked the resulting interval

93

with the bootstrapped 95% CI. Our simulation contained situations where a sample of

94

100 subjects was rated by 4 raters, which resulted in a corrected sample size for the CI

AC C

EP

86

4

ACCEPTED MANUSCRIPT formulas of n (m − 1) = 100 * (4 − 1) . The prevalence of the categories was varied between

96

50%, 70%, 80% and 90%. The overall agreement was set to reach about 0.80. We

97

repeated the analyses for 8 and 12 raters. Each condition was simulated 500 times [14]

98

and simulations were performed in R statistical software [15].

99

RI PT

95

Results

101

Calculation of observed and specific agreement

102

With four surgeons as raters it is possible to present six 2 by 2 tables (m(m-1)/2 =

103

4x3/2 =6), representing the agreement between surgeons: 1 vs 2; 1 vs 3; 1 vs4; 2 vs3; 2

104

vs 4; 3 vs 4. The question of agreement with four surgeons can be answered by adding

105

up the cells of these six 2 by 2 tables.

106

108

TE D

107

M AN U

SC

100

---------------------------- Table 2 here ------------------------------------

109

Calculating proportions of observed and specific agreement for Table 2C results in an

111

observed agreement of 0.747, and the proportion of specific agreement is 0.756 for the

112

‘satisfied’ scores and 0.736 for the ‘non-satisfied’ scores (for formulas and calculations

113

see Appendix A.3).

AC C

114

EP

110

115

Calculation of confidence intervals

116

For the observed agreement we calculated the 95% CI departing from the total number

117

of subjects (n=50). As the specific agreement is calculated based on the positive scores

118

(‘satisfied’) or negative scores (‘not-satisfied’), the sample size used for calculation of

5

ACCEPTED MANUSCRIPT 95% CI in our example (see Table 2c) is 26 (=156/300 * 50) and 24 (=144/300 * 50)

120

respectively.

121

The confidence intervals calculated with sample size n (m − 1) corresponded to the

122

bootstrapped confidence interval for 500 iterations[16]. A Table with these results is

123

presented in Appendix A.4. The results for the situation of 8 and 12 raters were similar

124

(data not presented). The 95% CI were quite close to the bootstrapped 95% CI,

125

indicating that n (m − 1) is an adequate correction for the sample size. We also see that

126

Fleiss correction[17] is necessary for the upper limit. (The formulas to obtain the lower

127

and upper limit of the 95% CI with continuity correction, and both with and without

128

Fleiss correction are presented in Appendix A.5.)

129

M AN U

SC

RI PT

119

Dependency on prevalence

131

The proportions of specific agreement are dependent on the prevalence of the scores. As

132

the same discordant cells (b and c) are related to both positive scores and negative

133

scores, it is clear that when the prevalence of positive scores is above 50%, the positive

134

agreement will be higher than the negative agreement and vice versa.

EP

135

TE D

130

In the case of Cohen’s kappa, the effect of prevalence seems counterintuitive to

137

researchers. If the prevalence of positive scores is > 90%, the proportion of observed

138

agreement is often high: say 0.95. In this case one would expect a high kappa value.

139

However, the expected agreement will also be high, i.e. more than 0.80 [(0.9 x 0.9) + (0.1

140

x 0.1)], and the resulting kappa value will be low. Note that in clinical practice there is no

141

such thing as chance agreement. The probability that a colleague agrees with a positive

142

score of a first rater will be larger if the prevalence of the positive score is higher.

AC C

136

6

ACCEPTED MANUSCRIPT 143

We considered adjusting for the effect of prevalence by calculating proportion

145

agreement when the prevalence is 50%. In Table 3A we present the results for the

146

example that was used in Table 2B. When the prevalence is 50% (as in Table 3B), the a

147

and d cells will be equal. If we then relate the values in the b and c cells to the positive

148

and negative scores, these will be equal. Note that in that case the specific agreement has

149

the same value as the total observed agreement.

--------------------------------Table 3 here ------------------------------------------------

M AN U

151

SC

150

RI PT

144

152

Systematic differences between raters

154

Table 2 shows, based on the different numbers in the b and c cells, that some surgeons

155

have more positive scores than other surgeons. Given a sufficient number of raters,

156

these systematic errors average out when the results of all pairs of raters are combined.

157

In the case of large systematic errors, it is informative to point out these differences.

158

This can be done, for example, by presenting the values of the proportion of absolute

159

agreement between pairs and the values of prevalences of agreement on positive scores.

160

In Table 2 the values of observed agreements runs from 0.68 (surgeons 2 vs 3) to 0.80

161

(surgeons 2 and 3 vs 4) and the prevalence of positive scores varies from 44%

162

(surgeons 2 and 4) to 52% (surgeons 1 and 3).

EP

AC C

163

TE D

153

164 165

Discussion

7

ACCEPTED MANUSCRIPT In this paper we have presented a method to easily obtain the proportions agreement

167

and specific agreement between multiple raters. The provided confidence interval

168

formula enables calculation of the uncertainty around the agreement. The aim of the

169

measures presented is to help clinicians in practice. There are a number of remarks to

170

be made about the methods we have proposed. These concern the summation of the

171

pairwise 2 by 2 tables, the calculation of the 95% CI, and the averaging of the values in

172

the corresponding discordant cells.

RI PT

166

SC

173

From a conceptual point of view it seems sensible to add up all pairwise 2 by 2 tables to

175

obtain an overall table which forms the basis for calculating proportions total agreement

176

and specific agreement. We compared this strategy to methods for calculating a kappa

177

value for a dichotomous scoring if more than 2 raters are involved. The kappa value

178

calculated for the overall table corresponds to the value of kappa as calculated by Light

179

[18]. The Light kappa for more than two observers is obtained by calculating the kappa

180

values for all 2 by 2 tables for pairs of observers and then averaging the result.

TE D

M AN U

174

181

To calculate the 95% CI around observed agreement and specific agreement the sample

183

size to be used is n (m − 1) . Our simulations showed that the Fleiss correction is only

184

needed for the upper limit. Note that in agreement studies the lower limit (indicating

185

how poor the agreement can be) is often more important than the upper limit of the

186

confidence interval.

AC C

EP

182

187 188

The formula for specific agreement uses the average of the values in the discordant cells

189

(cells b and c). However, different values in these discordant cells in a 2 by 2 table are

190

known to lead to slightly higher kappa values, i.e. when there are systematic differences 8

ACCEPTED MANUSCRIPT 191

between the raters [19]. By averaging we ignore potential systematic differences

192

between raters. Presentation of values of the magnitude of systematic differences is

193

therefore important.

194

The clinical perspective

196

Researchers have become used to kappa values. They like the single value for inter-

197

observer agreement, together with the appraisal system as proposed by Landis and

198

Koch [20]or Fleiss [17]. They experience problems, however, when the prevalence of

199

abnormalities becomes low or high, as kappa may have a counter-intuitively low value in

200

these situations. It is often stated that kappa cannot be interpreted in these situations.

201

Vach [21] argued that one cannot blame kappa for what it is supposed to do, i.e.

202

adjusting for chance agreement, and chance agreement is high when the prevalence of

203

(ab)normalities becomes high, leading to low kappa values. Researchers sometimes

204

sample patients specifically for a reliability study in order to obtain approximately equal

205

numbers in each category. However, from a clinical perspective this is not realistic and it

206

hampers the generalizability of the results to clinical practice.

207

The largest criticism of using the proportion agreement is that it does not correct for

208

chance agreement, and that it is therefore overestimated [22]. Note that chance

209

agreement is a non-issue when we look at inter-observer agreement from a clinical

210

perspective. Knowledge of the probability that colleagues will provide the same answer

211

does not reveal (nor is it important) whether this is chance agreement or not. One is

212

merely interested in this probability. When the prevalence of abnormalities is high, the

213

probability that colleagues agree will be higher, when the prevalence is low this

214

probability will be lower, because the same values of the discordant cells (b and c cells)

AC C

EP

TE D

M AN U

SC

RI PT

195

9

ACCEPTED MANUSCRIPT are related to the concordant positive cells and the concordant negative cells (a and d

216

cells).

217

For example, in Appendix A.6 we present an example with a high prevalence of 86%

218

‘satisfied’ scores (258/300) and 14% ‘non-satisfied’ scores. (42/300). In that case the 38

219

discordant scores would have been related to both 220 ‘satisfied’ and 4 ‘non-satisfied’

220

scores, resulting in a proportion of positive agreement of 0.853 and a negative

221

agreement of 0.095. The clinical application would be that if one surgeon scores ‘not

222

satisfied’, it is worthwhile to ask the opinion of a second surgeon. In case of a ‘satisfied’

223

score the probability that the other surgeon agrees is 85.3%, so it is probably not worth

224

the effort of involving a second surgeon. It is this dual information that makes specific

225

agreement especially useful for clinical practice.

226

M AN U

SC

RI PT

215

Strength and limitations

228

In our example the photographs were originally scored on an ordinal level and

229

dichotomized afterwards. This may have affected the outcomes with regard to the level

230

of agreement. However, in this paper the focus is on the parameters used to express the

231

agreement and not on the level of agreement among the surgeons. The example was

232

merely used as an illustration for the proposed method. Moreover, the example does not

233

contain any missing values, which is unrealistic in many situations. In a future study we

234

will therefore focus on how to deal with missing values in agreement studies.

EP

AC C

235

TE D

227

236

Conclusion

237

Specific agreement as described previously for the situation of two raters can be also

238

used in the situation of more than two observers.

239

10

ACCEPTED MANUSCRIPT 240

References

241

244 245 246 247

Measurement Errors. Oxford University Press, New York NY, 1989. 2. Shoukri MM: Measures of Interobserver Agreement. Chapman & Hall/CRC, Boca Raton

RI PT

243

1. Dunn G: Design and Analysis of Reliability Studies. The Statistical Evaluation of

FL 2004.

3. Lin L, Hedayat AS, Wu W: Statistical Tools for Measuring Agreement. Springer, New York NY 2012.

SC

242

4. Gwet KL: Handbook of inter-rater reliability. A Definitive Guide to Measuring the Extent

249

of Agreement Among Raters. 3rd Edition. Advanced Analytics, LLC, Gaithersburg MD

250

2012.

254 255 256 257 258 259 260 261

6. http://www.john-uebersax.com/stat/kappa.htm#procon. Kappa coefficients: a critical appraisal (accessed October 18, 2016)

TE D

253

Measurement, 196037-46, 1960.

7. Byrt T. Bishop J. Carlin JB. Bias, prevalence and kappa. J Clin Epidemiol. 1993; 46(5): 423-9.

8. Lantz CA, Nebenzahl E. Behaviour and Interpretation of the k Statistic. Resolution of the

EP

252

5. Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological

two paradoxes. J Clin Epidemiol 1996; 49(4): 431-4.

9. De Vet HC, Mokkink LB, Terwee CB, Hoekstra OS, Knol DL. Clinicians are right not to

AC C

251

M AN U

248

like Cohen’s κ. BMJ 2013;346:f2125 doi:10.1136/bmj.f2125.

10. Kottner J, Audigé L, Brorson S, Donner A, Gajewski BJ, Hrobjartsson A, Roberts C,

262

Shoukri M, Streiner DL. Guidelines for Reporting Reliability and Agreement

263

Studies (GRRAS) were proposed. J Clin Epidemiol. 2011; 64 (1): 96-106.

264

11

ACCEPTED MANUSCRIPT

266 267 268 269

11. Dice LR. Measures of the amount of ecologic association between species. Ecology 1945; 26 (3):297-302. 12. Cicchetti DV, Feinstein AR. High agreement but low kappa: II Resolving the paradoxes. J Clin Epidemiol. 1990; 43(6): 551-8. 13. Dikmans REG, Nene L, de Vet HCW, Mureau MAM, Bouman MB, Ritt MJPF ,

RI PT

265

Mullender MG The aesthetic items scale: a tool for the evaluation of aesthetic

271

outcome after breast reconstruction. Plastic and Reconstructive Surgery Global

272

Open, in press.

274 275

14. Burton A, Altman DG, Royston P, Holder RL. The design of simulation studies in medical statistics. Stat Med. 2006; 25(24): 4279–92.

M AN U

273

SC

270

15. Team RC. R: A language and environment for statistical computing. (R. F. for S.

276

Computing, Ed.). 2014. Vienna, Austria. Retrieved from http://www.r-

277

project.org/.

16. Efron, B., & Tibshirani, R. (1986). Bootstrap Methods for Standard Errors,

TE D

278

Confidence Intervals, and Other Measures of Statistical Accuracy. Statistical

280

Science, 1(1), 54–75. Retrieved from

281

http://projecteuclid.org/euclid.ss/1177013815.

283 284 285 286 287 288 289

17. Fleiss JL, Kevin B, Paik MC. Statistical methods for rates and proportions. 3rd

AC C

282

EP

279

edition. 2003. John Wiley & Sons, Inc. Hoboken, NJ.

18. Light RJ. Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol Bull. 1971; 76(5):365–77.

19. Feinstein AR, Cicchetti DV. High agreement but low kappa: I The problem of two paradoxes. J Clin Epidemiol. 1990; 43(6): 543-9. 20. Landis, J. R. and Koch, G. C. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-74. 12

ACCEPTED MANUSCRIPT

291 292 293

21. Vach W. The dependence of Cohen’s kappa on the prevalence does not matter. J Clin Epidemiol. 2000; 58(7): 655–61. 22. Hallgren KA. Computing Inter-Rater Reliability for Observational Data: An overview and tutorial. Tutor Quant Methods Psychol. 2012; 8(1): 23–34.

RI PT

290

AC C

EP

TE D

M AN U

SC

294

13

ACCEPTED MANUSCRIPT Table 1: Two by two table with dichotomous scores of two raters Rater X

Positive Negative Total rater X

Positive

a

b

a+b

Negative

c

d

c+d

a+c

b+d

a+b+c+d

AC C

EP

TE D

M AN U

SC

Total rater Y

RI PT

Rater Y

ACCEPTED MANUSCRIPT

Table 2: Pairwise comparison of surgeons’ scores and a summation table. Table 2A: Pairwise comparison of raters’ scores Surg 1 vs 3

Surg 1 vs 4

Surg 2 vs 3

Surg 2 vs 4

Surg 3 vs 4

RI PT

Surg 1 vs 2

17 9

19 7

18 8

16 6

17 5

19 7

5

7

4

10 18

5

3

Table 2B: summed table Surg X

S

20

23

21

SC

17

Table 2C: summed table with averaged b and c cells.

Not S

Total X

Surg Y

M AN U

19

Surg X

S

Not S

Total X

S

118

38

156

Not S

38

106

144

156

144

300

Surg Y

118

42

160

Not S

34

106

140

Total Y

152

148

300

AC C

EP

TE D

S

Total Y

ACCEPTED MANUSCRIPT Table 3: Positive and negative agreement for the situation that the prevalence of positive scores is not equal to the prevalence of negative scores (Table 3A), and the situation that

Table 3A S

Not S

Total X

Rater X

Rater Y 118

34

152

S

Not S

42

106

148

Total Y

160

140

300

AC C

EP

Not S

Total X

112 38

150

Not S

38

112

150

Total Y

150 150

300

M AN U

TE D

S

Rater Y

S

SC

Rater X

Table 3B

RI PT

the prevalence of both positive and negative scores is 50% (Table 3B).