The Clavien-Dindo Classification of Surgical Complications is Not a Statistically Reliable System for Grading Morbidity in Pediatric Urology

The Clavien-Dindo Classification of Surgical Complications is Not a Statistically Reliable System for Grading Morbidity in Pediatric Urology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 ...

348KB Sizes 0 Downloads 34 Views

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

The Clavien-Dindo Classification of Surgical Complications is Not a Statistically Reliable System for Grading Morbidity in Pediatric Urology Moira E. Dwyer,* Joseph T. Dwyer, Glenn M. Cannon, Jr., Heidi A. Stephany, Francis X. Schneck and Michael C. Ost From the Department of Urology, Division of Pediatric Urology, Children’s Hospital of Pittsburgh, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania, and de la Torre Research, New York, New York (JTD)

Purpose: Although the Clavien-Dindo classification of surgical complications has been evaluated using adult surgical patients, it is being applied to pediatric populations. We hypothesized that this instrument is not well suited to children and sought to determine the reliability of the tool in a pediatric urological population. Materials and Methods: We replaced adult surgical cases in the “Survey to Assess Acceptability and Reproducibility of the Classification” from the original Clavien-Dindo study with pediatric urology cases and mimicked original study methods. The survey was distributed with the REDCap (Research Electronic Data Capture) tool, and Krippendorff a coefficients of reliability were calculated from the responses. Results: There were 51 respondents and 40 complete responses. The Krippendorff a coefficient of reliability for the Clavien-Dindo classification (a ¼ 0.487) did not achieve the minimum level of acceptable agreement (a ¼ 0.667) with the pediatric urological cases, even when the disability suffix (a ¼ 0.266) was excluded from the analysis (a ¼ 0.632). The accuracy of the grading system with the pediatric urological surgical cases when excluding the disability suffix (410 of 550, 75%) was significantly less than the accuracy had been with the original adult cases (1,816 of 2016, 90%, p <0.0001). While 89% of respondents (32 of 36) thought the system was appropriate for adults, only 49% (17 of 35) found it appropriate for children (p <0.001). Conclusions: The Clavien-Dindo classification of surgical complications is not a reliable tool for use in pediatric urology, where its accuracy is significantly decreased compared to adult surgical cases. Further study is needed to determine if findings are similar across all pediatric surgical groups.

Abbreviations and Acronyms C-DC ¼ Clavien-Dindo classification Accepted for publication September 1, 2015. No direct or indirect commercial incentive associated with publishing this article. The corresponding author certifies that, when applicable, a statement(s) has been included in the manuscript documenting institutional review board, ethics committee or ethical review board study approval; principles of Helsinki Declaration were followed in lieu of formal ethics committee approval; institutional animal care and use committee approval; all human subjects provided written informed consent with guarantees of confidentiality; IRB approved protocol number; animal approved project number. * Correspondence: 435 Franklin Ave., Pittsburgh, Pennsylvania 15221 (telephone: 607-4355717; FAX: 412-692-7939; e-mail: jurology@ moiradwyer.com).

Key Words: intraoperative complications, morbidity, mortality, pediatrics, postoperative complications

WITHIN this decade there has been a strong movement to report surgical complications according to a standardized classification system.1 One such system is the ClavienDindo classification of surgical

complications, introduced in 2004 (supplementary figure, www. jurology.com).2 The 5-year experience with this instrument was published in 2009 with the declaration, “The classification of surgical

0022-5347/16/1952-0001/0 THE JOURNAL OF UROLOGY® Ó 2015 by AMERICAN UROLOGICAL ASSOCIATION EDUCATION AND RESEARCH, INC.

Dochead: Pediatric Urology

http://dx.doi.org/10.1016/j.juro.2015.09.071 Vol. 195, 1-5, February 2016 Printed in U.S.A.

www.jurology.com

FLA 5.4.0 DTD  JURO12945_proof  17 October 2015  4:14 pm  EO: JU-15-1150

j

1

58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114

2

115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171

CLAVIEN-DINDO GRADING OF SURGICAL COMPLICATIONS IN PEDIATRIC UROLOGY

complications has reached its goal, and can be recommended in its current form for use in retrospective and prospective studies.”3 The subsequent endorsement of the European Association of Urology guidelines panel helped make the Clavien-Dindo classification the most commonly used standardized system in urology.1,4 The Clavien-Dindo classification similarly began gaining traction among pediatric urological surgeons, who referenced the classification system in the abstracts of 22 pediatric urological publications from 2008 to 2014. However, the ClavienDindo classification was developed using a cohort of 6,336 adult surgical patients and no children. Although its application to pediatric urological cases has been defended with the argument that the Clavien-Dindo classification “has been well validated,”5 it has not been assessed for accuracy or reliability in a pediatric population. We hypothesized that this therapeutic consequences based system is not well suited to children, and sought to assess the statistical reliability of the Clavien-Dindo classification in pediatric urology patients.

MATERIALS AND METHODS We mimicked the survey based method that Dindo et al originally used to evaluate the instrument in the adult population.2 We emailed the classification of surgical complications and clinical examples of complication grades tables (supplementary figure) with an altered version of the “Survey to Assess Acceptability and Reproducibility of the Classification”2 (supplementary Appendix, www.jurology.com) to 28 American pediatric urology fellowship directors and 95 other pediatric urologists and pediatric urology fellows within the United States. Imitating the original instructions to obtain “replies from surgeons at various levels of training, respectively, at the junior level (intern to second year), at the senior resident level and from experienced surgeons,”2 recipients were asked to disseminate the survey to other pediatric urologists and trainees, including urological interns and residents. Study data were collected and managed using REDCap tools hosted at the University of Pittsburgh Medical Center.6 REDCap is a secure, Web based application designed to support data capture for research studies, providing 1) an intuitive interface for validated data entry, 2) audit trails for tracking data manipulation and export procedures, 3) automated export procedures for seamless data downloads to common statistical packages and 4) procedures for importing data from external sources. Respondents were asked to assign the correct C-DC grade and disability status, ie presence of a specific, persistent, disabling complication, to 14 written cases in 2 steps, and to grade cases according to novel morbidity and mortality categorization systems.7 The adult morbidity cases were all replaced with actual pediatric Dochead: Pediatric Urology

172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 RESULTS 202 There were 51 responses and 40 respondents who categorized all 14 pediatric cases. The table displays ½T1 203 204 self-reported demographics. Respondents correctly 205 classified 64% of cases (270 of 421), which is signifi206 cantly lower than the 90% accuracy rate yielded by 207 the C-DC with the original adult survey (1,816 of 208 2,016, p <0.0001). When accuracy only required 209 correct identification of the C-DC grade and not the 210 disability status, the pediatric urology rate (75%, 211 410 of 550 cases) was still significantly lower than 212 213 Respondent demographics 214 215 No. level of training/experience (%): Department of urology chair, division of pediatric urology chief or 10 (20) 216 pediatric urology fellowship program director 217 Pediatric urology attending in practice 7 yrs or more 10 (20) 218 Pediatric urology attending in practice less than 7 yrs 9 (18) Urology resident who had rotated on pediatric urology service 6 (12) 219 Urology resident who had never rotated on pediatric urology service 3 (6) 220 Pediatric urology fellow 1 (2) 221 Adult urology attending 1 (2) Unreported 11 (22) 222 No. practice type (%): 223 Academic 10 (20) 224 In training 10 (20) Private 9 (18) 225 Publically funded 6 (12) 226 Retired less than 3 yrs 3 (6) 227 Unreported 1 (2) 228

urological cases from our institution (supplementary Appendix). This substitution led to 1 less grade IIIa and IVa case each, which left no grade IIIa cases. All other cases matched the C-DC grade and disability suffix published with the original survey. A hypothetical case replaced the single surgical mortality case. The age range was changed from 20 to 86 years (median 49) to 7 months to 17 years (3.5 years). The part of the original survey that focused on personal judgments was expanded with questions about the perceived appropriateness of the C-DC and novel grading systems for pediatric and adult populations, the preferred grading system for communication of morbidity and mortality in the pediatric urological community, and respondent demographics (supplementary Appendix). Our primary outcome, inter-rater reliability, was determined through the calculation of Krippendorff a coefficients (irr Package, version 0.84, method: nominal, cran.r-project. org/web/packages/irr/irr.pdf) and bootstrapped Krippendorff a coefficients (aq, kripp.boot Package, version 0.2, 95% CI iterations: 1,000, method: nominal, seed: 2,116,133, www.rdocumentation.org/packages/kripp.boot/functions/ kripp.boot), where 0.667 is the minimum accepted level of agreement, good agreement is 0.800 and perfect agreement is 1.000.8,9 Additional statistical analyses were performed using R statistical programming software, version 3.0.2-1 (R Project for Statistical Computing, www.R-project.org) and consisted of two-way ANOVA, Fisher exact probability test and Welch 2-sample t-test. Any p value of 0.05 or less was considered statistically significant.

FLA 5.4.0 DTD  JURO12945_proof  17 October 2015  4:14 pm  EO: JU-15-1150

CLAVIEN-DINDO GRADING OF SURGICAL COMPLICATIONS IN PEDIATRIC UROLOGY

229 the adult rate (p <0.0001). On multivariate analysis 230 accuracy of the C-DC with pediatric urology cases 231 proved independent of the case and level of respon232 dent training/experience (p ¼ 0.80; p ¼ 0.50 for C-DC 233 grade only). Of the respondents 81% (29 of 36) 234 answered positively to the question, “Do you think 235 that this classification is reproducible?” Although 236 this rate is statistically no different from the 91% 237 (131 of 144) who perceived reproducibility in the 238 adult survey, the inter-rater reliability among those 239 who took the pediatric urology survey was 0.487. 240 The entire confidence interval of the predicted 241 agreement coefficient for the greater population 242 (aq ¼ 0.471, median 0.472, 95% CI 0.363e0.568) was 243 below the minimally acceptable level of agreement 244 ½F1 (see figure). Agreement regarding disability status 245 was poor (a ¼ 0.266, aq ¼ 0.228, median 0.259, 95% 246 CI 0.010e0.439) but when it was excluded, the reli247 ability coefficient for C-DC grade alone among re248 spondents was only 0.632. The uppermost portion 249 of the 95% confidence interval for the predicted 250 agreement coefficient did extend above 0.667 251 (aq ¼ 0.615, median 0.620, 95% CI 0.484e0.732, see 252 figure). While 89% of respondents (32 of 36) to the 253 pediatric urology survey believed that the C-DC was 254 appropriate for adults, only 49% (17 of 35) believed it 255 was appropriate for pediatrics (p <0.0003). Addi256 tional findings are presented in the supplementary 257 table (www.jurology.com). 258 259 260 DISCUSSION 261 The Clavien-Dindo classification of surgical com262 plications was introduced in its current form in 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 Predicted agreement coefficient or inter-rater reliability of Clavien-Dindo classification of surgical complications for 283 entire population that grades pediatric urological morbidity 284 and mortality (1,000 iterations). 285 Dochead: Pediatric Urology

3

2004.2 Widespread adoption and implementation was based on the original assessments of the accuracy and reproducibility of the tool, and fostered in urology by the general endorsement of the European Association of Urology guidelines panel.4 Despite acknowledgement that the C-DC was evaluated in a general surgery population (adults from surgical disciplines including urology),5 it is being used to report procedural outcomes in the pediatric and pediatric urological communities. A novel scoring system for the prediction of major postoperative complications in children has been designed on the application of the C-DC to pediatric surgical cases,10 and evaluated through the application of the C-DC to pediatric cases.11 We hypothesized that the inherent and immutable differences between adult and pediatric populations make it premature to categorically apply the conclusions of the adult surgical study to children, and we set out to assess the statistical reliability and accuracy of the C-DC in pediatric urology patients to determine if it is actually valid for use in this population. Reliability, a pillar of validity, is the degree of nonrandom agreement between a representative sample of independent raters who are applying identical grading guidelines to an identical set of cases. A reliable system is not affected by variations in the extraneous circumstances of the measuring process,12 such as the idiosyncratic habits, biases and levels of knowledge specific to each rater, a necessary characteristic of a standardized and valid coding process. Dindo et al recognized this fact, stating, “Data on outcome must be obtained in a standardized and reproducible manner to allow comparison among different centers, between different therapies and within a center over time.”2 However, they noted that the rate of accurate responses to their survey did not linearly depend on the level of respondent training or the nation in which the survey was taken, and deemed sufficient the percentage of respondents (91%, 131 of 144) who positively answered their yes-or-no question, “Do you think that this classification is reproducible?” Although these methods are not used as measures of statistical reliability elsewhere in the literature, the authors concluded that the C-DC “constitutes a . reproducible approach for comprehensive surgical outcome assessment.”2 de la Rosette et al calculated moderate agreement among urologists grading adult percutaneous nephrolithotomy cases using the Fleiss kappa test (l ¼ 0.457),13 which has no generally accepted measure of significance, is a controversial scale, and has sensitivity to the number of respondents and grading options.14 Therefore, we elected to calculate reliability using Krippendorff a, a highly rigorous and robust statistical measure of reliability “regardless of the number of

FLA 5.4.0 DTD  JURO12945_proof  17 October 2015  4:14 pm  EO: JU-15-1150

286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342

4

343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399

CLAVIEN-DINDO GRADING OF SURGICAL COMPLICATIONS IN PEDIATRIC UROLOGY

observers, levels of measurement, sample sizes and presence or absence of missing data.”12 In fact, this coefficient is defended as the “standard measure of reliability.”15 With Krippendorff a data with a reliability coefficient of less than 0.667 should be discarded. Although 81% of our respondents thought the C-DC was reproducible, the Krippendorff a coefficient was 0.487. The aө, or predicted coefficient in the population our cohort of respondents represented, and its entire 95% confidence interval were unacceptable. Agreement about the disability status was so poor that we recalculated reliability in its absence (see figure). While unacceptable among our respondents, 19.4% of bootstrapped samples were above 0.667. Nevertheless, 0 of 1,000 samples reached good agreement (0.800). We were unable to obtain the data to calculate Krippendorff a for the original C-DC survey. Data from the pediatric urological survey should not be used to decry or herald the reliability of the C-DC in adult surgical patients. The second pillar of validity is accuracy. The percentage of respondents who assigned the correct C-DC grades was 90% for the original adult survey but decreased to 64% when questions were replaced with pediatric urology cases (p <0.0001). It is not known whether the original definition of an accurate response demanded the correct assignment of grade and disability status, but when disability status was excluded from the analysis, the accuracy for the pediatric urology cases remained significantly lower than it had been with the adult cases (75%, p ¼ 0.0001). We had significantly fewer respondents who perceived the C-DC to be simple, logical and useful in comparison to the respondents who took the adult survey (p ¼ 0.001 to 0.02). Curiously when we isolated a cohort of surgeons who actively care for adult patients from our respondents, we found no statistical difference between their perceptions about the simplicity, logic and usefulness of the C-DC and the perceptions of the adult surgeons and trainees who responded to the original survey. Despite this phenomenon, when comparing those who operate on adults with those who operate on children, statistically equivalent proportions perceived the C-DC to be inappropriate for children and appropriate for adults (supplementary table). Furthermore, statistically equivalent proportions of these groups preferred the novel pediatric morbidity and mortality grading systems to the C-DC for use in pediatric urology. Our findings indicate that the C-DC is not a valid tool for use in pediatric urology, but survey design could have influenced some of our other results. For example each morbidity case was followed by a separate question regarding disability status. While Dochead: Pediatric Urology

respondents could progress without answering this question, the hard stop may have solicited more positive responses than a single menu of options would have. We found 13 choices per case to appear clunky and daunting, and believe that our design likely represents the underlying stepwise decision making process of applying the C-DC to real-world cases in which the grade and the disability status are logically separate judgments. Clavien et al stated that the introduction of the disability status “often raised the question of which of the conditions may qualify for this suffix,”3 and we suspect that the notably poor inter-rater reliability for disability status is likely owing to an inherent challenge. Sampling, researcher and respondent bias may have influenced responses to the opinion based part of the survey. Sampling bias typically results in a sample that is more homogeneous than the population of interest. Therefore, if truly present, the level of reliability among the true population may be lower than our estimate, which would strengthen our conclusion. On the other hand, observer expectancy effect could have contributed to the lower rate of respondents who marked that the C-DC is appropriate for children (49%) without decreasing the rate of respondents who marked that the system is appropriate for adults (89%, p ¼ 0.0003). Regardless of these opinions, the objective findings reveal that the C-DC is inappropriate for pediatric urology patients given the inadequate statistical values reflecting at least 1 of the 2 vital components of validity. Additional detail was provided with most pediatric urological morbidity cases than had been with the adult cases. This factor does not detract from an assessment of inter-rater reliability among the cohort, since all respondents were provided with the same level of detail. Furthermore, because the grading system is intended for real-life scenarios, our inclusion of information derived from real-life events should have approximated reality more closely and introduced only extraneous circumstances that a reliable system must be able to overcome. However, this approach meant that there was no C-DC grade IIIa case in the pediatric urology survey. While one could argue that decreasing the variance among coding units could reduce estimates of Krippendorff a, this is an empirical question. Our survey design followed the recommendation that reliability research select coding units that exhibit enough variance to cover all or most values for coding through stratified rather than random sampling. Additionally a coding process must be able to discriminate among the population to which it is applied, and the absence of grade IIIa cases from our survey is representative of the observed variance in a pediatric population given that invasive

FLA 5.4.0 DTD  JURO12945_proof  17 October 2015  4:14 pm  EO: JU-15-1150

400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456

CLAVIEN-DINDO GRADING OF SURGICAL COMPLICATIONS IN PEDIATRIC UROLOGY

457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508

procedures are almost always performed with the child under sedation or anesthesia. Such discrepancies in adult and pediatric practice models, and thus in coding unit variance, stem from the fundamental differences between the 2 surgical populations, highlighting why a coding system based on therapeutic consequences that was designed for or tested in adults should be evaluated separately in the pediatric population.

CONCLUSIONS The Clavien-Dindo classification of surgical complications is a grading system that was deemed to be accurate and reproducible after testing with adult surgical patients. We replicated the survey based method originally used to assess the tool but substituted pediatric urology cases. The system was significantly less accurate in pediatric urological

5

patients than it had been in adult patients and statistically was proved to have suboptimal reliability for grading morbidity in this population despite respondent perceptions to the contrary. These findings suggest that the Clavien-Dindo classification in its current form should not be used to grade morbidity in the pediatric urological population. While we suspect that the results of this study are explained by differences between adult and pediatric patients, causation was not investigated. Further study of the Clavien-Dindo classification in a pediatric general surgery population is needed before conclusions should be extended beyond pediatric urological patients. Efforts are being made to develop a superior option for the standardized reporting of surgical complications in pediatric urological and all pediatric surgical patients.

REFERENCES 1. Yoon PD, Chalasani V and Woo HW: Use of Clavien-Dindo classification in reporting and grading complications after urological surgical procedures: analysis of 2010 to 2012. J Urol 2013; 190: 1271.

6. Harris PA, Taylor R, Thielke R et al: Research electronic data capture (REDCap)da metadatadriven methodology and workflow process for providing translational research informatics support. J Biomed Inform 2009; 42: 377.

2. Dindo D, Demartines N and Clavien PA: Classification of surgical complications: a new proposal with evaluation in a cohort of 6336 patients and results of a survey. Ann Surg 2004; 240: 205.

7. Dwyer ME, Stephany HA, Cannon GM Jr et al: Assessment of the Unplanned Postoperative Morbidity in Children (UPMC) Scoring System for use in pediatric urology. Presented at annual meeting of Northeastern Section of American Urological Association, Amelia Island, Florida, November 13-15, 2014.

3. Clavien PA, Barkun J, de Oliveira ML et al: The Clavien-Dindo classification of surgical complications: five-year experience. Ann Surg 2009; 250: 187. 4. Mitropoulos D, Artibani W, Graefen M et al: Reporting and grading of complications after urologic surgical procedures: an ad hoc EAU guidelines panel assessment and recommendations. Eur Urol 2012; 61: 341. 5. Ozden E, Mercimek MN, Yakupoglu YK et al: Modified Clavien classification in percutaneous nephrolithotomy: assessment of complications in children. J Urol 2011; 185: 264.

Dochead: Pediatric Urology

8. Krippendorff K: Content Analysis: An Introduction to Its Methodology, 2nd ed. Thousand Oaks, California: Sage Publications Inc 2004.

assessment for children. J Am Coll Surg 2011; 212: 768. 11. Wood G, Barayan G, Sanchez DC et al: Validation of the pediatric surgical risk assessment scoring system. J Pediatr Surg 2013; 48: 2017. 12. Krippendorff K: Agreement and information in the reliability of coding. Commun Methods Meas 2011; 5: 93. 13. de la Rosette JJ, Opondo D, Daels FP et al: Categorisation of complications and validation of the Clavien score for percutaneous nephrolithotomy. Eur Urol 2012; 62: 246.

9. Riffe D, Lacy S and Fico FG: Analyzing Media Messages: Using Quantitative Content Analysis in Research, 2nd ed. New York: Taylor and Francis 2005.

14. Gwet KL: Kappa coefficient: a review. In: Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement among Multiple Raters, 3rd ed. Gaithersburg, Maryland: Advanced Analytics LLC 2012; chapt 2, pp 15e25.

10. Weinberg AC, Huang L, Jiang H et al: Perioperative risk factors for major complications in pediatric surgery: a study in surgical risk

15. Hayes AF and Krippendorff K: Answering the call for a standard reliability measure for coding data. Commun Methods Meas 2007; 1: 77.

FLA 5.4.0 DTD  JURO12945_proof  17 October 2015  4:14 pm  EO: JU-15-1150

509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560