Reliability Assessment of International Grading System for Vesicoureteral Reflux Charles B. Metcalfe, Andrew E. MacNeily and Kourosh Afshar* From the Department of Urologic Sciences (CBM, AEM, KA), and School of Population and Public Health (KA), University of British Columbia, Vancouver, Canada
Abbreviations and Acronyms VCUG ⫽ voiding cystourethrogram VUR ⫽ vesicoureteral reflux Study received institutional research ethics board approval. * Correspondence: Division of Pediatric Urology, British Columbia Children’s Hospital, Suite K0-134, 4480 Oak St., Vancouver, British Columbia, Canada V6H 3V4 (telephone: 604-875-2734; FAX: 604- 875-2721; e-mail:
[email protected]).
Purpose: The International Reflux Committee proposed a grading system for vesicoureteral reflux in 1985 which has been used extensively in everyday practice and research studies. Despite widespread use, based mainly on face validity, the interrater and intrarater reliability of this tool are not known. A tool cannot be considered valid unless it is reliable. Therefore, we estimated the interrater and intrarater reliability of the international grading system for vesicoureteral reflux. Materials and Methods: A series of 28 voiding cystourethrogram studies were selected. The images were assembled in an electronic presentation in random fashion. Four pediatric radiologists, 5 pediatric urologists and 4 senior urology residents graded the studies. The images were then shuffled in a random fashion and re-rated after 7 days (total 728 observations). Cohen weighted kappa statistics were used to determine interrater and intrarater reliability. Subgroup analysis was then performed comparing the variability among the 3 groups of raters and different grades. Results: The average interrater reliability was 0.53 (95% CI 0.52– 0.55, p ⬍0.0001). Agreement in subgroups was 0.61 for urologists, 0.59 for residents and 0.56 for radiologists. The lowest agreement was shown in grade III (0.36) and the highest in grade I (0.98). The intrarater reliability was 0.86 (95% CI 0.77– 0.95, p ⬍0.001). Conclusions: The international grading system for vesicoureteral reflux shows low interrater reliability for moderate degrees of vesicoureteral reflux whereas the intrarater reliability is high. Modification of this system may improve its reproducibility. Key Words: vesico-ureteral reflux, classification, reproducibility of results
1490
www.jurology.com
VESICOURETERAL reflux is a common pediatric condition referring to the retrograde flow of urine from the bladder to the upper urinary tract (ureters and collecting system).1 The prevalence of reflux has been estimated to be approximately 0.4% to 1.8% in normal healthy children.2 VUR is usually diagnosed during the investigation of urinary tract infection or prenatal hydronephrosis. Currently the standard test for diagnosing VUR is the voiding cystourethrogram, which is also used to classify the severity of reflux. The inter-
national VUR grading system has been the most widely accepted tool to classify VUR for more than 25 years. The system was adopted by clinicians since it seemed valid and reliable on the surface (ie it possessed good face validity). In this study we assess interobserver and intra-observer reliability of the international VUR grading system. Our hypothesis was that the international grading system has acceptable interrater reliability for low and high grade reflux but not for moderate degrees of the condition.
0022-5347/12/1884-1490/0 THE JOURNAL OF UROLOGY® © 2012 by AMERICAN UROLOGICAL ASSOCIATION EDUCATION
http://dx.doi.org/10.1016/j.juro.2012.02.015 Vol. 188, 1490-1492, October 2012 RESEARCH, INC. Printed in U.S.A.
AND
RELIABILITY ASSESSMENT OF VESICOURETERAL REFLUX GRADING
1491
METHODS AND MATERIALS
DISCUSSION
The protocol was approved by the University of British Columbia institutional research ethics board. A total of 28 VCUG studies were selected from the radiology library of a single institution (British Columbia Children’s Hospital). All identifying information was removed to preserve patient confidentiality and the possibility of bias if the case were to be previously known by viewers. The images ranged from grade I to grade V, with an emphasis placed on moderate grades. The initial VCUG grading at the time of selection was 2 grade I, 9 grade II, 7 grade III, 5 grade IV and 5 grade V. We included more moderate degrees of VUR according to our a priori hypothesis, expecting more variability in this group. These images were assembled in a PowerPoint® presentation in a random fashion. The presentation began with a slide depicting and explaining the grading system. The presentation was viewed and graded by 4 pediatric radiologists, 5 pediatric urologists and 4 senior urology residents. The images were then shuffled and reassembled in a random fashion and reassessed by the same participants between 7 and 10 days later. All the ratings were done independently. The grading of the studies was recorded and assembled. Kappa statistics were used to calculate intrarater and interrater reliability. To evaluate the effects of a less complex system on interrater reliability we performed 2 simulation analyses, recalculating kappa statistics combining grades II and III and then grades III and IV.
In 1985 the International Reflux Committee proposed a system of 5 grades to classify VUR by way of VCUG.3 This international grading system was designed as a hybrid of the Heikel and Parkkulainen4 system and that of Dwoskin and Perlmutter.5 Heikel and Parkkulainen described the grades of reflux depending on the extent of ureteral and pelvic dilatation, while Dwoskin and Perlmutter based their system on calyceal changes such as blunting and clubbing during the VCUG. The reliability and validity of the International Grading system have yet to be formally assessed in detail despite its widespread use in research and dayto-day clinical practice. It is not unusual to find disagreement between individuals when it comes to grading VUR. We hypothesized that this disagreement is more pronounced for moderate degrees of VUR. Inaccurate classification makes comparison of different clinical series difficult. In addition, the application of clinical study findings to routine clinical decision making may be compromised if the classification system proves to be unreliable. A prime example is the generation of guidelines based on a grading system.6 By definition a clinical instrument cannot be valid unless it is reliable.7 There are several types of reliability indices. For a classification system that is based on visual inspection and subjective opinion the degrees of agreement among observers (interrater) and within each observer for the same observation (intrarater) are the most pertinent. We used de-identified digital images of refluxing units from patients seen in our pediatric urology clinic. No clinical history was provided to the observers. The participants were all physicians familiar with VCUGs, including pediatric radiologists, pediatric urologists and senior urology residents. The second rating was done 7 to 10 days later to minimize recall bias. The degree of agreement may be expressed as a proportion of observations with the same interpretation (concordance) or kappa statistics. We elected to only use kappa statistics since they adjust the results for random agreement. There are no set criteria for an optimal kappa. Nevertheless, most authors view a value of 0 to 0.4 as poor, 0.4 to 0.6 as fair, 0.6 to 0.8 as substantial and greater than 0.8 as almost perfect agreement.8 The average interrater reliability in this study was fair at 0.53. When subgroup analysis was performed based on observer status of training, urologists had the highest agreement (kappa 0.61) (see table). However, such small observed differences among groups are probably not clinically relevant. We also performed a subgroup analysis based on reflux grades. The highest agreements were seen at the extremes of the scale (grade I kappa 0.98 and
RESULTS A total of 728 observations were performed. Mean interrater reliability was 0.53 (95% CI 0.52– 0.55, p ⬍0.0001). The lowest agreement was shown to be in moderate reflux, grade III, at 0.36 and the highest for grade I at 0.98. Mean intrarater reliability was 0.86 (95% CI 0.77– 0.95, p ⬍0.001). Subgroup analysis showed the pediatric urologists had a mean interrater reliability of 0.61, senior urology residents 0.59 and the pediatric radiologists 0.56. The table shows the results with corresponding 95% CI. Combining grades II and III vs III and IV increased the agreement level to 0.79 and 0.63, respectively.
Subgroup analysis
Type of training: Pediatric urologists Pediatric radiologists Urology residents Grade: I II III IV V
Agreement (kappa)
Lower 95% CI
Upper 95% CI
0.61 0.56 0.59
0.55 0.47 0.51
0.67 0.65 0.67
0.98 0.52 0.36 0.4 0.72
0.96 0.50 0.34 0.38 0.70
1.00 0.54 0.38 0.42 0.74
1492
RELIABILITY ASSESSMENT OF VESICOURETERAL REFLUX GRADING
Left VUR with variable calyceal morphology
grade V kappa 0.72) (see table). The poorer agreement on moderate grades may stem from judging the calyceal system dilation. An example of this issue is when some of the calyces are dilated and some have a normal anatomical contour. It is not clear how an observer would assign a grade in this situation (see figure). The same argument may apply to the degree of ureteral tortuosity. Our findings differ from those published by Craig et al.9 They showed excellent agreement in the diagnosis and grading of VUR among 3 pediatric radiologists. Our study was different in several aspects. We included other specialists in the observer pool as opposed to only radiologists. Although Craig et al reported the kappa statistic for the whole group, there were no data on the agreement per individual grades as shown in our study. Their methods for calculating kappa were also different. Craig et al used 2 observer comparisons and repeated this 3
times (observer 1 vs 2, 2 vs 3 and 1 vs 3).9 We used a multiobserver matrix combining all the data in a single analysis. To show the effects of a simplified system on interrater reliability we performed 2 simulation analyses. We first collapsed grades II and III and then grades III and IV into single categories. The kappa changed to 0.79 and 0.63, respectively. This shows that the highest variation is in differentiating grades II and III. Although simplifying the grading system may increase the interrater reliability, this is only 1 side of the coin. Any such modification then should be tested for validity regarding specific purposes such as outcome prediction and response to therapy. We were able to show excellent intrarater agreement. The average agreement was 0.86. The intrarater agreements among different training types were close at 0.86 for pediatric urologists, 0.83 for radiologists and 0.88 for urology residents. It is important to note that voiding cystourethrography as a diagnostic test has its own inaccuracies. For example, repeating the fill and empty cycle multiple times will result in the diagnosis of reflux more often than a single cycle VCUG.10 The assessment of these types of variability is beyond the scope of this study. Nevertheless, the quality of the study can affect the interpretation. For example, patient movements may blur the calyceal details and interfere with the visual interpretation of the rater. We tried to diminish this problem by selecting high quality images. In practice this type of technical issue may further increase the variability of grade assignment.
CONCLUSIONS The international grading system for vesicoureteral reflux demonstrates high intrarater reliability. The interrater reliability for moderate grades of VUR (II-IV) is low, the lowest being grade III. Simplifying the grading system may allow for higher interrater reliability and, thus, enhanced performance.
REFERENCES 1. Nagler EV, Williams G, Hodson EM et al: Interventions for primary vesicoureteric reflux. Cochrane Database Syst Rev 2011; 6: CD001532.
4. Heikel PE and Parkkulainen KV: Vesico-ureteric reflux in children. A classification and results of conservative treatment. Ann Radiol (Paris) 1966; 9: 37.
2. Bailey R: Vesicoureteric reflux in healthy infants and children. In: Reflux Nephropathy. Edited by J Hodson and P Kincaid-Smith. New York: Masson 1979.
5. Dwoskin JY and Perlmutter AD: Vesicoureteral reflux in children: a computerized review. J Urol 1973; 109: 888.
3. Lebowitz RL, Olbing H, Parkkulainen KV et al: International system of radiographic grading of vesicoureteric reflux. International Reflux Study in Children. Pediatr Radiol 1985; 15: 105.
6. Management and Screening of Primary Vesicoureteral Reflux in Children: AUA Guideline (2010). Available at www.auanet.org/content/ guidelines-and-quality-care/clinical-guidelines. cfm?sub⫽vur2010. Accessed September 2011.
7. Steiner DL and Norman GR: Basic concepts. In: Health Measurement Scales. Oxford: Oxford University Press 2003; pp 4 –13. 8. Viera AJ and Garrett JM: Understanding interobserver agreement: the kappa statistic. Fam Med 2005; 37: 360. 9. Craig JC, Irwig LM, Christie J et al: Variation in the diagnosis of vesicoureteric reflux using micturating cystourethrography. Pediatr Nephrol 1997; 11: 455. 10. Jequier S and Jequier JC: Reliability of voiding cystourethrography to detect reflux. AJR Am J Roentgenol 1989; 153: 807.