Outcomes/Epidemiology/Socioeconomics
An Operative Performance Rating System for Urology Residents Aaron Benson, Stephen Markwell, Tobias S. Kohler and Thomas H. Tarter* From the Department of Surgery, Southern Illinois University School of Medicine (AB, SM, TSK), Springfield and Cancer Care Specialists of Central Illinois, S.C. (THT), Decatur, Illinois
Purpose: An operative performance rating system for urology residents was developed for 6 sentinel urological procedures. We tested the reliability, validity and feasibility of the operative performance rating system for urology residents. Materials and Methods: The operative performance rating system of each procedure contained a 3-point case difficulty scale, 4 to 6 procedure specific items, 3 general items and an overall performance item. A Likert scale of 1 to 5 was used for each item. A single video/audio record of each procedure was evaluated by the faculty. Single item interrater agreement was measured by comparing the observed variance and random measurement error variance. Resident operative performance evaluations were completed on line. Internal consistency reliability was measured using Cronbach ␣. Overall scale scores by resident training postgraduate year level were compared using 1-way ANOVA. Results: Faculty evaluation of video/audio records showed an interrater agreement range of 0.71 to 0.92. Faculty evaluations of resident operative performance demonstrated an internal consistency reliability range of 0.91 to 0.95. Significant differences in overall scale scores between postgraduate year levels were noted for 3 of the 6 procedures (p ⱕ0.0016). Conclusions: An operative performance rating system for urology residents is feasible using an Internet based resident management system. Interrater agreement and internal consistency reliability meet threshold limits for checklist evaluation instruments. The operative performance rating system can discriminate among postgraduate year levels of resident training. A validated operative performance rating system can offer residents immediate, objective feedback on surgical performance and enable program directors to monitor progress in resident operative performance. Key Words: urology; surgical procedures, operative; internship and residency; educational measurement; competency-based education
CURRENTLY, urology resident proficiency in the operating room is usually based on case logs and end of rotation evaluations (global assessments) by attending urologists.1 Case logs are intended to measure the amount of resident operative experience but they do not provide an assessment of resident operative performance. End of rotation evaluations are limited by several fac-
tors, including the 1) provision of generalized evaluation of overall resident clinical performance rather than individual evaluations of specific skills; 2) absence of instruments to measure performance in specific operative procedures; 3) reliance on evaluator memory, which may be subject to the errors of memory loss, selective recall and recent history (recent events are more apt to
0022-5347/12/1885-1877/0 THE JOURNAL OF UROLOGY® © 2012 by AMERICAN UROLOGICAL ASSOCIATION EDUCATION
http://dx.doi.org/10.1016/j.juro.2012.07.047 Vol. 188, 1877-1882, November 2012 RESEARCH, INC. Printed in U.S.A.
AND
Abbreviations and Acronyms Cysto/Stent ⫽ cystoscopy and indwelling ureteral stent placement or replacement OPRS ⫽ operative performance rating system OSATS ⫽ Objective Structured Assessment of Technical Skill PGY ⫽ resident training postgraduate year RRC ⫽ Urology Residency Review Committee Scope/Stone ⫽ ureteroscopic removal of distal ureteral calculus TRUSP-Bx ⫽ transrectal ultrasound guided biopsy of prostate TURBT ⫽ transurethral resection of bladder tumor TURP ⫽ transurethral resection of prostate Submitted for publication March 27, 2012. Supplementary material for this article can be obtained at http://www.siumed.edu/surgery/urology/ oprs.html. * Correspondence: Cancer Care Specialists of Central Illinois, S.C., 210 W. McKinley Ave., Suite 1, Decatur, Illinois 62526.
www.jurology.com
1877
1878
OPERATIVE PERFORMANCE RATING SYSTEM FOR UROLOGY RESIDENTS
influence the evaluation); and 4) lack of specific, immediate and objective feedback to residents. We developed and tested the reliability, validity and feasibility of an OPRS for urology residents.
MATERIALS AND METHODS An OPRS was developed for each of 6 sentinel urological operative procedures, including 1) TURP, 2) TURBT, 3) scrotal surgery, 4) TRUSP-Bx, 5) Cysto/Stent and 6) Scope/Stone. The 6 procedures were selected from a review of resident case logs from July 2007 through June 2008 as those performed by the majority of the urology faculty to decrease the variation in rater stringency or leniency. The OPRS for each procedure included 4 to 6 procedure specific items, 3 general items and an overall performance item. A Likert rating scale of 1 to 5 with 5 representing excellent was used for each procedure specific and general item. Item specific descriptive anchors were included for points on each scale, including 5— excellent, 3— good and 1—poor. Procedure specific items for each OPRS instrument were developed from the literature to determine the technical aspects of a procedure that can impact patient outcomes. They were refined using faculty focus group discussion. One video/audio record of each procedure was produced and reviewed under ideal conditions by each faculty member. Evaluation forms were completed to measure the internal consistency reliability using Cronbach ␣. Interrater agreement was determined using the formula, r ⫽ 1 – (S2/2), where S2 represents the observed variance and 2 represents the random measurement error variance.2,3 The OPRS instrument for each of the 6 sentinel procedures was incorporated into Internet based resident management software. After a sentinel procedure was performed, the responsible faculty member received an e-mail invitation to complete an evaluation. After the evaluation was completed, the faculty member received 0.5 faculty incentive credit points using the faculty incentive program developed and used in the Department of Surgery at Southern Illinois University.4 The internal consistency reliability of completed resident evaluations was determined using Cronbach ␣. Correlations of the overall scale score with the overall performance item score, the performance specific item score and the general item score were measured using Pearson product-moment correlations. Comparison of overall scale scores between PGY levels was measured using 1-way ANOVA.
RESULTS One of each of the 6 sentinel procedures was video recorded. Each video recorded procedure was evaluated by 5 or 6 faculty members and rated using the appropriate OPRS instrument. Internal consistency reliability was 0.75 for Cysto/Stent, 0.90 for TRUSPBx, 0.88 for TURBT, 0.95 for Scope/Stone, 0.56 for TURP and 0.77 for scrotal surgery. The overall scale
score to individual item correlation was 0.97 for Cysto/Stent, 0.80 for TRUSP-Bx, 0.71 for TURBT, 0.98 for Scope/Stone, 0.46 for TURP and 0.65 for scrotal surgery. Interrater agreement ranged from 0.72 for Cysto/Stent to 0.92 for scrotal surgery (see table). On the faculty evaluations of resident operative performance the internal consistency reliability was 0.92 for Cysto/Stent, 0.92 for TRUSP-Bx, 0.91 for TURBT, 0.95 for Scope/Stent, 0.92 for TURP and 0.95 for scrotal surgery. The performance item-tooverall scale score correlation was 0.91 for Cysto/ Stent, 0.94 for TRUSP-Bx, 0.91 for TURBT, 0.97 for Scope/Stone, 0.94 for TURP and 0.95 for scrotal surgery (see table). Increasing PGY training level was associated with significantly higher overall scale scores for Cysto/Stent (PGY 1 ⬍2, 3 and 4, p ⫽ 0.0001), TRUSP-Bx (PGY 1 and 2 ⬍3 and 5, p ⫽ 0.0002) and TURBT (PGY 2 ⬍3 and 5, p ⫽ 0.0016). There were no significant differences in overall performance between PGY levels for Scope/Stone, TURP or scrotal surgery, which had the fewest completed faculty evaluations. The figure shows the difference in the mean overall performance rating by PGY level for the 6 sentinel procedures.
DISCUSSION In 1999 ACGME (American Council for Graduate Medical Education) endorsed 6 general competencies for resident training. To help programs assess specific competencies expected of residents, in 2000 the ACGME Outcomes Project and American Board of Medical Specialties published the Toolbox of Assessment Methods.5 The toolbox provides suggestions for the best methods of evaluating specific required skills for each competency. With respect to the patient care competency, MacNeily et al reported that most urology resident time (28%) is spent in the operating room.6 A national survey of urology program directors regarding ACGME regulations identified inadequate validated instrumentation as a barrier to implementing the core competencies.1 As suggested in the ACGME/American Board of Medical Specialties toolbox,5 the best methods of evaluating medical procedures in the patient care competency are 1) simulations and models, and 2) checklists. In 1994 Winckel et al reported the results of an intraoperative checklist instrument known as the structured technical skills assessment form, which showed high interrater reliability and significant differences in junior resident and senior resident performance, an indication of construct validity.7 Subsequently, Martin et al developed OSATS for application in a surgical skills laboratory setting
OPERATIVE PERFORMANCE RATING SYSTEM FOR UROLOGY RESIDENTS
1879
Faculty evaluations of single video/audio record of each procedure and resident operative performance Scope/Stone
TURP
Scrotal Surgery
Video/audio record 6 5 3.78 ⫾ 0.66 4.27 ⫾ 0.45 0.90 0.88
6 3.25 ⫾ 0.60 0.95
5 3.86 ⫾ 0.32 0.56
6 3.88 ⫾ 0.47 0.77
2.83 ⫾ 0.75 0.97
3.50 ⫾ 0.55 0.80
4.00 ⫾ 0.71 0.71
3.00 ⫾ 0.63 0.98
3.60 ⫾ 0.55 0.46
3.83 ⫾ 0.41 0.65
0.72
0.85
0.80
0.80
0.88
0.92
1 4† 1
2 3†
Cysto/Stent
TRUSP-Bx
No. raters Mean ⫾ SD overall scale score Internal consistency reliability (Cronbach ␣) Performance item: Mean ⫾ SD overall Overall scale score Pearson productmoment correlation Interrater agreement No. overall performance rating:* 2 3 4 5
6 3.19 ⫾ 0.49 0.75
No. evaluations Mean ⫾ SD overall scale score Internal consistency reliability (Cronbach ␣) Overall scale score Pearson product-moment correlations (item): Procedure specific General Overall performance
42 3.79 ⫾ 0.72 0.92
2 3† 1
—
— 1 3† — 1 Operative performance 33 38 3.55 ⫾ 0.61 3.88 ⫾ 0.69 0.92 0.91 3 3
—
0.97 0.94 0.91
TURBT
0.97 0.96 0.94
0.98 0.94 0.91
—
— 1 5†
—
—
—
17 4.32 ⫾ 0.72 0.95
16 3.86 ⫾ 0.76 0.92
29 4.17 ⫾ 0.74 0.95
0.96 0.94 0.97
0.99 0.95 0.94
0.96 0.96 0.95
* No procedure had a rating of 1. † Plurality in 5 of 6 procedures (interrater agreement measured using equation, r ⫽ S2/2).
using bench models and living animals.8 OSATS contains procedure specific checklists and a global rating scale, and showed high reliability and construct validity. OSATS was applied simultaneously at University of Southern California and Northwestern University, demonstrating the feasibility of central administration and delivery to multiple sites.9 Global rating scales have high interprocedure reliability and construct validity. Although procedure specific checklists do not improve reliability or validity, results of a study of OSATS applied to carotid endarterectomy indicated that procedure specific checklists are more useful for residents and global ratings are more useful for experienced surgeons.10,11 OSATS has been applied to gynecologic procedures and the European Board of Vascular Surgery incorporated OSATS into the fellowship examination.12,13 Since the development of OSATS in general surgery and application to other surgical fields, various assessment instruments have been developed for anatomical models, cadavers, computer simulators and video recordings of operations. Although simulators, computer animated devices and video recordings can provide a value adjunct to assess resident surgical skill, they are not uniformly applied across programs, can be expensive and do not evaluate operating room performance. Also, feasibility can be limited by demands on faculty time.
In 2005 Larson et al reported the results of an assessment tool, OPRS, which was developed at our institution to evaluate resident performance in the operating room.14 Instruments developed for 6 general surgery operations included 4 to 6 procedure specific items with a 5-point Likert rating scale and 4 general items modified from the OSATS global rating scale, including tissue handling, time and motion, flow and overall performance. Internal consistency reliability ranged from 0.70 to 0.95 and construct validity was noted since ratings improved in 5 of 6 procedures with increasing resident training level.4 Subsequent studies of OPRS emphasized the need for multiple resident performance evaluations by multiple evaluators to correct for differences in rater stringency or leniency and correct for the degree of rater involvement in the operative procedure.15,16 Operative performance assessment tools have since been developed for general surgery,17 obstetrics and gynecology,18 dentistry,19 gastroenterology,20 otolaryngology,21,22 vascular surgery23 and urology.24 Unlike OPRS and laboratory based OSATS, most assessment tools are specific to a single operative procedure. In urology Maizels et al were the first to report an objective assessment of resident performance for pediatric orchiopexy.24 The operation was taught to residents using computer enhanced visual learning in 11 separate steps or skills. Faculty sur-
1880
OPERATIVE PERFORMANCE RATING SYSTEM FOR UROLOGY RESIDENTS
Mean Rating
Cysto/Stent
p=0.0001
5.00
Mean Rating
p=0.0002
5.00
4.50
4.50
4.00
4.00
3.50
3.50
3.00
3.00
2.50
2.50
2.00
2.00
1.50
1.50
1.00
TRUSP-Bx
1.00 PGY1
Mean Rating 5.00
PGY2
PGY3
PGY4
TURBT
PGY5
p=0.0016
PGY1 Mean Rating 5.00
4.50
4.50
4.00
4.00
3.50
3.50
3.00
3.00
2.50
2.50
2.00
2.00
1.50
1.50
1.00
PGY2
PGY3
PGY4
Scope/Stone
PGY5
p=0.1759
1.00 PGY1
PGY2
Mean Rating 5.00
PGY3
TURP
PGY4
PGY5
p=0.1975
PGY1 Mean Rating 5.00
4.00
4.00
3.00
3.00
2.00
2.00
1.00
PGY2
PGY3
PGY4
Scrotal Surgery
PGY5
p=0.2103
1.00 PGY1
PGY2
PGY3
PGY4
PGY5
PGY1
PGY2
PGY3
PGY4
PGY5
Mean overall scale scores compared by PGY level using ANOVA
geons then rated the residents on each skill using a 5-point Likert scale. The raw score was multiplied by a case difficulty score of 1 to 5. The performance evaluation was reviewed with the resident after each orchiopexy. During the 22-month study period and for 166 orchiopexies the average weighted score increased more than 50% to best performance (p ⬍0.0001). Of the 24 residents who performed more than a single orchiopexy 23 (96%) showed improvement over entry, 14 (58%) showed 50% improvement over entry and 8 (33%) showed 100% improvement over the entry score. We developed an OPRS for urology residents using a sentinel procedure format according to previously published methods.14 The 6 procedures se-
lected were most frequently performed by the majority of the faculty to decrease the variation in rater stringency or leniency. Three general items modified from the OSATS global rating scale, including respect for tissue, time and motion, and operation flow, were included in each OPRS instrument. The procedure specific items were developed from the literature to determine the critical aspects of a procedure that can affect patient outcomes. They were refined using faculty focus group discussion. The number of procedure specific items was limited to 4 to 6 because increasing the number of rating items has little effect on reliability, whereas increasing the number of evaluations has a greater effect.25
OPERATIVE PERFORMANCE RATING SYSTEM FOR UROLOGY RESIDENTS
A single video/audio record of each procedure was produced and evaluated by 5 or 6 faculty raters under ideal conditions to measure internal consistency reliability and interrater agreement. The table shows an internal consistency reliability of Cronbach ␣ ⱖ0.75 for each procedure except TURP. Correlation of the overall scale score with the overall performance score was ⱖ0.71 for every procedure except scrotal surgery. The distribution of overall performance scores showed a plurality in each procedure except prostate biopsy. Interrater agreement ranged from 0.72 to 0.92. The faculty completed a total of 175 resident operative performance evaluations on line using resident management software. Internal consistency reliability was excellent (range 0.91 to 0.95), indicating a high intercorrelation of the items of each instrument. Correlations between overall scale scores and procedure specific item scores, general item scores and overall performance item scores were similarly high, another indication of high reliability. To be valid, an assessment instrument must show that it measures an anticipated outcome. As a measure of construct validity, we compared overall scale scores by resident training level. Significant differences were observed between PGY level for cystoscopy and ureteral stent placement (p ⬍0.0001), prostate biopsy (p ⬍0.0002) and TURBT (p ⬍0.0016). The other 3 procedures had the fewest completed evaluations. Although trends were demonstrated, they did not achieve statistical significance. A limitation of our study was the small number of faculty evaluations completed for some operative procedures during the 1-year study period. Another limitation was not including the estimated amount of faculty involvement on the assessment forms. The importance of estimating faculty involvement was new information published after our study data were compiled.16 We report in this pilot study that an OPRS can be developed for urology residents that can be reliable, valid and feasible. Feasibility is improved with a system of e-mail invitations to the faculty to complete evaluations on line, which takes minutes. We
1881
believe that a faculty incentive program can improve faculty compliance.2 To further improve faculty compliance, a smartphone application for resident operative evaluations has been developed at Southern Illinois University. The next logical step in the development of an OPRS for urology residents is a multi-institutional trial. At least 2 OPRS instruments could be developed with the help of expert consultants in each of the 6 specialty domains defined by the RRC. If enough instruments are developed, program directors can tailor the OPRS to their particular programs to have multiple raters for each procedure. An OPRS for urology residents can have significant advantages for residents, faculty, program directors, the RRC of the ACGME and the American Board of Urology. Residents can be provided with immediate, objective, specific feedback on surgical performance using reliable, valid instruments. As in the study by Maizels et al,24 faculty could review with their residents the assessment of procedure specific skills that impact outcomes. Program directors would be able to identify specific strengths or deficiencies in resident operative skill. Results of the general surgery OPRS indicate that the real value to program directors is monitoring progress and identifying outliers. The advantage to the RRC is objective evidence of surgical skill monitoring. Furthermore, with enough multi-institutional data, the RRC could more accurately determine the number of particular cases that residents must perform to demonstrate competence. The advantage to the American Board of Urology is demonstrating to the public that urology resident operative skill is objectively monitored and graduate urological surgeons have demonstrated competence in the operating room.
CONCLUSIONS Results show that an OPRS for urology residents can be developed that is reliable, valid and feasible. An OPRS for urology residents can have benefits for residents, faculty, program directors and other stakeholders in urology resident operative skill.
REFERENCES 1. Joyner BD, Siedel K, Stoll D et al: Report of the national survey of urology program directors: attitudes and actions regarding the Accreditation Council on Medical Education regulations. J Urol 2005; 174: 1961. 2. James LR, Demaree RG and Wolf G: rwg: An assessment of within-group interrater agreement. J Appl Psychol 1993; 78: 306. 3. Liao SC, Hunt EA and Chen W: Comparison of inter-rater reliability and inter-rater agreement in
performance assessment. Ann Acad Med Singapore 2010; 39: 613. 4. Williams RG, Dunnington GL and Folse JR: The impact of a program for systematically recognizing and rewarding academic performance. Acad Med 2003; 78: 156. 5. Toolbox of Assessment Methods: A Product of the Joint Initiative, Version 1.1. ACGME Outcomes Project. Accreditation Council for Graduate Medical Education
and American Board of Medical Specialties, September 2000. Available at http://www.chd. ubc.ca/files/file/instructor-resources/Evaluationtoolbox. pdf. Accessed January 15, 2009. 6. MacNeily AE, Nguan C, Haden K et al: Implementation of a PDA based program to quantify resident in-training experience. Can J Urol 2003; 10: 1885. 7. Winckel CP, Reznick RK, Cohen R et al: Reliability and construct validity of a structured technical
1882
OPERATIVE PERFORMANCE RATING SYSTEM FOR UROLOGY RESIDENTS
skills assessment form. Am J Surg 1994; 167: 423. 8. Martin JA, Regehr G, Reznick R et al: Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg 1997; 84: 273. 9. Ault G, Reznick R, MacRae H et al: Exporting a technical skills evaluation technology to other sites. Am J Surg 2001; 182: 254. 10. Regehr G, MacRae H, Reznick RK et al: Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE-format examination. Acad Med 1998; 73: 993. 11. Beard JD, Choksy S and Khan S: Assessment of operative competence during carotid endarterectomy. Vascular Society of Great Britain and Ireland. Br J Surg 2007; 94: 726. 12. Goff BA, Lentz GM, Lee D et al: Development of an objective structured assessment of technical skills for obstetric and gynecology residents. Obstet Gynecol 2000; 96: 146. 13. Memon MA: Assessing the surgeon’s technical skills: analysis of the available tools. Acad Med 2010; 85: 869.
14. Larson JL, Williams RG, Ketchum J et al: Feasibility, reliability, and validity of an operative performance rating system for evaluating surgery residents. Surgery 2005; 38: 640. 15. Kim MJ, Williams RG, Boehler ML et al: Refining the evaluation of operating room performance. J Surg Educ 2009; 66: 352. 16. Chen XP, Williams RG, Sanfey HA et al: How do supervising surgeons evaluate guidance provided in the operating room? Am J Surg 2012; 203: 44. 17. Wohaibi EM, Earle DB, Ansanitis FE et al: A new web-based operative skills assessment tool effectively tracks progression in surgical residents performance. J Surg Educ 2007; 64: 333. 18. Chou B, Bowen CW, Handa VW: Evaluating the competency of gynecology residents in the operating room: validation of a new assessment tool. Am J Obstet Gynecol 2008; 199: 571.e1. 19. Evans AW, Aghabeigi B, Leeson RM et al: Assessment of surgeon competency to remove mandibular third molar teeth. Int J Oral Maxillofac Surg 2002; 31: 434. 20. Vassiliou MC, Kaneva PA, Poulose BK et al: Global Assessment of Gastrointestinal Endo-
scopic Skills (GAGES): a valid measurement tool for technical skills in flexible endoscopy. Surg Endosc 2010; 24: 1834. 21. Stack BC, Siegel E, Bodenner D et al: A study of resident proficiency with thyroid surgery: creation of a thyroid-specific tool. Otolaryngol Head Neck Surg 2010; 142: 856. 22. Francis HW, Masood H, Laeeq K et al: Defining milestones toward competency in mastoidectomy using a skills assessment paradigm. Laryngoscope 2010; 120: 1417. 23. Doyle JD, Webber EM and Sidhu RS: A universal global rating scale for the evaluation of technical skills in the operating room. Am J Surg 2007; 193: 551. 24. Maizels M, Yerkes EB, Macejko A et al: A new computer enhanced visual learning method to train urology residents in pediatric orchiopexy: a prototype for Accreditation Council for Graduate Medical Education documentation. J Urol 2008; 180: 1814. 25. Williams RG, Verhulst S, Colliver JA et al: Assuring the reliability of resident performance appraisals: more items or more observations? Surgery 2005; 137: 141.