The American Journal of Surgery 193 (2007) 233–236
Association for Surgical Education
Is it live or is it Memorex? Student oral examinations and the use of video for additional scoring Kenneth Burchard, M.D.a,b,*, Horace Henriques, M.D.a,b, Daniel Walsh, M.D.a,b, Debra Ludingtonb, Pamela A. Rowland, Ph.D.a, Donald S. Likosky, Ph.D.a,b b
a Dartmouth Medical School, Hanover, NH, USA Department of Surgery, Dartmouth-Hitchcock Medical Center, One Medical Center Dr., Lebanon, NH 03756, USA
Manuscript received April 4, 2006; revised manuscript October 6, 2006
Abstract Background: Oral examination interrater consistency has been questioned, supporting the use of at least paired examiners and consensus grading. The scheduling flexibility of video recording allows more examiners to score performances. The purpose of this study was to compare live performance with video performance scores to assess interrater differences and the effect on grading. Methods: A total of 283 consecutive, structured, videotaped 30-minute examinations were reviewed. A 5-point Likert scale ranked problem solving (2 cases), verbal skills, and nonverbal skills. Nonparametric paired analyses tested for differences. Results: Live performance scores were higher for verbal and nonverbal skills and total scores. Video performance scores were higher for problem solving for the first presented case. The largest difference (.29 Likert point) was in nonverbal skills. Conclusions: The minor yet statistical differences in several scores did not actually impact student grades. The use of video recording is sufficiently reliable to be continued and advocated. © 2007 Excerpta Medica Inc. All rights reserved. Keywords: Oral examination; Interrater reliability; Clerkship; Student; Competency; Video recording
In 2003, the Accreditation Council for Graduate Medical Education outlined general competencies that are required for achievement during residency training and the accreditation process. This has influenced medical school faculty and several schools have developed similar concepts of capabilities that all medical students should acquire (Table 1). The oral examination is a tool that effectively measures such competencies as clinical problem solving, verbal communication, and nonverbal communication [1]. However, throughout the 20th century the interrater scoring consistency of oral examinations has been questioned, especially when examinees are performing poorly [2–7]. Because interrater difficulties may exist, a more reliable method is to use multiple examiners, similar to the format used by the American Board of Surgery. Unfortunately, the size of a medical school class, the frequency of such examinations (at least several times a year), and the constraints of * Corresponding author. Tel.: ⫹1-603-650-7903; fax: ⫹1-603-6508030. E-mail address:
[email protected]
surgical practice make a process used by the Board of Surgery impractical for medical school faculty. Video recording has been used to allow multiple raters to rank performances in a variety of evaluation circumstances, such as physical examinations, Objective Structured Clinical Examinations, and oral examinations [8 –10]. For the past decade, the evaluation and grading of students at Dartmouth Medical School has included an oral examination that has been scored by a live reviewer (LR) as well as a video reviewer (VR). The purpose of this study was to compare live with video oral examination performance scores to determine if this evaluation method is indeed consistent between raters, and to determine if any features of the examination process demand improvement. Methods Clerkship grading at Dartmouth Medical School uses a numeric scoring system that ranks ward performance (60% of the grade; maximum score, 45), written examination (20% of the grade; maximum score, 15), and an oral exam-
0002-9610/07/$ – see front matter © 2007 Excerpta Medica Inc. All rights reserved. doi:10.1016/j.amjsurg.2006.10.011
234
K. Burchard et al. / The American Journal of Surgery 193 (2007) 233–236
ination (20% of the grade; maximum score, 15). All scores are rounded to the nearest whole number. The oral examination is a structured, 30-minute encounter in which 2 case scenarios are presented: the first case scenario is based on scheduled, faculty-led case studies; the second is based on the medical student’s patient log. A 5-point Likert scale was used to rank 4 domains: problem solving for each case (2 domains), and verbal and nonverbal skills (1 domain each) (Table 2). The subcomponents of each of these domains (ie, diagnostic approach, management approach, and so forth) were averaged. Therefore, the maximum score for any examination was 20. For grading purposes the domain score for case 1 and case 2 were averaged, resulting in a grade score maximum of 15. This being understood, we chose to separate the case 1 and case 2 average domain scores for the comparisons described later. A total of 283 consecutive examinations were accrued over 5 years (2001–2005). Four faculty rated performances, with one (K.B.) attending most live evaluations (84%) and a second evaluating (D.W.) most of the video performances (74%). The mean values for the individual domain scores were compared with paired t tests, and median values were compared with the Wilcoxon signed-rank test to evaluate the equality of the matched pairs [11]. The differences between scores (VR ⫺ LR) was calculated for graphic illustration. Results All values are listed as means/medians unless otherwise stated. Problem solving for case 1 The VR score was higher for problem solving for case 1 (3.75/3.70 vs 3.65/3.66; P ⫽ .005). Differences were principally within 1 point (Fig. 1). Problem solving for case 2 There was no difference in problem solving for case 2 (3.83/4.00 vs 3.83/4.00; P ⫽ .608). Differences were principally within 1 point (Fig. 2).
Table 2 Oral examination scoring I. Problem solving Cases 1 and 2 A. Knowledge about disease process Poor Excellent 1 2 3 4 5 B. Diagnostic approach to disease process 1 2 3 4 5 C. Management approach to disease process 1 2 3 4 5 D. Average score for the 3 components _______________ II. Verbal communication skills A. Speech disturbances (“and ums,” “you know”) Frequent Rare 1 2 3 4 5 B. Hesitation 1 2 3 4 5 C. Level of vocabulary Poor Excellent 1 2 3 4 5 Average score for the 3 components __________________ III. Nonverbal communication A. Direct eye contact Poor Excellent 1 2 3 4 5 B. Professional dress 1 2 3 4 5 C. Distracting behaviors (object manipulation, rocking, leg, arm, hand movements) Frequent Rare 1 2 3 4 5 D. Engagement in conversation (attentive, listens, takes turns) Poor Excellent 1 2 3 4 5 Average score for the 4 components __________________ Total score: (score case 1 ⫹ score case 2)/2 ⫹ verbal score ⫹ nonverbal score
Verbal skills The LR score was higher for verbal skills (4.17/4.25 vs 3.99/4.00; P ⫽ .000). Differences were principally within 1 point (Fig. 3).
Table 1 General competencies for medical students Brown University Medical School Effective communication Basic clinical skills Using basic science to guide therapy Diagnosis, management, and prevention Lifelong learning Self awareness, self-care, and personal growth The social and community context of health care Moral reasoning and ethical judgment Problem solving Dartmouth Medical School Medical knowledge Clinical skills Communication skills Professionalism Practice-based learning and improvement Systems-based practice
Fig. 1. Differences between VR and LR scores for case 1. The x-axis shows the difference of the individual scores. The y-axis shows the fraction of scores that were in this difference range. Most of the differences were within 1 point, although differences larger than 1 point were more frequent when the video score was higher.
K. Burchard et al. / The American Journal of Surgery 193 (2007) 233–236
235
Fig. 2. Differences between VR and LR scores for case 2. The x-axis shows the difference of the individual scores. The y-axis shows the fraction of scores that were in this difference range. Most of the differences were within 1 point.
Fig. 4. Differences between VR and LR scores for nonverbal skills. The x-axis shows the difference of the individual scores. The y-axis shows the fraction of scores that were in this difference range. Most of the differences were within 1 point, but the LR score more frequently was higher.
Nonverbal skills The LR score was higher for nonverbal skills (4.37/4.66 vs 4.06/4.00; P ⫽ .000). Differences were principally within 1 point (Fig. 4).
the instruction of examiners, and the evaluation instruments [5,14 –17]. Despite these improvements, concerns about the validity and reliability of the oral examination as an assessment tool continue to be addressed, especially when such an instrument is used for summative evaluation and/or board certification [5,6,18]. Interrater consistency is enhanced when multiple examiners simultaneously evaluate a performance [19]. The American Board of Surgery Oral Examination uses 2 trained examiners for each of three 30-minute structured examinations (total, 6 examiners), a technique in keeping with the stature of the board-certification process. However, constraints on medical school faculty make scheduling multiple examiners for frequent live encounters impractical. Therefore, strategies that can allow more than 1 examiner to score an examination, yet result in less scheduling constraints, are practical to consider, as long as interrater performance is not influenced markedly by the method used. The present study showed that there was little actual summative difference in the ranking of the oral examination when one examiner viewed the performance live and another reviewed a video of the examination. This was especially evident in the median scores for all of the components. Only the difference in the nonverbal communication skill ranking (8%) was noticeable. This study principally compared the evaluation of 2 faculty members (K.B. and D.W.). One interpretation of this result is that these 2 particular surgeons rank students very similarly, regardless of the venue. Another interpretation is that the 2 venues obscured differences that would have been evident if both examiners ranked either the live or the video performance. A third interpretation is that the ranking represents truly independent evaluations by experienced examiners and that similar results would be expected had we used any similarly experienced faculty. Only further research might distinguish among these and other possible interpretations. Additional research could determine if nonverbal scoring parameters should be even more structured. For instance, should a male in a dark suit with a bland tie, white shirt, well-shined shoes, and no name tag be the expectation for a
Total scores The LR score was higher for total scores (15.86/16.00 vs 15.22/15.9; P ⫽ .000). In a review of the scoring data, we did not find any instance in which using only the LR or VR score, rather than the average of the 2 scores, would have changed the surgical grade for a student. Comments Attention to the uniqueness of an oral examination for the evaluation of trainees in medicine has fostered the evolution of a tool that once focused on factual recall to what now emphasizes clinical problem-solving skills [2,12–14]. In addition, growing attention to the structure of this tool has resulted in decreased variability in the clinical problems,
Fig. 3. Differences between VR and LR scores for verbal skills. The x-axis shows the difference of the individual scores. The y-axis shows the fraction of scores that were in this difference range. Most of the differences were within 1 point.
236
K. Burchard et al. / The American Journal of Surgery 193 (2007) 233–236
professional dress score of 5? Should the addition of a name tag knock the score down to 4? It is likely that more thorough guidelines for scoring such nonverbal features will reduce interrater differences. What about the poor performers? Our analysis did not allow particular attention to students with lower scores and it is possible that interrater discrepancies would have been larger for this subgroup. This also demands further study. What about using the video examination as an educational tool? The Department of Surgery offers each student the opportunity to review the video performance, but not as a clerkship requirement. Such reviews have been shown to be effective [8]. Unfortunately, too few students have sought this experience to comment, suggesting that a required review would be necessary to have any educational impact. As we review the results of this study from a practical, grading of students, standpoint, 2 commonly quoted statements come to mind: “There are three kinds of lies: Lies, damned lies, and statistics.” Attributed to Disraeli in Mark Twain, Autobiography, 1924 [20]. “For a difference to be a difference, it has to make a difference.” Usually attributed to Gertrude Stein with no documentation.
Despite the statistical differences between the examiners, the overall effect did not impact the grading process. We conclude that the scoring of video recordings of students’ oral examinations is associated with sufficient interrater similarity to be continued at Dartmouth Medical School as a fair and flexible assessment tool. References [1] Rowland-Morin PA, Burchard KW, Garb JL, et al. Influence of effective communication by surgery students on their oral examination scores. Acad Med 1991;66:169 –71. [2] Barnes EJ, Pressey SL. Educational research and statistics. School and Society 1929;30:719 –22.
[3] Levine HG, McGuire CH. The validity and reliability of oral examinations in assessing cognitive skills in medicine. J Educ Meas 1970; 7:63–74. [4] Kelley PR, Matthews JH, Schumacher CF. Analysis of the oral examination of the American Board of Anesthesiology. J Med Educ 1971;46:982– 8. [5] Wade TP, Andrus CH, Kaminski DL. Evaluations of surgery resident performance correlate with success in board examinations. Surgery 1993;113:644 – 8. [6] Burchard KW, Rowland-Morin PA, Coe NP, et al. A surgery oral examination: interrater agreement and the influence of rater characteristics. Acad Med 1995;70:1044 – 6. [7] Colton T, Peterson OL. An assay of medical students’ abilities by oral examination. J Med Educ 1967;42:1005–14. [8] Paul S, Dawson KP, Lamphere JH, et al. Video recording feedback: a feasible and effective approach to teaching history-taking and physical examination skills in undergraduate paediatric medicine. Med Educ 1998;32:332– 6. [9] Burchard KW, Rowland PA, Berman NB, et al. Clerkship enhancement of interpersonal skills. Am J Surg 2005;189:643– 6. [10] Kozol R, Giles M, Voytovich A. The value of videotape in mock oral board examinations. Curr Surg 2004;61:511– 4. [11] Wilcoxon F. Individual comparisons by ranking methods. Biometrics 1945;1:80 –3. [12] McGuire CH. The oral examination as a measure of professional competence. J Med Educ 1966;41:267–74. [13] Evans LR, Ingersoll RW, Smith EJ. The reliability, validity, and taxonomic structure of the oral examination. J Med Educ 1966;41: 651–7. [14] Anastakis D, Cohen R, Reznick RK. The structured oral examination as a method for assessing surgical residents. Am J Surg 1991;162: 67–70. [15] Quattlebaum TG, Darden PM, Sperry JB. In-training examinations as predictors of resident clinical performance. Pediatrics 1989;84: 165–72. [16] Allison R, Katona C. Audit of oral examinations in psychiatry. Med Teach 1992;14:383–9. [17] Schwiebert P, Davis A. Increasing inter-rater agreement on a family medicine clerkship oral examination—a pilot study. Fam Med 1993; 25:182–5. [18] Schubert A, Tetzlaff JE, Tan M, et al. Consistency, inter-rater reliability, and validity of 441 consecutive mock oral examinations in anesthesiology: implications for use as a tool for assessment of residents. Anesthesiology 1999;91:288 –98. [19] Daelmans HE, Scherpbier AJ, Van Der Vleuten CP, et al. Reliability of clinical oral examinations re-examined. Med Teach 2001; 23:422– 4. [20] Twain M. Mark Twain’s Autobiography. New York: Harper and Brothers, 1924.