Education
Selection of obstetrics and gynecology residents on the basis of medical school performance Jeffrey G. Bell, MD, Ioanna Kanellitsas, MD, and Lynn Shaffer, MS Columbus, Ohio OBJECTIVE: The purpose of this study was to determine whether United States Medical Licensing Examination scores during medical school predict resident-in-training examination scores and whether other criteria of medical student performance correlate with the faculty’s subjective evaluation of resident performance. STUDY DESIGN: United States Medical Licensing Examination step I and II scores for 24 residents were compared to their scores on in-training examinations. Faculty evaluated 20 graduated residents by rating both their cognitive and noncognitive clinical performance. Scores from these evaluations were compared with several criteria of their medical school performance. Statistical analysis for all comparisons was linear regression. RESULTS: United States Medical Licensing Examination scores positively correlated with in-training examination scores. United States Medical Licensing Examination scores, honor grades in student clinical rotations, and student interview scores did not correlate with the faculty evaluation of resident performance. CONCLUSION: Standardized tests of medical student cognitive function predict the resident’s performance on standardized tests. Selection criteria that are based on other medical school achievements do not necessarily correlate with overall performance as residents in obstetrics and gynecology. (Am J Obstet Gynecol 2002;186:1091-4.)
Key words: Resident, selection criteria, resident performance, evaluation
Residency program directors share a common goal of selecting the most capable medical students for postgraduate training. Criteria for selection usually include performance on the United States Medical Licensing Examination (USMLE), grades in clinical rotations, class rank, and interviews with faculty members. Unfortunately, these criteria do not always predict clinical performance during residency. One explanation for a discrepancy in performance is the distinction between cognitive function and noncognitive skills (such as surgical dexterity, clinical judgment, patient rapport, and work ethic). Evaluation of noncognitive skills is more subjective and less reproducible than testing cognitive function. Some studies have shown that faculty ward evaluations of residents, which traditionally rate both cognitive and noncognitive performance of residents, are inflated and do not correlate with other measures of perFrom the Department of Obstetrics and Gynecology, Riverside Methodist Hospitals. Supported by the Medical Research Foundation, Riverside Methodist Hospitals. Received for publication April 5, 2001; revised August 6, 2001; accepted November 2, 2001. Reprints not available from the authors. © 2002 by Mosby, Inc. 0002-9378/2002 $35.00 + 0 6/1/121622 doi:10.1067/mob.2002.121622
formance (such as in-training examinations and oral examinations).1,2 The educational dilemma is that faculty members evaluate resident physicians by a subjective assessment of both their cognitive and noncognitive skills. A peer’s overall impression of another physician’s ability is mostly a subjective appraisal. This subjective evaluation is the basis for a program director’s recommendation for a resident after graduation. Thus, the ability of a program director to identify criteria of a medical student’s personal attributes, common sense, and technical skills, in addition to cognitive function, is important in the selection of candidates who will perform well as residents. Although research in the association between the academic measures of medical students and the performance measures of residents dates back more than 3 decades, we found minimal reference to this type of research for obstetrics and gynecology residencies. Because resident performance ratings by faculty members and program directors significantly vary among specialties, each specialty must assess the relationship between medical school and its own residency program.3,4 The goals of this study were (1) to determine whether USMLE scores correlate with in-training examination results from all 4 years of residency training and (2) to compare our selection criteria (USMLE scores, medical school clinical 1091
1092 Bell, Kanellitsas, and Shaffer
May 2002 Am J Obstet Gynecol
Table I. Linear regression for USMLE scores versus CREOG scores CREOG PGY 1 USMLE I II
PGY 2
Slope
P value
Slope
0.6432 0.4972
.0015 .0004
0.4623 0.4882
PGY 3 P value .0078 .0001
grades, and faculty interviews of applicants) to an overall assessment of residency performance. These comparisons might assist in the prediction of a resident’s performance on the basis of performance as a medical student. Methods The study population consisted of obstetrics and gynecology residents who either were graduated or trained at Riverside Methodist Hospitals between 1995 and 1999. The following hypotheses were examined: (1) USMLE scores will predict scores on in-training examinations that are provided by the Council on Resident Education in Obstetrics and Gynecology (CREOG) for postgraduate years (PGYs) 1 through 4, (2) CREOG examination scores will not correlate with the faculty members’ subjective evaluation of resident performance, and (3) medical student composite scores of selection criteria will correlate positively with the faculty members’ evaluation of resident performance. When students apply to our program, each candidate receives a composite score that is based on the following criteria: (1) interviews with 3 of our faculty members who have access to grades, dean’s letters, and letters of recommendation, (2) the number of honors (A grades) in 5 core clinical rotations (medicine, family practice, surgery, pediatrics, and obstetrics-gynecology), and (3) USMLE scores (steps I and II). The interview is a minimally structured, global assessment of the student; the interviewer rates the student’s academic achievement, personality, and work ethic. The interview score is a subjective rating between 1 and 10 (10 being the highest). The 3 faculty interview scores are averaged into 1 rating for each student. USMLE scores have a national mean of 200 and an SD of 20 points. These 3 major criteria are combined into a composite score by the formula: number of honors + ([USMLE score – 200]/20) + ([interview score/2] – 4). The composite score weights a change in board score of 20 points, or a change in interview score of 2 points, to be equal to 1 clinical honor grade. For example, a student who achieved one clinical honor, scored 220 on the USMLE, and received an interview rating of 10 would have a composite score of 3. We developed a resident performance evaluation for faculty members to use in rating residents in 4 categories: clinical judgment and acumen, patient rapport, surgical
PGY 4
Slope
P value
Slope
0.5427 0.6654
.0231 .0001
0.2955 0.4791
P value .2967 .0482
ability, and work ethic. Each question is scored from 1 to 5 (a score of 1 represented poor performance and 5 represented superior performance). The evaluation consisted of 11 questions about resident clinical decision making, 5 questions about patient rapport, 6 questions about surgical skills, and 4 questions about work ethic. So that each of the 4 categories have equal weight, we averaged the scores for each category, thus giving a score of 1 to 5 for each category, with a total possible score of 20 for the evaluation. The total score on the evaluation provides a numeric representation of both cognitive and noncognitive aspects of resident performance. Twenty-nine faculty members completed evaluations on 20 residents who had graduated from our program. These faculty members had no knowledge of the composite scores that the residents had as medical students. The initial question on the evaluation asked how well the faculty member knew the resident; only evaluations with responses of “very well” and “extremely well” were used for analysis. To confirm both the content and construct validity of the evaluation, we used standard methods,5,6 which included the Spearman rank correlation test to assess the relationships among individual components and external measures. We evaluated content validity by obtaining comments from faculty members at another teaching institution. We assessed 3 aspects of construct validity. The first aspect was a comparison of survey questions to an outside measure, which we chose to be CREOG scores. Rank correlation testing showed that 6 questions that deal with the resident’s ability to diagnose and make appropriate treatment decisions positively correlated with CREOG scores. On the other hand, 6 of 7 questions regarding the resident’s work ethic and responsiveness to family concerns, patient comfort, and cultural issues showed no correlation with CREOG scores. The second aspect of construct validity was factor validity, which determined that questions of similar nature correlate with each other. All correlations were significant. The third aspect of construct validity was discriminate ability; in this case the ability to discriminate between high and low performers, as demonstrated by the distribution of performance scores. Step I and II USMLE scores were available for 24 residents who had either completed the residency program or who were currently residents. These scores were compared with CREOG in-training examination scores for
Bell, Kanellitsas, and Shaffer 1093
Volume 186, Number 5 Am J Obstet Gynecol
Table II. Resident performance evaluation scores Category Clinical acumen Patient rapport Surgical ability Work ethic Total
Mean
SD
Range
4.0 3.9 4.1 3.9 15.9
0.4 0.5 0.4 0.6 1.5
3.2-4.7 3.1-4.7 3.5-4.8 2.6-4.9 12.6-19.0
each PGY level: 24 PGY1 results, 20 PGY2 results, 16 PGY3 results, and 12 PGY4 results. Linear regression analysis assessed the ability of USMLE scores to predict CREOG scores. Linear regression analyzed the association between the composite score that the students received when they applied to the program and the resident performance evaluation score and the association between the CREOG examination scores and the resident performance score. We considered probability values <.05 to be significant. Our study had an 85% power to find a significant linear trend with a slope of 1.0 between the composite score and the faculty evaluation score; that is, for every 1-point increase in composite score, the resident performance score would increase by 1 point.7 Results By linear regression, scores on USMLE I significantly predicted performance on PGY 1-3 CREOG examinations (P < .05), and scores on USMLE II significantly predicted performance on CREOG examinations in all 4 years (P < .05; Table I). USMLE I results did not correlate with PGY4 CREOG scores. Medical student composite scores ranged from 0.4 to 6.0, with a mean of 3.7 and an SD of 1.6. Total scores for resident performance ranged from 12.6 to 19.0 for the 20 residents who were evaluated (Table II). The distribution of resident evaluation scores is shown in the Fig. Linear regression analysis showed that the CREOG examination scores did not correlate with resident performance as measured by the faculty evaluation scores, with probability values that ranged from 0.48 for CREOG PGY1 to 0.09 for CREOG PGY3. Linear regression analysis also showed no association between medical student composite scores and resident performance scores (slope = 0.03). Further regression analysis compared the individual components of the composite score (number of honors, interview score, and USMLE scores) to the resident performance score. These analyses did not identify any specific relationships. Comment In this study, USMLE I and II scores predicted performance on CREOG examinations in all PGYs, with the exception of USMLE I, which did not show a relationship to PGY4 CREOG. Previously, Spellacy8 in 1985 also reported
Fig. Distribution of resident evaluation scores.
a correlation between national board scores and PGY3 CREOG percentile scores. Other investigators have shown that national board scores predict in-training examination scores for both anesthesiology and orthopedic residents.9 Because both USMLE and CREOG examinations measure only cognitive function, the correlation between the 2 tests is not surprising. One possible explanation of why the USMLE I scores did not predict PGY4 CREOG examination scores may stem from the notion that USMLE I emphasizes more heavily the basic sciences, whereas CREOG examinations deal more with clinical knowledge. Perhaps, over the several years between the USMLE I and the PGY4 CREOG, residents become more proficient at taking tests on clinical material, and basic science tests become less predictive of that ability. The increasing probability value over the 4 years of residency (Table I) indicates a weakening of the relationship between the USMLE I and the CREOG scores as each year passes. Other studies have indicated that the length of time between performance measurements obscures the relationship between the measures.10,11 Also expected was the finding that CREOG examination scores in our program did not correlate with resident performance scores on the faculty evaluations. This result is consistent with our feeling that a distinction exists between cognitive skills (as measured by the CREOG examination) and noncognitive skills (as measured by the faculty evaluation). Our final hypothesis was that the medical student composite score would significantly predict the faculty’s evaluation of resident performance. Not only did the composite scores not correlate with the resident performance scores, but the individual components of the composite score (USMLE scores, clinical honors as students, and interview scores) also did not correlate with resident performance. Presently, rating noncognitive attributes is more subjective than objective. We were hypothesizing that medical student evaluations that are partly subjective and partly objective (such as clinical grades and interviews) would correlate with a subjective evaluation of a student’s performance as a resident. Gonnella and Hojat4 found a moderately significant relationship between medical student grades and obstetric-gynecology resident performance as judged by program directors, but the res-
1094 Bell, Kanellitsas, and Shaffer
ident assessment occurred at the end of PGY1. In addition, a program director’s evaluation may have bias because of knowledge of the resident’s medical school achievements. The composite score may not have predicted resident performance simply because the demands of medical school may be dissimilar to the rigors of residency. Academically proficient medical students may not perform as expected under the stress of residency education; on the contrary, medical students with sound humanistic qualities and strong work habits may emerge as exceptional physicians during the demanding residency. Performances expected of residents might not be mere extensions of those performances expected of medical students.12 One might argue that this study found no relationship between medical school achievement and resident performance because our resident performance evaluation was not the proper measurement. The question is whether any subjective evaluation of performance will correlate with subjective or objective criteria of medical school achievements. Acknowledging that our resident performance evaluation survey is subjective, we nevertheless tested its validity of content and construct by standard methods. Furthermore, the distribution of resident performance scores indicates a fairly normal frequency distribution (Fig). Even though our study had the statistical power to discern 1 point differences in resident performance scores that were based on the medical student composite scores, the composite scores did not distinguish between those residents who performed well above or well below the mean. Yindra et al3 found that the relationship between academic achievement in medical school and resident performance was strongest at the extremes of performance (ie, very weak or very strong performances). Our results do not support that argument. Our study shows that we cannot accurately predict the total performance of our residents from our current evaluation of their medical school achievements. Although standardized tests of cognitive skills seem to be reproducible and predictable from 1 year to the next, the assessment of both clinical acumen and noncognitive skills during medical school is difficult. We currently do not have selection criteria of achievements that can predict the noncognitive performances of our residents. The “best” students do not always make the “best” residents,
May 2002 Am J Obstet Gynecol
and sometimes the “average” students excel as residents. Using selection criteria from medical school achievements to formulate a rank list that predicts performance in residency is possible for some specialties (such as pediatrics),13 but it may not be possible for others (such as emergency medicine14) or obstetrics-gynecology, as we discovered. We agree with those individuals who advise that not only should each specialty investigate the specialty’s effect on the relationship between medical school achievement and residency performance but also that each training institution should analyze its expectations and its standards of performance.11 We will continue to develop our selection criteria and assess our residents’ performances so that we can hopefully identify a meaningful relationship. REFERENCES
1. Lawrence PF, Nelson EW, Cockayne TW. Assessment of medical student fund of knowledge in surgery. Surgery 1985;97:745-9. 2. Schwartz RW, Donnelly MB, Sloan DA, Johnson SB, Strodel WE. Assessing senior residents’ knowledge and performance: an integrated evaluation program. Surgery 1994;116:634-7. 3. Yindra KJ, Rosenfeld PS, Donnelly MB. Medical school achievements as predictors of residency performance. J Med Educ 1988;63:356-63. 4. Gonnella JS, Hojat M. Relationship between performance in medical school and postgraduate competence. J Med Educ 1983:58:679-85. 5. Streiner DL, Norman GR. Health measurement scales: a practical guide to their development and use. New York: Oxford University Press; 1989. 6. McDowell L, Newall C. Measuring health: a guide to rating scales and questionnaires. New York: Oxford University Press; 1996. 7. Kraemer HC, Thieman S. How many subjects? Statistical power analysis in research. Newbury Park (CA): Sage Publications; 1987. p. 59-67. 8. Spellacy WN. The use of national board scores in selecting residents for obstetrics and gynecology. Am J Obstet Gynecol 1985;153:605-7. 9. Ronai AK, Golmon ME, Shanks CA, Schafer MF, Brunner EA. Relationship between past academic performance and results of specialty in-training examinations. J Med Educ 1984;59:341-4. 10. Markert R. The relationship of academic measures in medical school to performance after graduation. Acad Med 1993;68: S31-4. 11. Gonnella JS, Hojat M, Erdmann JB, Veloski JJ. Epilogue. Acad Med 1993;68:S79-87. 12. Arnold L, Willoughby TL. The empirical association between student and resident physician performances. Acad Med 1993;68:S35-40. 13. Weiss JC, Lawrence JS. The predictive validity of a resident selection system. J Med Syst 1988;12:129-34. 14. Balentine J, Gaeta T, Spevack T. Evaluating applicants to emergency medicine residency programs. 1998;17:131-4.