Use of National Board test questions to evaluate student performance in obstetrics and gynecology William Chapel
N. P. Herbert, Hill,
North
M.D.,
William
C. McGaghie,
Ph.D., and George
B. Forsythe,
M.A.
Carolina
The evaluation of student performance in clinical obstetrics and gynecology is frequently based on results of the National Board Examination, Part II. An Obstetrics and Gynecology subtest of this examination was studied to determine its value as a measure of this clinical experience. The clinical usefulness of each question and the distribution of questions among subjects within the specialty (Normal Obstetrics, Reproductive Endocrinology and Infertility, etc.) were determined independently by faculty in obstetrics and gynecology. In addition, the influence of the type of question (e.g., multiple-choice, matching), the category of material, and the clinical usefulness of each question were studied in regard to student test performance. Eighty-six percent of the questions were judged to be indispensable, highly useful, or moderately useful. Abnormal Obstetrics, Gynecology, and Reproductive Endocrinology and Infertility accounted for 70% of the questions with the remaining questions distributed among Normaf Obstetrics, Population and Family Planning, and Gynecologic Oncology. Student test performance was not significantly influenced by the type of question format or category of material but was related to the level of clinical usefulness. Overall, these results, which are based on ratings from five faculty and a single class of medical students at one medical school, indicate that the Obstetrics and Gynecology subtest of the National Board Examination, Part II, is a reasonable measure of clinical experience in this field. (AM. J. OBSTET. GYNECOL. 147:73, 1983.)
Examinations developed by the National Board of Medical Examiners are widely used to evaluate medical students. The 1982-I 983 Curriculum Directory of the Association of American Medical Colleges reports that 57 of 126 (45%) United States medical schools required students to pass the Part I examination for promotion to the clinical years. In addition, 44 (35%) medical schools have established a passing score on the Part II Examination as a graduation requirement. The National Board of Medical Examiners has repeatedly stated that its nationally standardized examinations may not match the curricular characteristics of all medical schools and asserts that individual medical schools are responsible for the intramural use of its examinations.‘. ’ At the University of North Carolina (UNC), the posttest developed by the Association of Professors of Obstetrics and Gynecology is given to third-year medical students upon completion of the clerkship in obstetrics and gynecology, but the results are not con-
From the Department 4 Obstetrics and Gynecology and the Office of Research and Develobment for Education in the Health Professions, University of North Carolina School of Medicine. Received for fiublication Februan 9. 1983. Revised ,+i 4, 1983. ’ ’ Accepted April 12, 1983. Reprint requests: William N. P. Herbert, M.D., Department of Obstetriks and Gynecology,Uniuersie of North Carolina School of Medicine, 214 MacNider Building 202H, Chapel Hill, North Carolina 27J14.
sidered in evaluating student performance. To be promoted to the fourth year, students must pass the National Board Examination, Part II, which is administered at the end of the third year of medical school. Medical faculty that use National Board examinations for student evaluation make several assumptions about these tests. First, they assume the test questions are consistent with the educational goals their students are expected to achieve and that the tests contain questions representative of the material the students have experienced. Second, they assume the test questions are unbiased and that the ccores that result are appropriate for reaching decisions about students. This implies that National Board examinations are not susceptible to “testmanship” and that individual score differences reflect genuine variation in levels of knowledge of the material. Research designed to test these assumptions has produced conflicting results. Two studies have demonstrated a close match of National Board examinations with medical curricula,s, j Several other studies involving basic medical sciences”-’ and in clinical cancer education’* s have failed to confirm fully the assumed match between National Board test content and medical teaching goals. In regard to student performance as it relates to the question format, studies of National Board test questions have suggested that differences in the three basic question formats’ (multiple-choice, matching, or multiple true-false) can influence student performance.“. I1 73
74
Herbert,
McGaghie,
and Forsythe
September Am. J. Obstet.
Table I. Faculty ratings of the test questions
of the clinical
Scale value
5 4 3 2 I
Definition
Indispensable Hiehlv useful MGderately useful Slightly useful Not useful
usefulness
Absolute frequency
Percenlape
22 63 34 17
15.9 45.7 24.6 12.3
2 138
1.4 99.9
The present study of an Obstetrics and Gynecology subtest of the National Board Examination, Part II, was undertaken to address three questions. First, do the test questions measure clinically useful material? Second, what is the distribution of the questions among various topics within the specialty? Third, is medical student performance influenced by the type of question, the category of material, or the clinical usefulness of particular questions?
Material and methods Five full-time, Board-certified faculty members of the Department of Obstetrics and Gynecology at UNC participated in the study. The divisions of Gynecologic Oncology, Reproductive Endocrinology and Fertility, Maternal-Fetal Medicine, and General Obstetrics and Gynecology were represented. Each faculty member independently evaluated all 149 questions contained in the April, 1981, version of the Obstetrics and Gynecology subtest of the National Board Examination, Part II. This subtest along with five other National Board subtests was administered in the summer of 1982 to 1 I8 UNC third-year medical students to assess their readiness for promotion to the fourth year of medical school. Eleven test questions were deleted by the National Board during the scoring procedure, presumably after review of the content and item statistics.’ The remaining 138 questions are the object of analysis in this study. The participating faculty members evaluated each question in three ways. First, each question was coded according to six categories of material: Normal Obstetrics, Abnormal Obstetrics, Gynecology, Population and Family Planning, Gynecologic Oncology, and Reproductive Endocrinology and Infertility. Second, each question was coded according to its format (multiplechoice, matching, or multiple true-false). Third, each question was rated in terms of its “usefulness for clinical medicine” according to a five-point scale (5 = indispensable, 4 = highly useful, 3 = moderately useful, 2 = slightly useful, 1 = not useful). The difficulty index for each test question was used as the dependent variable for this study. This index, which represents the proportion of examinees within a
1, 1983 Gynecol.
group that answers each question correctly, was adjusted according to the method described by Jenseni and cast on a scale ranging from 200 to 800 with a mean of 500 and SD of 100. Higher values on this scale represent increasingly difficult questions. An adjusted difficulty index was derived for each question based on responses from the 118 UNC medical students and from a national sample of medical students for which the sample size was unknown. Means and SDS for the adjusted difficulty indexes of the questions in each category of content by question format were also calculated. The reliability of the faculty ratings was assessed according to procedures discussed by Winer.13 Reliability values can range from 0.00 to 1.00; the higher the number, the greater the consistency among the judges. The adjusted local and national data were analyzed in separate analyses of variance to determine if question format, content category, or an interaction of the two factors was responsible for variation in question difficulty. Finally, correlation coefficients were computed between the adjusted local and national difficulty indexes for the questions and the sum of the clinical usefulness ratings. The correlation between the local and national question difficulty was also calculated.
Results The clinical usefulness ratings of the Obstetrics and Gynecology test questions are shown in Table I. The three highest rating categories-indispensable, highly useful, moderately useful-accounted for 119 of the 138 questions (86%). Only two of the questions (1.4%) were judged to have no clinical usefulness. The distribution of test questions according to content category and question format is given in Table II. There was no significant disagreement among faculty judges over assignment of questions to the content categories. Three content categories-Abnormal Obstetrics, Gynecology, Reproductive Endocrinology and Infertility-having 31, 36, and 30 questions, respectively, accounted for 70% of the total examination. Normal Obstetrics accounted for 20 (14%) questions, Gynecologic Oncology for 18 (13%), and Population and Family Planning for eight (2%). Of the 138 questions, the simple multiple-choice format was used for 111 (80%). The remaining 27 (20%) questions were of the matching or multiple true-false format. Table II also presents means and SDS for the adjusted local and national difficulty indexes for each combination of test content and question format. Multiple-choice questions on Population and Family Planning were apparently the least challenging for medical students both nationally and at UNC. Matching and multiple true-false questions on Reproductive
Volume Number
Evaluation
147
performance
of student
75
1
Table II. Content coverage, subtest of the 1982 National NO?TZd Obstetrics Format
UNC
National
distribution of questions, and difficulty Board Examination, part II* Abnormal Obstetrics UNC
National
Multiple-choice questions 442 457 454 462 M SD 74 72 73 61 13 24 n Matching and multiple true-false questions M 429 449 454 476 SD 72 57 59 56 n 7 7
Total ques-
20 (14%)
Gynecology
indexes
Population and Family Planning
for Obstetrics
GynecoLqic Oncology
and Gynecology Rep-oductive Endocrinology and Inftiility
UNC
National
UNC
National
UNC
National
UNC
Nation&
428 81
430 72
420 109
415 104
463 62
475 50
465 65
480 64
30 462 49
31 (22%)
3 472 38
0 0
15 0 0
469 81
26 491 70
508 82
Total questions
111 (80%) 514 55
6
0
3
4
36 (26%)
3 (2%)
18 (13%)
30 (22%)
27 (20%)
tions *Because
of adjustment,
the lower
the mean
score,
the more
Endocrinology and Infertility were the most difficult for both student groups. Cautious interpretation is needed, however, because of the small number of questions in each category. The reliability analysis of the faculty ratings, based on the clinical usefulness of the test questions, was 0.75 when all five faculty judges were considered. However, in contrast with the other four judges, one faculty member rated the questions much higher in clinical usefulness and did not discriminate among individual questions. For this reason, the ratings determined by the four consistent faculty judges were used to assess clinical usefulness. Elimination of the question ratings from the inconsistent faculty judge raised the reliability to 0.81. The univariate analyses of variance were based on the UNC and national data contained in Table II. Neither analysis revealed statistically significant differences (p < 0.05) in test question difficulty because of question format, content category, or an interaction of the two factors. However, the influence of question content approached statistical significance (p = 0.08) for the national data. The aim of the last analysis was to determine if a correlation existed between the four judges’ rating of question usefulness (summed across judges) and the difficulty of each question. For the UNC data r = -0.25 (p < 0.01); for the national data r = -0.23 (p < 0.01). This indicates that more useful questions were more often answered correctly. The correlation between the question difficulty indexes derived from UNC students and the national sample was quite high (r = 0.91, p < 0.0001).
Comment These data clearly indicate the UNC faculty members believe most of the Obstetrics and Gynecology test
often
the questions
were answered
correctly.
Decimals
are omitted.
questions address clinically useful concepts and principles. This suggests that the Obstetrics and Gynecology subtest is more consistent with clinical educational goals than past studies have shown for National Board examinations used for student evaluation in the UNC basic science curriculum.“. ’ The distribution of questions by category (Table II) may not conform with the undergraduate teaching goals of all departments. For example, a department’s teaching faculty could give Increased attention to Normal Obstetrics and Populat.ion and Family Planning, with concomitant reductions in Abnormal Obstetrics, Gynecology, and Reproductive Endocrinology and Infertility. This acknowledges 1hat the clinical curriculum is not nationally standardized. Other departments may have different, yet equally valid, objectives for medical student education and evaluation. It is important to emphasize that the content categories used in this study were derived at UNC. The categories were not based on the National Board test outline. Also, the distributicm of questions on future examinations is unlikely to be identical to the distribution reported in this article. Neither the type of question nor the category of material seems to influence question difficulty significantly. This contrasts with past research”, ” and suggests that this test contained questions of equivalent difficulty across six contem areas and for different types of questions. Most faculty members would expect that students would answer questions correctly more often if the material were thought to be clinlcally useful, and the present analysis supports this notion. Faculty views about the clinical utility of test material have a statistically significant correlation with student test performance. It is noteworthy that faculty members may vary in their judgment of the clinical usefulness of individual test
76
Herbert,
McGaghie,
as illustrated
questions,
faculty members from the other Numerous
four
from
However, inations
designs matters
such are
also
the
not cialty.
members
subject
the format to faculty
one
of the
with
drawing
in
would
prevent
way
in which
analyses
disagreement questions
of the type
potential
Gynecology
useful one
test
and indicate
succeeds
as judged
medical
school.
among
six areas was
are
we have
problems
subtest
material
performance
or content evaluations
firm
investigation.
of National Board examcontrol, test research with
to many
and
five
significantly
interpretation of results.‘” the results of this investigation
distributed
Student
that
retrospective
Data
clinically
evenly
fact
associated
as the
Obstetrics
evaluating ulty
” that
is not possible.
require cautious In summary, that
are
because the design is not subject to local
prospective
September 1, 1983 Am. J. Obstet. Gynecol.
this study differed in this assessment.
a single,
about used
by the
in
hazards
conclusions
coded*
and Forsythe
by
Questions within
not
in facwere
the
spe-
influenced
of the questions but of question usefulness.
was
by related
We acknowledge the assistance of Drs. Edward H. Bishop, Lamar E. V. Ekbladh, Luther M. Talbert, and Leslie A. Walton of the Department of Obstetrics and Gynecology, University of North Carolina School of Medicine. and Dr. Paul Kelley of the National Board of Medical Examiners for their assistance in this invrestigation.
REFERENCES
1. Hubbard, J. P.: Measuring medical education: The and the experience of the National Board of Medical aminers, ed. 2, Philadelphia, 1978, Lea & Febiger.
Bound
volumes
Bound volumes able to subscribers ($90.00 international)
available of the
(only)
tests Ex-
2. National Board of Medical Examiners: 1981 Annual Report, Philadelphia, 1982. 3. Kennedy, W. ‘B., Kelly, P. R., and Hubbard, J. P.: The relevance of National Board Part I Examinations to Medical School Curricula, Philadelphia, 1970. National Board of Medical Examiners. 4. Garrard, J., McCollister, R. J., and Harris, I.: Review by medical teachers of a certification examination: Ratio. nale, method, and application, Med. Educ. 12:421, 1978. 5. McGaghie, W. C., Burford, H. J., and Harward, D. H.: Content representativeness and student performance on National Board Part I special subject examinations, Ann. Conf. Res. Med. Educ. 19:15, 1980. 6. McGaghie, W. C., Burford, H. J., and Harward, D. H.: External and internal tests of preclinical learning: Content representativeness, item educational emphasis, and student achievement, Ann. Conf. Res. Med. Educ. 20: 115, 1981. 7. Wile, M. 2.: External examinations for internal evaluation: The National Board Part I test as a case, J. Med. Educ. 53:92, 1978. 8. Ruckdeschel, J. C., Lea, J. W., Brown, S., and Horton, J.: Content bias in the neoplastic-related items of the National Board of Medical Examiners Part II examination, Med. Pediatr. Oncol. 10~269, 1982. 9. Goetzel, R. Z., Croen, L. G., Lan, S., and Bases, R. E.: A content validity analysis of neopIastic items of the National Board of Medical Examiners Part II examination, Med. Pediatr. Oncol. 10:413, 1982. 10. Skakun, E. N., Nanson, E. M., Kling, S., and Taylor, W. C.: A preliminary investigation of three types of multiple choice questions, Med. Educ. 13:91, 1979. 11. Sarnacki, R. E.: The effects of test-wiseness in medical education, Eval. Health Prof. 4:207, 1981. 12. Jensen, A. R.: Bias in Mental Testing, New York, 1980, Free Press. 13. Winer, B. J.: Statistical Principles in Experimental Design, ed. 2, New York, 1971, McGraw-Hill Book Company, Inc. 14. Appelbaum, M. I., and Cramer, E. M.: Some problems in the nonorthogonal analysis of variance, Psychol. Bull. 81:335, 1974.
to subscribers
AMERICAN
JOURNAL
OF OBSTETRICS
AND GYNECOLOGY
are
avail-
for the 1983 issues from the Publisher, at a cost of $70.95 for Vol. 145 (J anuary-April), Vol. 146 (May-August), and Vol. 147 (September-December). Shipping charges are included. Each bound volume contains a subject and author index and all advertising is removed. Copies are shipped within 30 days after publication of the last issue in the volume. The binding is durable buckram with the JOURNAL name, volume number, and year stamped in gold on the spine. Payment must accompany all orders. Contact Mr. Deans Lynch at The C. V. Mosby Company, 11830 Westline Industrial Drive, St. Louis, Missouri 63146, USA. Subscriptions must be in force to qualify. Bound volumes are not available in place of a regular JOURNAL subscription.