APPLYING GENERALIZABILITY THEORY TO HIGH-STAKES OBJECTIVE STRUCTURED CLINICAL EXAMINATIONS IN A NATURALISTIC ENVIRONMENT Douglas M. Lawson, DC, MSca
ABSTRACT Purpose: The purpose of this project was to determine if generalizability theory could be successfully applied to a highstakes licensure objective structured clinical examination as part of its normal administrative procedures and whether the analysis could yield useful information with regard to sources of variance. Methods: The anonymized data received from the Canadian Chiropractic Examining Board for its June 2005 Clinical Skills Examination were analyzed with generalizability theory. Variance components were estimated with SPSS 11.5 (SPSS Inc, Chicago, Ill) as partially nested data. The data included 182 candidates, 43 raters, 40 standardized patient actors, and 18 individual cases. Results: Internal consistency estimates (Cronbach a) were .86 for day 1 and .91 for day 2. The a estimates for stations averaged .68 for day 1 and .74 for day 2. The generalizability-coefficient for the day 1 exam was .65 and for the day 2 was .42. G-coefficients for stations averaged .63 for day 1 and .74 for day 2. On day 1, the raters contributed 7% of the variance, and on day 2, the raters contributed 8%. Conclusions: Generalizability theory can contribute to the understanding of sources of variance and provide direction for the improvement of individual stations. The size of the rater variance in a station may also indicate the need for increased training in that station or the need to make the scoring checklist more clear and definitive. Generalizability theory, however, must be cautiously applied, and it requires careful selection of the floating raters and vigorous training of the raters in each station. (J Manipulative Physiol Ther 2006;29:463- 467) Key Indexing Terms: Education; Professional; Chiropractic; Competence; Professional; Generalizability Theory; OSCE
O
bjective structured clinical examinations (OSCEs) generally consist of a series of individual stations in which specific skills are evaluated.1,2 The mix of station content is typically based on a curriculum outcome study or job analysis.3 There may be a series of up to 20 stations, and all candidates must attend each station. The time in each station is set by a number of factors, including the specific skills to be demonstrated in the station. It is common for each station to include a candidate, 1 or more raters, and a standardized patient (SP). Raters use a checklist for scoring that may include a number of rating scales. The rating scale may be dichotomous (0 = not performed, 1 = performed), a 3-point a
Graduate Student, University of Calgary, Faculty of Medicine, Medical Education Research Centre, Calgary, Alberta, Canada. Submit requests for reprints to: Douglas M. Lawson, DC, MSc, 4832 26 Ave NE, Calgary, Alberta, Canada T1Y1C9 (e-mail:
[email protected]). Paper submitted August 18, 2005; in revised form November 29, 2005. 0161-4754/$32.00 Copyright D 2006 by National University of Health Sciences. doi:10.1016/j.jmpt.2006.06.009
scale (0 = not performed, 1 = attempted but not performed correctly, and 2 = performed correctly), or a more extensive rating scale. Some OSCEs also include a global rating scale for each station (eg, 0 = outright fail, 1 = borderline fail, 2 = borderline pass, 3 = outright pass, 4 = exceptional pass). With large high-stakes examinations, it is common to have parallel tracks of stations where raters only evaluate the candidates within their track. There are a number of sources of variance on an OSCE, which include different candidate abilities, different rater severity/leniency measures, different SP severity/leniency measures, and different station difficulty. Sources of variance are of particular concern when all raters and SPs do not evaluate all examiners. The effect of differences in rater stringency/ leniency effect or bhawk and doveQ effect on candidate scores has been previously reported.4-7 Generalizability theory (G-theory) concerns itself with the variance of the components of a measure and their estimation.8,9 As a quick example, a doctor with 1 rectal alcohol thermometer will know a patient’s temperature. A doctor with 2 thermometers will never quite be sure, and the patient will not be amused. Generalizability theory has an advantage over other types of reliability estimates (test463
464
Lawson Generalization Theory and OSCEs
retest, interrater, internal consistency) because it can decompose the variation in the measure of interest (candidate scores) into the various components simultaneously (the G-study). This information can then be used to compute changes to estimates of reproducibility if the makeup of the components is changed (perhaps adding more stations or more examiners per station). The initial study is the generalizability study or G-study, which consists of specifying all the sources of variation that are believed to impact the measure of interest (the observed score). These are then included as crossed or nested factors in an analysis of variance (ANOVA) factorial analysis. In a cross design, all raters observe all candidates over all stations. In a nested design, raters will be nested within stations, and all raters will only observe a portion of the candidates and only 1 station. The ANOVA analysis estimates a variance component for each of the sources of variation specified in the model. The follow-up study is the decision study or Dstudy. In the decision study, it is possible to imagine what would happen if the makeup of the examination is changed. For example, what will happen if each station is scored by 2 raters, or if the number of stations is increased? Typical measures of reliability for OSCEs are somewhat unsatisfactory because they do not take into account the variance of the individual components of the examination. Although candidate variance is expected, and is in fact what we want to measure, the variance due to raters or SPs is an error variance, and the more error variance reported as part of a candidate’s measure, the less confidence we have in that measure. If G-theory can be successfully applied, then the amount of variance and percentage of variance due to candidates, raters, and SPs can be estimated. Such estimation can assist in designing future examinations. For example, a decision can be made as to whether it is better to increase the number of stations or increase the amount of raters in a station. Newble and Swanson10 demonstrated in 1988 that when finite resources (funding or rater availability) are a limiting factor to increasing the number stations on an OSCE, it is better to increase the number of stations with 1 rater per station than it is to have 2 raters per station. Such feedback from G-theory analysis is useful for planning OSCEs. One of the difficulties in applying G-theory is that, for most cases, the data must be fully crossed. This means that all examiners will rate all candidates over all stations. There are some practical limitations to this. Without the limitations, it would be possible to have the same 10 raters complete score sheets on all candidates across all stations. Because of the impracticality of such a scenario, G-theory is typically applied in a pilot study (limited number of cases and raters), and then the pilot study is generalized to the real world. Any testing environment struggles to be a reflection of the real world. The further environmental constraints of having fully crossed studies, where all raters are compared, further complicate the environment. In 2002, it was reported that
Journal of Manipulative and Physiological Therapeutics July/August 2006
rater performance was best evaluated by standardized candidates and not by quality assurance examiners. Flaws in rater conduct tend to disappear when they know someone is looking over their shoulder.11 It seems likely that raters in a fully crossed study will behave differently than raters functioning in a room without observation. In 2001, Wass et al12 compared the long case to OSCEs. Only some of the observations included 2 raters, with most only having 1 rater. Their data were analyzed in 2 groupings, 1 a fully nested design (1 rater per case) and 1 partially crossed (with 2 raters per case). They calculated reliabilities using G-theory and explored the use of long cases vs OSCEs and test length (number of cases). They found that 3.5 hours of testing time using 10 cases were no worse and no better than OSCEs in producing reliable measures. Their study did not consider SPs as a source of variance. The purpose of this research project is to determine if G-theory can be applied in a more naturalistic environment. If G-theory can be successfully applied, then what are the variances of the different components of the examination, and can this information be used to improve the examinations?
METHODS Subjects Data were supplied by the Canadian Chiropractic Examining Board (CCEB) and consisted of anonymized data for its June 2005 Clinical Skills Examination. The data received from the CCEB were coded in such a way that it was impossible to track or identify an individual candidate, rater, or SP. The research project received ethics approval from the CCEB.
Test Format The Clinical Skills Examination consisted of a 10-station OSCE administered over 2 days to 182 candidates. There were 2 cycles a day, and candidates in the morning could not contaminate candidates in the afternoon. Eight different stations were used on the second day with the remaining 2 being common to both days. There were 43 raters involved and 40 SPs. Of the pool of raters, experienced raters were selected to float from station to station. The 3 floating raters moved from track to track and station to station to link the data. All SPs rotated between the tracks as the cycles changed to be able to separate rater measures from SP measures. Each station had a checklist for computer scoring of the data. Each checklist consisted of a series of 3-point rating scales (0 = not performed, 1 = performed but not correctly, 2 = performed correctly), one 5-point rating scale for professionalism, and one 10-point rating scale for overall technique. Both the professionalism scale and overall technique scale were anchored at the borderline pass and borderline fail levels. Stations 1 and 7 were case history stations, 2 and 8 were physical examination stations,
Journal of Manipulative and Physiological Therapeutics Volume 29, Number 6
Lawson Generalization Theory and OSCEs
Table 1. Example ANOVA table for both days (3-factor model: raters, SPs, and stations), independent variable = candidate scores
Table 2. Example variance estimate table for both days (3-factor model: raters, SPs, and stations), independent variable = candidate scores
Source
Type III sum of squares
df
Mean square
Variance estimates
Model Candidate Rater SP Station Error Total
1185 481.23 12 002.45 2042.42 1064.18 250.87 4666.60 1190 147.83
149 60 33 8 2 91 240
7956.25 200.04 61.89 133.02 125.44 51.28
Component
Estimate
Percentage (%)
Candidate Rater SP Station Error Total
76.82 2.84 35.16 15.69 51.28 181.78
42.26 1.56 19.34 8.63 28.21 100.00
and 5 and 6 were combined history and physical examination stations. At the end of each of these stations, candidates were required to communicate a diagnosis or differential, and a plan of management. Stations 3 and 9 were differential physical examination stations with multiple cases. Station 4 was an informed consent station, and station 10 was a chiropractic treatment station.
Statistics The data were collected in 2 versions. Data set I included only those observations for which there were 2 raters for each observation. Data set II included all observations with the floating raters removed. Descriptive statistics of both sets of data were first compared. Data set II was then analyzed for reliability by station and total examination using Cronbach a (a measure of internal consistency) to ensure that the data were of sufficient quality. Cronbach a was calculated with all checklist items as separate observations and with station totals only as separate observations. Data set II was not used for generalizability study but to confirm that the smaller data set I was reasonable and representative of data set II. The variance of the components for data set I was estimated using SPSS 11.5 (SPSS Inc, Chicago, Ill). Data set I was analyzed as a 1-facet generalizability study, with raters (all random model) for each station and for the day. Data set I was also analyzed as a 2-facet generalizability study with raters, SPs as facets (partially nested, all random model) for each day. Finally, data set I was analyzed as a 3-facet generalizability study with raters, SPs, and stations partially nested model for the combinations of days. The generalizability studies were conducted by estimating the variance due to each component. The component variance was estimated by using a general linear main effects model, ANOVA, and type III sums of squares method. The percentage of variance attributed to each component was then calculated.
RESULTS The comparison of data set I to data set II (multiple raters/single raters) for day 1 mean percentage scores for all
stations and the overall examination were (1) 64%/67%; (2) 63%/63%; (3) 75%/80%; (4) 70%/75%; (5) 65%/66%; (6) 61%/65%; (7) 76%/75%; (8) 63%/70%; (9) 81%/82%; (10) 71%/70%; day totals, 69%/72%. Day 2 mean percentage scores for all stations and the overall examination were (1) 78%/76%; (2) 61%/68%; (3) 81%/83%; (4) 61%/73%; (5) 65%/69%; (6) 57%/64%; (7) 71%/66%; (8) 70%/66%; (9) 68%/77%; (10) 72%/72%; day totals, 69%/71%. Day 1 SDs for all stations and the overall examination were (1) 7%/ 11%; (2) 12%/15%; (3) 14%/13%; (4) 15%/14%; (5) 13%/ 11%; (6) 10%/11%; (7) 14%/10%; (8) 9%/14%; (9) 13%/ 11%; (10) 12%/11%; day totals, 13%/6%. Day 2 SDs for all stations and the overall examination were (1) 11%/11%; (2) 17%/14%; (3) 10%/12%; (4) 10%/13%; (5) 11%/11%; (6) 19%/14%; (7) 8%/12%; (8) 8%/13%; (9) 14%/14%; (10) 9%/14%; day totals, 14%/7%. The reliability for the day 1 examination (Cronbach a) was .86 (all items included) or .64 (station totals only). The reliability for the day 2 examination was .91 (all items included) or .73 (station totals only). Data set I (stations with multiple raters only) for day 1 yielded the following generalizability coefficients for stations and overall examination (1-facet all random model): (1) 0.39; (2) 0.81; (3) 0.34; (4) 0.89; (5) 0.49; (6) 0.91; (7) 0.51; (8) 0.68; (9) 0.42; (10) 0.73; day totals, 0.65. Day 2 yielded the following generalizability coefficients for stations and overall examination (1-facet all random model): (1) 0.25; (2) 0.80; (3) 0.61; (4) 0.63; (5) 0.52; (6) 0.94; (7) 0.66; (8) 0.48; (9) 0.98; (10) 0.03; day totals, 0.42. The 2-facet generalizability study (raters and SPs) yielded a generalizability coefficient of .59 for Saturday and .16 for Sunday. The 3-facet generalizability study for the combined days (raters, SPs, and stations) yielded a generalizability coefficient of .34. Tables 1 and 2 are examples of the 22 ANOVA tables created by the analysis. Table 1 is the midpoint of the variance estimates and provides a typical ANOVA table for the 3-factor model. The independent variable is candidate scores, and the 3 factors are raters, SPs, and stations. Table 2 is the actual variance estimated for the independent variable and the factors expressed in both raw score and percentage of variance. Because the generalizability coefficient is the amount of
465
466
Lawson Generalization Theory and OSCEs
Journal of Manipulative and Physiological Therapeutics July/August 2006
Table 3. Percentage of variance explained (by component, station, and day total) 1
2
3
4
5
6
7
8
9
10
Total
Day 1 Candidate Rater Errors
39 34 27
81 1 18
34 25 41
89 0 11
49 0 51
91 2 6
51 28 21
68 0 32
41 21 38
73 15 12
65 6 29
Day 2 Candidate Rater Errors
25 55 20
80 0 20
61 32 7
63 32 5
52 34 14
94 1 5
66 18 16
48 16 36
98 0 2
3 0 97
42 8 50
Data are expressed as percentage.
variance explained by the independent variable of interest divided by the total variance in the system (due to the independent variable, factors, and error), it can be seen that the generalizability coefficient and percentage of total variance explained by differences in candidate ability are the same measure. Table 3 is a summary table of the percentage of variance explained by the single-facet generalizability model (candidate, rater, and error) for each day. It should be noted that the stations for each of the days were different cases, and station 1 from day 1 cannot be compared with station 1 on day 2. The percentage of variance explained for the 2-facet model (raters, SPs, and error) for day 1 was candidates = 59%, raters = 7%, SPs = 1%, and error = 33%. The percentage of variance explained for the 2-facet model for day 2 was candidates = 16%, raters = 3%, SPs = 73%, and error = 8%. The percentage of variance explained for the 3-facet model (raters, SPs, and stations) for both days was candidates = 42%, raters = 2%, SPs = 19%, station = 9%, and error = 28%.
D-studies could be used to imagine what the change in station reliability would be if more than 1 rater was in each station, especially the problem stations with high rater variance. Alternatively, and as a guide to the examination administration, the percentage of variance explained by the various components could be used to flag those stations for which additional training may need to occur, or for which the scoring key allows too much interpretation. Rather than using the variance as a method of estimating changes in the examination reliability by increasing the number of examiners in each station, the information might better be used to flag those stations for which quality control measures need to be put into place. The 2-facet generalizability for day 1 was quite reasonable, but very poor for day 2. The G-coefficient for each station, or the percentage of variance explained by candidate performance, accurately describes the quality of measures for each station. Stations with high G-coefficients do not require remedial efforts. Stations with lower Gcoefficients should be examined to determine if the poor quality of candidate measures is due to statements in the scoring keys that are somewhat arbitrary, require additional rater training, or should be scored by more than 1 rater. This research project attempted to create a more naturalistic environment to apply G-theory to high-stakes OSCEs. Generalizability theory could also be applied to a more formative assessment with multiple raters, dozens of students, and 1 station/case. Such a reduced scenario would provide information about the performance of raters, but not the variance due to case or SP. The scenario would also be more artificial because raters would know that their scoring was being compared with many others, unless raters scored by videotape. Further research could be performed in this area.
CONCLUSIONS DISCUSSION Examinations from days 1 and 2 exhibited high measures of internal consistency when all items were included in the estimation and fair when station totals only were included. The comparison between data sets I and II revealed that the floating raters had a higher variability in their scoring than the rater pool. (The SD for total scores was twice that of the rater pool.) The small sample size for the floating raters contributes to this difference in SD. The small sample size (12 observations per station) and higher variance of rater performance draws into question the generalizability coefficients that were estimated. Day 2 was the most troublesome day, with station 10 being highly questionable. A comparison of the means and SDs for day 2 suggests that data set I for day 2 is not representative. As the data set I data for day 2 is not representative, conclusions should not be drawn from the day 2 analysis. The analysis of day 2 is included to be complete in the reporting.
The attempt to reduce the artificiality of the usual fully crossed generalizability study by including floating raters during the course of a high-stakes OSCE at first glance may have solved some of the artificial issues, but created others because of the small sample size in each station. Day 1 yielded very useful information regarding the reliability of each station and the percentage of variance born by candidates, raters, and SPs. This information could be used to determine which stations might benefit by having an extra rater, increased training emphasis, or review and clarification of the scoring keys. If day 2 was not reported, the usefulness of this pilot project could be overestimated. Day 2 data did, however, exist, and its failure to provide meaningful information reflects on the need for careful selection and training of the floating raters and of the raters in general. The results of this research project point out the care that must be taken when applying G-theory to OSCEs, and the guiding nature of its results. Expertise is required to ensure that the results are not misapplied.
Journal of Manipulative and Physiological Therapeutics Volume 29, Number 6
Practical Applications ! G-theory can be used in a naturalistic environment to estimate sources of variance. ! Both raters and SPs contribute to error variance, but raters contribute much more (8% vs 2%). ! G-theory must be used as guide and be cautiously applied.
ACKNOWLEDGMENT This research project was supported and funded in part by the Canadian Chiropractic Examining Board, but does not necessarily reflect its policies or opinions.
REFERENCES 1. Harden RM, Stevenson M, Downie WW, Wilson GM. Assessment of clinical competence using objective structured examination. BMJ 1975;1:447-51. 2. Harden RM. What is an OSCE? Med Teach 1988;10:19-22. 3. Harden RM. Twelve tips for organizing an objective structured clinical examination (OSCE). Med Teach 1990;12:259-64. 4. Harasym PH, Crutcher R, Lawson D. Interviewer’s leniency/ stringency effects on candidate scores. Association of American Medical Colleges, Research in Medical Education Confer-
Lawson Generalization Theory and OSCEs
ence; 2004 Nov 5-10; BostonWashington (DC)7 Association of American Medical Colleges; 2004. p. 36. 5. Lawson DM, Harasym P. Adjusting for examiners who are hawks and doves on a bhigh stakeQ OSCE. 22nd Annual Conference for Generalists in Medical Education; 2002 Nov 9-10; San Francisco, CA. Washington (DC)7 Generalists in Medical Education; 2002. p. 17. 6. Lawson DM. Use of the multifacet Rasch model to adjust for the error variance due to the examiner stringency/leniency effects in OSCEs. Calgary7 University of Calgary, Faculty of Graduate Studies; 2003. 7. Lawson D, Harasym P. The contribution of standardized patients to error variance in candidate scores on a high stakes objective structured clinical examination. Educ Me´d 2004;7:166. 8. Brennan RL. Introduction. Generalizability theory. New York7 Springer-Verlag, Inc; 2001. p. 4. 9. VanLeeuwen DM, Barnes MD, Pase M. Generalizability theory: a unified approach to assessing the dependability (reliability) of measurements in the health sciences. J Outcome Meas 1998;2:302-25. 10. Newble DI, Swanson DB. Psychometric characteristics of the objective structured clinical examination. Med Educ 1988; 22:325-34. 11. Lawson D, Harasym P. Using standardized candidates to identify stringent/lenient examiners and to correct for examiner and standardized patient performance. J Chiropr Educ 2003;17:19. 12. Wass V, Jones R, Van der V. Standardized or real patients to test clinical competence? The long case revisited. Med Educ 2001;35:321-5.
467