Radiography (2002) 8, 203–210 doi:10.1016/radi.2002.0386
Presence of bias in radiographer plain film reading performance studies S. Brealey, BSc, Research Fellow* A. J. Scally, BSc MSc, Lecturer† and N. B. Thomas, BSc, MBBS, FRCR, Consultant Radiologist‡ *Department of Health Sciences, University of York, York YO1 5DD, U.K.; †Division of Radiography, University of Bradford, Bradford BD5 0BB, U.K.; ‡X-ray Department A, North Manchester General Hospital, Manchester M8 5RB, U.K. KEY WORDS: image interpretation; biases; skill mix; study design. (Received 2 January 2002; revised 4 July 2002; accepted 28 September 2002)
Purpose To raise awareness of the frequency of bias that can affect the quality of radiographer plain film reading performance studies. Methods Studies that assessed radiographer(s) plain film reading performance were located by searching electronic databases and grey literature, hand-searching journals, personal communication and scanning reference lists. Thirty studies were judged eligible from all data sources. Results A one-way analysis of variance (ANOVA) demonstrates no statistically significant difference (P=0.25) in the mean proportion of biases present from diagnostic accuracy (0.37), performance (0.42) and outcome (0.44) study designs. Pearson’s correlation coefficient showed no statistically significant linear association between the proportion of biases present for the three different study designs and the year that the study was performed. The frequency of biases in film and observer selection and application of the reference standard was quite low. In contrast, many biases were present concerning independence of film reporting and comparison of reports for concordance. Conclusions The findings indicate variation in the presence of bias in radiographer plain film reading performance studies. The careful consideration of bias is an essential component of study quality and hence the validity of the evidence-base used to underpin radiographic reporting policy. 2002 The College of Radiographers. Published by Elsevier Science Ltd. All rights reserved.
INTRODUCTION The NHS and Community Care Act (1990) produced a climate in which the traditional methods of delivering health care were challenged resulting in the blurring of professional boundaries [1]. Since then there has been an increase in the reporting of Accident and Emergency (A&E) films by radiographers [2, 3] which is affecting the delivery of services on a national scale. When assessing radiographer’s film reading performance it is important to be aware of the main threats to study validity. These studies involve observers (e.g. radiographers) interpreting a sample of films. An arbiter (i.e. health care 1078–8174/02/040203+08 $30.00/0
professional) then judges whether the reports made by the radiographers are concordant with a reference standard (e.g. consultant radiologist). The resulting data are then used to calculate performance statistics such as sensitivity and specificity. However, the environment in which a radiographer is assessed (i.e. controlled conditions or clinical practice) can affect, for example, film selection, choice of reference standard and method of analysis [4]. Table 1 describes three different plain film reading performance study designs. Those that assess observers reporting in controlled conditions, such as radiographers under examination conditions, are called diagnostic accuracy
2002 The College of Radiographers. Published by Elsevier Science Ltd. All rights reserved.
204
Table 1
BREALEY ET AL.
Plain film reading performance study designs
Design Diagnostic accuracy
Description To assess the film reading performance of one (or more) group of observers in controlled (ideal) conditions. Diagnostic performance To assess the film reading performance of one group of observers during clinical practice. Diagnostic outcome To assess the film reading performance of two (or more) groups of observers during clinical practice.
studies. In these studies a mix of normal and abnormal films are carefully selected with the abnormalities covering a range of pathology, body areas and degrees of conspicuousness. A robust reference standard such as a double/triple blind consultant radiological report is developed against which the observers’ reports are compared. Diagnostic performance studies monitor the progress of one group of observers’ interpreting a consecutive series of films during clinical practice compared perhaps with a single consultant radiologist as the reference standard. Studies that compare radiographers’ with other professional groups such as casualty officers are called diagnostic outcome studies because the assumption is that the group with the highest reporting accuracy will contribute more to improving patient outcome. Biases that are a threat to study validity when assessing diagnostic test performance [5–7] have been applied to film reading performance studies as five subgroups: film selection; observer selection; application of the reference standard; measurement of results; and independence of interpretations. A previous paper describes in detail what these biases are, how they occur, how they can affect estimates of film reading performance and how they can be minimized [8]. This paper serves to further raise awareness of these biases in the context of radiographer plain film reading performance studies by presenting the frequency with which they occur. This involves investigating changes in the presence of bias over time and heterogeneity of bias in the different study designs. For the more prevalent biases an explanation of why they occurred in a study and the importance of avoiding or minimizing them is discussed.
Example Radiographers reporting on a validated bank of films as part of a postgraduate course. An audit of radiographers film reading performance. A comparison of radiographers and casualty officers film reading performance.
METHODS Data sources Four methods were used to identify eligible studies: electronic sources; hand-searching journals; personal communication and scanning reference lists. These methods have been described in more detail elsewhere [9].
Study selection Studies were eligible for inclusion if they assessed radiographer(s) plain film reading performance compared with a reference standard and included appropriate statistics e.g. sensitivity, specificity. Studies were excluded based on the following criteria: not performed during the specified time frame, non-English language, non-UK, insufficient data to apply selection criteria, case studies, visual search strategy studies and abstracts later published as papers. To minimize duplication of data when studies were re-published only the original paper was used. Methods for minimizing ‘reviewer bias’ in the selection of studies have also been previously discussed [9].
Data extraction Thirty studies were judged eligible. Table 2 lists the questions used to extract data. When a bias was not present it was scored as (A), when unclear as to its presence as (B) when present as (C) and when not applicable as (N/A). Some biases were only applicable to a certain study design. Spectrum and film-filtering bias were only relevant to diagnostic accuracy studies. Centripetal, popularity,
BIAS IN FILM READING PERFORMANCE STUDIES
Table 2
205
Questions asked when assessing presence of bias
Film selection + Was an attempt made to include a case mix based on criteria like prevalence of disease, severity and range of disease type, pertinent body areas (spectrum bias)? + Were criteria stated for those films eligible for inclusion in the study? (film-filtering bias?) + Is the establishment(s) where the study was undertaken stated (centripetal bias)? + Is the establishment from where the patients were referred stated (popularity bias)? + Was a series of films over a suitable time period included; or a valid random sample of films selected (population bias)? + Were criteria stated for those films eligible for inclusion in the study? (film-filtering bias?) + Were all eligible films interpreted (film-selection bias)? Observer selection + Was an appropriate group of observers selected (observer-cohort bias)? + Were groups of observers matched according to relevant characteristics e.g. number of years experience in the profession/relevant speciality (observer-cohort-comparator bias)? Application of the reference standard + Were all the films interpreted by the observers also interpreted by the reference standard (verification bias)? + Was the observers’ report used to decide whether the reference standard is applied (work-up bias)? + Was the observers’ report used to generate the reference standard (incorporation bias)? Measurement of results + Was appropriate radiological review used (disease-progression bias)? + Are equivocal reports included (indeterminate results)? + Are all films and clinical details available for verification (loss to follow-up bias)? + Was a subsample of the same films interpreted by different observers (inter-observer variability)? + Was a subsample of the films re-interpreted by the observers at a later date (intra-observer variability)? + Was a subsample of the reports compared by independent arbiters (inter-arbiter variability)? + Was a subsample of reports compared by the same arbiter at a later date (intra-arbiter variability)? Independence of interpretations + Were the observers’ blind to the reference standard report (observer-review bias)? + Was the reference standard blind to the observers reports (reference-standard-review bias)? + Did all observers interpret the films independently (observer bias)? + Did all observers interpret the same or a similar set of films (observer-comparator bias)? + Did all observers only have access to plain films (co-image bias)? + Was the arbiter one of the observers or the reference standard (arbiter-review bias)? + Was the arbiter blind to whether the report was made by an observer or the reference standard (arbiter bias)? + Did the arbiter judge whether reports agreed without access to the films (film-access bias)? + Did both cohorts interpret the same films independently (cohort-comparator bias)? + Did both cohorts have similar access to other films (co-image-comparator bias)? + Was the arbiter blind to which observers were responsible for the different reports (arbiter-comparator bias)?
population, film filtering and film-selection bias were only relevant to diagnostic performance and outcome studies. Finally cohort comparator, co-image comparator, and arbiter-comparator bias were not applicable to diagnostic performance studies. Data extraction was completed by personal communication for ten studies. This involved SB visiting the investigator responsible or completing the checklist by telephone. For the remaining studies the other two reviewers (AS and NT) independently appraised ten studies each. All reviewers were familiar with many of the studies so
no blinding was made of publication details. Any discordance between reviewers was resolved by discussion. SB compared with AS and NT respectively had 72% and 71% agreement and moderate Kappa scores of 0.53 [95% CI: (0.44, 0.62)] and 0.58 [95% CI: (0.50, 0.66)].
RESULTS Figure 1 illustrates graphically the proportion of biases present for diagnostic accuracy (n=11),
206
BREALEY ET AL.
Tables 4 to 6 demonstrate variation in the presence of these biases within and between study designs.
DISCUSSION Diagnostic accuracy studies
Figure 1 Scattergram of presence of bias for different study designs by year of study performed.
performance (n=11) and outcome (n=8) study designs by the year the study was performed. This is to demonstrate changes in the presence of bias in such studies over time. Although there is some variation in the mean proportion of biases present in diagnostic accuracy (0.37), performance (0.42) and outcome (0.44) studies the one-way ANOVA in Table 3 demonstrates that this was not statistically significant (P=0.25). Pearson’s correlation coefficient also showed no statistically significant linear association between the year the study was performed and the proportion of biases present in diagnostic accuracy (r= 0.02; P=0.96), performance (r= 0.27; P=0.43) and outcome (r= 0.27; P=0.52) studies. This is evidence that there is no difference in the presence of bias overall between studies and there has been no change over time in the presence of bias in such studies. However, the small number of studies for each design meant that all tests had low power. Heterogeneity in the presence of individual bias for each study design, explanations for this and solutions to eliminate bias are now discussed. Table 3
There were only eleven diagnostic accuracy studies. These studies involve the careful selection of films to assess radiographers potential to report and as Table 4 shows spectrum bias was not present. Film-filtering bias was only present in two studies. Except for one study there were no observer cohort bias or observer-cohort-comparator bias. Similarly, because the aim was to measure the true film reading accuracy of radiographers it was important that an independent reference standard report was generated for each film. Indeed, Table 4 shows verification, work-up or incorporation bias were absent. In the subgroup measurement of results, intraand inter-observer/arbiter-variability biases were present in all but one study. These biases are concerned with reproducibility in decision-making of the same observer/arbiter on different occasions or different observer/arbiter at the same time. Consciously varying one’s confidence threshold allows an observer to achieve almost any estimate of sensitivity or specificity. Similarly, variation in decision-making by the same arbiter or different arbiters could affect indices of performance. However, because diagnostic accuracy studies are conducted under strict conditions it is unlikely radiographers’ performance would vary dramatically on different occasions. Furthermore, because films have been carefully selected based on criteria described above, there are probably no equivocal reports to complicate the process of arbitration. Therefore, although the potential for observer/ arbiter variability ideally should be assessed, it is of less importance for these studies.
One way analysis of variance in the presence of bias
Source of variation Between groups Within groups
df 2 27
Total
29
Sums of squares 0.02569 0.239
Mean squares 0.01285 0.00887
F 1.448
P 0.253
BIAS IN FILM READING PERFORMANCE STUDIES
207
Table 4 Number of diagnostic accuracy studies in which a bias was present Bias Spectrum Film filtering Observer cohort Observer-cohort comparator Verification Work-up Incorporation Disease progression Indeterminate results Loss to follow-up Inter-observer variability Intra-observer variability Inter-arbiter variability Intra-arbiter variability Observer review Reference standard Observer Observer comparator Co-image Arbiter review Arbiter Film access Cohort comparator Co-image comparator Arbiter comparator
Score A B 11 0 8 2 10 0 3 0 11 0 11 0 11 0 0 0 10 1 0 0 0 0 0 0 0 0 1 0 11 0 11 0 9 1 3 0 10 1 9 1 0 3 1 9 7 0 7 0 1 0
Table 5 Number of diagnostic performance studies in which a bias was present Bias
C 0 1 1 0 0 0 0 0 0 0 10 11 11 9 0 0 1 0 0 1 8 1 0 0 5
N/A 0 0 0 8 0 0 0 11 0 11 1 0 0 1 0 0 0 8 0 0 0 0 4 4 5
For independence of interpretations, biases such as the observers and reference standard reporting films blind to each other’s reports (i.e. observerreview bias and reference-standard-review bias) and observers not reporting independently the same or similar set of films (i.e. observer bias and observercomparator bias) were absent. This is because the reference standard reports were generated prior to the observers reporting and the study objective was to assess the film reading accuracy of individual observers. In contrast, other biases like arbiter, film access, and arbiter-comparator bias could not be discounted in nearly every study. Arbiter bias occurs when the arbiter was aware of whether the report was made by an observer or the reference standard. Arbitercomparator bias occurs if two (or more) cohorts are compared and the arbiter is aware of which cohort made the reports. Knowledge of who was responsible for which report, particularly when more than one cohort was being assessed, could influence the
Centripetal Popularity Population Film filtering Film selection Observer cohort Observer-cohort comparator Verification Work-up Incorporation Disease-progression bias Indeterminate results Loss to follow-up Inter-observer variability Intra-observer variability Inter-arbiter variability Intra-arbiter variability Observer review Reference standard review Observer Observer comparator Co-image Arbiter review Arbiter Film access
Score A B 10 0 11 0 9 0 9 0 10 0 11 0 11 0 10 0 10 0 7 0 11 0 6 1 8 3 0 0 0 0 0 0 0 0 10 0 4 0 7 3 1 1 11 0 0 3 0 1 2 3
C 1 0 2 2 1 0 0 1 1 4 0 2 0 11 9 10 9 1 7 0 0 0 8 10 6
N/A 0 0 0 0 0 0 0 0 0 0 0 2 0 0 2 1 2 0 0 1 9 0 0 0 0
arbiter’s judgment and subsequently distort the indices of performance. These biases could easily be avoided by blinding the arbiter to who made the report. Film-access bias is present if the arbiter has access to the sample of films when judging whether the reports being compared are concordant. The arbiter’s interpretation of the films can inappropriately influence the decision of whether reports agree. Furthermore, the arbiter’s judgement could be affected by an incorrect report when viewing the films [8].
Diagnostic performance studies Similar to diagnostic accuracy studies, Table 5 shows that generally biases were absent in film or observer selection and the application of the reference standard except for incorporation bias which was present in four of eleven studies. This bias occurred because when a radiographer’s report
208
BREALEY ET AL.
Table 6 Number of diagnostic outcome studies in which a bias was present Bias Centripetal Popularity Population Film filtering Film selection Observer cohort Observer-cohort comparator Verification Work-up Incorporation Disease progression Indeterminate results Loss to follow-up Inter-observer variability Intra-observer variability Inter-arbiter variability Intra-arbiter variability Observer review Reference standard review Observer Observer comparator Co-image Arbiter review Arbiter Film access Cohort comparator Co-image comparator Arbiter comparator
Score A B 8 0 8 0 5 0 7 0 4 0 8 0 0 0 5 0 5 0 7 1 0 0 6 2 5 1 0 1 1 0 0 0 0 0 8 0 2 1 1 3 1 0 8 0 2 2 0 2 0 2 3 1 8 0 0 2
C 0 0 3 1 4 0 0 3 3 0 0 0 1 7 7 8 8 0 5 0 0 0 4 6 6 1 0 6
N/A 0 0 0 0 0 0 8 0 0 0 8 0 1 0 0 0 0 0 0 4 7 0 0 0 0 3 0 0
disagreed with a consultant radiologist a second consultant radiologist reported the film. If the second radiologist report agreed with the radiographer and not the first radiologist the radiographer’s report was judged correct. Thus the status of the reference standard was partly affected by the radiographer’s report. If there is such uncertainty in the validity of the reference standard report it might be more useful not to measure accuracy but assess radiographer’s concordance with a consultant radiologists report i.e. inter-observer variability. During clinical practice various factors can affect performance such as disturbances, time of the day, time constraints, and length of reporting sessions [10–12]. Therefore, it is also useful to demonstrate that a radiographer can report consistently i.e. intra-observer variability. Diagnostic performance studies also involve
radiographers reporting on a valid sample of films from clinical practice which may include inferior quality films or cases with differential diagnosis. This means an assessment of arbiter variability is necessary to ensure decisions about discordance between reports are consistent. However, Table 4 shows that when eligible, all studies were susceptible to intra- or inter-observer/arbiter-variability bias. Independence is important as observers’ performance can be dramatically influenced by prior knowledge of another report [13–14]. The degree to which the assessment is intended to be pragmatic determines whether radiographers should have access to previous relevant reports, other colleagues’ advice, or even discussion with patients. In diagnostic performance studies, which aim to measure the performance of radiographers in clinical practice, independence in film interpretation is relevant to most studies. However, referencestandard-review bias was present in seven of eleven studies. In three studies, the reference standard had access to the radiographers report and in one study a red dot system affected the reference standard report. Furthermore, arbiter-review bias was present in eight of eleven studies, as the radiographer or the reference standard was also the arbiter. In ten of eleven studies arbiter bias was also present, as the arbiter was not blind to who made the report and in six of eleven studies film-access bias was present as the arbiter had access to the films when judging reports for concordance.
Diagnostic outcome studies Unlike the previous two study designs biases concerning film selection and application of the reference standard are present in a few studies. Table 6 shows that for three of eight studies population bias was present due to, for example, the limited time frame during which films were reported. Moreover, film-selection bias occurred in four of eight studies as observers did not interpret all eligible films. In two studies observers left some proformas incomplete and in another two studies films were not dotted as part of a ‘red dot system’ due to uncertainty, lack of time, or forgetfulness. No study adequately demonstrated a difference between eligible films that were interpreted from those which were not interpreted for demographic (e.g. age, gender) or clinical (e.g. body area) variables.
BIAS IN FILM READING PERFORMANCE STUDIES
In three of eight studies verification bias was present as the reference standard did not interpret all films reported by the observers. Indeed, the reference standard only interpreted films when the reports made by the two groups (i.e. radiographers versus radiologists) were discordant. Hence, work-up bias was also present. The statistical consequence is an overestimation of the two professions’ film reading performance as omission of the standard denies the potential for identifying other false negatives/positives. These biases may be avoided if the observers’ reports are not known before the reference standard is applied. It could be argued that when studies compare trained radiographers with radiologists the objective is not to measure the true film reading accuracy of the two groups but the degree of concordance. Therefore, verification and/or work-up bias is less important as when present in diagnostic accuracy or performance studies. However, similar to the other designs, there was generally no assessment of intra- or inter-observervariability bias. Furthermore, despite the need to assess intra- or inter-arbiter-variability bias for the same reasons described for diagnostic performance studies, this was not done. Finally, most of these studies were susceptible to biases involving independent decision-making. In five of eight studies reference-standard-review bias was present as the reference standard had prior knowledge of the observers reports. In six of eight studies the arbiter was also either an observer or the reference standard (arbiter bias), knew which report was made by whom (arbiter-comparator bias) and had access to films when judging reports for concordance (film-access bias). These biases were sometimes a consequence of the reference standard only interpreting films having decided the reports made by the radiographers and radiologists were discordant.
209
occurred in the application of the reference standard, which suggests that in these studies radiographers film reading performance may have been overestimated. Studies in all groups did not consider intra- or inter-observer/arbiter-variability bias. Demonstrating that observers and arbiters can produce reliable decisions is desirable. However, this is not so important if it is reasonable to assume that inconsistency in decision-making is random, rather than a systematic bias being present that could distort performance characteristics. When applicable independence of interpretation biases were also frequently present in studies conducted in clinical practice. This is probably the most important bias present that can distort film reading performance estimates and yet is easy to eliminate by blinding people to the information that can consciously or subconsciously affect their decisionmaking. The presence of these biases in studies does not necessarily mean that they are invalid but should be interpreted with some caution. Whether studies aim to assess radiographers reporting accuracy under controlled conditions (diagnostic accuracy), or to monitor their performance during training (diagnostic performance) or even to substitute other professionals reporting role (diagnostic outcome) they should be conducted with as much scientific rigour as is feasible. Improving awareness of these biases so that they can be eliminated should help improve study quality and thus the evidence used to inform practice, policy and research.
ACKNOWLEDGEMENTS We are profoundly grateful for the assistance of the authors of the studies included in the review without whose help this paper would not be possible and also for the comments made by the two referees.
CONCLUSIONS A combination of biases and methodological factors influence the quality of film reading performance studies. The results of this review demonstrate some variation in the presence of individual biases within and between study designs. In general, studies selected a valid sample of films and observers. However, biases sometimes
REFERENCES 1. The College of Radiographers. Role Development in Radiography. London: College of Radiographers, 1996. 2. Price RC, Le Masurier SB, High L, Miller LR. Changing times: a national survey of extended roles in diagnostic radiography. Br J Radiol 1999; 72(S): 7.
210
3. Paterson AM. Role Development—Towards 2000: A survey of Role Developments in Radiography. London: College of Radiographers, 1995. 4. Brealey S. Measuring the Effects of Image Interpretation: An Evaluative Framework. Clin Radiol 2001; 56: 341–7. 5. Begg CB. Biases in the assessment of diagnostic tests. Stat Med 1987; 6: 411–23. 6. Irwig L, Tosteson ANA, Gatsonis C et al. Guidelines for Meta-analyses Evaluating Diagnostic Tests. Ann Intern Med 1994; 120: 667–76. 7. Jaeschke R, Guyatt G, Sackett DL. Users’ Guides to the Medical Literature. III. How to Use an Article About a Diagnostic Test. A. Are the Results of the Study Valid? JAMA 1994; 271: 389–91. 8. Brealey S, Scally AJ. Bias in plain film reading performance studies. Br J Radiol 2001; 74: 307–16. 9. Brealey S, Scally AJ, Thomas NB. Methodological standards in radiographer plain film reading performance studies. Br J Radiol 2002; 75: 107–13.
BREALEY ET AL.
10. Gale AG, Murray D, Millar K, Worthington BS. Circadian variation in radiology. In: Gale AG, Johnson F, eds. Theoretical and applied aspects of eye movement research. Amsterdam: North Holland, 1984. 11. Bryan S, Weatherburn G, Roddie M, Keen J, Muris N, Buxton MJ. Explaining variation in radiologists’ reporting times. Br J Radiol 1995; 68: 854–61. 12. Robinson PJA, Fletcher JM. Clinical coding in radiology. Imaging 1994; 6: 133–42. 13. Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical Epidemiology: a basic science for clinical medicine, 2nd edn. London: Little, Brown and Company, 1991. 14. Aideyan UO, Berbaum K, Smith WL. Influence of prior radiological information on the interpretation of radiographic examinations. Acad Radiol 1995; 2: 205–8.