Journal of Clinical Epidemiology 58 (2005) 649–650
COMMENTARY
The problem of imperfect reference standards S.D. Walter* Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario L8N 3Z5, Canada Accepted 10 November 2004
Bertrand et al. [1] are to be congratulated for their interesting exploration of the properties of the Hui–Walter (HW) method in comparison to analyses using a reference standard. Ideally, such a comparison would be performed using a reference standard that is completely error free. One would then have direct evidence on the accuracy of the HW estimates, which is based on a latent class analysis and which does not require a gold standard measurement. Unfortunately, Bertrand et al. used pathology data as the reference standard in their example and, as their own citations have indicated, pathology data can be far from perfect. Kappa values of 0.05 to 0.74 are cited, which suggest that pathology data have only reasonable reliability, at best. The lower range of these κ-values represents agreement among pathologists that is barely better than chance. By adopting pathology as a supposedly trustworthy reference method, in spite of the obvious problems invoked by its low reliability, one will necessarily ascribe potential failings to competing methods, such as the latent class method, even though some of the discrepancies between methods could well be due to the inaccuracy of the reference standard. Despite the lack of validation for their reference standard, the authors are nevertheless quite prepared to make subjective judgments concerning its likely error rates, and they accordingly deem false-positive rates of 34% or false-negative rates of 42% to be “unlikely in practice” for pathology. Such claims seem to fly in the face of what the literature is telling us. Bertrand et al.’s implication that using a reference standard is, in some sense, better than any alternative approach colors their interpretation of the results. For instance, in their section dealing with possible reference test errors, they conclude that “the pathology findings defined a markedly different classification than the latent classification.” A reasonable alternative interpretation is that the classifications are in fact much the same, but the accuracy of observation is poor in either or both of the radiology and pathology data. In further investigations to elucidate the differences between the reference diagnosis and the latent classes, Bertrand et al. calculated values of latency (L) sensitivity and speci* Corresponding author. Tel.: 905-525-9140; fax: 905-529-3012. E-mail address:
[email protected] (S.D. Walter). 0895-4356/05/$ – see front matter 쑖 2005 Elsevier Inc. All rights reserved. doi: 10.1016/j.jclinepi.2004.11.026
ficity, parameters that were estimated in subgroups of the data with positive and negative pathology findings, respectively. Once again, this approach demonstrates the authors’ subtle bias in favor of pathology, and furthermore it sets up a situation where the HW method is likely to perform badly. To see this, imagine that there is, in truth, moderate or good agreement between the reference diagnosis and the latent classes. Restriction of the analysis to the subgroup of subjects with positive pathology findings would then imply that the prevalence of the positive latent category would be very high; similarly, restriction to subjects with negative pathology would imply a low prevalence of the negative latent classification. In fact, in the extreme case where the reference and latent categories agree completely, one would not be able to demonstrate that fact by subgroup analyses, because the latent class method would fail due to degeneracy in the data (all the subjects would then be truly positive or truly negative in a given subgroup). Even if the agreement is good but not completely perfect, the latent class model will be unstable if there is only a small minority of subjects with the positive (or negative) latent classification. A common misunderstanding about the latent class model is that it necessarily requires an assumption of conditional independence between the observations. Whether this assumption is actually required depends on the format of the available data. As Bertrand et al. correctly describe, the assumption is needed if there are only three observers, because then the model is saturated and all of the available degrees of freedom are taken up with estimating prevalence, sensitivity, and specificity values. If, however, additional observers are available (as in Bertrand et al.’s example, where there are four), then further degrees of freedom are available. Bertrand et al. used the extra degrees of freedom in their data to test the assumption of conditional independence using an overall goodness of fit, and found no convincing evidence to reject it. They recognized that this test has potentially low power, so another alternative that could be pursued is to fit specific additional parameters representing the correlations between observers. With binary outcomes and four observers, there are 6 extra degrees of freedom available, which would allow including six pairwise correlations between all possible pairs of observers before we arrive again at the saturated model.
650
S.D. Walter / Journal of Clinical Epidemiology 58 (2005) 649–650
This approach would have the advantage of providing quantitative estimates of the degree of dependence between observers, rather than simply testing the null hypothesis of independence, with possibly low power. It should be noted that the R method (based on use of the pathology data as a reference standard) proposed by Bertrand et al. does not explicitly recognize or take into account possible correlations of observers. They claim that there was “complete independence” in the radiology and pathology data. However, we need to make a clear distinction between the observers working separately and in physical isolation, as opposed to the statistical dependence between the errors in their responses. Even though the pathologists may work separately from one another, and even independently of the radiologists, there may still be correlated errors in the data. Such correlations can arise for a number of reasons, including heterogeneity in the material being examined (e.g., a spectrum of easily recognized cases vs. more difficult, borderline cases), or because of similar clinical behavior related to common clinical training or experience. Bertrand et al. use a method due to Gart and Buck [2,3] to correct the values of R sensitivity and R specificity. Somewhat strangely, their use of the Gart and Buck method relies on obtaining an accurate estimate of R prevalence (based on the positive pathology findings); the R sensitivity and R specificity are then corrected using assumed (and unverifiable) values of the pathology sensitivity and specificity. Given that the observed prevalence in the pathology data will be biased because of measurement error (a feature of the data which is itself under investigation), this appears to be a rather circular argument. It seems problematic to accept the accuracy of one part of the pathology data (the prevalence, whose empirical estimate is known to be biased), and then use it to make corrections to other summaries of the data (sensitivity and specificity), which are assumed to be in error. One message from Bertrand et al.’s results is the strong possibility of small-sample bias in the HW method. I believe it is well known that the maximum likelihood approach, which is used in the HW method, will often perform badly in small-sample situations, especially if there are a substantial number of data cells with very small or zero frequencies, and especially if boundaries of the parameter space are encountered. Bertrand et al.’s findings reinforce the need for caution in interpreting maximum likelihood estimates and their standard errors in such situations. Perhaps an area for future research is to devise exact methods of inference for this kind of data, as opposed to relying on approximate large sample behavior for the usual maximum likelihood estimates. It would also be useful to repeat Bertrand et al.’s investigation with other data sets. Although many of their conclusions may be valid for the particular radiology and pathology data at hand, it is less clear whether they can be generalized to other data of the same type or to other clinical applications. Further experience where the sample size is somewhat larger, and in different clinical areas, is required, and preferably in
situations where the reference standard is indeed a close approximation to an error-free gold standard. Although I enjoyed much of Bertrand et al.’s comparisons of the HW and reference standard methods, I found myself in disagreement with the title of their paper, and with the authors’ conclusion that the HW latent class method is more appropriate for analyzing agreement, rather than diagnostic accuracy. The fundamental products of a latent class analysis are estimates of observer sensitivity and specificity, together with prevalence. Measures of agreement are not obtained, at least not directly. It is a useful exercise to examine how the HW estimates might change as the panel observers is modified, and in particular to examine the effect of observer disagreement on the estimates of tests accuracy. Bertrand et al. did this in their example, and interestingly they found that one observer was frequently discordant with the others and had a substantial effect on the conclusions. Lacking a true gold standard, this type of information can provoke a discussion of whether the discordant observer is simply less accurate than the others, or perhaps is using a different clinical paradigm and so is possibly recognizing cases of disease that the others are missing (or perhaps eliminating false positives identified as cases by the other observers; or both). This would then lead to a useful improvement in quality control in radiographic reading. In summary, although I recognize that Bertrand et al. have considered several possible interpretations for the discrepancies between the latent class and reference diagnosis methods, their particular stance and example may have biased them subtly against the latent class approach. First, the use of an example with many sparse data cells will almost certainly create a situation where the latent class approach with maximum likelihood might be expected to perform badly. Second, the adoption of pathology as a reference standard, in the face of its dubious reliability, will again bias the analysis against the latent class approach. I suggest that the latent class method, when used appropriately with suitable data, remains a useful technique for the evaluation of diagnostic accuracy. The investigation by Bertrand et al. has appropriately drawn attention to some of the issues to be considered in the use of the latent class method, but it does not provide sufficient evidence upon which this method might be dismissed as a suitable technique in the analysis of diagnostic test data. References [1] Bertrand P, Be´nichou J, Grenier P, Chastang C. Hui and Walter’s latent-class reference-free approach may be more useful than diagnostic performance in assessing agreement. J Clin Epidemiol 2005;58:689– 701 (this issue). [2] Gart JJ, Buck AA. Comparison of a screening test and a reference test in epidemiologic studies. I. Indices of agreement and their relation to prevalence. Am J Epidemiol 1966;83:586–92. [3] Gart JJ, Buck AA. Comparison of a screening test and a reference test in epidemiologic studies. II. A probabilistic model for the comparison of diagnostic tests. Am J Epidemiol 1966;83:593–602.