Statistics in the pathology laboratory: characteristics of diagnostic tests

Statistics in the pathology laboratory: characteristics of diagnostic tests

Pathology (2001 ) 33, pp. 93– 95 STATISTICS IN THE PATHOLOGY LABORATORY: CHARACTERISTICS OF DIAGNOSTIC TESTS MARIANNE B. EM PSON Department of Public...

64KB Sizes 0 Downloads 27 Views

Pathology (2001 ) 33, pp. 93– 95

STATISTICS IN THE PATHOLOGY LABORATORY: CHARACTERISTICS OF DIAGNOSTIC TESTS MARIANNE B. EM PSON Department of Public Health and Community Medicine and Department of Clinical Immunology, Westmead Hospital, and Department of Clinical Immunology, Auckland Hospital, Auckland, New Zealand

Summary Sensitivity, specificity and receiver operating characteristic (ROC) curves all provide information about the ability of a diagnostic test to provide useful information in the assessment of disease. They are discussed in this review along with the importance of estimates of precision. Key words: Sensitivity, specificity, receiver operating characteristic curves, diagnostic tests, precision. Abbreviations: ROC, receiver operating characteristic curves; ANA, antinuclear antibodies; SLE, systemic lupus erythematosus; HIV, human immunodeficiency virus; 95%CI, 95% confidence interval. Accepted 4 August 2000

INTRODUCTION In recent times the quality of clinical research particularly with respect to interventions has reached new heights. The effect of bias and the reasons for it have resulted in the randomised-controlled trial becoming almost standard in intervention studies. In contrast, the performance of diagnostic test research continues to lack rigour in study design, process, analysis and reporting. The aim of this review series is to look at the appropriate methods of reporting and analysing diagnostic tests both from a research point of view and in the day to day clinical setting. The first, in this series of three, is a brief review of the characteristics of diagnostic tests, how they should be reported and interpreted. Sensitivity, specificity, ROC curves and precision will be described. The second review in the series will explore test interpretation including reference ranges, predictive values and likelihood ratios. A third review will address methods of comparing diagnostic tests. It is hoped that by understanding these basic concepts, both the performance of research and the interpretation of diagnostic tests will improve, although statistics comprise only a small part of the process required.

negative (1–sensitivity). The false-positive rate is the proportion of people without the disorder in whom the test is positive ( 1–specificity). Neither the sensitivity nor the specificity provides information on the likelihood of disease in the presence of a positive or negative test. Table 1 demonstrates how the proportions are calculated. Sensitivity and specificity are independent of prevalence. For example the sensitivity of the antinuclear antibody ( ANA ) for systemic lupus erythematosus ( SLE) will be the same if performed in a community practice where the prevalence of SLE may be 5:10 000 population or in a tertiary centre immunology clinic where the prevalence may be 1:10 population. However, if these populations differ in the severity of disease, the sensitivity and specificity of the test may be affected. This spectrum bias results in greater sensitivity in those with more severe disease and greater specificity in normal controls. Hence it is important when considering the diagnostic characteristics of a test to ensure that the study population includes a broad spectrum of disease severity, a range of similar diseases and normal controls. ANA serology provides a good example of spectrum bias. It is highly sensitive for SLE, the sensitivity approaching 100% with the newer substrates. Subjects with severe disease have very high levels of antibody and are easy to detect. Those with milder disease will often have lower levels that are less easy to detect and occasionally may even be missed by the less experienced reader. The specificity of the ANA is highly dependent on the subjects having the test. Those with an underlying autoimmune disease, or even a first-degree relative with an autoimmune disease, are more likely to have a positive ANA in the absence of SLE. This will have the effect of lowering the specificity of ANA for SLE. In contrast false positives are less likely ( although do still occur) in healthy subjects, and in this group the specificity will be greater.

TA B LE 1

2 ´ 2 table demonstrating how the proportions are calculated Reference standard

SENSITIVITY AND SPECIFICITY The sensitivity of a test is the proportion of people with the target disorder in whom the test result is positive ( true positives). Specificity refers to the proportion of people without the target disorder in whom the test result is negative ( true negatives). The false-negative rate is the proportion of people with the disorder in whom the test is

Test result Disease present Disease absent

Disease present

Disease absent

True positive ( a ) False negative ( c )

False positive ( b ) True negative ( d )

Sensitivity = a/( a + c ); Specificity = d/( b + d )

ISSN 0031–3025 printed/ISSN 1465– 3931 online/01/010093 – 03 © 2001 Royal College of Pathologists of Australasia DOI: 10.1080/00313020120034966

94

Pathology ( 2001 ), 33, February

EMPSON

The actual proportions of true positives and true negatives detected by a test are arbitrarily set according to the position of the cut off above or below which a result becomes negative or positive. There is a trade off between the sensitivity and specificity in determining the position of this cut off such that as the sensitivity increases, the specificity decreases. This will be more obvious later in this review when ROC curves are discussed. The major determinant of the cut-off point is the underlying purpose of the diagnostic test and the actual disease it is detecting. Where it is felt to be extremely important to detect all disease, sensitivity will need to be as high as possible. The screening of blood by the blood bank for HIV is an example where the consequences of not detecting a case of HIV are felt to be so great as to employ highly expensive measures with maximal sensitivity. The associated decrease in specificity is not deemed to be a major consequence as all positive results are then assessed by the Western blot, a highly specific follow up test performed on the same blood sample, which can distinguish the true positives from the false positives. Alternatively, screening programs for the detection of cervical cancer need to be sensitive, but the potential for false positives needs to be limited. All subjects with a positive result will require further testing to determine whether the result was a true positive, and this carries with it anxiety for the subject and additional cost. In this situation a decision on an acceptable false-positive and false-negative rate needs to be made and this may be done with the assistance of an economic analysis which looks at the costs to the health system and the individual subject. In tests used for the diagnosis of disease where there is a clinical suspicion, i.e., outside the screening arena, a similar decision is made as to what is an acceptable falsepositive and false-negative rate. Alternatively a normal range is supplied and the general assumption is that a test within the normal range is normal and one outside this is abnormal. As will be discussed in a later article this is a simplistic interpretation which can be very misleading.

PRECISION An estimate of the precision of the sensitivity and specificity should always be provided. As sensitivity and specificity are proportions derived from a sample of the basal population, there is a degree of imprecision that is dependent on the sample size. As with any statistical assessment, the precision increases, i.e., the sample estimate approaches the population or true value, as the sample size increases. The 95% confidence interval ( 95%CI) as a measure of precision in diagnostic test assessments is not yet widely used compared to other areas of the medical literature. It provides the upper and lower boundary of a range of values including the point estimate from the sample. Within this range, the true population result will be located 95% of the time. There is no reason why such information should not be provided in diagnostic tests. On the contrary, without this information the reader can be grossly misled by an imprecise estimate of sensitivity or specificity which has been derived from a small sample size.1

An example of use of 95%CIs recently published was a paper exploring antifilaggrin antibodies in rheumatoid arthritis.2 The results were reported for two thresholds; at the lower threshold the sensitivity for rheumatoid arthritis was 57% ( 95%CI 50–64%) while the specificity was 93% ( 95%CI 90–97%).2 In contrast a study investigating the characteristics of a new ELISA for coeliac disease, tissue transglutaminase IgA antibody, reported the sensitivity and specificity as 95 and 94%, respectively, without the 95%CI. 3 The sample sizes were 136 with disease and 207 without. Confidence intervals can be calculated in a similar manner to any proportion using an estimate of the variance based on the binomial distribution. Exact confidence intervals can be obtained from most statistical packages but alternatively an approximation can be obtained using the following formula: 95%CI equals p ± 1.96 Îpq/n, where p is either sensitivity or specificity, q is 1–p and n is the number of subjects with the disease (sensitivity ) or the number of subjects without the disease ( specificity ).1,4 Therefore in the latter example the 95%CI for sensitivity and specificity respectively can be calculated as 0.95 ± 1.96Î 0.95´0.05/136 = 0.91–0.99 and

0.94 ± 1.96Î 0.94´0.06/207 = 0.91–0.97.3 In this example the confidence intervals can be seen to be quite narrow because of the reasonably large sample sizes. In contrast a study of the characteristics of alternative serological diagnostic tests for untreated coeliac disease including IgA antigliadin antibodies reported sensitivity and specificity as 52% and 94%, respectively. The sample sizes were much smaller with 31 subjects with active coeliac disease and 32 without coeliac disease.5 Using the normal approximation to the binomial distribution the 95%CI for sensitivity can be calculated to be 34–70%. The 95%CI for specificity using the same approximation is 86–102%. Clearly this is not possible and demonstrates the problem of using this approximation when the sample size is small, i.e., < 10, or the proportion is close to 0 or 1. To use the normal approximation, n´p ( number of true positives or negatives) and n´( 1–p ) ( number of false negatives or positives) should be greater than 5. In all the prior examples this criteria is fulfilled including, in this last case, the sensitivity. Since the specificity of IgA antigliadin antibodies is so high and the sample size so small there are only two false positives. The approximation when applied to this proportion is clearly inappropriate and a more accurate method to calculate the 95%CI is necessary. Such methods are beyond the scope of this review but can be found in Armitage and Berry.6 Alternatively exact confidence intervals can be obtained from the Geigy Scientific Tables.7 For the above example, the 95% confidence interval for the specificity of 94% is 79–99%. This illustrates that when the sample size for assessing sensitivity or specificity is small, the point estimate alone can be misleading as the true sensitivity in this case may lie anywhere between 34% and 70%, and the true specificity between 79% and 99%.

CHARACTERISTICS OF DIAGNOSTIC TESTS

95

at the numbers but the final decision is no less subjective. It can be seen from Fig. 1 that the IgG1 index has a very high specificity ( > 95%) for a sensitivity of 80%. Beyond this sensitivity, there is a rapid drop off in specificity with little gain in sensitivity.8 In this case the final decision with respect to the cut off will be tempered by the clinical requirements for the test. Whether a reasonably high sensitivity with high specificity is more acceptable than a very high sensitivity and low specificity will be determined by the clinical setting as discussed earlier. The area under the ROC curve can be calculated and used as an estimate of the test’s diagnostic accuracy. It is equal to the probability that someone with disease will have a greater test result than someone without.9

Fig. 1 Receiver operating curve for IgG and IgG1 index for the diagnosis of multiple sclerosis. Redrawn from Ref. 8: CSF IgG/Serum IgG . IgG Index = CSF albumin/Serum albumin

SUMMARY In this review diagnostic test characteristics have been discussed. The need for estimates of precision has been stressed and examples used to highlight the reasons for this. The next review will discuss the various ways that test results can be reported and interpreted. This will include reference ranges, predictive values and likelihood ratios. Address for correspondence: M.E., Department of Clinical Immunology, Level 2 Building 31, Auckland Hospital, Park Rd, Grafton, Auckland, New Zealand. E-mail: [email protected]

RECEIVER OPERATING CHARACTERISTIC CURVES These curves graphically demonstrate the relationship between sensitivity and specificity and can be used to compare different tests as will be discussed in a later article. The curve is obtained by plotting the true positive rate ( sensitivity ) against the false positive rate ( 1–specificity). A range of points are obtained by using a range of cut-offs to determine positivity. It demonstrates well the trade-offs between the sensitivity and specificity. Figure 1 demonstrates the sensitivity and specificity of the IgG1 and IgG index in the diagnosis of multiple sclerosis.8 As the sensitivity increases so too does the rate of false positives. However the better the test, the higher up it will be in the left hand corner where the sensitivity will be greater for a given specificity as demonstrated by the IgG1 index curve. Alternatively a test which is of no value would produce a straight diagonal line between zero and one. ROC curves can be used to decide the most appropriate threshold or cut point for a test. By visualising the trade-off over a range of cut-offs this can be easier than just looking

References 1. Harper R, Reeves B. Reporting of precision of estimates for diagnostic accuracy: a review. Br Med J 1999; 318: 1322– 3. 2. Vincent C, de Keyser F, Masson-Bessiere C, Sebbag M, Veys E, Serre G. Anti-perinuclear factor compared with the so called “antikertin” antibodies and antibodies to human epidermis filaggrin, in the diagnosis of arthritides. Ann Rheum Dis 1999; 58: 42–8. 3. Sulkanen S, Halttunen T, Laurila K, Kolho K, Korponay-Szabo I, Sarnesto A, et al. Tissue transglutaminase autoantibody enzyme-linke d immunosorbent assay in detecting coeliac disease. Gastroenterology 1998; 115: 1322–8. 4. Obuchowski N. Sample size calculations in studies of test accuracy. Statist Methods Med Res 1998; 7: 371– 92. 5. Lerner A, Kumar V, Iancu T. Immunological diagnosis of childhood coeliac disease: comparison between antigliadin, antireticulin and antiendomysial antibodies. Clin Exp Immunol 1994; 95: 78– 82. 6. Armitage P, Berry G. Statistical methods in medical research, 3rd edn. Oxford: Blackwell Scientific Publications, 1994; 120– 2. 7. Lentner C, editor. Geigy Scientific Tables, Vol. 2, 8th edn. Basle, Switzerland: Ciba-Geigy Limited, 1982; 89–102. 8. Hasan B. Intrathecal immunoglobulin production in multiple sclerosis. PhD Thesis 1990; University of Sydney. 9. Altman DG, Bland JM. Diagnostic tests 3: receiver operating characteristic plots. Br Med J 1994; 309: 188.