Optimal use of literature knowledge to improve the bayesian diagnosis of coronary artery disease

Optimal use of literature knowledge to improve the bayesian diagnosis of coronary artery disease

J Clin Epidemiol Vol. 42, No. II, pp. 1041-1047, 1989 Printed in GreatBritain. All righhts reserved Copyright c 08954356/89 $3.00 + 0.00 1989 Pergam...

707KB Sizes 0 Downloads 30 Views

J Clin Epidemiol Vol. 42, No. II, pp. 1041-1047, 1989 Printed in GreatBritain. All righhts reserved

Copyright c

08954356/89 $3.00 + 0.00 1989 Pergamon Press plc

OPTIMAL USE OF LITERATURE KNOWLEDGE TO IMPROVE THE BAYESIAN DIAGNOSIS OF CORONARY ARTERY DISEASE ROBERTDETRANO* Department of Medicine, Division of Cardiology, University of California, Irvine, CA 92717 and Veterans Administration Medical Center, Long Beach, CA 90822, U.S.A. (Received in revised form 8 May 1989)

Abstract-Bayes’ theorem with the independence assumption is applied to a test sample of 141 subjects, using two sets of test sensitivities and specificities. The first set is derived by averaging over literature reports on the accuracy of the exercise electrocardiogram, exercise thallium scintigraphy, and carciac fluoroscopy. The second set of indices is derived by applying multivariate regression to the technical, population, and methodologic attributes obtained from the same literature by the use of meta-analysis. The meta-analytically corrected sensitivities and specificities resulted in significant improvement in the discriminatory power of the Bayes model. (Area under ROC curve increased, p = ~0.01). However, the corrected model was not as accurate as a data-derived logistic regression model of the same test variables. Meta-analysis may be useful for modest improvement in the accuracy of literature-derived Bayesian models for predicting disease probabilities.

Bayes’theorem Disease prediction

Meta-analysis

Coronary artery disease

INTRODUCTION

Bayes’ theorem has been used to estimate coronary disease probabilities [l-3]. The theorem gives the post-test probability of disease based on clinical and noninvasive test information: Pi =

Se PO SeP,+(l

-P,)(l

-Sp)’

(1)

Here Se represents the sensitivity of a noninvasive test result. This is the frequency of a positive test result in a population of patients who have disease defined by coronary angiography. Sp is the specificity of this test result, or the frequency of a negative test result in a population of patients who do not have coronary disease. POis the patient’s probability of disease *All correspondence should be addressed to: Robert Detrano, M.D., Cardiology Ill-C, V. A. Medical Center, 5901 E. 7th Street, Long Beach, CA 90822, U.S.A. (Tel. 213-494-5486)

Sensitivity

Specificity

before the test is done. PO can be based on clinical information, which usually includes the patient’s age, sex and chest pain type; or it can be based on clinical information and the result of noninvasive tests done and analyzed prior to the test in question. The formula above is valid and will produce accurate probabilities only if the test sensitivity and specificity are accurate, and if these indices are independent of the information which resulted in the pre-test probability PO in the diseased or nondiseased populations. The use of inaccurate sensitivities and specificities and the failure to satisfy the independence condition will cause errors in the probabilities resulting from the theorem’s application. That is, the average estimate for disease probability in a group of patients with a given set of clinical symptoms and test results will not be equal to the observed prevalence of disease in that group of patients when the above equation is applied. We have produced evidence that the literature-based

1041

1042

ROBERTDETRANO

Bayesian method described above does indeed result in such errors [4]. Accurate sensitivities and specificities can be computed from large groups of patients in the population of interest (the test population) when the disease-defining procedure (coronary angiography) and the noninvasive test are both performed in these groups. Since such large groups of patients are rarely available, investigators [l, 21 have derived sensitivities and specificities by averaging over literature reports of testing done on varied populations according to testing methods which also varied. Since the joint distribution of clinical and test results is almost never available from such combined literature groups, the assumption of independence is necessary. Literature sensitivities and specificities vary widely because of different patient populations studied, different testing methods used, different methods used to analyze test results, and because of angiographic referral and test review bias [5]. For example, the exercise thallium scintigram, a commonly used noninvasive test, has a reported sensitivity which varies from 63 to 98% [6]. We shall consider two aspects of accuracy: discriminatory power and reliability [7,8]. Discriminatory power is a measure of a model’s ability to separate diseased from non-diseased patients. It depends on the ordering or ranking of the probabilities, not on their absolute numerical values. We shall define reliability as the closeness of the probability values to the observed disease prevalence in subjects with similar clinical and test data. Since testing is done under different circumstances than those of the “literature average”, errors will be incurred by using literature mean sensitivities and specificities. Fortunately, most literature reports contain information concerning patient population, testing methods and interpretation. Furthermore, careful review of literature reports can result in qualitative estimates of test review and referral bias. Knowledge of this information in a clinical test situation should allow for the correction of the literature mean sensitivities and specificities for the specific test and patient conditions at hand. Meta-analysis using multivariate regression allows for such a correction as long as sufficient data are reported in the individual literature reports. The present report describes the results of an attempt to apply meta-analytic techniques to derive relationships between literature-reported

sensitivities and specificities, and other literature-reported attributes concerning patient characteristics, testing methods, methods of analysis, and bias. These relationships between sensitivities, specificities and the other literaturederived results are then used to correct the mean sensitivities and specificities from the literature to the actual patient characteristic, and testing conditions in a test population which is different from those used by the investigators reporting their results in the literature.

METHODS

The probability models Derivation. Bayes’ theorem as described by formula (1) above requires that pre-test probabilities be known before the first test is done. We have used the pre-test probabilities derived from the literature by Diamond and Forrester [l], which depend on age, sex and chest pain type. Chest pain is classed according to the four categories previously used [3]. These categories are: (1) Typical angina pectoris: pain that occurs in the anterior thorax, neck, shoulders, jaw or arms and is precipitated by exertion and relieved within 20 min by rest. (2) Atypical angina: pain in one of the above locations which is either not precipitated by exertion or not relieved by rest within 20 min. (3) Nonanginal pain: pain which is not located in any of the above places, or if it is so located is not related to exertion and lasts less than 10 seconds or longer than 30 minutes. (4) No pain. The pre-test probabilities are given in Table 1. The first model used, which we shall call the uncorrected Bayesian model, involves the application of equation (1) above using pooled sensitivities and specificities from 147 literature reports on exercise electrocardiography [9], 56

Table 1. Pre-test probability for coronary artery disease Nonangina1 pain Asymptomatic

Typical angina

Atypical angina

Men 60-69 50-59 4049 30-39

0.943 0.920 0.873 0.697

0.671 0.589 0.46 I 0.218

0.281 0.215 0.141 0.052

0.123 0.097 0.055 0.019

Women 60-69 5&59 4cH9

0.906 0.794 0.552

0.544 0.324 0.133

0.186 0.084 0.028

0.075 0.032

0.258

0.042

0.008

Age

30-39

0.010 0.003

1043

Bayesian Diagnosis of Coronary Disease

literature reports on exercise thallium scintigraphy [6], and 13 literature reports on cardiac fluoroscopy [lo]. All of these reports concern investigations of the relationship between non-invasive test results and coronary angiographic results. These sensitivities and specificities are contained in Table 2. The corrected Bayesian method is similar to the above except that instead of using literature averages for sensitivities and specificities, the following relationships are used: Sej =

exp(A ) 1+ exp(&)

where

J;=C C,Xi/ ,

for the sensitivity of test j and a similar expression for specificity. The weighting coefficients C, and the variable names X, refer to the various laboratory specific, technical, population and methodologic factors for test j. These equations give the sensitivity and specificity in a particular laboratory for a particular test in a particular group of patients. The weighting coefficients are derived by the application of stepwise linear regression, using the logits of the sensitivities and specificities as dependent variables and the lab specific factors as independent variables. This is explained in greater detail in the Appendix. For reference, we also calculated a six variable discriminant function by applying logistic regression to the clinical and test data of 162 subjects referred for angiography, selected by randomly dividing a group of 303 patients into two groups: a training group of 162, and a test group of 141 subjects who were used to test the models. These 303 patients were without prior myocardial infarction and had been consecutively referred for cardiac catheteriz-

ation at the Cleveland Clinic Foundation. They have been described in detail elsewhere [3]. three nonAll 303 subjects underwent invasive tests: exercise electrocardiography with measurement of the exercise-induced ST depression, exercise thallium scintigraphy, and cardiac cinefluoroscopy. The discriminant function has the form DF =

exp(Q 1 + exp(L)

where L = B, + B, (age) + B, (sex) + B3 (chest pain class) + B4 (ST depression with exercise) + B, (thallium perfusion defect) + B, (number of fluoroscopically detected coronary calcifications). The details of its derivation are contained in Ref. [3]. Application. The test population consisted of persons referred for cardiac catherization at the Cleveland Clinic Foundation who had neither history nor electrocardiographic evidence of prior myocardial infarction. After subtracting the 162 subjects used for derivation of the discriminant function, the remaining 141 comprised the test group. Pre-test probabilities were obtained by referring to the patients’ age, sex and chest pain type. The same pre-test probabilities were used for both Bayesian models and were obtained from the tables of Diamond and Forrester [l]. By sequentially applying the Bayesian and corrected Bayesian models using the sensitivity and specificity of the exercise electrocardiogram, then those of the exercise thallium scintigram, and finally those of cardiac fluoroscopy, post-test probabilities can be calculated. These probabilities are then sorted in ascending order and the following analyses of discrimination and reliability are made. Similarly, the discriminant function derived from the same population was applied to the data of each of the 141 patients. Comparison and determination of accuracy.

Table 2. Sensitivities and specificities used in Bayes’theorem (weighted averages) Sensitivity

Specificity

Exercise induced ST depression

0.667

0.774

Post exercise thallium perfusion defect

0.848

0.844

Fluoroscopic detection of coronary calcification

0.587

0.820

One method of assessing the discriminatory power of these probabilities is by examining their receiver operating characteristic (ROC) curves [l I]. These curves are constructed by plotting the sensitivity of probability cutpoints against their false positive rates. It should be emphasized that while the area under the curve is a good measure of the discriminatory power of a model, it does not measure the model’s reliability [3]. The method of Hanley and McNeil [12] was used to compare the two ROC curves’ discriminatory power.

ROBERTDETRANO

1044

Reliability can be better assessed by dividing the sorted probabilities into quintiles and calculating the expected probability and observed prevalence in each quintile. An accurate model should produce expected probabilities that are numerically close to the observed prevalence. Such an analysis was done by Cornfield for the Framingham data [ 121. The accuracy of the probabilities can be further assessed using specialized indices called reliability statistics. The simplest of these is the difference between the angiographic disease prevalence and the mean disease probability of all subjects in the test group. This quantity is standardized by dividing it by its standard deviation. This “overestimation statistic” gives an overall assessment of the fit of observed to expected probabilities and can be assumed standard normal [14]. Models which produce excessively high probabilities will be represented by overestimation statistics which are positive numbers. Those which produce probabilities which are too low will have negative overestimation statistics. The overestimation statistic measures whether predictions agree with observed data on the average. If the probability estimates cluster around an average which happens to be the same as the disease prevalence, the overestimation statistic will be zero, though the distribution of probabilities may not be accurate at all. Hilden [ 141has derived an index which will be numerically small (absolute value < 1.0) when the probability estimates are close to those expected, and will be negative and numerically (< - 1). Mathematically, large otherwise Hilden’s index is: HI = & 1 {Pid(i)- E*

I

[Pid(i)l).

-

Correobd

+

Bayel

Bay08

0.5 0.4 0.3 0.2 0.1

-

Diet.

Function

0.30 f: 0

0.1

0.2

0.3

0.4

0.6

0.6

0.7

0.8

0.9

t

False Positive Rate Fig. 1. Receiver operating characteristic curves of sensitivity vs false positive rate for probabilities generated by application of the six-variable discriminant function (Disc. Function), Bayes’ theorem (Bayes), and those generated by using Bayes’ theorem with meta-analytical correction (Corrected Bayes).

RESULTS

Discrimination The ROC curves for the discriminant function and for the Bayesian and corrected Bayesian models are illustrated in Fig. 1. The areas under the curves were 0.909, 0.746 and 0.822 respectively (p < 0.01 for all differences). Discrimination by the corrected Bayesian method was thus intermediate between that of the discriminant function and the uncorrected Bayesian model. Reliability Figures 2 and 3 show the fit of expected to observed disease prevalence according to the two models. Both models produce a poor fit at intermediate probability. The corrected Bayesian model is more reliable in the uppermost quintile. The overestimation statistics for the discriminant function, the Bayesian and corrected Bayesian models were 0.06, 5.43 and 2.30

Pidtijis the estimate P, if disease is present in 1 - Pi if not. E* [Pi& is the sum Pf + (1 - Pi)2. As for the overestimation statistic, Hilden’s index is standardized by dividing it by its standard deviation. This index HI, though somewhat more obscure to the unfamiliar, does not suffer from the potential hazard of the overestimation statistic described above. Since this index is the mean of a measurement {PidCij - E * [PidCo J} assumed to be normally distributed, comparisons in the same population can be made using the paired t test with n - 1 degrees of freedom (df).

1-

09080706.

1

2

Quintile I

4

3

5

of Probability

Expected Prr,baD,iify m

Observed

PrevaIence

Fig. 2. Expected vs observed probabilities in ascending quintiles of probabilities generated by using Bayes’theorem.

Bayesian Diagnosis of Coronary Disease

Quintile =

of Probability

Expected Pr~bab~lsty m

Observed

Prevalence

Fig. 3. Expected vs observed probabilities in ascending quintiles of probabilities generated by using Bayes’ theorem with meta-analytical correction.

respectively. Since these are assumed to be standard normal, this difference is significant (paired t test p < 0.01 for all differences between these three values). Since the positive sign indicates overestimation of disease probability, both Bayesian models significantly overestimate; the corrected model, however, is significantly more reliable than the uncorrected model. The Hilden indices of reliabilities for the discriminant function, the Bayesian and corrected Bayesian models were - 0.53, - 6.64 and -2.14 (p ~0.01 with n - 1 df for all differences). These indices confirm the same order of reliability: discriminant function > corrected Bayesian > uncorrected Bayesian.

uncertain. This lack of data caused a poor correction between the multivariate models derived and the sensitivities and specificities reported. Insufficient numbers of reports were found on the relationship between exerciseinduced ST depression other than 1 mm and angiographic results. This made it impossible to correct sensitivity and specificity for 2 mm or 0.5 mm ST depression. Diamond [15] has illustrated the increased information obtained by increasing the number of test result categories. Such an improvement might surpass the relatively modest increase in accuracy demonstrated in this report. Despite these limitations and the preliminary nature of our work, our results should encourage others to investigate similar analyses.

Acknorr,ledRemenfs-The editorial assistance of Maggie Meyer and the computational help of Marta Juhasz are graciously acknowledged.

REFERENCES I.

2. DISCUSSION

The corrected Bayesian model is modestly though significantly more accurate than the uncorrected model. This increase in accuracy includes both improved discriminatory power and improved reliability.

3.

4.

5.

Limitations Though these results are somewhat encouraging, we retain some reservations as to the widespread applicability of these corrections. The meta-analyses necessary to derive the correction are long and painstaking. The body of literature on a given test must be large and the individual reports must be relatively complete. Our review of the voluminous literature on the exercise electrocardiogram revealed that this completeness was unfortunately lacking. Methodologic problems in research design can also produce errors in reported results. We have attempted to correct for these errors in the final model. However, the quantity of these biases and therefore of appropriate correction remains

1045

6.

7.

8.

9.

10.

Diamond GA, Forrester JS. Analysis of probability as an aid in the clinical diagnosis of coronary artery disease. N Engl J Med 1979; 300: 1350-1358. Diamond GA, Forrester JS. Probability of coronary artery disease. Circulation 1982; 65: 64-642. Detrano R, Leatherman J, Salcedo EE, Yiannikas J, Williams G. Bayesian analysis versus discriminant function analysis: their relative utility in the diagnosis of coronarv disease. Circulation 1986; 73: 970-977. Detrano R; Guppy K, Abbassi N, Janosi A, Sandhu S, Froelicher V. Reliability of Bayesian probability analysis for predicting coronary artery disease in a veterans hospital. J Clin Epidemiol 1988; 41: 599-605. Philbrick J, Horwitz R, Feinstein A. Methodologic problems of exercise testing for coronary artery disease: groups, analysis and bias. Am J Cardiol 1980; 46: 807-812. Detrano R, Janosi A, Lyons K, Marcondes G, Abbassi N, Froelicher V. Factors affecting the sensitivity and specificity of a diagnostic test: the exercise thallium scintigram. Am J Med 1988; 84: 699-710. Pryor D, Harrell F Jr, Lee K, Califf R, Rosati R. E&mating the likelihood of significant coronary artery disease. Am J Med 1983: 75: 771-780. Russek E, Kronmal G, Fisher LD. The effect of assuming independence in applying Bayes’ theorem to risk estimation and classification in diagnosis. Comput Biomed Res 1983; 16: 537-552. Gianrossi R, Detrano R, Mulvihill D, Lehmann K, Dubach P, Colombo A, McArthur D, Froelicher V. Exercise-induced ST depression in the diagnosis of coronary artery disease: a meta-analysis. Circulation 1989; 80: 87-98. Gianrossi R, Detrano R, Colombo A, Froelicher V. Cardiac fluoroscopy for the diagnosis of coronary artery disease: a meta-analytical review. Unpublished.

ROBERTDETRANO

1046 Il.

12.

13.

14.

15.

16.

Metz CE. Basic principles of ROC analysis. Semin Nucl Med 1978; 8: 283-298. Hanley J, McNeil BJ. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 1983; 148: 839-843. Cornfield J. Joint dependence of risk of coronary heart disease on serum cholesterol and systolic blood pressure: a discriminant function analysis. Fed Proc 1962; 21: 58-63. Hilden J, Habbema JDF, Bjerregard J. Measurement of performance in probabilistic diagnosis. Meth Inform Med 1978; 17: 227-237. Diamond G, Hirsch M, Forrester J. Application of information theory to clinical diagnostic testing. Circulation 1981; 63: 915-921. Frane J. Missing Data and BMDP: Some Pragmatic Approaches. BMDP Technical Report No. 45; BMDP Statistical Software.

APPENDIX Derivation of Corrected Bayesian Model

The literature on the diagnostic accuracy of the exercise electrocardiogram, the exercise thallium scintigram, and cardiac fluoroscopy was reviewed. The details of these reviews are published elsewhere [6,9, IO] with the lists of variables recorded and the frequency of missing data for each variable recorded from the literature reports. Variables were selected as candidates for multivariate analysis on the basis of their clinical importance and the availability of data. That is, a variable was rejected if more than 50% of its data was missing from the reports. Candidate variables for the three tests are presented in Table Al. Table A2 gives the results of application of stepwise linear regression to the logit functions defined by log[Se/( 1 - Se)] and

log[Sp/( 1 - Sp)]

as the dependent variables, and the variables in Table Al as independent candidate variables. A significance level of 0.15 was used for entry or deletion from the calculation. Before executing this regression, the method of Frane ef al. [16] was used to estimate the values of missing data. This method employs multivariate linear regression to estimate missing values. The use of this imputational method for potentially important variables with missing data was considered preferable to the alternative approach of dealing these variables from the calculation. Indeed, for the exercise induced ST depression, all studies reviewed had at least one variable with missing data. Table A3 contains the values of the variables used in the correction formulas for the test group of 303 subjects. Table 2 shows the corrected sensitivities and specificities for the 303 subjects. These corrected indices were applied in Bayes’ theorem (equation 1) to arrive at the literature corrected Bayesian model.

Table Al. Candidate variables entered in multivariate analysis to determine the correction formula for the corrected Bayesian model (numbers in [brackets] are the percent of reports with missing data for that item) Exercise induced ST depression (> 1 mm) 132 reports for sensitivity 122 reports for specificity

(1) Treatment of equivocal tests as normal, abnormal, or their exclusion. [0]

(2) Percent angiograph stenosis defining disease. [3] Exercise protocol (treadmill v bicycle ergometer). [3] Treatment of upsloping ST depress. [5] Measurement at J point vs 80 msec thereafter. [23) Number of ECG leads. [9] Use of pre-exercise hyperventilation. [0] Reason for terminating exercise. [21] Exclusion of subjects with prior infarction, left ventricular hypertropy, bundle branch blocks, resting ST abnormalities. [0] (10) Exclusion of subjects on digitalis, beta receptor blocking agents, nitrates. [0] Percent men in study group. [24] r::i Mean patient age in study group. [31] (13) Comparison of this test with a proposed “better” test. (3) (4) (5) (6) (7) (8) (9)

PI

Adequate definition of study group. [0] it:; Presumed absence of work-up bias. [0] (16) Blind reading of exercise test and angiogram. [0] Exercise thallium defect 60 reports for sensitivity 57 reports for specificity

(1) Exercise protocol. [5] (2) Percent angiographic stenosis defining disease. [0] Use of tomographic imaging. [0] Use of background subtraction. [2] Automation of scintigram reading. [2] Exclusion of subjects with prior infarction, and those on beta receptor blocking agents. [3] (7) Exclusion of others of avoid false positive results.

(3) (4) (5) (6)

PI (8) Percent men in study group. [21] (9) Mean age of study group. [33] (10) Blind reading of scintigram and angiogram. [0] Presumed absence of work-up bias. [0] it:; Thallium dose. [S] Cardiac fluoroscopy (any calcified vessel) 13 reports for sensitivity and specificity (1) (2) (3) (4) (5) (6) (7) (8) (9)

Percent angiographic stenosis defining disease. [0] X-ray tube current and voltage used. [38] Use of tine film. [31] Number of views used. [31] Exclusion of patients with prior infarction. [0] Percent men in study group. [31] Mean age of study group. [31] Presumed absence of working-up bias. [0] Blind reading of fluoroscopy and angiogram. [0]

1047

Bayesian Diagnosis of Coronary Disease

Table A2. Variables selected by stepwise regression for the sensitivities and specificities of the three tests. Coefficients are in parentheses Sensitivity

Specificity

Exercise induced ST depression

Intercept (0.580) Equivocal result considered normal (-0.3 10) Use of treadmill protocol (-0.230) ST measured at J point (-0.332) Pre-exercise hyperventilation used (0.310) Patients on digitalis excluded (0.187) Comparison with a “better” test (-0.205) Study group well defined (0.205)

Intercept (0.389) Percent men (0.00885) Prior infarction excluded (-0.258) Patients on nitrates excluded (-0.252) Left bundle branch block excluded (0.154) Blind reading of angiogram (0.212) Upsloping ST segments abnormal (-0.230) Hyperventilation used (-0.561) Exercise terminated with angina

Thallium defect

Intercept (2.483) Exclusions to avoid false positives (0.140) Blind reading of angiogram (-0.193) Use of tomography (0.519) Automatic scan reading (0.162) Thallium dose (- 0.640) Percent men (0.013 1)

Intercept (5.547) Mean age (-0.07 11) Presumed absence of work-up bias (0.337)

Fluoroscopic coronary calcification

Intercept (2.162)

Intercept (I ,880)

Exclusion of patients with prior infarction (0.359) Blind fluoroscopic reading (-0.443)

Disease defined by 50% stenosis (-0.525) Presumed absence of work-up bias (0.703)

Table A3. Values of the correction variables in the test population Exercise protocol: Percent angiographic stenosis: Pm-exercise hyperventilation: Equivocal results Patients excluded: With prior infarction: Left bundle branch block: On digitalis: On nitrates: To avoid false positives: ST measured: Unsloping ST 1 mm or greater: Comparison with “better” test Study group well defined: Work-up bias avoided: Blind reading of tests: Percent men in test group: Mean age of test group: Thallium tomography: Semiautomated reading of scintigram Thallium dose:

Treadmill 50% Not used Arbitrary decision made in all cases (not necessarily normal) Yes Yes No No No 80msec after J point Considered abnormal Not done Yes Yes Yes 68% 52 years Not used Used 2mC