Comparison of Three Bayesian Methods to Estimate Posttest Probability in Patients Undergoing Exerc=ise Stress Testing Anthony P. Morise, MD, and Robert D. Duval, PhD
To determine whether recent refinements in Bayesian methods have led to improved diagnostic ability, 3 +hods using Bayed theorem and the indepentsmmpth for esthmtlng posttest probabiliziti exercise stress testing were compared. Each method dii in the number of variables . consukd in the posttest probability estimate (methodA=5,methodB=6andmethodC= 15). Method C is better known as CADENZA. There wem439pdients(Z9Onm1 and 199women) who underwent stress testing (139 had concurrent thallium scintigraphy) followed within 2 months by coronary arteriography. Coronary artery disease ([CAD], at least 1 vessel with 159% diameter narrowing) was seen in 169 (36%). Mean pretest probabilities ushtg each method were not diit. However, the mean posttest probabiliies for CADENZA were signittcantly greater than those for method A or B (p <0.9001). Each deck of posttest probability was compared to the actual prevalence of CAD in that deck At posttest probabilities <20%, there was underestimation of CAD. However, at postWt probabBties 269%, there was overestimation of CAD by all methods, especially CADENZA. Comparison of sensitivRy and specificity at every ftfth percentile of posttest probability revealed that CADENZA was significantly more sensRiveandlessspsclftcthanmethodsAandB. Therefore, at lower probabiri thresholds, CADENZA was a better screening method. However, methods A or B still had merit as a means to conflrm higher probabilities generated by CADENZA (especially 269%). These results may have an impact on how these methods are used in clinical decision making. (AmJCardiol1989;64:1117-1122)
onditional probability analysis has allowed for the developmentof methodsto improve the interpretation of diagnostic testing.’ Using the sequential application of Bayes’ theorem, methods have been developedto convert test responsesinto estimates of probability of diseasepresence.2-4Diamond and Forrester3were the first to developand apply this approach to noninvasive test interpretation of patients with suspectedcoronary artery disease(CAD). Several methods now exist for applying these techniques to the evaluation of this population.5-7A refinement of this original approach has recently appeared7and this method requires a desktop microprocessorto enter the data and perform the analysis. We evaluated whether and to what extent this refinement has led to improvement in the diagnostic ability and utility of this approach.
C
MBt’HODS Patlent population: BetweenJune 1981 and October
1988, 168 1 patients were referred to our stresslaboratory for the expresspurpose of evaluation for the presence or absenceof CAD. None of the patients had a history of prior myocardial infarction or coronary arteriography. In addition, resting electrocardiograms revealed no findings diagnostic of previous myocardial infarction, such as significant Q waves.Within 2 months after stresstesting, 436 of thesepatients underwent coronary arteriography. These 436 patients comprise the study population of this report. Clinical informatlom Clinical data were collected during the initial preexercisetest interview by a technician instructed in gathering the information, and later confirmed by 1 of us by reviewing the patient’s medical record. Recordeddata included patient’s age, sex,symptoms and risk factors. Chest pain was classified according to the categoriesused by Diamond*: (1) typical angina-substernal, exertional, relieved by rest or nitroglycerin; (2) atypical angina-2 of these features; (3) nonanginal chest pain-l or none of thesefeatures; and (4) asymptomatic-no pain. Risk factors recordedwere systolic blood pressureand histories of high cholesterol, current cigarette smoking and diabetes mellitus. From the Section of Cardiology, Department of Medicine, West VirResting electrocardiogram: These were classified as ginia University Schoolof Medicine, Morgantown, West Virginia. This study was supported in part by a grant-in-aid from the American normal, equivocal or abnormal. Abnormal electrocarHeart Association, West Virginia Affiliate. Manuscript received May diograms demonstrated ST-T wave changes consistent 22, 1989; revised manuscript received August 4, 1989, and accepted with left ventricular hypertrophy, ischemia or left bunAugust 6. dle branch block. Equivocal tracings had nonspecific Address for reprints: Anthony P. Morise, MD, Section of Cardiology, Health SciencesCenter, West Virginia University, Morgantown, ST-T wave changesnot consistentwith any of these diagnoses. West Virginia 26506. THE AMERICAN JOURNAL OF CARDIOLOGY NOVEMBER 15.1989
1117
COMPARISON OF BAYESIAN METHODS
Exercise testsr Most patients exercised using the standard Bruce treadmill protocol. The standard Naughton treadmill protocol was used by 38 patients and 18 used an arm ergometer protocol. The following data were collected from the stressstudies:exercisetime (minutes), resting and maximal heart rates, maximal systolic blood pressure (manual stethoscopewith cuff sphygmomanometry), exercise-induced symptoms (anginal vs nonanginal), millimeters of ST-segment depression and R-wave amplitude changes (V, rest-V5 stress). ST-segment changes were recorded for all except those with abnormal resting electrocardiograms. Exerciserelated ST-segment changes were measured 0.08 secondafter the J point and comparedto the baseline between 2 PR segments.ST segmentswere also categorized as to whether they were upsloping, horizontal or downsloping in nature. All studies were read by 1 of us without knowledge of the historical, scintigraphic or angiographic data. Concurrent thallium scintigraphic studies were performed on 135 patients. The images were read by 1 of
us (APM) and a radiologist and classified as normal or abnormal with fixed or reversible defects. Coronary angiography: Angiograms were read by 2 cardiologists (APM one of them). Differences were resolved by consensus.CAD was defined as the presence of 11 vesselwith >50% diameter narrowing. Probability algorithms: Three methods were compared, all of which usedboth Bayes’ theorem as defined within the original reports and the independenceassumption.9 The first method (method A) initially reported by Diamond and Forrester in 19793*5used a tabular format and considered5 variables: age, sex, symp toms, millimeters of horizontal or downsloping ST depressionand thallium results. The second method (method B) was a variation of method A reported by Patterson et al6 in 1982. It useda graphic format for the samevariables as method A except that exercise-inducedangina was also considered. In addition, exercise ST changes and thallium testing had fewer groups into which responseswere categorized, This was considered to be a simplification of
PRETEST PROBABILITY 150
A
125 -
50 n 2 25 E
0
5
FIGURE l.A,B,dCrefertothe3dlferentmethodseompuwl.The-of pstbntsh~decneofpretest(A)and l-tt-t(B)probJlyly-~* eadlmethod.
FINAL POSTTEST PROBABILITY
2 150
ii : 125 L
25
1
1118
2
3
4
5 6 DECILE
7
8
THE AMERICAN JOURNAL OF CARDIOLOGY VOLUME 64
9
10
method A reported by Patterson et al6 in 1982. It useda graphic format for the samevariables as method A except that exercise-inducedangina was also considered. In addition, exercise ST changes and thallium testing had fewer groups into which responseswere categorized. This was considered to be a simplification of method A with the addition of a new variable. Method C (CADENZA), the third method, was a further variation. CADENZA used a computer format based on the sequential application of Bayes’ theorem and was developedby Diamond et al.’ In addition to the already mentioned variables, it consideredFramingham risk factors, exercisetime, heart rate and blood pressure responses,and more detailed entries for ST-segment, QRS and thallium data. Other testing modalities, such as coronary fluoroscopy, were available for entry of results, but these were not consideredin this study. Method of comparison and analysis: Data from all 436 patients were used in this study with no exclusions for normality of resting electrocardiogram, medications or predetermined adequacyof stresstesting. Each method was used to generate 4 probabilities that were then subsequently analyzed: pretest probability, stress electrocardiogram posttest probability, thallium posttest probability (only for those who underwent that study) and final posttest probability. This final posttest probability representedthe final probability after all available noninvasive studies were performed. For some, it was the sameas the stresselectrocardiogram posttestproba-
bility and for those who underwent thallium studies, it was the thallium posttest probability. Depending on the analysis undertaken, final posttest probabilities were divided into subgroups (e.g., deciles). Mean pretest and posttestprobabilities for each method were comparedby analysis of variance. Means were expressedas the mean f 1 standard deviation with 95% confidencelimits in parentheses.Final posttestprobability as deciles was compared to the actual prevalenceof CAD in that decile. Sensitivity and specificity for each method at each fifth percentile of final posttest probability were evaluated. Formulas for the pertinent fractions were as follows: Sensitivity (true positive fraction) =
true positives true positives + false negatives ’
Specificity (true negative fraction) true negatives =-. true negatives + false positives Proportions were expressedas percentageswith the raw proportion and the 95% confidence limits in parentheses. Comparison of proportions was performed using nonparametric comparison testing.
100 90 80 70 60 50 40 30 20 10 0 6 DECILE
8
THE AMERICAN JOURNAL OF CARDIOLOGY NOVEMBER 15. 1989
i 119
COMPARISON OF BAYESIAN METHODS
tom classesshowed the following distribution: typical angina in 99 (23%), atypical angina in 182 (42%), nonangina1chest pain in 132 (30%) and asymptomatic in 23 (6%). Catheterization results revealed that 169 of the 436 (38%) patients had CAD; 107 men (43%) and 64 (34%) women had CAD. By respective symptom class, CAD was seen in 54% (typical), 37% (atypical), 27% (nonanginal) and 52% (asymptomatic). Comparison of pmbabilitieo: Table I lists the mean pretest probability and stresselectrocardiogram, thallium and final posttest probabilities for the 3 methods. While the 3 pretest probability results were not different, CADENZA consistently demonstratedsignificantly higher posttest probabilities compared to the other 2 methods.
TABLE I Comparison of Pre- and Posttest Probabilities Method
A
B
Pretest Mean&SD 42~t30 95%CL 39-45 Post stress ECG MeanfSD 34-+32 95% CL 3137 Post thallium (n = 135) MeanfSD 3Df34 24-36 95% CL Final MeanfSD 34f33 95% CL 31-37
C
AvsC
NS
43f29 40-46
NS
4D&31 3B-45
NS
NS
37k34 3w
*
48*36 44-51
*
NS
3Df32 25-36
*
47f3B 41-54
*
NS
36f34 33-39
*
4Bf37 45-52
*
* p
NS = dierence
not signiicant;
RESULTS
Comparison to actual incidence of cotmary artery disease: Figure 1 shows the distribution of patients in
Popdatbx Of the 436 patients who met the criteria for this study, 250 (57%) were men with an averageage of 53 f 12 years (95% confidence interval: 52 to 55) and 186 (43%) were women with an average age of 55 f 10 years (95% confidence interval: 54 to 57). Symp-
each decile for pretest probability and final posttest probability. Figure 1A shows a similar distribution of pretest probability for each method and a larger number of patients in the lower deciles reflecting the large percentageof our sample with atypical and nonanginal
A
&OO0 gob 80$ 70w” : cn 0 a LJ E
60504O3020loaFlGURE3.A,B,andCrefestothe 3difkentmathodscompar&The tluoposRlvofracBenerrenritivtty (A)andthotNolK&lvolractbner rpad(icity(B1ueplottadovwthe ontlrormgeofRnelpsttostprebebllinos.
1120
I
1
I
0
10
20
I
I
I
I
I
1
30 40 50 60 70 80 POSTTEST PROBABILITY
THE AMERICAN JOURNAL OF CARDIOLOGY VOLUME 64
I
I
90
100
symptoms.Figure 1B showsthat each method moved a large number of patients to either the lower or higher deciles with methods A and B distributing relatively more patients to the lower de&es. Concerning the actual incidence of CAD, Figure 2 showshow each of the methods compared to the others and to the expectedresults for each decile of final posttest probability. All methods underestimated the incidence of CAD in the first decile. By the third decile, there was accurate prediction by each method. Beyond the fourth decile, there was overestimation by each method. CADENZA was the worst in this respectand method A was slightly better than method B. Conceming this latter difference in the higher de&s, these differences were not statistically significant. However, when the highest 4 deciles (60 to 99%) were pooled and each method compared, a difference was noted between method A and CADENZA (83/l 19-70% vs 105/ 180-58%; p <0.05). Comparison of sensitivity and specitieityr Figure 3 shows sensitivity (true positive fraction) and specificity (true negative fraction) as a function of the final posttest probability for each method. Across the entire range of posttest probability, CADENZA was more sensitive and less specific than methods A and B (p
Figure 2 shows that as the deciles increased, the divergence between CADENZA and method A increased.This is likely an example of deterioration of diagnostic accuracy due to failure of the independence assumption. CADENZA used 3 times as many variables as method A. Fryback has demonstratedthat as variable number increases,degradation of performance due to overweighting of redundant information is more likely to occur. Detrano et all2 have, however, demonstrated independenceof most of the variables used in CADENZA, but exceptionsto this included age - coronary calcifications and sex - ST-segment changes. Only the latter of these would have influenced this study. Considering this, we inspected the CADENZA softwareat the point just before using Bayes’formula to calculate the probability based on the ST-segment result. Here the prior probability was modified by a factor that varied depending on sex. This was not noted for any of the other test variables we considered.This was possiblyan attempt by the investigatorswho created the software to adjust for the apparent dependenceof ST results on sex. Methods A or B do not have any such adjustments and whether it made any difference to CADENZA’s results is unclear. Other methods that do not depend on the independenceassumptionare being developed.13J4 These methods usually involve logistic regression or discriminant function analysis as the means to determine posttest probability. It is hoped that by using a pool of uniform data from a large number of patients, discriminant functions or a logistic regressionequation can be generated that will have universal applicability.13 Accuracy of Bayesian methods: Many studies have evaluated the effectivenessof techniques using Bayes’ theorem and the independenceassumption for the purposeof separating those patients with from those without CAD.4J0-12J5-20Studies have compared posttest probabilities of CAD presenceto the actual incidence of CAD and found positive relations. These were performed using populations from previously published series,3 the Coronary Artery Surgery Study population2i and smaller defined populations.4,7J5,22 Our own data also indicated this samepositive relation (Figure 2) but, in addition, emphasizedthe point that Bayesian methods that use the independenceassumption seem to underestimate CAD prevalence at lower posttestprobabilities and overestimateCAD prevalence at higher posttest probabilities. Other studies have also noted this drawback.4J5Therefore, our study yielded data consistent with that found elsewhere although our population differed in the overall prevalence of CAD. To our knowledge, our study population’s CAD prevalenceof 38% (population of both men and women) was lower than that found in all previously published studies that have evaluated Bayesianmethods.This was so even though our population was also derived from those referred for catheterization. Prevalencesin previous reports have ranged from 44 to 73% with most
DISCUSSION Comparison of three methods: For the 3 methods evaluated, there appear to be significant differencesbetween the results obtained with methods A and B and those of CADENZA. The simpler methods A and B were, in fact, virtually the same technique, differing only by 1 extra variable for B, in the number of categories for test results, and in the need for visual estimate of probabilities. Therefore, the equivalenceof their sensitivity and specificity over all thresholds of posttest probability should be no real surprise. As a result, becausemethod A precededmethod B, methodsA and B will be referred to as method A for the remainder of this discussion. However, CADENZA differed dramatically from the other methods and these differences depended on the threshold used. Previous studies using method A to evaluate its utility have considered 110% as the cutpoint between those with and without CAD.4J0J1 At this particular threshold, CADENZA was a better discriminator than method A. Unfortunately, the same cannot be said concerning CADENZA’s ability at the oppositeend of the diagnostic spectrum. Higher posttest probabilities with CADENZA were significantly lesslikely to be true positives when compared to those derived from method A. Therefore, whereas CADENZA may well be a better screening method due to its sensitivity, method A is a better confirmer of diseasepresencedue to its specificity. Given the relative merits of each method, both could be used (i.e., 1 to first exclude or include disease presence[CADENZA] and then if included, 1 to con>55%. firm [method A] diseasepresence).
THE AMERICAN JOURNAL OF CARDIOLOGY NOVEMBER 15. 1989
1121
COMPARISON OF BAYESIAN MEWOW
One way that our population differed from most others was that all of our patients were sent by their referring physicians (internists, family practitioners and cardiologists) for stress testing before any decisions concerning catheterization. Most other study populations were derived from patients referred for catheterization with no prior indication of whether stress testing was done previously or whether it was usedin the decisionto refer for catheterization. The method or rationale in previous studies by which patients were selectedor referred for catheterization was usually not discussedexcept that they were felt to be candidates by their referring physicians. Our study likewise did not look at the methods used to determine referral, but we do know that stresstesting may have had a role becauseall patients received it. Whereas our study, as well as all others, suffered from what has been called pretest referral bias,23we do know that all of our patients were drawn from the population with suspectedCAD. Those patients who were not felt to be appropriate for stress testing and who were referred directly for catheterization were not considered in our study. This group (not referred to us) may have a higher prevalence of CAD and might explain why our prevalencewas lower. Nonetheless,concerning the accuracy of Bayesian methods, our population with a lower prevalenceyielded similar conclusions to those studies using higher prevalencepopulations. Therefore, our data lend support to another assumption of Bayesian methods, referred to as the cross-institution assumption.l5 This assumptionrefers to the applicability of test data at 1 institution to patients at another institution. Nevertheless, despite this support, the cross-institution assumption (which remains an assumption) as well as referral bias will remain confounding factors that need to be tolerated. For ethical reasons,it will never be possibleto perform accuracy studieson the real population that needsto be studied (i.e., all of those presenting with suspectedCAD). Our own group of 436 patients was drawn from a consecutive series of 1,681 patients with suspected CAD. The actual prevalence in the entire group of 1,681 was unknown, but it was likely to be significantly <38%. We make this statement based on the fact that the mean pretest probability (method A) for the 1,245 patients who did not undergo catheterization was 24.7% and for the 436 patients in our study was 41.6%. The fact that the prevalence in our study group was lower than any other published report on this subject suggests that our sample population was closer to representing the original group than those for most studies.However, the pretest probabilities suggest that our study group was still very different from the entire group from which it was drawn. Becausethis particular study dealt only with comparing the 3 Bayesian methods and their ability to detect either the presenceor absenceof CAD, we cannot comment on the assessmentof CAD severity and its effects on our results. Further studies at our institution will consider this important variable. 1122
THE AMERICAN JOURNAL OF CARDIOLOGY VOLUME 64
Ca&udons: Whereas CADENZA demonstrated improved sensitivity over methods A and B, its specificity was inferior to the other methods.Therefore, whereas CADENZA may be a better screening tool to exclude CAD, higher posttest probabilities generated by CADENZA cannot be relied on to confirm the presence of CAD. For this purpose,method A or B is more accurate. REFERENCES
1. Sax HC. Probability theory in the use of diagnostic t&s. An introduction to critical study of the literature. Ann Intern Med 1986;104.60-66. 2. Ritkin RD, Hood WB. Bayeaiananalysisof electrocardiographicexercisestress testing. N Engl J Med 1977:297.681-686. 3. Diamond GA, Forteater JS. Analysis of probability as an aid in the clinical diagnosisof coronary artery disease.N Engl J Med 1979:30&1350-1358. 4. Diamond GA, Forrester JS, Hirsch M, Staniloff HM, Vas R, Berman DS, Swan HJC. Application of conditional probability analysisto the cliical diagnosis of coronary artery disease.J Clin Inwst 1980,65:1210-l221. 5. Staniloff HM, DiamondGA, FreemanMR, BermanDS, Forrester JS.SiipliBed application of Bayesian analysis to multiple cardiologic teats. Cl[n Cwdiol 1982:5:630-636.
S. Patterson RE, Eng C, Horowitz SF. Practical diagnosisof coronary artery disease:a Bayes’ theorem nomogramto correlate clinical data with noninvasive exerciseteats.Am J Cordiol 1984:53:252-256. 7. Diamond GA, Staniloff HM, Forrester JS, Pollack BH, Swan HJC. Computer-assisteddiagnosis in the noninvasive evaluation of patients with suspected coronary artery disease.JACC 1983;1:444-455. 8. Diamond GA. A cliically relevant classification of cheatdiscomfort. JACC 1983:1:574-575. S. Fryback DG. Bayea’ theorem and conditional nonindependenceof data in medical diaanoais.Coma Biomed Res 1978:11:423-434. 10. Melm YA, Wijns W, VanbutseleRJ, Robert A, DeCoster P, Brasseur LA, Becker%C, Detry JR. Alternative diagnosticstrategiesfor coronary artery disease in women:demonstrationof the usefubtessand efticiency of probability analysis. Circulation 1985;71:535-542. 11. Santinga JT, Flora J, Maple R, Brymer JF, Pitt B. The determination of the posttestlikelihood for coronary diseaseusing Bayes’ theorem. J Electrocwdiol 1982:15:61-68. 12. Detrano R, Yiamdkas J, Salccdo EE, Rincon G, Go RT, Williams G, Leatherman J. Bayesianprobability analysis: a prospectivedemonstrationof its clical utility in diigncsing coronary disease.Cfrculation 1984,69:541-547. 13. Detrano R, MarcondosG, Froelicher VF. Application of probability analysis in the diagnosisof coronary artery disease.Chest 1988,94:380-385. 14. Hung J, Chaitman BR, Lam J, LesperanceJ, Dupros G, FinesP, Cherkaovi 0, Robert P, BourassaM. A logistic regressionanalysis of multiple noninvasive teatsfor the prediction of the presenceand extent of coronary artery diseasein men. Am Heart J 1985;110:460-469. IS. Detrano R, Guppy KH, Abbassi N, Janosi A, Sandhu S, Froelicher V. Reliability of Bayeaianprobability analysisfor predicting coronary artery disease in a Veterans Hospital. J Clin Epidemiol 1988;41:599-605. 16. Dam PE, Weiner JP, Melin JA, Becker LC. Conditional probability in the diagnosisof coronary artery disease:a future tool for elbninating mmeceasary testing? South Med J 1983;76:1118-1121. 17. PattersonRE, Horowitz SF, Eng C, Rudin A, Meller J, Halgash DA, Pichard AD, GoldsmithSJ, Herman MV, Gorlin R. Can exerciseelectrocardiographyand thallium-201 myccardii imaging exclude the diagnosisof coronary artery disease?Am J Cwdiol 1982:49:1127-l135. 12 Mehn JA, Piret LJ. VanbutseleRJ, RousseauMF, CosynaJ, BrasseurLA, BeckersC, Detry JR. Diagnosticvalue of exerciseelectrocardiographyand thallium mycnxdial scintigraphy in patientswithout previousmyomrdial infarction: a Bavasian armroach.Circulation 1981.63:1019-1024. 19. Hlatky ‘M, Botvinlck E, Bnmdage B. Diignostic accuracy of cardiologists compared with prob&lity calculations using Bay& rule. Am J Cwdiol 1982;49:1927-1931. 20. Weintraub WS, Madeira SW, BodenheimerMM, SeelausPA, Katz RI, Feldman MS. Aaarwal JB, Banka VS. Helfant RH. Critical analysis of the _.application of Bayes’theoremto sequentialtesting in the noninvasivediagnosisof coronary artery dii. Am J Cwdiol 1984;54:43-49. 21. Diamond GA, Forrester JS. Probability of CAD. cirnrlotion 1982,65:641642.
22. Greenberg PS, Elleatad MH, Clover RC. Comparison of the multivariate analysisand CADENZA systemsfor determinationof the probability of coronary artery dkase. Am J Cwdiol 1984:53:493-4%. 22. koranski A, Berman DS. Silent myocardial ischemia.I. Pathophysiology, frequencyof occurrence,andapproachesto detection.Am Hewt J 1987;114.615626.