Assessment of diagnostic ECG results using information and decision theory

Assessment of diagnostic ECG results using information and decision theory

Journal of Electrocardiology Vol. 25 Supplement Assessment of Diagnostic ECG Results Using Information and Decision Theory Results From the CSE Diag...

692KB Sizes 0 Downloads 16 Views

Journal of Electrocardiology

Vol. 25 Supplement

Assessment of Diagnostic ECG Results Using Information and Decision Theory Results From the CSE Diagnostic Study

Jos L. Willems, MD, PhD

Abstract:

Information

and decision theory were applied to assess the interpreta-

tion results of 15 computer gram) and

9 cardiologists;

vectorcardiogram, demonstrates prehensive

programs 8 of whom

using a database of

that information

view

(9 electrocardiogram analyzed

of diagnostic

1,220 clinically

ECG

6

vectorcardioand 5 the

validated cases. The study

content and utility indices, by providing performance,

the more classic statistical performance electrocardiography,

and

the electrocardiogram

can enhance

measures.

classification,

Key words:

diagnostic

a com-

the insight given by

accuracy,

computerized performance

testing.

while all programs were also compared with the combined interpretation of the cardiologists.2 Several methods and measures exist for the assessment of diagnostic ECG results with the true condition as a reference. Sensitivity, specificity, receiver operating characteristics (ROC) curves, positive and negative predictive test results, as well as total accuracy are most often used.3 Less well known are the comprehensive measures derived from information and decision theory. 4,5 The application of information analysis to diagnostic testing is derived from the recognition that all clinical tests are imperfect. This imperfection introduces uncertainty or noise into the interpretation. Decision theory and utility functions take into account the clinical value and relevance of the diagnostic test results. Both approaches have been applied in the CSE diagnostic study. Results derived by means of classical statistical analysis methods have been recently reported for ECG interpretation.2 This paper aims to summarize

Computer analysis of the electrocardiogram (ECG) is widely used. ’ However, no systematic assessment has been performed on the various programs of the same database.2 The European Project, Common Standards for Quantitative Electrocardiography (CSE), a large international study, was undertaken to compare nine electrocardiographic and six vectorcardiographic (VCG) computer programs using 1,220 clinically validated cases. The same ECGs were also read by nine cardiologists, eight of whom analyzed the ECG and five the VCG. All individual program and cardiologist results were compared with the “truth,” based on ECG independent evidence, From the Division of Medical Informatics, University Hospital Gasthuisberg, Leuven, Belgium. Supported by Directorate General XII of the European Comrnission within the frame of its Third and Fourth MedicaI and Public Health Research Programmes, and by various funding agencies within the European Member States. Reprint requests: Jos L. Willems, MD, PhD, Division of Medical Informatics, University Hospital Gasthuisberg, 49, Herestraat, 3000 Leuven, Belgium.

120

Assessment

the main results obtained by means of information content and utility analysis.

Materials and Methods Patient Material and ECG Processing The CSE diagnostic study was launched in 1985, as a follow-up study to the first CSE study of which the objectives were the development of standards and the assessment of the performance of ECG measurement programs.6 A pilot database consisting of 250 validated cases and various statistical methods were developed first. Details have been previously described.‘-” The database was expanded to 500 cases in 1987-88, and in 1989-90 it was expanded to its final size of 1,220 cases. The data collection was entirely based on ECG-independent clinical information and was restricted to seven main diagnostic categories. Details on patient material and data analysis by classic evaluation methods have been recently published.2 The following diagnostic groups were included: normals (NL; n = 382); left (LVH; n = 183), right (RVH; n = 55), and biventricular (BVH; n = 53) hypertrophy; anterior (AMI; n = 170); inferior (IMI; n = 273), and combined (MIX: n = 73) myocardial infarction; combined infarction and hypertrophy (n = 3 1). The case selection was verified by a panel of three cardiologists. All standard ECG leads and the orthogonal X, Y, and Z leads were simultaneously recorded with a sampling rate of 500 Hz over 10 seconds. Those with major conduction defects or poor technical quality were excluded. That was the only reason why the ECGs were screened by the panel. The ECGs were analyzed by eight cardiologists, from seven European countries, and by nine ECG computer programs. The VCGs were independently read by five cardiologists and by six VCG programs. The ECG programs used included: Marquette (2), Hannover (4), Hewlett-Packard (5), Medis (7), Nagoya (8), Glasgow (ll), Padova (13), Means (15), and Leuven ( 16). The VCG programs were: VectoLouvain (3), Hamrover (6), Lyon (9), AVA-Pipberger ( lo), Port0 ( 12), and Means ( 14). The numbers refer to the program identifiers used in this paper. Details of most of these programs have been recently published in a special issue of Methods of Information in Medicine.’ Programs 4, 6, 10, and 16 use a statistical approach to classification. All others applied heuristic or deterministic methods. Only age and sex were

of Diagnostic

ECG Results

l

Willems

121

given as prior information to all programs and cardiologists, except for program 10, which was given a code indicating one of three conditions (suspicion of coronary disease, hypertensive or valvular disease, and other) to adjust prior probabilities, as required by the program developers.‘O In addition to the individual results, combined results were calculated by means of weighted averaging, as previously described.“-‘3 This was done for the programs (ECG, VCG, and all combined) and also for the cardiologists.

Statistical Analysis The program and cardiologist results were compared with the true disease category obtained from the independent clinical data. Program results were also compared with the results of the average individual cardiologist and with the combined cardiologist results, that is, the group opinion.2,s Different classification matrices were calculated: first in a bigroup analysis, that is, normal versus abnormal, left ventricular hypertrophy versus nonleft ventricular hypertrophy, etc.; next in a 3- x -3 matrix; that is, normal versus infarct versus hypertrophy; then in a 5- x -5 matrix, where hypertrophy was split into its three respective subgroups (left ventricular hypertrophy, right ventricular hypertrophy, biventricular hypertrophy); and finally in a full 7- x -7 classification matrix. The information content (I) results have been derived from the 2- x -2 classification matrices and plotted for various prevalence values in incremental steps of 5%, according to the formula described by Diamond et al.,’ whereby 1 = Cj[ap log 2(ap) + bq logz(bq) lap + bq) log2 (ap + bq)] - CiPi logzpi where a b P q

= = = =

p(T#D+) p(Tj/D-) P(D+) p(D - ) = 1 - p

= = = =

true positive rate false positive rate prevalence of disease prevalence of nondisease

and p was varied in increments

of 0.05

The area under the resulting curve represents the average information content. An ideal test should have an I value, expressed in bits, of 1 at 50% disease prevalence and a maximal average of 0.72. Utilities have been derived by weighting the observed frequencies of the classification matrix, ac-

122 Journal of Electrocardiology

Vol. 25 Supplement Table 1. Cost/Utility Prediction

True

VI-If

Group NL LVH RVH BVII AMI IMI MIX VH + MI

-

NL

LVH

RVH

BVH

AMI

IMI

MIX

5.00 4.00 3.50 4.00

-2.00 4.00 0.00 0.00 - 5.50 -6.50 - 6.50 0.00

-2.00 0.00 5.00 0.00 - 6.00 -5.50 -5.50 0.00

- 3.00 0.00 0.00 5.00 - 5.50 - 6.00 - 6.00 0.00

- 6.50 -4.50 - 5.00 -4.50

- 5.50 -4.00 - 4.50 -4.50

-6.50 - 5.00 - 5.00 - 5.00

8.00 7.50

8.00 9.00

0.00 8.00

8.00 0.00 0.00 0.00

0.00 0.00

Other

MP -

- 2.00 - 2.00 - 1.00 -2.00 - 4.00 - 4.00 - 4.00 - 6.00

7.00 1.00 1.00 1.00

0.00 0.00 8.00

0.00 0.00 0.00

0.00

9.00

NL = normals; LVH = left ventricular hypertrophy; RVH = right ventricular hypertrophy; BVH = biventricular hypertrophy; AMI = anterior myocardial infarction; IMI = inferior myocardial infarction; MIX = combined myocardial infarction; VH + MI = ventricular hypertrophy plus myocardial infarction.

cording to the clinical relevance of the different correct and incorrect classifications and summing up the results into one overall utility figure for a specific clinical environment, for example, hospital versus ambulatory care. The overall utility figures were derived from the 7 x 7 misclassification results, whereby the number of predicted diagnoses are multiplied with the “utility scores” awarded for each of the cells as listed in Table 1, and summed up according to the formula: Utility Index = SUMi) (uij*nij ), whereby

uij = utility for each respective prediction and nij = number of cases in the respective cells.

The applied utility or cost factor for each prediction is arbitrary and depends on the user and environment in which the program is to be applied. Thus, utilities for health screening will not be the same as for an acute hospital care environment. Results presented in this paper have been derived for the latter setting. During the CSE Workshop of December 1989’ 13 participants (7 cardiologists and 6 noncardiologists) gave their personal utility or respective cost estimates for each of the cells of the 7- x -7 classification matrix, using scores from + 10 to - 10. They filled in the same form a second time at home. There was a high correlation between their first and second round estimates (r2 = 0.88; n = 72). Median estimates obtained from the first round have been used in this paper. These utibty estimates are presented in Table 1.

Corresponding values for the cardiologists were 96.0% (range, 92.7-97.6%) for the ECG and 80.6% (range, 73.8-97.2%) for the VCG. The median sensitivities for left and right ventricular hypertrophy, and for anterior and inferior myocardial infarction were 55.7%, 32.7%, 74.1%, and 65.1%, respectively for the computer programs, versus 63.4%, 48.5%, 82.9%, and 73.3%, respectively for the cardiologists (p < 0.02 for all 4). The results of the cardiologists showed equal or lower rates of false positives than the programs. Programs with the best performance reached almost equal levels of total accuracy as the best cardiologists (Table 2). Programs that used statistical methods for diagnostic classification (ECG programs 4 and 16; VCG programs 6 and 10) generally had a higher diagnostic accuracy than the deterministic programs when compared with the “truth,” but the study design may have favored this outcome. Combined results of the programs and of the cardiologists were in almost all cases more accurate than those of the individual programs and individual cardiologists (Table 2).

Table 2. Total Accuracy of the Programs and Cardiololgists Cardiologists No. 2 3 4 5 6 7

75.3 77.1 78.3 78.5 75.5 73.8 72.6

a

at.0

C

79.2

1

Results Sensitivity, Specificity, and Total Accuracy The median specificity of the ECG programs was 91.3% (range, 86.3-97.1%) against 80.9% (range, 71.4-86.6%) for the VCG programs (p < 0.001).

ECG (%)

Programs

No.

VCG (%)

1 2 3 4 5

67.5 74.4 66.3 70.3 73.1

C

74.8

No.

ECG (%)

2 4* 5 7 a 11 13 15 16* c

69.8 75.8 69.3 67.6 65.6 69.7 62.0 69.8 77.3 76.3

No.

VCG (%)

3 6* 9 lO* 12 I4

66.3 70.6 64.5 76.2 64.3 70.2

C

77.0

* Statistically based programs. C = respective combined results.

Assessment of Diagnostic ECG Results

L

z

0.0

!, 0

l

Comb-Fief-ECG

0

Comb-Ref-VCG

.

Comb-Prog-ECG

q

Comb-Prog-VCG Overall-Comb-REF Overall-Comb-Prog

A A

(

(

20

40

1

60 80 100 PREVALENCE (%)

Fig. 1.

The information content results obtained by the combined ECG and VCG results of the programs and cardiologists for the classification of the abnormal (ABN) versus normal (NSA) cases. Note that abnormal means classification into one of the six abnormal classes included in this study. The figure demonstrates that the classification rate of the VCG was significantly lower than that of the ECG.

Overall utility results derived from the 7- x -7 misclassification matrices are depicted in Figure 3. Utility results correlated highly with total accuracy. The determination coefficient (r’) was 0.896 for the cardiologists and 0.865 for the programs. However, the slope of the curves was steeper for the programs than for the cardiologists. Utility scores for the combined program results and for the statistical program results were almost as high as those of the best cardiologists. Figure 3 also shows that some programs had a significantly lower performance. In general, the VCG programs and VCG readers showed lower utility scores as well as lower total accuracies than the ECG programs and ECG readers.

An information content curve, depicting the results for the classification of normal versus abnormal, is shown in Figure 1 for the combined results of the ECG, VCG, and all the programs versus the cardiologists. From these curves it is apparent that ECG interpretation was superior over VCG interpretation, both for programs and cardiologists. Figure 1 also shows that the combined program results were almost as good as those of the cardiologists. Figure 2 depicts individual, average information

Vl-v5.

E V

2

3

4

5

123

Utility Results

Information Content Results

Fig. 2. The average information content for the classification of abnormal (ABN) versus normal (NSA) cases for the various programs and cardiologists. The combined ECG program results are labeled with E, the combined VCG program results with V, and the overall combined program results with P. The combined ECG results of the cardiologists are labeled with EC and their combined VCG results with VC. The results of individual ECG readers are labeled El-ES, and those of the VCG readers

Willems

results for the classification of normal versus abnormal. Note that “abnormal” means classification into one of the six other disease categories. It is apparent that the results of the various programs differ widely. The performance of the four statistical programs was significantly better than that of the other programs. Their results were almost as good as those of the best cardiologists when tested against the clinical evidence. The results of the VCG readers, labeled Vl to V5, were significantly inferior to those of the ECG readers, labeled El to E8. This is mainly due to the lower specificity of VCG analysis. Median results and ranges of information content data for the other groups are presented in Table 3. The highest discrimination was found for the infarction categories, followed by left ventricular hypertrophy. Results for right ventricular and biventricular hypertrophy were significantly lower.

b

.,.,,

l

6

7

6

9 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

PROGRAM RESULTS

REFEREE RESULTS

124

Journal of Electrocardiology Table

3. Average

Vol. 25 Supplement

Information

Content

(in bits* 100) for the Main Diagnostic Combined VCG

Combined ECG

Average Information Content (bits* 100)

Programs

NL versus Abnormal LVH versus no-LVH RVH versus no-RVH BVH versus no-BVH AMI versus no-AM1 IMI versus no-IMI MIX versus no-MIX

33 27 16 20 38 30 30

* 2 2 2 2 f k

3 3 4 4 3 3 4

Programs

Cardiologists 38 27 20 18 46 36 35

2 -c k k k t 2

3* 3 4 4 4+ 3+ 5*

29 27 21 18 36 35 34

? k t 2 f f 2

3 3 4 4 3 3 5

Cardiologists 28 24 21 17 34 31 36

* k * ” 2 2 ”

3 3t 4 4 3* 3-t 5*

Categories

Individual Programs

Individual Cardiologists

25(38-15) 20(33-13) 11(28-5) 12(31-10) 33(35-19) 25(39-14) 24(34-17)

32(44-18) 24(28-17) 18(22-16) 15(30-7) 41(46-26) 30(41-22) 30( 36-24)

Differences between programs and cardiologists. * p < 0.05; t p < 0.01; * p < 0.001. Average of 19 discrete values computed at prevalence increments of 5%. that is, from 5% to 95%, and standard deviations are given for the combined results. Medians and ranges are given for the individual program (n = 15) and cardiologist results (n = 13). Abbreviations same as in Table 1.

Total accuracy varied between 62.0% and 77.3% (median, 69.7%) for the nine ECG programs and between 64.5% and 76.2% (median, 68.2%) for the six VCG programs. Corresponding results were 72.6-81.0% (median, 76.3%) for the eight ECG readers and 66.3-74.4% (median, 70.3%) for the five VCG readers.

Discussion A standardization and performance assessment task, such as undertaken by the CSE Working Party, is difficult and can only be done by means of the international collaboration of clinicians and computer specialists.’ By including contributions and expertise from different centers with international reputation, this study was not restricted to a “special ECG school” in the hope that it’s results would be widely accepted by the medical community as well as by the industry. I

_ 5500 , u)

Yl

.Z

=

. 3743

+ 107.X

Y2 =

- 6956

+ 149.X

R”2

= 0,896

!z

3 x

4500

s

2 3 4

2500

l

.

CARDIOLOGISTS COMFVIERPROGRPMS

z 1500 60

65

60

Fig. 3. The total utility index (in arbitrary units) of the cardiologists and different programs. The individual program results are labeled with numbers 2- 16, as previously explained. Labeling of combined program results is the same as in Figure 2.

The combined referee results presented in this study demonstrate that on the current CSE database-at the very best-a total accuracy of approximately 80% can be reached by routine ECG and VCG analysis. From this, one can conclude that the limits of standard electrocardiography may have been reached and that novel diagnostic methods need to be developed to improve accuracy further, as suggested by several investigators.‘4-‘6 In the CSE diagnostic study, we have demonstrated that several standard ECG computer programs perform almost at equal level as did the cardiologists in the classification of seven main diagnostic categories. However, the results also demonstrate that all the programs can still be improved and that some require overall improvement. The four programs that used statistical methods had a higher diagnostic accuracy than the deterministic programs. However, the study design may have favored this outcome to some extent, as explained in previous publications.2,8 This study shows that information content and utility indices, by providing a comprehensive view on diagnostic performance, can enhance the insight given by the more classic statistical performance measures. It is beyond the scope of this paper to discuss the differences in the diagnostic results obtained from ECG versus VCG analysis, as this will be reported in detail elsewhere. I7 The complexity of evaluating the performance of computer ECG programs leads to the conclusion that providing one single summary measure of quality is not sufficient. As stated in a previous paper3 we feel that the diversity of the relevant characteristics of such an evaluation leads to a multi-parameter description of performance, which enables a potential program user to draw conclusions with respect to his/ her particular needs. In principle, the field of application of an ECG computer program ranges from mass screening and use as a tool in epidemiological or clinical research studies, to the use within a particular

Assessment clinical setting. These settings determine which aspects of the multi-parameter evaluation are of specific importance. Although the use of utilities is still quite controversial, decision theory and utility analysis can help, as demonstrated in this paper. Finally, this study shows that comparative testing of diagnostic computer programs is necessary for improvement of computerized electrocardiography and for consumer protection, as advocated by Pipberger and Cornfield almost 20 years ag0.l’

References 1. Drazen E, Mann N, Borun R et al: Survey of computerassisted electrocardiography in the United States. J Electrocardiol 2 1(suppl):98, 1988 2. Willems JL, Abreu-Lima C, Amaud P et al: The diagnostic performance of computer programs for the interpretation of electrocardiograms. New Engl J Med 325:1767, 1991 3. Michaelis J, Wellek S, Willems JL: Reference standards for software evaluation. Meth Inf Med 29:289, 1990 4. Smets P, Willems JL, Talmon J et al: Methodology for the comparison of various diagnostic procedures. Biometrie-Praximetrie 15:3, 1975 5. Diamond GA, Hirsch M, Forrester JS et al: Application of information theory to clinical diagnostic testing: the electrocardiographic stress test. Circulation 63:915, 1981 6. Willems JL, Amaud P, van Bemmel JH et al: Common standards for quantitative electrocardiography: goals and main results. Meth Inf Med 20:263, 1990 7. Willems JL, Campbell G, Bailey JJ: Progress on the CSE diagnostic study: application of McNemar’s test revisited. J Electrocardiol 22: 13 5, 1989

of Diagnostic

ECG Results

l

Willems

125

8. Willems JL: Common standards for quantitative electrocardiography: 10th and final CSE progress report. Acco. Leuven, 1990 9. van Bemmel JH, Willems JL: Standardization and validation of medical decision-support systems: the CSE project (editorial). Meth Inf Med 39:261, 1990 10. Pipberger HV, McCaughan D, Littmann D et al: Clinical application of a second generation electrocardiographic computer program. Am J Cardiol 35:597, 1975 11. Willems JL, Abreu-Lima C, Arnaud P et al: Effect of combining electrocardiographic interpretation results on diagnostic accuracy. Eur Heart J 9: 1348, 1988 12. Kors JA, van Herpen G, Willems JL, van Bemmel JH: Improvement of automated electrocardiographic diagnosis by combination of computer interpretations of the electrocardiogram and vectorcardiogram. Am J Cardiol 70:96, 1992 13. Milliken JA, Burggraf G, McCans J et al: Evaluating ECG interpreters against expert majority and clinical data. p. 341. In Laks M (ed) : Computerized interpretation of the electrocardiogram VII. Engineering Foundation, New York, 1982 14. Simonson E, Tuna N, Okamoto N et al: Diagnostic accuracy of the electrocardiogram and vectorcardiogram: a cooperative study Am J Cardiol 17:829, 1966 15. Komreich F, Block P, Bourgain R et al: Multigroup diagnosis classification with a new “maximal” lead system. p. 12 1. In Abel H (ed): Electrocardiology advances in cardiology. S. Karger, Basel, 1976 16. Rautaharju PM, Blackbum HW, Wolf HK, Horacek M: Computers in clinical electrocardiology: is vectorcardiography becoming obsolete? p. 143. In Abel H (ed): Electrocardiology: advances in cardiology. S. Karger, Basel, 1976 17. Willems JL, Abreu-Lima C, Arnaud P et al: Comparison of VCG and ECG results obtained by computer and cardiologists. (in press) 18. Pipberger HV, Cornfield: What ECG-computer program to choose for clinical application: The need for consumer protection. Circulation 47:9 18, 1973