Nonparametric Comparison of Entire ROC Curves for Computerized ECG Left Ventricular Hypertrophy Algorithms Using Data From the Framingham Heart Study
Gregory Campbell, Phi), Daniel Levy, MD, Alice Lausier, MA, M a r t h a R. H o r t o n , M S c , a n d J a m e s J. B a i l e y , M D
Abstract: A computer program may be capable of several different statements for left ventricular hypertrophy (eg, possible LVH, probable LVH, consistent with LVH), but such statements resulting from discretized levels of sensitivity/specificity would represent only isolated points on a receiver-operating characteristic (ROC) curve, which is a plot of all levels of sensitivity versus specificity. Even if two algorithms use the same discrete scales, their performances may not readily be compared. The authors present a comparison methodology for ROC curves using ROC area as a nonparametric measure of the ability of the algorithm to separate the two populations; the ROC area ranges from 0.5 (no ability) to 1.0 (perfect separation) and is unbiased if the normal versus abnormal populations have no common values for the measurement. The methodology compares the performance of ECG algorithms on the same population of cases by testing for significant differences of ROC areas and incorporating correlation of the algorithms in a nonparametric way. To illustrate this methodology, they use ECG and echocardiographic data from the Framingham Heart Study. Key words: ROC curves, electrocardiogram, echocardiogram, left ventricular hypertrophy.
Typically, an ECG diagnostic algorithm uses a threshold on a m e a s u r e m e n t or combination of measurements, such as QRS amplitudes, durations, and T w a v e changes, to determine left ventricular hyp e r t r o p h y (LVH). This threshold m a y establish a sensitivity for a fixed level of specificity or, alternatively, m a y achieve the maximal separation of two populations (ie, normal vs. abnormal) by classical discriminant or logistic discriminant analysis, thereby
making assumptions a b o u t the data that m a y not be valid. Such thresholds represent only single values for sensitivity and specificity. Hence, the comparability of two algorithms for LVH on the basis of their different, single values for sensitivity and specificity is questionable, since a n y change in the levels of sensitivity and/or specificity might produce a different conclusion. An additional difficulty is introduced w h e n an ECG algorithm renders qualified statements such as "possible LVH," " p r o b a b l e LVH," or "consistent with diagnosis LVH," resulting in several different levels of sensitivity and specificity. Such statements resulting from discretized levels of sensitivity/speci-
From the Division of Computer Research and Technology, National Institutes of Health and the Framingham Heart Study, National Heart Lung and Blood Institute, Bethesda, Maryland. Reprint requests: James J. Bailey, MD, Building 12A, Room 2041, National Institutes of Health, Bethesda, MD 20892.
152
Nonparametric Comparison of ROC Curves 9 Campbell et al.
ficity would represent only isolated points on a receiver operating characteristic (ROC) curve, a plot of all levels of sensitivity versus specificity. Even if two algorithms use the same discrete scales, their performances m a y not readily be compared. We present a comparison methodology for ROC curves using ROC area as a nonparametric measure of the ability of the algorithm to separate the two populations; the ROC area ranges from 0.5 (no ability) to 1.0 (perfect separation) and is unbiased if there are not m e m b e r s in each of the populations having the same values for the measurement. We illustrate this m e t h o d o l o g y by comparing LVH algorithms on the same patient population using the differences in areas under the ROC curves and incorporating correlation of two algorithms in a nonparametric way.
Methods and Materials The collection of echocardiographic data in the F r a m i n g h a m study has been described, l"2 as has the collection of ECG data and processing m e t h o d s by the IBM (V2 M0) p r o g r a m to produce measurements. 3- 5 To analyze the complete ROC curve, it is necessary to choose an LVH algorithm that could theoretically produce a continuous parameter. This is necessary to eliminate or minimize tie values b e t w e e n normal a n d a b n o r m a l cases, which would require m u c h m o r e complicated statistical analysis. Two recently published algorithms that fulfill this requirement are the ECG left ventricular mass index (ECG-LVMI), developed by Rautaharju et al. 6 and the Cornell voltage criteria. 7 Voltage measured using the SokolowLyon m e t h o d can also be used and is of some historical interest, s The equations that are for statistical comparison of the areas under the ROC curves are given in the Appendix. 9 Other statistical analyses were carried out using SAS i m p l e m e n t e d on NIH m a i n f r a m e computers, lo
Discrete Categorization of Cases Into Normal Versus Abnormals Figure 1 is a scatterplot of ECG voltage versus echocardiographic left ventricular mass index (gm per height, in meters). The difficulty with this data is that
153
it is impossible to separate those cases at the upper normal limit (but truly normal) f r o m mild cases of left ventricular hypertrophy. There tend to be m a n y more such " g r e y " cases in the F r a m i n g h a m study than w o u l d be found in a hospital population. One reason is that m a n y more hospital patients receive ECGs t h a n receive echocardiograms; hence, most of the LVH cases m a y be first detected by ECG at a time w h e n the LVH is rather m o r e severe. Another reason is that hospital-based clinical investigators tend to form their normal populations from very healthy adults in w h o m the possibility of cardiac abnormality has been minimized by several different tests. To model this situation, we used parameters from the n o r m a l population already defined by Levy et al. a'2 For the well-defined " n o r m a l s " we accepted those cases w h o s e echo-left ventricular mass index (echo-LVMI) (in gm per meters-height) did not exceed the m e a n plus 1 standard deviation. For the well-defined " a b n o r m a l s " we selected those whose echo-LVMI was at least equal to the m e a n plus 4 standard deviations or greater. Figure 2 shows the normal and abnormal populations defined in this way, with m e n and w o m e n grouped together. (The 109 abnormals have echo-LVMI > 125 gm/m.)
Results Table 1 shows the results of applying the published thresholds for the ECG-LVMI, 6 Cornell voltage criteria, v and Sokolow-Lyon voltage criteria (S 1 + R5 > 35 m m ) a to the ECGs of the the two populations. The sensitivity/specificity values for SokolowLyon are similar to those reported by Reichek and Devereux 8 and Casale et al. 7 The sensitivity/specificity values for the Cornell Voltage criteria are similar to those reported by Casale et al. 7 Rautaharju et al. 6 estimated LVH prevalence at 36.6% by ECG-LVMI criteria, as c o m p a r e d with 12.2% by Comell criteria. Indeed, the ECG-LVMI criteria in Table 1 appear to be m o r e sensitive than the Cornell criteria; however, the comparison is not at the same specificity. If the ECG-LVMI criteria thresholds are adjusted to higher levels so that the specificities are the same, as shown in the last line of Table 1, then the sensitivity of the Comell criteria appears to be s o m e w h a t higher. To test this difference in sensitivities for significance, one can apply M c N e m a r ' s estimate for chisquare ~m2 : X 2
= (B - C)2/(B + C)
Table 2 illustrates a comparison of the sensitivities, given equal specificities (i.e., 92.6%), of the ECG-
154 Journal of Electrocardiology Vol. 22 Supplement
+
ID
5
+ o
+
+
+
+
+
+
+
+
+++
+ +
+
+
+
%
+
3
+ +
M 0
+
+ +
+
+
+ +
+
2
+
+
+-h~-
0
i
0
i
I
I
300
200
100
400
Left Ventricular Mass / Height Fig. 1. Scatterplot of Comell voltage versus echo-LVM index (LV mass [gm]/height [m]). Measurement of ECG voltages were obtained by the IBM program. LVMI (adjusted) criteria versus the Cornell voltage criteria. The result of McNemar's test (X 2 = 0.926, p = 0.335) shows little evidence that the sensitivities of the two algorithms differ w h e n the empirical specificity is held at 92.6%. However, w h a t if the two algorithms were compared at s o m e other level of specificity? Figure 3 shows the entire ROC curves for the Cornell voltage, ECG-LVM index, and Sokolow-Lyon voltage. From this g r a p h one can immediately estimate differences in sensitivity for a given level of specificity, or vice versa. But the question of significance of difference w o u l d require application of M c N e m a r ' s formula again and again for each level of specificity (or sensitivity) chosen. Suppose M c N e m a r ' s test revealed
significant differences at some levels but not at others. H o w could such results be interpreted? An alternative is to regard the ROC as a single entity and to use the area u n d e r the curve as a global nonparametric performance measure for each algorithm. Table 3 shows the result of applying the statistical m e t h o d o l o g y in the Appendix to the differences in ROC areas s h o w n in Figure 3. Different formulas for the ECG-LVMI index were used for m e n and for w o m e n . (The ROC curves in Figure 3 and the results in Table 3 are for m e n only. The ROC curves for females s h o w similar results.) The inference derived from the ROC curve analysis is that while both the Cornell voltage and ECG-LVM index Table 2. Detection of 109 Abnormal Patients Using
Table 1. Sensitivities and Specificities for Four ECG Criteria Sensivity (N = 109)
Specificity (N = 378)
So kolow-Lyon criteria ECG-LVMI criteria Comell voltage criteria
20.2 % 67.9% 39.4%
92.1% 75.1% 92.6%
ECG-LVMI criteria adjusted
34.9%
92.6%
Two ECG Criteria With Fixed Specificity of 92.6% (A dj us t e d) ECGLVMI Criteria
Comell Voltage Criteria
True +
False -
True +
A = 27
B = 16
43
False -
C = 11
D = 55
66
38
71
109
Total
Total
Nonparametric Comparison of ROC Curves
+
+ +
+
+
+
+
O
~> +
+
++
+
++§
+++
_~ +
+ +
+ +. ++ + ++-v + +•
0
155
+
+
o
Campbell et al.
+++
+
~0
+
9
-~ +
2
+ +
+ +
+ 4- ~ + - , ~ - 7- :g "q- + +-F
+
+
++
0
t
!
I00
0
I
Left Ventrieular
40s
300
200
Mass / Height
Fig. 2. Scatterplot of cases selected to obtain discrete normal and abnormal populations, m e n and w o m e n together (see text).
,
differ significantly from the Sokolow-Lyon voltage, they do not differ significantly from each other. .]........... !"1
Discussion
~..j~~~-~G] I
/
rfl
The distribution of LV masses in the Framingham study demonstrates that it is unrealistic to categorize LVH discretely; wherever the cutoff between normal
/ - ~
~Sokolow-Lyon T a b l e 3. ROC Statistics for 217
Men
Correlation vs. Area Under Echo-LVMI ROC Curve Comell voltage ECG-LVM index Sokolow-Lyon voltage
0.4960 0.5543 0.2190
0.8378 0.7976 0.5993
Unadjusted values for ROC area differences
z statistic 0
.5
1 - Specificity Fig. 3. Receiver operating characteristic (ROC) curves for C o m e l l voltage, ECG-LVM index, and Sokolow-Lyon voltage.
p value
1
Cornell voltage--ECG-LVMI Cornel1 voltage--SokolowLyon ECG-LVMI--Sokolow-Lyon
0.850 3.37
0.39 0.0008
3.30
0.001
156
Journal of Electrocardiology Vol. 22 Supplement
and abnormal cases was drawn, it would be purely arbitrary and any given case could end up on the w r o n g side of the border and be misclassified. Clearly, patients progress continuously from normal LV mass, to upper normal limit, to mild LVH, and to moderate LVH before reaching severe LVH. Nevertheless, historically LVH has been treated as a discrete diagnostic category. Therefore, we modeled this situation by extracting the appropriate data from the F r a m i n g h a m Heart Study. The fact that the sensitivity/specificity values for the Cornell and Solokow-Lyon criteria are similar to those values previously published 7"8 suggests that this was an appropriate model for the usual situation in hospital series. However, one must recognize that exclusion of the middle data must introduce a bias, just as excluding certain obese subjects because of technically unsatisfactory echocardiograms also introduces a bias. Normalizing echo-LV mass to body surface area w o u l d also introduces a bias, since obese subjects are more likely to have LVH but less likely to have LVH voltage. Normalizing echo-LV mass to height as was done here should reduce that problem. ~'2 ROC curves could be derived from the total data in Figure 1 if a cutoff could be chosen that could effectively classify cases. Clearly, any ECG algorithm applied to the total data will show a poorer tradeoff between sensitivity and specificity than w h e n it is applied to the data in this discretized model, which has the grey zone data in the middle removed. The scatterplot of the total data suggests that ECG voltage does not correlate very highly with the echo-LVM index. One can derive similar scatterplots for other ECG algorithms, which have relatively poor correlations, as was s h o w n in Table 3 and by Reichek and Devereaux. 8 Nevertheless, if a cutoff were selected, ROC curves for the data in Figure 1 could be derived. The methodology described in the Appendix could still be applied. Clearly, however, the methodology is most easily applied to normal versus abnormal categories defined with few or no grey zone cases. In that situation, the results of ROC curve analysis are easier to understand and interpret.
References 1. Levy D, Savage DD, Garrison RJ et al: Echocardiographic criteria for left ventricular hypertrophy: the Framingham Heart Study. Am J Cardiol 59:956, 1987 2. Levy D, Anderson KM, Savage DD et al: Echocardiographically detected left ventricular hypertrophy: prevalence and risk factors. Ann Intern Med 108:7, 1988 3. Levy D, Bailey J J, Garrison RJ et al: Electrocardiographic changes with advancing age: a cross-sectional study of the association of age with QRS axis, duration, and voltage. J Electocardiol 20 (suppl):44, 1987 4. Levy D, Savage DD, Garrison RJ et al: Left ventricular hypertrophy: the Framingham study, p. 273. In, Bailey JJ (ed) Computerized interpretation of the electrocardiogram:X. Engineering Foundation, New York, 1986 5. Garcia R, Breneman GM, Goldstein S: Electrocardiographic computer analysis: practical value of the IBM Bonner-2 (V2 MO) programs. J Electrocardiol 14:283, 1981 6. Rautaharju PM, LaCroix AZ, Savage DD et al: Electrocardiographic estimate of left ventricular mass versus radiographic cardiac size and the risk of cardiovascular disease mortality in the epidemiologic follow-up study of the first National Health and Nutrition Examination Survey. Am J Cardiol 62:59, 1988 7. Casale PN, Devereux RB, Alonso DR et al: Improved sex-specific criteria of left ventricular hypertrophy for clinical and computer interpretation of electrocardiograms: validation with autopsy findings. Circulation 75:565, 1987 8. Reichek N, Devereux RB: Left ventricular hypertrophy: relationship of anatomic, echocardiographic, and electrocardio-graphic findings. Circulation 63:1391, 1981 9. Campbell G, Douglas MA, Bailey JJ: Nonparametric comparison of two tests of cardiac function on the same patient population using the entire ROC curve. p. 267. In Ripley KL (ed): Computers in cardiology. IEEE Computer Society, Silver Spring, MD, 1989 10. SAS user's guide: statistics, version 5 ed. SAS Institute Inc., Can/, NC, 1985 11. Bailey J J, Campbell G, Horton MR et al: How can statistically significant differences in the performance of ECG diagnostic algorithms be determined? p. 35. In Ripley KL (ed): Computers in cardiology. IEEE Computer Society, Silver Spring, MD, 1988 12. Bailey JJ, Campbell G, Horton MR et al: Determination of statistically significant differences in the performance of ECG diagnostic algorithms: an improved method. J Electrocardiol 21 Suppl: 188, 1988
Appendix Consider two parameters or tests, 1 and 2, of cardiac function. To compare the performances of these two diagnostic tests, the areas under the ROC curves can be compared. Let Xi and Yj denote results of test
1 on patient i in healthy group and patient j in the diseased group and Ui and Vj for the same i, j patients with test 2 (i = 1. . . . . m; j = 1. . . . . n). Let AREAi denote the ROC area for test i (i -- 1, 2):
Nonparametric Comparison of ROC Curves * Campbell et al. 1
ITI n E Z I{x.
AREA1 =
In the event that the same patients are in both tests, the exact expression for the covariance is:
1 m n AREA2 -- - • E IIai
Cov(AREA1, AREA2) =
where {~
I n general, the variance of the difference is given by:
(1)
1
m.n
{(P22
1512 -- m n ( n -
w h e r e 0-2 = V a r ( A R E A 0 , i = 1, 2. If the patients are different for the t w o tests, t h e n the covariance is a u t o m a t i c a l l y zero. N o w 0-12 is given by:
0"12 ---~
{01(1 - 01) + (n -
02]
+
(m
-
1) [P12
1)[P21
--
1)(p21 -- 0102) 1)(P~2 -
Ox02)},
In
n
n
~ 1 i~j.1 E I{Xi
estimate P21 = P(Xi < Yj, U i , < Vj) with 1
m.n
0102) "~- (m -
w h e r e 01 = P(X~ < Yj) and 02 = P(U~ < Vj). Estimate 0i by AREA~, for i = 1, 2. Estimate P12 = P(X~ < Yj, Ui < Vj,) with
1
= 0-12 + (r22 - 2 Coy(AREA1, AREA2)
1
-
+ (n -
o.2 = Var(AREA1 - AREA2)
(2)
-
(3)
ifx < y if x -> y
Iix
157
m
m
n
f521 -- m ( m _ 1)n i= __~1 i*= E 1 '~=i l I{xi
9
and estimate p22 = P(Xi < Yj, Ui < Vj) w i t h Since the true area 01 = P(Xi < Yj) is u n k n o w n , estimate it with AREA1. Similarly, estimate P12 = P(Xi < Yj, X~ < Yj,) w i t h the expression 1 in n n 1512
mn(n-
1) ~
~
~
I{x,
,=, j=~ i;7,1 a n d estimate p21 = P(X~ < Yj, X~, < Yj) w i t h
1 1~21 - m ( m -
m
m
n
1)n i=IE i*=l~ j=!i~_i,EI{Xi
For the variance 0-22 for AREA2, the estimation is similar, replacing Xs a n d Ys b y Us and Vs.
1522 =
in n 1 ~ ~ I{x,
The estimates of the variances a n d covariances in equations 1 - 3 are obtained by replacing the above theoretical quantities by their respective estimates to obtain the estimate 6.2 for 0-2 . T h e n the test statistic Z is given by: Z=
AREA1 - AREA2 6.
and c o m p a r e d with the standard n o r m a l distribution for the test statistic to c o m p a r e the t w o ROC areas.