Specificity, Sensitivity, and Prevalence in the Design of Randomized Trials: A Univariate Analysis Alfred P. Hallstrom and Gene B. Trobaugh Department of Biostatistics, University of Washington, Division of Cardiology, Harborview Medical Center, Seattle, Washington
ABSTRACT: The number of cases needed to conduct a randomized trial is related to the sensitivity and specificity of a measurement indicative of a condition, to the prevalence of the condition, to the expected benefit of therapy (or other basis for change), and to the statistical precision desired. Sample size calculations frequently ignore sensitivity and specificity (at least qualitatively) probably because no simple formula is provided in the literature. Such a formula is included here. As an example, the number of patients required for a randomized clinical trial was calculated for a clinical outcome (nonfatal myocardial infarction or coronary artery disease death) used to detect atherosclerotic heart disease and is compared to the sample sizes required for each of three noninvasive diagnostic studies (exercise ECG ST depression, exercise LVEF reduction, and thallium myocardial imaging) performed for the detection of atherosclerotic heart disease. We calculated that the sample size should be much smaller when these diagnostic studies are employed compared to the clinical outcome, thereby offering the potential for reduced cost and complexity of a randomized clinical trial. KEYWORDS: specificity, sensitivity, prevalence, power, randomized trails
INTRODUCTION
The number of subjects required in a randomized trial (n) is related to the specificity and sensitivity of the test used to detect a difference in disease prevalence between subjects groups, to the expected prevalence of the disease in the control group, to the expected benefit due to therapy (or other basis for change in prevalence) and to the statistical precision desired. Although it is desirable in any trial to have a clear and unequivocal outcome measure, this is usually not possible. This situation exists because the presence of an endpoint is typically assessed by an outcome measure that is not perfectly sensitive or specific for that endpoint. For example, many people may have coronary atherosclerosis and yet be
Received February 27, 1984; revised September 17, 1984
Address reprintrequeststo: Alfred P. Hallstrom, PH.D., Departmentof BiostatisticsSC-32, University of Washington, Seattle, WA 98195.
128 0197-2456/85/$3.30
Controlled Clinical Trials 6:128-135(1985) ~ Elsevier SciencePublishing Co., Inc. 1985 ,2 Vanderbilt Ave., New York, New York 10017
129
Specificity, Sensitivity, Randomized Trials
completely asymptomatic. On the other hand, cardiac death or myocardial infarction indicates a very high (but not absolute) probability of coronary atherosclerosis [1,2]. Certain diagnostic tests for coronary atherosclerosis (exercise ECG ST depression, exercise LVEF reduction, and thallium myocardial imaging) would theoretically permit a reduction in sample size, principally because of a higher sensitivity [3-17]. Since the definition of n is critically dependent on the estimation of sensitivity and specificity for detection of the outcome, a clear basis for estimation of these values is essential.
DEFINITIONS
The "sensitivity" of a test is its ability to detect the condition among those who have it, and the "specificity" of a test is its ability to not indicate the condition among those who do not have it. The "prevalence" of a disease is the percentage of people in a given population who have the disease. The "(x-level" of the test is the predetermined level of risk used to set the test for rejection of the null hypothesis, that is, the probability that the null hypothesis will be rejected when in fact the null hypothesis is true. The "power" of the test is the probability that the null hypothesis will be rejected assuming there is a real specified difference (alternate hypothesis).
DETERMINATION
O F n: A U N I V A R I A T E
ANALYSIS
Assuming that the number of subjects required will be more than 30 or so, n is usually determined from a normal approximation to a test on proportions [18]. An equation relating the number of subjects needed per group to the ¢x level, power, prevalence, and specified alternate (treatment effect) is [Z,X/2R(1 - R) + ZaX/r(1 n =
(r -
-
rg) 2
r) + rg(1
-
rg)] 2 ,
(1)
where Z . = the normal deviate for an o~-level test, Z~ = the normal deviate for a power of 1 - ,, r = the true prevalence of the disease in the target population, g = (100 minus the expected percentage reduction in disease due to treatment divided by 100, e.g., the percent reduction = 100(1-g), and R = (r + rg)/2.
The number of cases required increases as the prevalence decreases and also increases inversely to the square of the treatment effect (e.g., there is a marked increase in the number of subjects needed as the percentage reduction in the disease becomes smaller). When the examination used to detect the disease (and hence to estimate true prevalence) does not have 100% sensitivity and specificity, the above relation is altered because the observed prevalence is different than the true prevalence due to misdassification. In particular, the observed prevalence, is given by: observed prevalence = sensitivity x true prevalence + (1 - specificity) x (1 - true prevalence), or equivalently, solving eq. (2) for the true prevalence,
(2)
130
Alfred P. Hallstrom and Gene B. Trobaugh true prevalence = (observed prevalence + specificity - 1)/ (sensitivity + specificity - 1).
(3)
In this case the n u m b e r of cases is given by {Z~X/2R(1 - R) + Z o X / ( b r + a)[1 F/
=
(br + a)] +
(bgr + a) [1 - (bgr + a)} 2
(br - bgr) 2 where
b = sensitivity
+
(4)
specificity - 1,
a = I - specificity,
and
R = (br + a + bgr + a)/2.
Figures 1 and 2 relate sample size to prevalence. In Figure 1 sensitivity is fixed at 95% and specificity varies. In Figure 2 specificity is fixed at 95% a n d sensitivity varies. For a moderately high sensitivity and specificity, sample size is m u c h more affected by specificity than by sensitivity over a wide range of prevalence. To illustrate the effects of these variables, we calculated the sample size per group n e e d e d for a r a n d o m i z e d trial designed to test the lipid hypothesis; for example, that treatment that will lower serum cholesterol will prevent or forestall the progression of coronary atherosclerosis. These calculations were performed for the clinical outcome [myocardial infarction (MI) and/or death (DTH) due to coronary artery disease (CAD)], such as was used in the Coronary Primary Prevention Trial (CPPT) [19] and for three noninvasive diag-
Figure 1 Sample size curves for fixed power and sensitivity. 95% Sensitivity: 35% Reduction Specificity --
--
--
60X
- - - - - -
.
.
.
.
80X
-
-
BO~
8 8 X
- - . - -
-
85~
I00%
-
15000 Q_
I
t
~2000 o rr £9
I
i
9000
l
alpha ~
ii \
level
power
.05 .9
',
6000 \\
', N
g
\
3000
•
65
.
i
0
. ~ .' 5
. 2 0'
. 2 '5
. 3 '0
PREVALENCE
.
g
5
.40
.
5
Specifidty, Sensitivity, Randomized Trials
131
95% Specificity:
35% Reduction
Sensitivity --
--
--
25%
-----. . . .
- - 7 0 %
50%
80%
-----85% - - 1 0 0 %
15000 l
5~2000 0
~
alpha \
level
.05
.9
power
W 9000 N
A
W <
\
\
6000
3000
k'..\
\
.,..,
_
0
I
• 05
I
.10
I
.15
I
.20
I
.25
I
.30
~
.35
"
'"
....
.40
.
•
I
.45
PREVALENCE Figure 2 Sample size curves for fixed power and specificity. nostic tests used clinically for the detection of CAD: (1) -- 1 mm ECG ST depression induced by maximal treadmill exercise, (2) the exercise induced decrement (>5) in left ventricular ejection fraction (LVEF), and (3) exercise thallium myocardial imaging. The CPPT provides a model for calculation of the prevalence of CAD using selected risk factors [3,19,20,21,22]. The incidence of nonfatal MI or death due to coronary artery disease (CAD DTH) was predicted to be 0.087 over a 7-year span. As was recognized in the design of the CPPT, small errors in estimation of a low incidence of CAD can lead to large errors in the calculated number of subjects needed (see Figs. I and 2). Formulas can be used to refine this estimated incidence based upon the Framingham Study [20] as modified by Diamond and Forrester [3]. The projected reduction in the incidence of MI/CAD DTH in the CPPT group assigned to receive the medication to lower serum cholesterol was 35.6%, resulting in an incidence of 0.056 in the treatment group. This projected reduction was largely based on data relating cholesterol levels to MI/CAD DTH rates. The lipid hypothesis, however, relates choleseterol levels to coronary atherosclerosis. Thus to test the lipid hypothesis using MI/CAD DTH as an outcome measure, it is necesssary to determine the sensitivity and specificity of MI/CAD DTH for CAD. Unfortunately, neither myocardial infarction nor sudden cardiac death are perfect indications that coronary artery disease is present. In a prospective
132
Alfred P. Hallstrom and Gene B. Trobaugh study by Betriu et al., 3% (8/259) of myocardial infarctions were associated with normal coronary arteries [1]. The percentage appears to be age-dependent: 45% (5/11) patients with a myocardial infarction who were under 45 years of age had normal coronary arteries, as did 4 of 4 patients under age 30. These findings were in agreement with those of Weaver, who found that 6% of patients (mean age 55 years) catherized for out-of-hospital ventricular fibrillation had no significant coronary disease [2]. For the purposes of this analysis, we assumed that the specificity is 97.5%. Absence of s y m p t o m s of CAD is recognized clinically to be compatible with the presence of CAD. In the Framingham study [23] a group of men aged 30 to 62 who had serum cholesterol levels in excess of 250 and who were free of CAD on the initial examination were found to have a 14-year incidence ratio of MI to CAD of 36/152 (23.7%). Assuming that there were an additional 20% of cardiac deaths not ascribed to an MI, the ratio of MI/CAD DTH to CAD should be no more than 28.4%. Over a period of 7 years, the sensitivity of MI/CAD DTH for CAD may be about 25%. Assuming that MI/CAD DTH is being used as a marker for CAD, and that the estimated incidence of MI/CAD DTH is 0.087 in the control group and 0.056 in the treatment group, then the estimated sensitivity (0.25) and specificity (0.975) of MI/CAD DTH for CAD yields an estimated true incidence of CAD of 0.276 and 0.137 (eq. 3). Thus the CPPT may have observed a 50% reduction in the incidence of CAD, from 0.276 to 0.137. The calculated incidence of CAD in 7 years of 0.276 in the control group agrees with the Framingham study above, which found an incidence of CAD of 0.61 (152/250) in 14 years. The CPPT actually found an incidence of MI/CHD DTH of 0.098 in the control group and a reduction of 17.4% for the treatment group. Again assuming that MI/CHD DTH is 97.5% specific and 25% sensitive for CAD, these results would correspond to an incidence of CAD of 0.326 in the control group and a reduction of 23.3% for the treatment group. In Table 1, the sensitivity and specificity for the clinical outcome and three noninvasive tests are compared to the number of cases needed per group based on the original CPPT projections (c~ = 0.05, ~ = 0.1, expected CAD prevalence after 7 years in control group of 0.276, and a 50% reduction in CAD in the treatment group) and based on the CPPT results (o~ = 0.05, f~ = 0.01, expected CAD prevalence after 7 years in control group of 0.325, and 23.3% reduction in CAD in the treatment group). The latter would apply if a repeat trial were undertaken to verify the results of the CPPT. If a small number of participants with CAD at enrollment are included in each of the control and treatment groups, the effect on the calculation of n is small, since the quantity in the denominator of eq. (4) would be the same [(0.025 + br) - (0.025 + bgr) = br - bgr] while the numerator would only be increased slightly.
DISCUSSION We have reviewed the relationship between the number of cases needed to achieve desired significance and power levels in a randomized trial with special attention to how the sensitivity and specificity of the outcome measure affects this relationship. Our sample calculations demonstrate that the number
133
Specificity, Sensitivity, Randomized Trials Table 1
N u m b e r of Cases/Group N e e d e d to Test the Lipid Hypothesis for Four Outcome Measures
Tests Exercise induced EKG ST depression c Exercise induced decrement in LVEF Planar thallium-201 myocardial imaging~ MIdCAD DTH
Number of Cases/Groupa
Number of Cases/Group b
381
1447
80%
414
1466
85%
95%
236
941
25%"
97.5%
1182
4283
Sensitivity 75%
Specificity 90%
90%
ac~ = 0.05, 13 = 0.1, 50% reduction in CAD, prevalence = 0.276. ba = 0.05, 13 = 0.1, 23.3% reduction in CAD, prevalence = 0.325. CRefs.3-6. ~Refs. 7-13. eRefs. 3, 14-17. of cases required is very d e p e n d e n t u p o n the specificity, particularly w h e n sensitivity is high. The suitability of using an outcome measure other than death for a clinical trial d e p e n d s u p o n the primary purpose of the trial. For example, the exact mechanism for initiation of out-of-hospital ventricular fibrillation (VF) is poorly u n d e r s t o o d [24,25]. One hypothesis is that ventricular tachycardia (VT) will degenerate into VF. In order to evaluate an antidysrythmic or other possible therapeutic intervention, s u d d e n cardiac death might be considered as a suitable endpoint, since it is generally a s s u m e d that s u d d e n cardiac death (death within seconds) is due to VF. However, s u d d e n cardiac death is only about 95% specific for VF [26] and has an u n k n o w n sensitivity fo~ VF. Thus s u d d e n cardiac death might not be the best outcome measure for testing w h e t h e r an antiarrhythmic which prevents or decreases the frequency or duration of VT results in fewer episodes of VF. An alternative outcome might make use of event recorders. Most clinical trials are based on an hypothesis concerning processes intermediate b e t w e e n the intervention and death. For example, the CPPT d e p e n d s u p o n (1) treatment (e.g. diet or medication) leading to a reduction of choleseterol levels, which will (2) result in a slower progression or (hopefully) a regression in CAD, which will (3) result in fewer myocardial infarctions or cardiac deaths. If the primary purpose of the trial is to establish clinical efficacy of the treatment (e.g. prevention of myocardial infarction or death), then the appropriate outcome are these endpoints. However, it will still be necessary to determine if death is caused by a condition other than the one being studied; that is, a determination will have to be made, subject to some small error, if a death was due to factors other than cardiac (e.g., neurologic, which can also be present with s u d d e n death). The issue of determination of w h e n a suitable e n d p o i n t has occurred can be difficult, for example, separation of nonatherosclerotic arrhythmic deaths, such as occur in patients with congestive cardiomyopathies or w h o have ideopathic hypertrophic subaortic stenosis, from coronary-related arrhythmic deaths. A clinical trial with a negative
134
Alfred P. HaUstrom and Gene B. Trobaugh o u t c o m e (prevention of d e a t h by treatment could not be established) m a y have very little p o w e r to evaluate the intermediate processes. For example, it w o u l d be theoretically possible for a d r u g to be helpful in p r e v e n t i n g atherosclerosis progression w i t h o u t a decrease in cardiac death rates, if the medication also potentiated other m e c h a n i s m s that could cause an a p p a r e n t coro n a r y d e a t h (sudden, u n e x p e c t e d death), such as by increased frequency of stroke or by potentiation of arrhythmias. The most direct w a y of assessing coronary artery constrictions is with coronary angiography, which permits direct visualization and potential for careful m e a s u r e m e n t of these constrictions. The risk o f this invasive s t u d y is low as currently p e r f o r m e d . Consequently, the role of direct angiographic visualization to assess a t h e r o s d e r o t i c progression is also being investigated [27,28], which should substantially r e d u c e the n u m b e r of patients being required for such a s t u d y b e y o n d the n u m b e r s indicated here for the noninvasive studies. If it is possible to use an actively d e t e r m i n e d o u t c o m e measure as an e n d p o i n t for an investigation, the logistics m a y be simpler or m a y make a s t u d y feasible because of the r e d u c e d n u m b e r s of cases required. From the point of view of a f u n d i n g agency, controlling the cost of a s t u d y by limiting the n u m b e r of patients required is useful in terms of additional opportunities for research [29].
REFERENCES 1. Betriu A, Pare JC, Sanz GA, Casals F, Magrifia J, Castafier A, Navarro-Lopez F: Myocardial infarction with normal coronary arteries: A prospective clinical angiographic study. Am J. Cardiol 48:28-32, 1981 2. Weaver WD, Lorch GS, Alvarez HA, Cobb LA: Angiographic findings and prognostic indicators in patients resuscitated from sudden cardiac death. Circulation 54:895-900, 1976 3. Diamond GA, Forrester JS: Analysis of probability as an aid in the clinical diagnosis of coronary artery disease. N Engl J Med 300:1350-1358, 1979 4. Chaitman BR, Hanson JS: Comparative sensitivity and specificity of exercise electrocardiographic lead systems. Am J Cardiol 47:1335-1349, 1981 5. Simoons ML, Boom HBK, Smallenburg E: Online processing of orthogonal exercise electrocardiograms. Comput Biomed Res 8:105-117, 1975 6. Simoons ML, Hugenholtz PG, Ascoop CA, Distelbrink CA, de Land PA, Vinke RYM: Quantitation of exercise electrocardiography. Circulation 63:471-475, 1981 7. Okada RD, Pohost GM, Kirshenbaum HD, Kushner FG, Boucher LA, Block PC, Strauss HW: Radionuclide-determined change in pulmonary blood volume with exercise. Improved sensitivity of multigated blood pool scanning in detecting coronary artery disease. N Engl J Med 301:569-576, 1979 8. Caldwell JH, Hamilton GW, Sorensen SG, Ritchie JL, Williams DL, Kennedy JW: The detection of coronary artery disease with radionuclide techniques: A comparison of rest-exercise thallium imaging and ejection fraction responses. Circulation 61:610-619, 1980 9. Reduto LA, Wickemeyer WJ, Young JB, Del Ventura LA, Reid JW, Glaeser DH, Quinones MA, Miller RR: Left ventricular diastolic performance at rest and during exercise in patients with coronary artery disease. Assessment with first-pass radionuclide angiograph. Circulation 63:1228-1237, 1981 10. Brady TJ, Thrall JH, Lo K, Pitt B: The importance of adequate exercise in the detection of coronary heart disease by radionuclide ventriculography. J Nucl Med 91-112R-113fl 19~fl
Specificity, Sensitivity, Randomized Trials
135
11. Berger HJ, Reduot LA, Johnstone DE, Borkowski HB, SandsJM, Cohen LS, Langou RA, Gottschalk A, Zaret BL: Global and regional left ventricular response to bicycle exercise in coronary artery disease. Assessment by quantitative radionuclide angiography. Am J Med 66:13-21, 1979 12. Borer JS, Kent KM, Bacharach SL, Green MV, Rosing DR, Seides SF, Epstein SE, Johnston GS: Sensitivity, specificity and predictive accuracy of radionuclide cineangiography during exercise in patients with coronary artery disease, comparison with exercise electrocardiography. Circulation 60:572-580, 1979 13. Rerych SK, Scholz PM, Newman GE, Sabiston DC Jr, Jones RH: Cardiac function at rest and during exercise in normals and in patients with coronary heart disease. Evaluation by radionuclide angiocardiography. Ann Surg 5:449--464, 1978 14. Bailey IK, Griffith LSC, Rouleau J, Strauss HW, Pitt B: ThaUium-201 myocardial perfusion imaging at rest and during exercise: Comparative sensitivity to electrocardiography in coronary artery disease. Circulation 55:79-87, 1977 15. Ritchie JL, Trobaugh GB, Hamilton GW, Gould KL, Narahara KA, Murray JA, Williams DL: Myocardial imaging with 201-thallium at rest and during exercise: Comparison with coronary arteriography and resting and stress electrocardiography. Circulation 56:66-71, 1977 16. Trobaugh GB, Hamilton GW: Rest-exercise thallium myocardial imaging for detection and assessment of ischemic heart disease. Prog Nucl Med 6:56-67, 1980 17. Garcia E, Maddahi J, Berman D, Waxman A: Space/time quantitation of thallium201 myocardial scintigraphy. J Nucl Med 22:309-317, 1981 18. Snedecor GW, Cochran WG: Statistical Methods. Ames, IA: Iowa State University Press, 1967 19. The Lipids Research Clinics Program. The Coronary Primary Prevention Trial: Design and implementation. J Chron Dis 32:609-631, 1979 20. Shurtleff D (ed): The Framingham Study: An Epidermiological Investigation of Cardiovascular Disease. Bethesda MD: National Heart, Lung and Blood Institute, 1970, sect 26, append B 21. Lipid Research Clinics Program. The Lipid Research Clinics Coronary Primary Prevention Trial results. I. Reduction in incidence of coronary heart disease. JAMA 351-364, 1984 22. Lipid Research Clinics Program. The Lipid Research Clinics Coronary Primary Prevention Trial results. II. The relationship of reduction in incidence of coronary heart disease to cholesterol lowering. JAMA 251:365-374, 1984 23. Kannel WB, Castelli WP, Gordon T, McNamara PM: Serum cholesterol, lipoproreins, and the risk of coronary heart disease: The Framingham Study. Ann Intern Med 74:1-12, 1971. 24. Cobb LA, Werner JA, Trobaugh GB: Sudden cardiac death: I. A decade's experience with out-of-hospital resuscitation. Mod Concepts Cardiovasc Dis 49:31-36, 1980 25. Cobb LA, Werner JA, Trobaugh GB: Sudden cardiac death: II. Outcome of resuscitation; management, and future directions. Mod Concepts Cardiovasc Dis 49:37-42, 1980 26. Hallstrom AP, Eisenberg MS, Bergner L: The persistence of VF and its implication for evaluation of Emergency Medical Services. Emerg Health Serv Q 4:41-49, 1982 27. Levy RI: The influence of cholestyramine-induced lipid changes or coronary artery disease progression: The NHLBI type II Coronary Intervention Study. Circulation 68:III-188, 1983 28. Nikkila EA, Viikinkoski P, Valle M: Effect of lipid lowering treatment on progression of coronary atherosclerosis. A 7 year prospective angiographic study. Circulation 68:III-188, 1983 29. Levy RI, Sondih EJ: Large scale clinical trials: Are they worth the cost? Ann NY Acad Sci 382:411-421, 1982