Hypothesis Testing in Noninferiority and Equivalence MRMC ROC Studies

Hypothesis Testing in Noninferiority and Equivalence MRMC ROC Studies

Hypothesis Testing in Noninferiority and Equivalence MRMC ROC Studies Weijie Chen, PhD, Nicholas A. Petrick, PhD, Berkman Sahiner, PhD Rationale and O...

215KB Sizes 5 Downloads 87 Views

Hypothesis Testing in Noninferiority and Equivalence MRMC ROC Studies Weijie Chen, PhD, Nicholas A. Petrick, PhD, Berkman Sahiner, PhD Rationale and Objectives: Conventional multireader multicase receiver operating characteristic (MRMC ROC) methodologies use hypothesis testing to test differences in diagnostic accuracies among several imaging modalities. The general MRMC-ROC analysis framework is designed to show that one modality is statistically different among a set of competing modalities (ie, the superiority setting). In practice, one may wish to show that the diagnostic accuracy of a modality is noninferior or equivalent, in a statistical sense, to that of another modality instead of showing its superiority (a higher bar). The purpose of this article is to investigate the appropriate adjustments to the conventional MRMC ROC hypothesis testing methodology for the design and analysis of noninferiority and equivalence hypothesis tests. Materials and Methods: We present three methodological adjustments to the updated and unified Obuchowski-Rockette (OR)/DorfmanBerbaum-Metz (DBM) MRMC ROC method for use in statistical noninferiority/equivalence testing: 1) the appropriate statement of the null and alternative hypotheses; 2) a method for analyzing the experimental data; and 3) a method for sizing MRMC noninferiority/equivalence studies. We provide a clinical example to further illustrate the analysis of and sizing/power calculation for noninferiority MRMC ROC studies and give some insights on the interplay of effect size, noninferiority margin parameter, and sample sizes. Results: We provide detailed analysis and sizing computation procedures for a noninferiority MRMC ROC study using our method adjusted from the updated and unified OR/DBM MRMC method. Likewise, we show that an equivalence hypothesis test is identical to performing two simultaneous noninferiority tests (ie, either modality is noninferior to the other). Conclusion: Conventional MRMC ROC methodology developed for superiority studies can and should be adjusted appropriately for the design and analysis of a noninferiority/equivalence hypothesis testing. In addition, the confidence interval of the difference in diagnostic accuracies is important information and should generally accompany the statistical analysis and any conclusions drawn from the hypothesis testing. Key Words: MRMC; ROC; noninferiority; hypothesis testing. ªAUR, 2012

M

ultireader multicase receiver operating characteristic (MRMC ROC) analysis is a popular approach to evaluating and comparing the diagnostic accuracy of medical imaging modalities (1). A commonly used statistical tool for comparing the diagnostic accuracy of two or more imaging modalities is hypothesis testing. Methods of hypothesis testing in MRMC ROC studies have been investigated extensively, for example, the DBM (DorfmanBerbaum-Metz) method (2,3) and the OR (ObuchowskiRockette) method (4,5). These methods have been further updated and compared (6,7) and, more recently, the DBM and the OR methods are unified for the analysis (8) and power estimation of multireader ROC studies (9). In these methods, the null and alternative hypotheses are generally defined as follows: under the null hypothesis (denoted as H0), the diagnostic accuracies (eg, areas under the ROC curve, or AUC) of

Acad Radiol 2012; 19:1158–1165 From the Division of Imaging and Applied Mathematics, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, US Food and Drug Administration, 10903 New Hampshire Avenue, Silver Spring, MD 20993 (W.C., N.A.P., B.S.). Received November 14, 2011; accepted April 23, 2012. Address correspondence to: W.C. e-mail: weijie. [email protected] ªAUR, 2012 doi:10.1016/j.acra.2012.04.011

1158

all the modalities are equal; under the alternative hypothesis (denoted as H1), they are not all equal (ie, at least one is significantly different from the others). The goal of the study is to reject the null hypothesis and demonstrate there is a difference or superiority, the success of which can be claimed by obtaining a P value (Type I error rate) that is less than a prespecified significance level a (eg, 0.05). Although superiority of diagnostic accuracy is often the driving force for scientific innovation, meeting this high bar is not always necessary for accepting a new technology into clinical practice. For example, when a new imaging modality offers lower or equivalent radiation dose to the patient than does the conventional modality and has similar diagnostic accuracy (ie, noninferior performance), it would be appropriate for use in the clinic. Another example could be demonstrating that a computer-aided diagnosis (CAD) system works equally well on images obtained from multiple image acquisition systems where equivalent performance among the various imaging systems would provide evidence of the robustness of the CAD system across different image acquisition systems. Using hypothesis testing to show noninferiority or equivalence is not trivial, and our intuition on interpreting hypothesis testing results may be misleading. A common misconception is to interpret the failure to reject the null hypothesis as proof that the null hypothesis is correct

Academic Radiology, Vol 19, No 9, September 2012

(10,11). For example, if a large P value is obtained in the superiority testing setting (ie, H0 cannot be rejected), one may interpret the nonsignificant result as a proof that the null hypothesis is correct and then to conclude equivalence among the modalities. This is incorrect because, along with differences in diagnostic accuracy, sample size plays an important role in statistical hypothesis testing. Having too small a sample size leads to a nonsignificant P value and failure to reject the null hypothesis even when a difference in diagnostic accuracy exists. The correct interpretation is that there is not enough statistical evidence to show a difference in accuracy among the modalities under study. This could be because there is no real difference among the modalities or that the sample size is not large enough to show this difference in a statistical sense. A key nature of hypothesis testing is that the null hypothesis is never proved or established, but instead it is only possibly disproved by using a test statistic computed from experimental data, thereby establishing the alternative hypothesis (12). Therefore, if the goal of a study is to establish equivalence or noninferiority, equivalence or noninferiority must be set not as the null hypothesis, but as the alternative hypothesis. Methodologies designed for superiority or nonequivalence hypothesis testing must be adjusted for the noninferiority or equivalence settings. Such methodological adjustments have been investigated in the field of pharmaceutical clinical trials where the endpoint is often binomially distributed (10). ROC endpoints are often used in the evaluation of diagnostic devices in general and medical imaging devices in particular. Methods of noninferiority diagnostic tests on ROC endpoints have been investigated for AUC (13), partial AUC (14), a (sensitivity, specificity) pair (15), or other derivative metrics (16). However, these methods generally do not consider reader variability (ie, not applied to MRMC studies). The purpose of this article is to investigate appropriate adjustments of the conventional MRMC ROC methodology, which was designed for the superiority setting, to the design and analysis of noninferiority studies. We base our adjustments on the updated and unified OR/DBM method (9). We describe three methodological adjustments for application to the noninferiority setting: 1) an appropriate statement of the null and alternative hypotheses; 2) a method for analyzing the experimental data; 3) a method for sizing these studies. We focus on the AUC endpoint although the conceptual framework is applicable to other endpoints as well. We finally discuss a simple extension of the noninferiority paradigm to extend its application to show equivalence between two modalities. MATERIALS AND METHODS Statement of Hypotheses

Appropriate statement of the null and alternative hypotheses depends on what the investigators intend to show: the null hypothesis is the hypothesis that the study is designed to reject and the alternative hypothesis is the one that the investigators

NONINFERIORITY AND EQUIVALENCE MRMC ROC

actually wish to establish. Suppose we want to compare the diagnostic accuracy of two imaging modalities: qe the diagnostic accuracy for the experimental modality under study and qc the diagnostic accuracy for the conventional or control modality. Here qe and qc are population AUC parameters representing the expected diagnostic accuracy of a population of physicians on a population of patients for the experimental and conventional modalities, respectively. If we wish to show that the diagnostic accuracy values are different for the two modalities, then H0 is qe = qc and H1 is qe s qc, which is the conventional two-sided hypothesis test setting for the nonequivalence testing. If we wish to show that the experimental modality outperforms the conventional modality, then H0 is qe = qc and H1 is qe > qc for superiority testing. If, however, we wish to show that the diagnostic accuracy of the experimental modality is equivalent or noninferior to that of the conventional modality, we need to specify a margin parameter d, which is constrained to be a positive number for convenience. In an equivalence study, we set H0 as jqe  qcj = d, meaning that there is a difference of d between qe and qc, and set H1 as jqe  qcj < d, meaning that the difference between qe and qc is less than d (see Fig 1, top, for a pictorial illustration). The margin parameter is needed because it is very difficult, if not impossible, to show statistically two quantities are exactly equal. Therefore the goal has to be showing the difference between two quantities is within some tolerable range. The two modalities would be considered clinically equivalent if the difference between their accuracies is sufficiently small and clinically unimportant. Similarly for a noninferiority study, the null hypothesis is defined as qe  qc = d, meaning that qe is less than qc by an amount of d and the alternative hypothesis is qe  qc > d, meaning that qc exceeds qe by no more than d (ie, the performance of the new modality is no worse than d below that of the conventional modality; see Fig 1, bottom, for a pictorial illustration). The margin parameter must be prespecified in a clinical study design in order to determine the sample sizes. However, the determination of the margin parameter is beyond the scope of this article because it is decided on scientific or clinical grounds and is application specific. We will use some assumed values to illustrate our method. Appropriate statements of the hypotheses are summarized in Table 1 for the four types of hypothesis testing settings discussed previously. Analysis of Data for Noninferiority Tests

We consider a fully crossed design of an MRMC ROC study where J readers read images of both modalities from N0 nondiseased patients and N1 diseased patients. The AUC of reader j is denoted as qej for the experimental modality and qcj for the conventional modality. Obuchowski and Rockette (4,5) proposed a two-way analysis of variance (ANOVA) model (OR model) treating the diagnostic modality as having fixed effects, and readers, patients and their interactions with the modality as having random effects. Obuchowski and 1159

CHEN ET AL

Academic Radiology, Vol 19, No 9, September 2012

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ^s ¼ MSðT  RÞ þ 2Hðc^ov2  c^ov3 Þ; J

(2)

where the function H is defined as H(x) = x if x > 0 and 0 otherwise, c^ov2 is the estimated covariance in diagnostic accuracies of different readers in the same modality, c^ov3 is the estimated covariance in diagnostic accuracies of different readers in different modalities, and MS(T*R) is the twoway ANOVA test-by-reader mean squares in the OR model, Figure 1. Illustration of the null and alternative hypotheses in equivalence and noninferiority tests.

TABLE 1. Statements of the Hypotheses in Four Types of Hypothesis Testing Settings Two-sided

One-sided

Nonequivalence H0: qe  qc = 0 H1: qe  qc s 0 Equivalence H0: jqe  qcj = d H1: jqe  qcj < d

Superiority H0: qe  qc = 0 H1: qe  qc > 0 Noninferiority H0: qe  qc = d H1: qe  qc > d

X   1 ^qej  ^qcj  ð^qe  ^qc Þ 2 : 2ðJ  1Þ j¼1 J

MSðT  RÞ ¼

These parameters can be estimated either by the OR method or by transforming the DBM outputs; see previous work (8) on how to transform the DBM outputs into OR parameters. Note that the SD of ^qe  ^qc þ d is identical to the SD of ^qe  ^qc , which was derived in the superiority setting (5,8,9). Under the null hypothesis in the noninferiority setting, the test statistic t follows a Student’s t distribution with df0 degrees of freedom (density function denoted as f(t;df0jH0)) where df0 can be estimated, by applying Hillis’s method (7), as ^df0 ¼ fMSðT  RÞ þ HðJðc^ov2  c^ov3 ÞÞg : 2 ðMSðT  RÞÞ =ðJ  1Þ 2

Rockette developed the hypothesis testing method for the nonequivalence setting (ie, H0: qe  qc = 0, H1: qe  qc s 0), the details of which can be found elsewhere (4,5). In practice, this method is often used to demonstrate superiority by showing that there is a significance difference among the diagnostic accuracies and that the experimental modality has better diagnostic accuracy than that of the comparison modalities. The OR method was shown to yield identical results to the DBM method when the two methods agree with respect to the computation of accuracy measure, covariance, and degrees of freedom (DOF) of the F statistic (6). Hillis (7) later investigated a new approach to compute the DOF of the denominator of the F statistic that was shown to be more accurate than the method used in the original OR method. Here we give the adjustment necessary to adapt the updated and unified OR/DBM analysis methodology to the noninferiority setting. Because we only consider a two-modality problem, we will deal with the t statistic rather than the more general F statistic (the square root of which is the t statistic when there are only two modalities). For a noninferiority test with a prespecified margin parameter d, the test static under the OR model is t¼

^qe  ^qc þ d ; ^s

(1)

where ^qi is the estimated average reader AUC for modality i (i = e,c) and the dot symbol represents an average over readers, ^s is the estimated standard deviation (SD) of ^qe  ^qc þ d, 1160

(3)

These mean squares and covariances can also be transformed from the DBM outputs (8). Given a significance level a, one can compute the cutoff value tc ¼ F 1 ð1  a=2; ^df0 jH0 Þ, where F (t; df0jH0) is the cumulative distribution function of the test statistic t under the null hypothesis H0, which is a Student’s t distribution with df0 degrees of freedom. One can also compute the exact P value using P ¼ 2ð1  Fðt; ^df0 jH0 ÞÞ. If the observed test statistic t is greater than tc, or equivalently if the P value is less than the significance level a, we reject the null hypothesis and conclude that the diagnostic accuracy of the experimental imaging modality is noninferior to that of the conventional modality at significance level a. The confidence interval (CI) for the difference in diagnostic accuracies is very useful information to accompany the binary conclusion drawn from the hypothesis testing (11). The approximate 100(1-a)% CI of ^qe  ^qc is   ^qe  ^qc  ^sta=2;df ; ^qe  ^qc þ ^sta=2;df : 0 0

(4)

It is clear that the lower bound of the 100 (1-a)% CI is larger than -d if and only if the null hypothesis is rejected at significance level a. However, the CI conveys more information than the binary conclusion drawn from hypothesis testing as it additionally quantifies the uncertainty concerning the true difference in a meaningful way (11).

Academic Radiology, Vol 19, No 9, September 2012

NONINFERIORITY AND EQUIVALENCE MRMC ROC

Sizing/Power Calculation for Noninferiority Tests

With the adjusted test statistic defined in Eq. 1 for noninferiority tests, we present the adjustments of the updated and unified OR/DBM power computation procedure (9) for sizing a noninferiority MRMC ROC study. The steps of the procedure are the following: 1. Specify the noninferiority margin d. As discussed earlier, this parameter represents the amount of performance that the experimental modality can be lower than the conventional modality without significant clinical consequences (ie, of little clinical significance). This parameter is generally subjective and is often determined in an ad hoc manner based on scientific or clinical experience. 2. Specify the effect size d. We define d = qe  qc where qi (i = e, c) is the expected diagnostic accuracy for a randomly selected reader reading a random sample of cases. This parameter can be measured in a pilot study. 3. Obtain the variance and covariance parameter estimates of the OR model. These parameters can be estimated by one of two methods: 1) analyzing a pilot study with the OR method or 2) analyzing a pilot study with the DBM method and then transforming the DBM parameters into OR parameters. The methods for parameter estimation are identical to those in the superiority tests and the details can be found elsewhere (9). 4. Compute the noncentrality parameters and denominator degrees of freedom estimates for specified case and reader sample size. As indicated earlier, the adjusted test statistic t in Eq. 1 follows a Student’s t distribution with df0 degrees of freedom under the null hypothesis. Under the alternative hypothesis, t follows a noncentral t distribution with df1 degrees of freedom and noncentrality parameter l (density function denoted as f(t;df1,l jH1)). Adjusted from the results elsewhere (9), the parameters l and df1 can be estimated by

dþd ^l ¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; u (  )  u 2 c t2 s ^  c^ov1 þ ðJ  1ÞHðc^ov2  c^ov3 Þ =J ^2 þ s TR c ε (5)   )2 c 2 ^  c^ov1 þ ðJ  1ÞHðc^ov2  c^ov3 Þ ^ þ s s TR c ε ; (  )2  2 c ^  c^ov1  Hðc^ov2  c^ov3 Þ ^2 þ s s TR c ε

(

2

^df1 ¼

within-reader variability, and the test-reader interaction variance is estimated as ^2ε þ c^ov1 þ Hðc^ov2  c^ov3 Þ: ^2TR ¼ MSðT  RÞ  s s

(7)

5. Compute the power. An illustration of the distributions of the adjusted test statistic under the null and alternative hypotheses is shown in Figure 2. By definition, the statistical power is one minus the type II error:

Power ¼ 1  Fðtc ; df 1 ; ljH1 Þ;

(8)

where F(t; df1, ljH1) is the distribution function of the test statistic t under the alternative hypothesis H1 and tc = F1(1-a/2;df0jH0). The adjustments here include the test statistic (Eq. 1), the specification of a noninferiority margin parameter, and the noncentrality parameter l. We note two further issues: (a) the approach assumes that the abnormal to normal case ratio is the same in the planned study and the pilot study, and (b) ^2TR can be negative. To size a new study that the estimated s has different abnormal to normal case ratio than the pilot study, Hillis et al (9) proposed an ad hoc approach, ie, resample one group (abnormal or normal) of the pilot study data with replacement to achieve the desired abnormal to normal case ratio and estimate the (co)variance parameters with this resampled dataset; then repeat the procedure multiple times and average the estimated parameters. For the second issue, Hillis et al (9) proposed replacing the negative estimate of ^2TR with zero or a positive number. The readers are referred s to Hillis et al (9) for more discussion of these two issues. Equivalence Tests

For an equivalence study where the goal is to show the diagnostic accuracies of two imaging modalities are equivalent, the problem is identical to performing two simultaneous noninferiority tests (13), ie, each modality is noninferior to the other. Similar to the noninferiority testing, the CI of the difference in diagnostic accuracies is very useful information to accompany the binary conclusion drawn from the hypothesis testing. The null hypothesis of an equivalence test is rejected (ie, equivalence is established) if and only if the 100(1-a)% CI of the difference in diagnostic accuracies is within the range [d, d]. RESULTS Example

J1

(6) where c* is the case sample size in the pilot study, c and J are the case sample size and the reader sample size, respectively, for ^2˛ is the sum of case variability and the the planned study, s

Here we provide a clinical example to further illustrate the analysis and power computation for noninferiority MRMC ROC studies and give some insights on the interplay among effect size, margin parameter, and sample sizes. 1161

CHEN ET AL

Academic Radiology, Vol 19, No 9, September 2012

OR parameters as follows; again, all are identical to those in Hillis et al (9):  Mean squares in the OR ANOVA model: MS(T) = 0.004003382; MS(T*R) = 0.000622731; ^2ε ¼ 0:001392652,  Covariance estimates: s c^ov1 ¼ 0:000351859, c^ov2 ¼ 0:000346505, c^ov3 ¼ 0:000221453;  Degree of freedom of the t statistic: ^df0 ¼ 16:0659: Using these estimated parameters and a specified noninferiority margin, we can compute the t statistic, P value, and the CI of qe  qc (Eqs. 1-4) as follows:  If we set d = 0.01, we have t = 2.2386, P = .0397, 95% CI = [0.0073, 0.0874];  If we set d = 0.02, we have t = 2.6862, P = .0162, 95% CI = [0.0073, 0.0874]. Figure 2. Illustration of the distributions of the test statistic under the null and alternative hypotheses.

In this clinical example (17), the clinical task is to compare the diagnostic accuracy of two modalities for the detection of thoracic aortic dissection: single spin-echo magnetic resonance imaging (MRI) versus cinematic presentation of MRI. There were 45 signal-present patients and 69 signalabsent patients with both spin-echo and cinematic MRI exams. There were five radiologists (ie, readers) in this study who read independently both MRI images for each patient. The readers used a 5-point ordinal scale to rate his or her confidence that the signal was present for each modality as follows: 1. 2. 3. 4. 5.

definitely no aortic dissection, probably no aortic dissection, unsure about aortic dissection, probably aortic dissection, definitely aortic dissection.

This dataset has been used by Hillis et al (9) to demonstrate the application of their unified and updated OR/DBM approach for power estimation in planning a new superiority study. To illustrate our adjusted methods for noninferiority studies using this clinical dataset, we suppose here that the purpose of the study is to show that the performance of spin-echo MRI is noninferior to that of cinematic MRI in terms of AUC. We first analyze this dataset in the noninferiority setting and then demonstrate how to use the information from the analysis to size a new noninferiority study. In order to make a direct comparison with the superiority results, we adopted the same methods as those elsewhere (9) for computing the single-reader AUC values and variancecovariance matrix of 10 AUCs (two modalities, five readers). Briefly, the proper binormal model (18,19) was used to compute the single-reader AUC and the jackknife resampling method was used to compute the variance-covariance matrix. All the intermediate results are identical to those elsewhere (9) and are omitted here. We summarize the relevant estimated 1162

We conclude that, for either of the noninferiority margins, the AUC of spin-echo MRI is noninferior to that of cinematic MRI (P < .05). Note that the 95% CI of qe  qc is (of course) the same as that estimated in the superiority setting and, given a noninferiority margin, one can obtain the same noninferiority conclusion by comparing the lower bound of the 95% CI of qe  qc with  d (ie, 0.0073 >  d). However, the adjusted t statistic and analysis are necessary to compute the exact P value in the noninferiority setting. We next demonstrate the power computation using the analysis results above for different combinations of number of readers and number of cases. We treat the MRI study above as a pilot study and use the estimated effect size and variance information to size a pivotal noninferiority study. Note that this is only for methodology demonstration purpose. To size a noninferiority study, we first specify the noninferiority margin parameter. We set d = 0.01 and d = 0.02 with the former being a more restrictive criterion than the latter. Using the covariance estimates and mean square estimates ^ 2TR\0 and we set shown above and Eq. 7, we find that s 2 2 ^TR ¼ 0:0001 as suggested by Hillis ^TR ¼ 0 and also to s s et al (9), with the latter being more conservative than the former. We then estimate ^l using Eq. 5 and estimate^df1 using Eq. 6, and finally compute the power using Eq. 8. For a given reader set size J, we vary the number of cases c to obtain the lowest power equal to or larger than 80%. The results are shown in Table 2. The results for d = 0.01 and d = 0.04 (the left half of the table) in this noninferiority power computation are identical to those for d0 = 0.05 in superiority power computation (see Table 5 of ref. 9). This is expected because the distribution of the test statistic given in Eq. 1 under the null hypothesis for noninferiority (H0: qe  qc = d) is the same as the distribution of the standard test statistic (4) under the null hypothesis for superiority (H0: qe  qc = 0). Likewise, the distribution of the test statistic under H1 for noninferiority (H1: qe  qc = d) is the same as the distribution of the test statistic under H1 for superiority (H1: qe  qc = d0 = d + d). From the formula of the adjusted test statistic (Eq. 1) and the adjusted noncentrality parameter

Academic Radiology, Vol 19, No 9, September 2012

(Eq. 5), we see that the effect size d in the corresponding formula for superiority tests is replaced by d + d in the formula for noninferiority tests, with all the other parameters remaining the same. Comparing the left half and the right half of the table, we see that a larger noninferiority margin requires smaller sample size to achieve the same statistical power. For ^2TR ¼ 0, 183 cases and 128 cases example, with 8 readers and s are needed, respectively, for d = 0.01 and d = 0.02 to achieve a power of 80%. Finally, similar to superiority studies (9), the increase in the number of cases needed based on ^2TR ¼ 0 is mostly noticeable ^2TR ¼ 0:0001 as compared to s s for small number of readers (eg, J # 5). DISCUSSION We have presented the methodological adjustments that are necessary for the design and analysis of a noninferiority MRMC ROC study based on the OR statistical analysis model and its recent updates, which were designed to test for differences in diagnostic accuracy among diagnostic modalities (ie, nonequivalence or superiority studies). The adjustments for a noninferiority setting included changes to the statement of the null and alternative hypotheses, to the statistical analysis method, and to the method for sizing/power computation for planning a study. The equivalence test is identical to performing two simultaneous noninferiority tests (ie, each modality is noninferior to the other). We emphasize that the correct statement of the null and alternative hypotheses is important as it serves as the basis of the subsequent analysis and the interpretation of the hypothesis testing results. It is in the statement of the hypotheses where the margin parameter should be specified (ie, the margin is specified and justified on scientific or clinical grounds before any analysis of the experimental data in the current study is done). This is important because, if one first analyzes the data and obtains the CI of the performance difference, then one can always claim success by setting the margin parameter to be larger than the negative of the lower bound of the CI. Conclusions drawn by using a margin parameter based solely on the experimental data from the current study are not valid. With appropriately stated hypotheses, the adjustment of the updated and unified OR/DBM approach for the analysis of data is straightforward, namely by incorporating the margin parameter into the numerator of the test statistic (Eq. 1). Actually, as we have shown, one can even use the OR/DBM approach without adjustment to decide noninferiority or not by comparing the lower bound of the CI of the performance difference with the margin parameter (20). However, the adjustment is necessary to compute the exact P values in the noninferiority setting. Sometimes, P values corresponding to the superiority setting are reported in a noninferiority study (20), which can be a potential source of confusion because those P values represent the significance of the difference and are not relevant to the hypotheses in the noninferiority setting.

NONINFERIORITY AND EQUIVALENCE MRMC ROC

The adjustment of the updated and unified OR/DBM approach for sizing a noninferiority study is also straightforward, namely by replacing the effect size d with the sum of the effect size d and the margin parameter d in the numerator of the noncentrality parameter (Eq. 5). We emphasize the distinction between the two parameters: the effect size can be estimated using a pilot study (although the variability from a small pilot study can be large) and the margin parameter is practically determined on scientific or clinical grounds outside of the pilot study. The straightforwardness of methodological adjustments can be surprising at the first sight as the null hypothesis in the noninferiority test may appear to be similar to the alternative hypothesis in the superiority test. However, this is not so surprising by appreciating a connection between the superiority test and the noninferiority test. By rewriting the H0 and H1 of the noninferiority test (Table 1) as: H0 : qe  ðqc  dÞ ¼ 0 H1 : qe  ðqc  dÞ.0; and comparing to the superiority test (Table 1), we see immediately that to show qe is noninferior to qc appears isomorphic with showing qe is superior to qc  d. The connection between the superiority test and the noninferiority test suggests that one can use the available software packages designed for superiority tests for analyzing or sizing a noninferiority study. For example, as we have mentioned, to analyze a noninferiority study with the updated and unified OR/DBM approach, one can obtain the 100 (1-a)% CI of the performance difference and compare the lower bound of the CI with the margin parameter (but, again, the adjusted approach is necessary to compute the exact P value in the noninferiority setting). For sizing a noninferiority study with the updated and unified OR/DBM approach for superiority test, one can simply input d + d in the place of d (or equivalently input qc  d in the place of qc). However, such a convenience does not exist for the original OR approach (5). In that approach, the qc and qe are used to estimate the case variance parameter using the formula of Hanley and McNeil (21). If we were to use qc  d in place of qc in the original OR approach, then the case variance would be overestimated (the variance based on qc  d is larger than that based on qc), which in turn would lead to an oversized study. In this article, we focused on a fully crossed MRMC design and an AUC performance metric. Similar adjustments could be applied to other types of non-fully crossed MRMC designs such as those discussed elsewhere (22,23), and to other metrics such as sensitivity and specificity. We have focused on general statistical techniques for noninferiority MRMC ROC studies. The question of how to decide which type of hypothesis testing (of those outlined in Table 1) should be used in a particular study is beyond the scope of this article. The answer to this question depends on

1163

CHEN ET AL

Academic Radiology, Vol 19, No 9, September 2012

TABLE 2. Combinations of Number of Cases and Number Readers for 80% Power in Establishing Noninferiority of Spin-echo MRI as Compared to Cine MRI, Based on the Van Dyke et al Data (17) Noninferiority Margin d = 0.01 ^2TR s Readers 3 4 5 6 7 8 9 10 11 12 13 14 15

Cases 559 343 266 225 200 183 171 162 154 148 143 139 136

^2TR s

¼0 Power 0.8005 0.8004 0.8014 0.8004 0.8002 0.8001 0.8008 0.8017 0.8003 0.8002 0.8001 0.8005 0.8021

Cases 1898 491 330 263 227 203 187 174 165 158 152 146 142

Noninferiority Margin d = 0.02 ^2TR s

¼ 0:0001 Power 0.8000 0.8004 0.8007 0.8002 0.8014 0.8008 0.8018 0.8002 0.8010 0.8021 0.8024 0.8001 0.8009

Cases 388 238 185 157 139 128 119 113 107 103 100 97 94

^2TR ¼ 0:0001 s

¼0 Power 0.8003 0.8001 0.8020 0.8023 0.8005 0.8029 0.8016 0.8035 0.8005 0.8011 0.8028 0.8024 0.8003

Cases 710 300 213 174 151 137 126 118 112 108 104 100 97

Power 0.8002 0.8002 0.8005 0.8011 0.8002 0.8022 0.8008 0.8002 0.8004 0.8033 0.8033 0.8008 0.8001

MRI, magnetic resonance imaging. The abnormal to normal case ratio in the planned study is assumed to be the same as that in the Van Dyke data (ie, 45/69 = 0.652). The effect size d is 0.04 as measured in the Van Dyke study.

the specific application. For example, if the purpose is to demonstrate that a CAD system works equally well on two image acquisition systems, an equivalence study would generally be most appropriate. If, however, the purpose is to collect evidence for establishing the effectiveness of a novel imaging modality, both noninferiority and superiority may be viable options. In general, for a new modality, use of noninferiority in diagnostic accuracy as an acceptance criterion must be fully justified. Also, use of the noninferiority test has the disadvantage of having to select a noninferiority margin parameter, which is often subjective. However, if the noninferiority criterion is fully justified and the noninferiority margin is appropriately prespecified, the noninferiority study requires fewer cases or readers compared with a similar superiority study to achieve the same statistical power. In summary, conventional MRMC ROC methodology designed for superiority studies can and should be adjusted appropriately for the design and analysis of noninferiority/ equivalence studies. In addition, the CI of the difference in diagnostic accuracies is important information and should generally accompany the statistical analyses and any conclusions drawn from the hypothesis testing.

ACKNOWLEDGMENT We thank Carolyn Van Dyke, MD, for sharing her MRI dataset. We thank the anonymous reviewer for his/her constructive suggestions that substantially helped improve the quality of our article. We thank our colleagues Frank Samuelson and Brandon Gallas for helpful discussions. Commercial materials and equipment are identified in order to adequately specify experimental procedures. In no case does such identification imply recommendation or 1164

endorsement by the FDA, nor does it imply that the items identified are necessarily the best available for the purpose.

REFERENCES 1. Wagner RF, Metz CE, Campbell G. Assessment of medical imaging systems and computer aids: a tutorial review. Acad Radiol 2007; 14: 723–748. 2. Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Invest Radiol 1992; 27:723–731. 3. Hillis SL, Berbaum KS, Metz CE. Recent developments in the DorfmanBerbaum-Metz procedure for multireader ROC study analysis. Acad Radiol 2008; 15:647–661. 4. Obuchowski NA, Rockette HE. Hypothesis testing of the diagnostic accuracy for multiple diagnostic tests: an ANOVA approach with dependent observations. Commun Stat Simulation Comput 1995; 24: 285–308. 5. Obuchowski NA. Multi-reader multi-modality ROC studies: hypothesis testing and sample size estimation using an ANOVA approach with dependent observations. With rejoinder. Acad Radiol 1995; 2(Suppl 1):S22–S29. 6. Hillis SL, Obuchowski NA, Schartz KM, et al. A comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette methods for receiver operating characteristic (ROC) data. Stat Med 2005; 24:1579–1607. 7. Hillis SL. A comparison of denominator degrees of freedom methods for multiple observer ROC analysis. Stat Med 2007; 26:596–619. 8. Hills SL, Berbaum KS, Metz CE. Recent developments in the DorfmanBerbaum-Metz procedure for multireader ROC study analysis. Acad Radiol 2008; 15:647–661. 9. Hillis SL, Obuchowski NA, Berbaum KS. Power estimation for multireader ROC methods: an updated and unified approach. Acad Radiol 2011; 18: 129–142. 10. Blackwelder WC. ‘‘Proving the null hypothesis’’ in clinical trials. Controlled Clin Trials 1982; 3:345–353. 11. Metz CE. Quantification of failure to demonstrate statistical significance: the usefulness of confidence intervals. Invest Radiol 1993; 28: 59–63. 12. Fisher RA. The Design of Experiments. London: Oliver and Boyd, 1935. 13. Liu J-P, Ma M-C, Wu C-Y, et al. Tests of equivalence and non-inferiority for diagnostic accuracy based on the paired areas under ROC curves. Stat Med 2006; 25:1219–1238.

Academic Radiology, Vol 19, No 9, September 2012

14. Li C-R, Liao C-T, Liu J-P. A non-inferiority test for diagnostic accuracy based on the paired partial areas under ROC curves. Stat Med 2008; 27: 1762–1776. 15. Lu Y, Jin H, Genant HK. On the non-inferiority of a diagnostic test based on paired observations. Stat Med 2003; 22:3029–3044. 16. Lui KJ, Zhou XH. Testing non-inferiority (and equivalence) between two diagnostic procedures in paired-sample ordinal data. Stat Med 2004; 23:545–559. 17. Van Dyke CW, White RD, Obuchowski NA, et al. Cine MRI in the diagnosis of thoracic aortic dissection. 79th RSNA Meetings, Chicago, IL; 1993. 18. Pan XC, Metz CE. The ‘‘proper’’ binormal model: parametric receiver operating characteristic curve estimation with degenerate data. Acad Radiol 1997; 4:380–389.

NONINFERIORITY AND EQUIVALENCE MRMC ROC

19. Metz CE, Pan XC. ‘‘Proper’’ binormal ROC curves: theory and maximum likelihood estimation. J Math Psychol 1999; 43:1–33. 20. Gennaro G, Toledano A, di Maggio C, et al. Digital breast tomosynthesis versus digital mammography: a clinical performance study. Eur Radiol 2010; 20:1545–1553. 21. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982; 143: 29–36. 22. Gallas BD, Brown DG. Reader studies for validation of CAD systems. Neural Networks 2008; 21:387–397. 23. Obuchowski NA. Reducing the number of reader interpretations in MRMC studies. Acad Radiol 2009; 16:209–217.

1165