Classifier variability: Accounting for training and testing

Classifier variability: Accounting for training and testing

Pattern Recognition 45 (2012) 2661–2671 Contents lists available at SciVerse ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/lo...

363KB Sizes 0 Downloads 68 Views

Pattern Recognition 45 (2012) 2661–2671

Contents lists available at SciVerse ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Classifier variability: Accounting for training and testing Weijie Chen a,n, Brandon D. Gallas a, Waleed A. Yousef b a Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, 10903 New Hampshire Avenue, Silver Spring, MD 20993, United States b Human Computer Interaction Lab., Faculty of Computers and Information, Helwan University, Egypt

a r t i c l e i n f o

abstract

Article history: Received 24 September 2011 Accepted 30 December 2011 Available online 11 January 2012

We categorize the statistical assessment of classifiers into three levels: assessing the classification performance and its testing variability conditional on a fixed training set, assessing the performance and its variability that accounts for both training and testing, and assessing the performance averaging over training sets and its variability that accounts for both training and testing. We derived analytical expressions for the variance of the estimated AUC and provide freely available software implemented with an efficient computation algorithm. Our approach can be applied to assess any classifier that has ordinal (continuous or discrete) outputs. Applications to simulated and real datasets are presented to illustrate our methods. Published by Elsevier Ltd.

Keywords: Classifier evaluation Training variability Classifier stability U-statistics AUC

1. Introduction This paper concerns the statistical assessment of supervised learning classifiers in binary classification tasks. The assessment uses a training dataset and a test dataset to estimate the classification performance (in terms of some metric, e.g., area under the receiver operating characteristic (ROC) curve, or AUC) and its variability. A point estimate of the classifier performance with a finite sample is meaningless unless it is accompanied with its uncertainty. Variability estimation is important to make statistical inference of the classification performance on the population from which the finite datasets are drawn. The traditional approaches [1–4] to assess the uncertainty of the estimated AUC only account for the variability in sampling a test set. However, a finite training set is also a random sample from the population of training sets, and the AUC varies with the training set. Training variability is a characterization of classifier stability with respect to varying training sets. The main theme of this paper is to investigate a nonparametric approach to quantify the classifier performance variability that accounts for both the randomness of the finite training set and that of the test set. The traditional approach to the assessment of pattern classifiers is to use a test set to estimate the standard deviation of the classifier performance and ignore the variability caused by the random choice of the finite training set. This might be acceptable for some particular classifiers and particular feature distributions, for example, theoretical analyses of Bayes classifiers on Gaussian

n

Corresponding author. Tel.: þ1 301 796 2663; fax: þ1 301 796 9925. E-mail addresses: [email protected], [email protected] (W. Chen).

0031-3203/$ - see front matter Published by Elsevier Ltd. doi:10.1016/j.patcog.2011.12.024

distributions by Fukunaga et al. [5] show that the variance comes predominantly from the finite test set. However, Beiden et al. [6] showed that training variability should be taken into account in general and particularly when comparing competing classifiers. Ignoring training variability assumes that the training set is ‘‘frozen’’ and therefore training is treated as a fixed, rather than random, effect. However, training is rarely frozen in practice and usually classifier developers keep updating their training. Without taking into account the training variability, the estimated confidence interval of the estimated classifier performance would not cover the performance observed when the training set changes. Accounting for both testing and training variability is particularly important during early technology development (e.g., in pilot studies) where only a limited dataset is available for training and testing the classifier. It is also important to account for both sources of variability when deciding which to support with resources and fully develop. Such a comparison should not be affected by the random choice of a finite training set. Training variability is a characterization of the stability of the estimated classifier performance with respect to varying training sets. Such classifier stability has been a major concern in applications that involve a large number of features and a relatively small number of training samples. In multiple biomarker applications such as gene-expression microarray studies, classifiers are typically developed with a few of hundreds of patient samples on tens of thousands of gene-expression measurements [7]. Stability/ fragility questions often arise in data-oriented pattern classifier findings. A well-known example is the exchange between Dave et al. [8] and Tibshirani [9]. Dave et al. [8] developed a model for predicting patient survival from gene expression data and their model had a highly significant p-value as tested on an

2662

W. Chen et al. / Pattern Recognition 45 (2012) 2661–2671

independent test set. Tibshirani [9] re-analyzed their data by swapping their training and test sets and found that the results were no longer significant and concluded that Dave et al.’s result is extremely fragile. For another example, Michiels et al. [10] studied the stability of the molecular signature and the classification accuracies of seven published microarray studies, each one of which is based on a fixed training set. Michiels et al. [10] reanalyzed the seven studies and found that ‘‘Five of the seven studies did not classify patients better than chance’’. Michiels et al. [10] assessed the classifier performance (error rate) by averaging over multiple training sets that are generated with random resampling. However, to our best knowledge, no methods are formally introduced to the literature for evaluating the variability of such obtained classifier performance. Other efforts to average out the variability caused by the finite datasets include the widely practiced cross validation (CV) and repeated CV methods. However, variability estimation methods currently available for CV are biased [11]. Variability accounting for both training and testing can be obtained easily in Monte Carlo simulation experiments where training sets and test sets are repeatedly and independently drawn from the specified distributions (see, e.g., [12]). However, an analytical approach is desired in real-world applications when there is one training set and one test set. Yousef et al. [13] proposed a nonparametric approach to estimating the classifier performance variability that accounts for both testing and training. However, their approach assumes the scores output from a classifier are continuous and would yield erroneous results when there are ties for two cases from two classes. In addition, the computation algorithm provided in that paper is too slow. Our major contribution in this paper is to provide a unified framework and analytical derivations of variance expressions as well as estimation methods for the assessment of classifiers by accounting for both the finite size of the training set and the finite size of the test set as sources of variability. We extend the work in Yousef et al. [13] to provide assessment methods without assuming the data continuous. We also adopt a fast computation algorithm for these methods and develop software tools that are freely available. We focus here on the area under the curve (AUC) in receiver operating characteristic (ROC) analysis, and, in future work, our approach can be extended to many other performance metrics such as partial AUC, error rate, sensitivity and specificity. AUC provides a meaningful summary measure of the separability of the distributions of the positive-class scores and the negativeclass scores and is a widely used performance metric in classification problems [1,2,4,14–16]. We will analyze AUC and its variance in Section 2. Simulation results will be presented in Section 3 followed by the conclusion and discussions in Section 4.

under the ROC curve, AUC, is a summary measure of the separability of the pair of distributions that corresponds to the two classes and can be interpreted as the average sensitivity over all possible specificity values. AUC is also the probability that the decision score (t1) of a randomly chosen positive case is larger than that of a randomly chosen negative case (t0) plus half of the probability that two such cases have the same score (tie) (assuming without loss of generality that the score of the positive class is larger than that of the negative class, on average), i.e., (we use Pr to denote ‘‘probability’’ throughout this paper), AUC ¼ Prðt 0 ot 1 Þ þ 12 Prðt 0 ¼ t 1 Þ: 2.1. Variance of AUC estimate We first formally define the following quantities.

 Conditional AUC, denoted as Ar, is the true performance of a 

classifier on the entire test population conditional on a training set r. b r , is the estimate of Ar using a finite Estimated AUC, denoted as A test set. We add a hat to a parameter to denote its estimate and this notation will be used throughout this paper. Let t 0r  ft 01r ,t 02r , . . . ,t 0N0 r g and t 1r  ft 11r ,t 12r , . . . ,t 1N1 r g represent the test scores of the classifier trained with dataset r as tested on N0 test samples from the actually negative class and N1 test samples from the actually positive class, respectively. Then br ¼ A

N0 X N1 1 X sðt ; t Þ, N0 N 1 i ¼ 1 j ¼ 1 0ir 1jr

where sðx; yÞ is 8 > < 1, sðx; yÞ  0:5, > : 0,

ð2Þ

the two-sample kernel function, x oy, x ¼ y,

ð3Þ

x 4y:

 Mean AUC, denoted as A, is the mean performance of a



classifier, where ‘‘performance‘‘ is obtained by training on a particular training set and testing on the entire test population, and ‘‘mean‘‘ is taken over a population of training sets. A is the expectation of Ar over the population of training sets, which implies that the training set is treated as a random effect, i.e., A  Er ðAr Þ. We denote the expectation over the population of training sets as Er and similarly denote the expectation over the population of testing sets as Et. b Estimated mean AUC, denoted as A , is the estimate of A using a finite test set and finite number (R) of training sets, as defined in (4). Note that (2) can also be viewed as an estimate of A as it is a special case of (4) with R¼1. R b 1Xb A¼ Ar Rr ¼1

2. AUC and its variance A classifier generates decision scores, either discrete or continuous, for cases in the positive class and cases in the negative class, respectively. By applying a decision cutoff value to the decision score, one can call a case positive or negative. In evaluation, the classifier decision is compared with the truth status of the case obtained from a reference standard. In medicine, the truth for ‘‘positive’’ is often called diseased or responder (to certain treatment) whereas the truth for ‘‘negative’’ is often called non-diseased or non-responder. A positive call for a diseased or responder case is a true positive (TP) whereas a positive call for a non-diseased or non-responder case is a false positive (FP); similarly one can define true negative (TN) and false negative (FN). Varying the cutoff value over its range, one can plot the true positive fraction (TPF, or sensitivity) versus the false positive fraction (FPF, or 1-specificity), which is the ROC curve of the classifier [1,2,4,14–16]. The area

ð1Þ

ð4Þ

 Testing variability of Ab r , denoted as Vart9r Ab r , is the variance of 



b r due to the finite sample size of the test set conditional on a A fixed training set r. b r , denoted as Var A b r , is the variability of A b r due Variability of A to both the finite sample size of the test set and that of the training set, which is treated as a random sample from the population of training sets. b b b Variability of A , denoted as Var A , is the variability of A due to both the finite sample size of the test set and that of the training set, which is treated as a random sample from the population of training sets.

We categorize the statistical assessment of classifiers into three levels. At level one, the classification performance and its variability

W. Chen et al. / Pattern Recognition 45 (2012) 2661–2671

are estimates of the range of performance expected if the experiment is repeated many times, drawing independently a test set each time without changing the training. In this sense, the performance is generalized to the test population but conditional on a particular training set. At level two, the classification performance and its variability are estimates of the range of performance expected if the experiment is repeated many times, drawing independently a training set and a test set each time. The performance is generalized to the test population as well as the population of training sets. At level three, the classification performance and its variability are estimates of the range of performance expected if the experiment is repeated many times, drawing independently a test set and a number of training sets each time and averaging the performance over the training sets. Fig. 1 illustratively shows these three levels of assessment: for each level, the figure shows the training set and test set when the experiment is repeated, the observed performance estimate in each repetition, and their asymptotic distribution, which is a normal distribution that is represented by its mean and variance parameters. Asymptotic normality is an inherent property of the Ustatistics [17] that is independent of the distribution of the feature data or the classifier scores. The task of assessment in one realworld experiment is to estimate these (mean and variance) parameters for the performance of one classifier or the performance difference between two classifiers. We next present, in a unified U-statistics framework, analytical expressions for the three variances in the three levels of assessment with brief derivations. See Appendix A for an introduction to U-statistics and Refs. [17,18] for more details.

2663

b r as defined in (2). The randomness of A br that we consider here is A br only arises from the finite size of the test set. The variance of A over the population of testing samples, conditional on a fixed training set r, is the testing variance, and is given by b r ¼ c1 m1r þc2 m2r þ c3 m3r ð1c4 Þm4r , Vart9r A

ð5Þ

where c1 ¼ 1=N 0 N1 ,

c2 ¼ ðN0 1Þ=N 0 N1 ,

c3 ¼ ðN1 1Þ=N 0 N 1 ,

c4 ¼ ðN0 1ÞðN1 1Þ=N0 N 1 ,

m1r  Et ½sðt 0ir ; t 1jr Þ2 , m2r  Et ½sðt 0ir ; t 1jr Þsðt 0i0 r ; t 1jr Þ, m3r  Et ½sðt 0ir ; t 1jr Þsðt 0ir ; t 1j0 r Þ, m4r  A2r : The proof is straightforward using the U-statistics theory. b r is an estimable U-statistic of degree ð1, 1Þ, it follows Since A from the U-statistic theory (see Lemma 1 in Appendix A) that b¼ Vart9r A

1 ½ðN 0 1Þx01 þðN1 1Þx10 þ x1;1 , N0 N1

ð6Þ

where

x01 ¼ Et ½sðt0ir ; t1jr Þsðt0i0 r ; t1jr ÞEt ½sðt 0ir ; t1jr Þ2 ¼ m2r m4r , x10 ¼ Et ½sðt0ir ; t1jr Þsðt0ir ; t1j0 r ÞEt ½sðt 0ir ; t1jr Þ2 ¼ m3r m4r ,

2.1.1. Level one assessment At this level, the training set of the classifier is treated as fixed and the population parameter of interest is Ar. The estimate of Ar

x1;1 ¼ Et ½sðt0ir ; t1jr Þ2 Et ½sðt0ir ; t1jr Þ2 ¼ m1r m4r : Substituting these back into (6), then (5) follows.

Fig. 1. Three levels of classifier assessment: for each level, the figure shows the training set and test set when the experiment is repeated, the observed performance estimate in each repetition, and their asymptotic distribution, which is a normal distribution that is represented by its mean and variance parameters.

2664

W. Chen et al. / Pattern Recognition 45 (2012) 2661–2671

Level one assessment is currently the most prevailing practice b r . We note that the and many methods exist for estimating Vart9r A nonparametric expression in (5) is equivalent to a formula given by Bamber [1]. Estimators of the testing variance were also given by Hanley et al. [2] and also DeLong et al. [3]. Hanley et al. provided the variance as a function of the single performance value Ar; it is a parametric and conservatively biased estimator based on the bi-exponential distribution assumption. DeLong et al. provided a nonparametric and asymptotically unbiased estimator based on U-statistics and the jackknife. Yousef et al. b r and its nonparametric and unbiased esti[13] derived Vart9r A mator based on U-statistics. However, they assumed that the data are continuous and hence they defined the kernel function as ( 1, x oy, 0 s ðx; yÞ  ð7Þ 0, x 4y: When the assumption is violated, use of (7) would yield erroneous estimate of Ar and the error would propagate to the variance estimates at all the three levels. Our simple solution to this problem here is to define the kernel function as in (3) and rederive the variance expression in the U-statistics framework. The variance expression in (5) is similar to the expression in Yousef et al. [13] but the moment parameters (m1r  m4r ) are based on a different kernel function. In addition, in Yousef et al. [13], they have m1r ¼ Ar , which is generally not true without assuming the data continuous. Moreover, the parameters m2r and m3r have a probability interpretation with three terms as shown in (8) and (9), whereas in Yousef et al. [13] the right hand side of these two equations only has the first term. Appendix B provides an alternative method to account for ties in deriving the variance expression, which sheds light on, when there are actually ties in the scores for cases from two classes, how ignoring such ties causes errors.

population parameter of interest is A. The estimate of A that we b br consider here is A as defined in (4), rather than the estimate A b considered in the Level Two Assessment. Clearly A r is a special b b b r as an estimate case of A where R¼1 and A is less variable than A b of A. The randomness of A arises from both the random choice of the training sets and the finite size of the test set. In order to b derive the variance of A , we need the covariance between two AUC estimates. b r and A b r0 be the AUC estimates of a classifier trained Let A with independent datasets r and r 0 , respectively and tested on a common dataset that is independent of both r and r 0 . The b r and A b r0 is given by covariance of A br ,A b r0 Þ ¼ c1 m5 þ c2 m6 þc3 m7 ð1c4 Þm8 , CovðA

ð14Þ

where m5  Er ½Et ½sðt 0ir ; t 1jr Þsðt 0ir0 ; t 1jr0 Þ, m6  Er ½Et ½sðt 0ir ; t 1jr Þsðt 0i0 r0 ; t 1jr0 Þ, m7  Er ½Et ½sðt 0ir ; t 1jr Þsðt 0ir0 ; t 1j0 r0 Þ: To derive this result, let h^ r and h^ r0 be the two classifiers trained with r and r 0 , respectively, and cðx ; x Þ  sðh^ r ðx Þ; h^ r ðx ÞÞ ¼ 1j

0i

1j

ð8Þ

sðt 0ir ; t 1jr Þ and fðx0i ; x1j Þ  sðh^ r0 ðx0i Þ; h^ r0 ðx1j ÞÞ ¼ sðt 0ir0 ; t 1jr0 Þ, then it follows from the U-statistic theory (see Lemma 2 in Appendix A) that

ð9Þ

b r0 Þ ¼ br ,A CovðA

m3r  Pr½t 0ir o minðt 1jr ,t 1j0 r Þ þ 12 Pr½t 0ir ¼ minðt 1jr ,t 1j0 r Þ and t 1jr a t 1j0 r  þ 14 Pr½t 0ir ¼ t 1jr ¼ t 1j0 r :

2.1.3. Level three assessment At this level, the training set is treated as random and the

0i

m2r  Pr½maxðt 0ir ,t 0i0 r Þ ot 1jr  þ 12 Pr½maxðt 0ir ,t 0i0 r Þ ¼ t 1jr and t 0ir a t 0i0 r  þ 14 Pr½t 0ir ¼ t 0i0 r ¼ t 1jr ,

Substituting (12) (13) back into (11), then (10) follows. We note that expressions (12) and (13) represent natural b r : (12) is the testing variability averaged components of Var A over training sets and we call it the testing variability component, (13) is the variance of the true performance with respect to training sets and we call it the training variability component.

1 ½ðN 0 1Þz0;1 þ ðN1 1Þz1;0 þ z1;1 , N0 N1

ð15Þ

where 2.1.2. Level two assessment At this level, the training set is treated as random and the population parameter of interest is A. The estimate of A that we b r , and the randomness of A b r arises from both the consider here is A random choice of the training set and the finite size of the test set. b r accounts for both training and testing, and is The variance of A given by

z0;1 ¼ Er ½Et ½sðt0ir ; t1jr Þsðt0i0 r0 ; t1jr0 ÞEr ½Et ½sðt 0ir ; t 1jr Þ

b r ¼ c1 m1 þ c2 m2 þ c3 m3 þc4 m4 m8 , Var A

By substituting back into (15), (14) follows. b The variance of the estimated mean AUC, A , is given by

ð10Þ

where mi  Er ½mir , m8  ðEr ½Ar Þ2 : The derivation of this result needs the conditional variance identity theorem [19], ð11Þ

Using (5), we have Er ½Vart9r

b r  ¼ c1 m1 þ c2 m2 þc3 m3 ð1c4 Þm4 A

ð12Þ

b r  ¼ Ar so we have and notice that Et ½A b r  ¼ Er ½A2 ðEr ½Ar Þ2 ¼ m4 m8 : Varr Et ½A r

Similarly

z1;0 ¼ m7 m8 , z1;1 ¼ m5 m8 :

b 1 Var A ¼ ðc1 m1 þc2 m2 þc3 m3 þ c4 m4 Þ R R1 ðc1 m5 þ c2 m6 þ c3 m7 þc4 m8 Þm8 : þ R

i ¼ 1; 2,3; 4,

b r ¼ Er ½Var A b b Var A t9r r  þ Varr Et ½A r :

Er0 ½Et ½sðt 0i0 r0 ; t 1jr0 Þ ¼ m6 m8 :

ð13Þ

ð16Þ

To derive this result, from (4), we have 1 b Var A ¼ 2 R

R X r¼1

br þ Var A

R X X

! br ,A b r0 Þ : CovðA

ð17Þ

r ¼ 1 r0 a r

Eq. (16) follows immediately by substituting (10) and (14) into (17). This result is analogous to the result in Gallas [20] where R readers read N0 signal-absent patient images and N1 signal-present patient images for a diagnostic classification task. An apparent parallel is clearly seen here between the assessment of human intelligence and that of the artificial intelligence. Analogous to the

W. Chen et al. / Pattern Recognition 45 (2012) 2661–2671

variance decomposition in Gallas et al. [21], this variance can be represented as a sum of three components: training, testing, and training–testing interactions, expressed in (18), (19), and (20), respectively. This decomposition is useful in characterizing classifiers. b 1 Varr A ¼ ðm4 m8 Þ, R

ð18Þ

b Vart A ¼ c1 m5 þ c2 m6 þc3 m7 ð1c4 Þm8 ,

ð19Þ

b 1 Varrt A ¼ ðc1 m1 þ c2 m2 þc3 m3 ð1c4 Þm4 c1 m5 c2 m6 R c3 m7 þð1c4 Þm8 Þ:

ð20Þ

2.2. Variance estimation In the previous section, we have derived expressions for the variance of estimated AUC at all three levels. Every variance expression is a linear combination of some population parameters (see (5), (10, (16)), i.e., the second-order moments of the kernel function s: mir ði ¼ 1, . . . ,4Þ and mj ðj ¼ 1, . . . ,8Þ. Hence, the U-statistic estimator of the variances can be obtained by replacing these moment parameters with their corresponding U-statistic estimators. The U-statistic estimators of mir ði ¼ 1, . . . ,4Þ are N0 X N1 X sðt 1jr ; t 0ir Þ2

b 1r  m

i¼1j¼1

N0 N1

ð21Þ

N0 N 1 ðN0 1Þ

i ¼ 1 j ¼ 1 i0 a i

N0 X N1 X N1 X sðt 1jr ; t 0ir Þsðt 1j0 r ; t 0ir Þ

b 3r  m

N 0 N1 ðN 1 1Þ

i ¼ 1 j ¼ 1 j0 a j

,

i ¼ 1 j ¼ 1 i0 a i j0 a j

,

ð23Þ

N0 N 1 ðN0 1ÞðN 1 1Þ

:

ð24Þ

db b 4r : b 1r þc2 m b 2r þ c3 m b 3r þ c4 m b 4r m Vart9r A ð25Þ r ¼ c1 m d db b Similarly the expressions for Var A r and Var A can be obtained by replacing mj ðj ¼ 1, . . . ,8Þ in (5) and (16), respectively, with their Ub j ðj ¼ 1, . . . ,8Þ: statistic estimators m N0 X N1 R X X sðt 1jr ; t 0ir Þ2 r ¼1i¼1j¼1

b2  m

RN0 N 1

b3  m

RN 0 N1 ðN0 1Þ

N0 X N1 X N1 R X X sðt 1jr ; t 0ir Þsðt 1j0 r ; t 0ir Þ r ¼ 1 i ¼ 1 j ¼ 1 j0 a j

b4  m

,

N0 X N1 X N0 R X X sðt 1jr ; t 0ir Þsðt 1jr ; t 0i0 r Þ r ¼ 1 i ¼ 1 j ¼ 1 i0 a i

RN0 N1 ðN 1 1Þ

ð26Þ

,

ð27Þ

,

ð28Þ

N0 X N1 X N0 X N1 R X X sðt 1jr ; t 0ir Þsðt 1j0 r ; t 0i0 r Þ , RN 0 N 1 ðN 0 1ÞðN 1 1Þ 0 0 r ¼1i¼1j¼1

ð29Þ

i ai j aj

b5  m

b6  m

N0 X N1 X R X R X sðt 1jr ; t 0ir Þsðt 1jr0 ; t 0ir0 Þ r ¼ 1 i ¼ 1 j ¼ 1 r0 a r

RN 0 N 1 ðR1Þ

,

N0 X N1 X N0 X R X R X sðt 1jr ; t 0ir Þsðt 1jr0 ; t 0i0 r0 Þ r ¼ 1 i ¼ 1 j ¼ 1 i0 a i r 0 a r

RN0 N1 ðR1ÞðN1 1Þ

,

N0 X N1 X N0 X N1 X R X R X r ¼ 1 i ¼ 1 j ¼ 1 i0 a i j0 a j

sðt 1jr ; t 0ir Þsðt 1j0 r0 ; t 0i0 r0 Þ : RN 0 N 1 ðR1ÞðN 0 1ÞðN 1 1Þ r0 a r

ð32Þ

ð33Þ

The major computation cost in these variance estimators is b j ðj ¼ 1, . . . ,8Þ. For instance, the computation the computation of m b 8 is OðR2 N 20 N21 Þ if computed directly using (33) complexity of m [13]. We adopt a fast algorithm [21] that substantially reduces the b 8 , the complexity is reduced to computation complexity, e.g., for m OðRN 0 N 1 Þ with the fast algorithm, which is a tremendous improvement. The readers are referred to [21] for the algorithmic b j ðj ¼ 1, . . . ,8Þ. Here we explain the key details for computing m idea in this algorithm using a simple example. To calculate the sum S¼

n X n X

x i x i0

i0 ¼ 1 i a i0

the complexity is Oðn2 Þ in terms of the number of multiplications. However, notice that !2 n n X X S¼ xi  x2i i¼1

then the complexity is O(n). The source code in Matlab for all the three variance estimators will be freely available from the World Wide Web (http://code.google.com/p/assessment-of-classifiers/). 2.3. Multiple training sets and bootstrap approximation

Hence,

b1  m

r ¼ 1 i ¼ 1 j ¼ 1 j0 a j r 0 a r

ð22Þ

N0 X N1 X N0 X N1 X sðt 1jr ; t 0ir Þsðt 1j0 r ; t 0i0 r Þ

b 4r  m

b8  m

N0 X N1 X N1 X R X R X sðt 1jr ; t 0ir Þsðt 1j0 r0 ; t 0ir0 Þ

i¼1

,

N0 X N1 X N0 X sðt 1jr ; t 0ir Þsðt 1jr ; t 0i0 r Þ

b 2r  m

b7  m

2665

RN0 N 1 ðR1ÞðN 0 1Þ

ð30Þ

,

ð31Þ

b In level-two and level-three assessments, the expressions of A , db db Var A r and Var A involve R trainings of the classifier using R training sets. All these estimators are unique minimum variance unbiased estimators if R independent training sets are available. However, classifier developers often have one finite dataset for training the classifier and one finite dataset for testing it. In this situation, bootstrap resampling can be used to generate R training sets for the level-two assessment. Thus for level-two assessment, b r , which is an one can use the two original datasets to calculate A db unbiased estimate of A, and then calculate Var A r using R bootstrapped training sets and the test set. The bootstrap gives good db approximation of Var A r (see Section 3). One may attempt to perform a level-three assessment using bootstrap approximation. However, some concerns suggest that the variance estimation at level three using the bootstrap may be less reliable than that at level two. First, given a training set r of size N, the performance of a classifier trained with a sample bootstrapped from r is, on average, between the performance of the classifier trained with an independent dataset of size 0.5N and that of the classifier trained with an independent dataset of size b 0.632N [22,23]. Therefore, the estimation of A using R bootstrap training sets in (4) is biased low as compared to the population quantity A. Next, the variance estimator is constructed, by theory, for the (biased) performance estimator that uses multiple independent training sets and not for the performance estimator that uses the bootstrap training sets, which are correlated. Our simulation results (see Section 3) have confirmed that level-three assessment with bootstrap approximation is not reliable. Level-three assessment would be useful if multiple independent training sets are available. For example, it can be used to characterize or optimize classifiers using simulated data, the

2666

W. Chen et al. / Pattern Recognition 45 (2012) 2661–2671

population distribution of which is known and independent training sets can be drawn from that population distribution. Although the Monte Carlo (MC) approach can be used to obtain the variance when the population distribution of the training data is known, the estimation method presented here needs much smaller number of trainings than the MC approach and hence can substantially reduce the computation cost. In addition, it can also be used for characterizing the learning curve of a classifier. For a training dataset of size N, one can partition the dataset into R independent sets and assess the classifier performance at training size N=R . By varying R, one can obtain an estimation of the mean classifier performance and the associated error bar as functions of training sample size, which we call the ‘‘learning curve’’ of the classifier. If there is a large slope to the curve, there is a large hope that more training data will help. 2.4. Confidence intervals and model comparison The confidence interval (CI) of classifier performance in terms of AUC can be approximated with use of the estimated variance and the asymptotic normality property of U-statistics [24]. b r can be approximated by At level one, the 100ð1aÞ% CI of A rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi db b db b ½A r z1a=2 Var A r , A r þ z1a=2 Var A r , where z1a=2 is the t9r

t9r

upper 1a=2 critical value for the standard normal distribution. b can be approximated by At level two, the 100ð1aÞ% CI of A qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffir d d br , A b r þ z1a=2 Var A b r . At level three, the 100 b r z1a=2 Var A ½A b ð1aÞ% CI of A can be sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffi db b db Var A , A þ z1a=2 Var A .

approximated

by

b ½A z1a=2

2.5. Comparing two classifiers It is often of interest to calculate the variance of the difference b ð1Þ and A b ð2Þ ) of two classifiers for model of two AUC estimates (A r r comparison purpose. For example, one can use a test statistic qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi b ð2Þ Þ= VarðA b ð1Þ A b ð2Þ Þ to test the null hypothesis b ð1Þ A z ¼ ðA r r r r b ð1Þ ¼ A b ð2Þ . One can also show that the lower bound of the H0 : A r r b ð2Þ is larger than a pre-specified critical value in b ð1Þ A 95% CI of A r r order to prove that classifier 1 is noninferior to classifier 2. The variance of the difference of two AUC estimates can be easily calculated by using the same expressions in (5), (10), and (16) but replacing the kernel function s by sð1Þ sð2Þ in the estimation b ir ði ¼ 1, . . . ,4Þ or of the second-order moment parameters m b j ðj ¼ 1, . . . ,8Þ. m

3. Simulation results In this section, we use simulated datasets to illustrate our variance estimation methods. We validate the variance estimators by comparing them with the true variances obtained by the Monte Carlo (MC) method and examine the effect of bootstrap resampling on the variance estimation. In the first experiment, we focus on the level-two and levelthree analyses and examine the effect of bootstrap resampling on the accuracy of variance estimation. Level-one analysis is not considered in this experiment since no resampling is needed and it has been widely used in practice. We assumed the population distributions of the two classes are a pair of multivariate Gaussian distributions with the same identity covariance matrix and

different means, where the mean difference vector is set to be ð0:01,0:1,0:3,0:4,0:5,0:6Þ to approximate an ideal AUC of 0.75. Linear discriminant analysis (LDA) is known to be the optimal classifier with this feature space design [25]. We examined the performance of both LDA and the quadratic discriminant analysis (QDA) for a setting that there is a training dataset consisting of N observations per class and an independent dataset consisting of N observations per class for testing and we set N ¼ 30; 90. In each MC trial for a sample size 2N, we randomly sampled from the designed feature space a dataset r for training and another dataset t for testing to calculate an AUC. The MC trial was repeated 1000 times and the sample mean and sample variance of the 1000 observed AUC values served, respectively, as the truth of the mean b r Þ. Also, in AUC (A) and the truth of the level-two variance ðVar A each trial, we randomly drew 100 independent training sets and one test set to calculate a mean AUC value. The sample variance of 1000 observed mean AUC values served as the truth of the levelb three variance ðVar A Þ. For level-two assessment, in each of the 1000 MC trials, we used one randomly drawn training set r and one test set t to b r as an estimate of A. We used 100 training sets and calculate A db one test set for variance estimation ðVar A r Þ and, to examine the effect of bootstrap approximation, the 100 training sets were obtained either by bootstrapping the training set r or by random sampling from the population. For level-three assessment, in each of the 1000 MC trials, we used 100 training sets and one test set db b for estimating the mean performance A and the variance Var A . Similarly, to examine the effect of bootstrap approximation, the 100 training sets were obtained either by bootstrapping a training set r or by random sampling from the population. The mean and db db the standard deviation of 1000 estimations of Var A r and Var A were calculated to compare with their truth. Fig. 2(a) plots the mean AUC results. The truth value of the mean AUC is, by definition, identical to the mean of the level-two estimate of the AUC and they are both plotted as circle ðJÞ in Fig. 2(a). For the level-three estimate of the mean AUC, with both classifiers, the estimation of the mean AUC is unbiased when multiple independent training sets are available and is downwardly biased when only one dataset is available and the bootstrap is used to generate multiple training sets. The cause of bias has been explained in Section 2.3. While it is theoretically clear that both LDA and QDA would yield identical classification performance for this feature space when the training size is sufficiently large, the performance of QDA is lower than that of LDA at limited training size due to the increased complexity of the QDA. Fig. 2(b) and (c) plots the mean of 1000 estimated variance values in the MC trials at level two and level three, respectively. The error bar for the variance estimation is 7 1 standard deviation of those 1000 estimates (estimation error). The results show that the variance estimations at both levels are unbiased when multiple independent training sets are available (represented by  in the plot). When only one dataset is available and the bootstrap is used to generate multiple training sets (represented by & in the plot), the level-three variance estimation is substantially biased and the bias in the level-two variance estimation is within the estimation error. The results imply that using bootstrap to generate multiple training sets provides good approximation for level-two variance estimation; however, level-three variance estimation using our approach requires independent training sets. Fig. 3 plots the variance components (averaged over 1000 MC trials) of variance estimation with independent training sets for the sample size N ¼30. Fig. 3(a) shows that the estimations of both the testing component and the training component of the level-two variance, using (12) and (13), respectively, are not

W. Chen et al. / Pattern Recognition 45 (2012) 2661–2671

LDA

0.8

QDA

For (a): in 1000 MC trials Mean AUC by MC (truth) and the mean of the level 2 AUC Mean of the level 3 AUC estimate (Independent training sets) Mean of the level 3 AUC estimate (Bootstrap training sets)

Mean AUC

0.75 0.7 0.65 0.6

For (b) and (c):

0.55 0.5

Variance (truth) Variance estimation with Independent training sets Variance estimation with the bootstrap training sets

30 90 30 90 Training/testing sample size (per class) x 10−3

LDA

x 10−3

QDA

LDA

QDA

8

6

a

4

2

Level two variance

Level two variance

8

0

2667

6

4

2

0

30 90 30 90 Training/testing sample size (per class)

30 90 30 90 Training/testing sample size (per class)

Fig. 2. Estimation of classifier performance and its variability at level two and level three. (a) Mean AUC, (b) level-two variance, and (c) level-three variance. The error bar for the variance estimation is 7 1 standard deviation of the 1000 estimated variance values in 1000 MC trials.

Var. Decomp. Cond. AUC

QDA

10−3

10−4 Training Testing Train−Test interaction

10−5

10−6

10

20

50 100 200 10 R

20

LDA

10−2

50 100 200

Var. Decomp. Mean AUC

LDA

10−2

QDA

10−3

10−4

10−5

10−6

10

20

50 100 200 10

20

50 100 200

R

Fig. 3. Variance decomposition as a function of the number of independent training sets. (a) Level-two variance is decomposed into training and testing components and (b) level-three variance is decomposed into training, testing, and train–test interaction components.

2668

W. Chen et al. / Pattern Recognition 45 (2012) 2661–2671

sensitive to the number of training sets R when R 410. The results indicate that the training variability is not negligible even under the well-posed conditions in these simulation experiments. Fig. 3(b) shows that the training component of the level three variance decreases and becomes negligible with sufficiently large R. This is expected since the mean AUC estimate is averaging over R training sets and hence becomes less variable with respect to the training sets when R is larger. In the second experiment, we used synthetic datasets to validate the approach presented in Section 2.4 for the estimation of 95% CI of classifier performance at three levels. At level one, we trained the classifier with a randomly sampled training set. With this fixed classifier, the following were repeated 10,000 times: the trained classifier was tested on a randomly sampled test set and the test performance and its 95% CI were estimated. Taking the mean of the 10,000 test AUC values as the true AUC value, the actual coverage of the estimated 95% CI is the percentage of 10,000 trials that the test AUC lies in between the lower bound and the upper bound of the estimated CI. At level two and level three, the following were repeated 10,000 times: the classifier was trained with a randomly sampled training set and tested on a randomly sampled test set and the test AUC was saved; the classifier was further trained R¼100 times with multiple training sets (either independently drawn from the population or bootstrapped from that particular training set) and tested on that test set and these test results were used to compute the approximate 95% CI. Taking the mean of the 10,000 test AUC values as the true parameter value (i.e., the mean AUC), the actual coverage of the estimated 95% CI is the percentage of 10,000 trials that the true AUC value lies in between the lower bound and the upper bound of the estimated CI. Note that the truth values of the performance at level two and level three are identical and the estimate at level three is less variable than that at level two. In this experiment, we used four classifiers, LDA, logistic regression, k nearest neighbor (k NN), and QDA, where k NN gives discrete outputs and hence may have ties for cases from different classes while the other three classifiers give continuous output. We simulated two feature spaces. The first feature space consists of 10 features that follow a pair of Gaussian distributions with different covariance matrices and different means. The second feature space consists of seven features that arise from a pair of mixture of Gaussian distributions—each class is a mixture of three Gaussian distributions. The parameters of the Gaussian mixtures were estimated by fitting a real-world dataset to the mixture of Gaussian model using the expectation-maximization algorithm [26,27] implemented in the NETLAB toolbox [28]. The real-world dataset used was the ‘‘Breast Cancer Wisconsin (Diagnostic)’’ dataset in the UCI repository of machine learning databases [29]. The original dataset has a nearly perfect separation with 30 features (AUC 4 0:99 assessed with leave-one-out cross validation using LDA) and we used the seven lowest performing features in this dataset to simulate a separation that is within the commonly encountered range of performance in medicine (AUC  0:89 with LDA). The purpose of this simulation is to generate datasets that are more realistic than Gaussian distributions. The results of this experiment are presented in Table 1. Shown in the table are the actual coverage of the estimated 95% CI in 10,000 MC trials. There are two columns for the level two or level three results: ‘‘with independent’’ for generating multiple training sets by independently drawing from the population and ‘‘with bootstrap’’ for bootstrapping a particular training set multiple times. The results consistently show that our approach provides reliable estimation of 95% CI of classifier performance except the ‘‘with bootstrap’’ estimation at level three. The results show again that using the bootstrap to generate multiple training sets provides a good approximation for level-two CI estimation; however, the level-three CI estimation using our approach requires independent training sets.

Table 1 Validation of the estimation of 95% CI of classification performance at three levels. In the table is the actual coverage of the estimated 95% CI in 10,000 MC trials. Level one (%)

Level two With independent (%)

Dataset 1: Gaussian distribution LDA 94.49 96.04 Logistic 94.49 96.06 regression k NN 94.75 96.07 QDA 94.39 96.02 Dataset 2: Mixture of Gaussian LDA 95.22 96.51 Logistic 95.12 96.29 regression k NN 97.17 97.10 QDA 94.53 96.14

Level three With bootstrap (%)

With independent (%)

With bootstrap (%)

95.69 95.73

94.99 95.00

85.19 85.14

96.03 96.03

94.52 94.61

69.46 57.30

96.33 96.30

95.64 95.37

94.08 93.62

96.98 96.61

96.55 94.77

91.08 83.22

4. Conclusion and discussions In this paper, we presented a method for assessing classifiers at three levels: assessing the classification performance and testing variability on a fixed training set (level one), assessing the classification performance and its variability that accounts for both training and testing (level two), and assessing the performance averaging over training sets and its variability that accounts for both training and testing (level three). In the traditional level-one assessment, the estimated classifier performance can be generalized to the population of test cases, conditional on a particular training set. In level-two and level-three assessments, the estimated classifier performance can be generalized to both the population of test sets and the population of training sets. Training variability is a characterization of classifier stability with respect to varying training sets. We derived analytical expressions for the variance of the estimated AUC as well as efficient computation algorithms for computing AUC, its variance, and confidence intervals at three levels of assessment. Our U-statistics-based approach is nonparametric and can be applied to assess any classifier that has ordinal (continuous or discrete) outputs. Our simulation results show that the bootstrap is a practical and useful tool to generate multiple training sets for the level-two assessment of classifiers. Although the level three assessment gives a less variable estimate of the classifier performance, this gain is not free—it requires multiple independent training sets to get reliable estimates of the variability and the confidence interval of the classifier performance. We note that the assessment of automated classifiers (artificial intelligence) is analogous to the assessment of human raters (human intelligence) in classification tasks. In medical imaging, a random sample of human readers (e.g., radiologists) rate the images of a random sample of patients, where the rating reflects the level of suspicion of the reader that certain disease exists in the image. For any conclusion regarding the classification performance of an imaging modality to be generalizable to a population of readers and a population of patients, the mean performance of readers has to be estimated and the variability of the mean performance has to account for both the effect of finite number of readers and the finite sample size of cases [30,31]. This is analogous to the level-three assessment of automated classifiers, where classifiers trained with different training sets correspond to multiple human readers. Many methods have been investigated to assess the reader and case variability of AUC in human reader studies [30,32–34,20]. In particular, Gallas et al. [20,21] applied U-statistics estimators to estimating the reader and case variability of AUC in human reader studies and further developed

W. Chen et al. / Pattern Recognition 45 (2012) 2661–2671

2669

a fast computation method, which we have adopted for classifier assessment in this paper. Despite the conceptual analogy, there is a practical difference between automated classifier studies and human reader studies: human reader studies often recruit independent readers and therefore the assessment is at level-three; however, there is usually only one training set in real-world classifier studies and therefore a level-two assessment is more feasible. In our simulation experiments, we have restricted the scope of ‘‘training’’ to classifier training, i.e., the process of determining the parameters of a classifier given a fixed classifier architecture and a fixed number of features. This restriction is only for simplicity because the purpose of our simulation here is to validate our variance expressions. Our computation involves training (with a dataset r) a classifier h that has a fixed architecture and input b r ðx Þ for test subject i dimensionality (p) to obtain classifier output h i with feature vector xi A Rp (from the test set t). In modern biomedical applications such as microarray analysis, one often starts from a feature space with much larger dimensionality D where D bp. Our simulations have been restricted to the situation where p features have been selected and the task is to assess the classification performance of these p features. This implies the existence of an extra independent dataset for feature selection if data-oriented automatic feature selection is used for dimension reduction, which is often the case unless feature selection can be achieved solely by some domain knowledge (such as biology or medicine in medical diagnostic classifiers). Although in principle one can always partition the available data into three sets for feature selection, classifier training, and classifier testing, respectively, this is often not the practical choice when only limited data are available. A more efficient use of the limited data is to partition the available data into two sets, one for model training and one for testing, where the scope of ‘‘training’’ here is broadened to include not only classifier parameter estimation but also feature selection and possibly a procedure to determine some tunable model parameters (such as the number of neighbors k in k NN). Fortunately, the three-datasets paradigm is not a necessary requirement for our assessment approach to work. Our approach makes no assumption concerning b r , i.e., it can be either a classifier that maps the actual function of h the p-dimensional feature vector to a scalar decision variable or a combination of a feature selector and a classifier that maps the D-dimensional feature vector to a scalar decision variable. Therefore, our approach can be equally applied to quantify the training variability whether the ‘‘training’’ refers only to classifier training or both feature selection and classifier training. However, one must clearly state the actual scope of ‘‘training’’ that is assessed in a study to correctly interpret the results, i.e., what repeating experimental procedures are implied by the estimated variance and confidence interval or to what conditions the estimated classifier performance can be generalized. For example, if the ‘‘training’’ includes both feature selection and classifier training, then the level two or level three variability reflects the range of classification performance when both feature selection and classifier training is repeated on a training set randomly drawn from the D-dimensional space. The stability of feature selection in high dimensional and low sample size settings such as genomic studies has attracted significant attention in recent years [35,36] and statistical methodology for stable feature selection in high dimensional data is an active area of research [37,38]. The approach presented in this paper can be used to assess these methods and has potential to quantify the stability of feature selection.

without any further assumption. In this section, we give the basic definition of the two-sample U-statistic and two lemmas on the (co)variance of U-statistics. For more details on the theory of U-statistics, the readers are referred to [24,17,18,13]. Let t 0  ft 01 ,t 02 , . . . ,t 0N0 g and t 1  ft 11 ,t 12 , . . . ,t 1N1 g be independent random samples from distributions with cumulative distribution functions Fðt 0 Þ and Gðt 1 Þ, respectively. A parameter A is said to be estimable of degree (u,v) for distributions (F,G) in a family F if u and v are the smallest sample sizes for which there exists an estimator of A that is unbiased for every ðF,GÞ A F . That is, there is a function sðÞ such that EðF,GÞ ½sðt 0 ; t 1 Þ ¼ A for every ðF,GÞ A F . Without loss of generality, the two-sample kernel sðÞ can be assumed to be symmetric (i.e., symmetric in its t0i components and separately symmetric in its t1j components). The two-sample U-statistic for the estimable parameter A of degree (u,v), for N0 Z u and N1 Z v, is

Appendix A. U-statistics

In the main text, we provided a simple and efficient method by redefining the kernel function of the AUC estimator to account for ties in scores for cases from two classes. Here we provide a br . method to derive an alternative variance expression for Vart9r A

The U-statistic approach is nonparametric, distribution-free, i.e., a U-statistic estimator depends only on the observed data

XX 1 Uðt 0 ; t 1 Þ  N N  sðt 0a1 , . . . ,t 0au ; t 1b1 , . . . ,t 1bv Þ, 0

u

1

v

ð34Þ

a A Ib A J

where I (or J) is the collection of all subsets of u (or v) integers chosen without replacement from the integers f1, . . . ,N0 g or f1, . . . ,N1 g. The following lemma is on the variance of any two-sample U-statistic, which will be used in analyzing the variance of A^ r . For proof of this Lemma, see [17,18,13]. Lemma 1. The variance of any two-sample U-statistic defined in (34) is    v X u   X 1 u N 0 u v N 1 v xj,k , ð35Þ Var Uðt 0 ; t 1 Þ ¼ N N  0 1 j uj k vk u v k¼0j¼0 where

xj,k ¼ Cov½sðt 01 , . . . ,t0j ,t0ðj þ 1Þ , . . . ,t 0u ;

t 11 , . . . ,t 1k ,t 1ðk þ 1Þ , . . . ,t 1v Þ,

sðt 01 , . . . ,t 0j ,t 0ðu þ 1Þ , . . . ,t 0ð2ujÞ ;

t 11 , . . . ,t 1k ,t 1ðv þ 1Þ , . . . ,t 1ð2vkÞ Þ: ð36Þ

The following lemma on the covariance of two two-sample U-statistics is a straightforward extension of its one-sample-Ustatistics version in [18]. The proof is similar to that of Lemma 1 and is omitted here. This lemma will be used in deriving the expression for the covariance of two classifiers’ AUC estimates. Lemma 2. Let U ð1Þ and U ð2Þ be two U-statistics of degree (u,v) constructed from the same two samples x0 ¼ fx01 ,x02 , . . . ,x0N0 g and x1 ¼ fx11 ,x12 , . . . ,x1N1 g using two kernel functions c and f, respectively. The covariance of U ð1Þ and U ð2Þ is    v X u   X N 0 u v N1 v 1 u zj,k CovðU ð1Þ ,U ð2Þ Þ ¼ N N  0 1 k j uj vk u v k¼0j¼0 ð37Þ where

zj,k ¼ Cov½cðx01 , . . . ,x0j ,x0ðj þ 1Þ , . . . ,x0u ;

x11 , . . . ,x1k ,x1ðk þ 1Þ , . . . ,x1v Þ,

fðx01 , . . . ,x0j ,x0ðu þ 1Þ , . . . ,x0ð2ujÞ ; x11 , . . . ,x1k ,x1ðv þ 1Þ , . . . ,x1ð2vkÞ Þ:

ð38Þ

Appendix B. An alternative method to account for ties in the variance of AUC

2670

W. Chen et al. / Pattern Recognition 45 (2012) 2661–2671

This alternative expression sheds light on, when there are actually ties in the scores for cases from two classes, how ignoring such ties may cause errors in the variance estimation. This is achieved by representing the variance expression explicitly as a summation of two terms: one that accounts for non-ties and one accounts for ties. We define two kernel functions ( 1, x o y, sC ðx,yÞ ¼ 0, otherwise, ( sD ðx,yÞ ¼

1,

x ¼ y,

0,

otherwise

then the U-statistic estimator of AUC is br ¼ A

N0 X N1 N0 X N1 1 X 1 X sC ðt 0ir ; t 1jr Þ þ sD ðt 0ir ; t 1jr Þ: N0 N1 i ¼ 1 j ¼ 1 2N 0 N1 i ¼ 1 j ¼ 1

Following the procedure in the main text, one can derive the b r as shown in (39), which alternative variance expression Vart9r A is explicitly expressed as a summation of two terms: the first term is the contribution from the scores that do not tie across two classes and the second term is the contribution from those that are tied across two classes. Yousef et al. [13] assumes the classifier scores are continuous and hence they only had the first br . term in their expression for Vart9r A b b r and Var A Similarly, alternative expressions for Var A can be derived but are omitted here because these expressions are more complex and computationally less efficient than those in the main text. b r ¼ ðc1 m0 þc2 m0 þ c3 m0 ð1c4 Þm0 Þ Vart9r A 1r 2r 3r 4r þðc1 14 e þc2 ð12 a2 þ 14 b2 Þ þ c3 ð12 a3 þ 14b3 Þ ð1c4 Þðm01r e þ 14e2 ÞÞ,

ð39Þ

where m01r  Et ½sC ðt 0ir ; t 1jr Þ2  ¼ Et ½sC ðt 0ir ; t 1jr Þ ¼ Prðt 0ir ot 1jr Þ, m02r  Et ½sC ðt 0ir ; t 1jr ÞsC ðt 0i0 r ; t 1jr Þ ¼ Pr½maxðt 0ir ,t 0i0 r Þ o t 1jr , m03r  Et ½sC ðt 0ir ; t 1jr ÞsC ðt 0ir ; t 1j0 r Þ ¼ Pr½t 0ir o minðt 1jr ,t 1j0 r Þ, m04r  ðEt ½sC ðt 0ir ; t 1jr ÞÞ2 ¼ ðPrðt 0ir ot 1jr ÞÞ2 ,

a2  Et ½sD ðt0ir ; t1jr ÞsC ðt0i0 r ; t 1jr Þ þsD ðt0i0 r ; t1jr ÞsC ðt 0ir ; t1jr Þ ¼ Pr½maxðt 0ir ,t 0i0 r Þ ¼ t 1jr 4t 0ir at 0i0 r ,

b2  Et ½sD ðt 0ir ; t1jr ÞsD ðt0i0 r ; t 1jr Þ ¼ Pr½t0ir ¼ t0i0 r ¼ t1jr ,

a3  Et ½sD ðt0ir ; t1jr ÞsC ðt0ir ; t1j0 r Þ þsD ðt0ir ; t 1j0 r ÞsC ðt 0ir ; t1jr Þ ¼ Pr½t 0ir ¼ minðt 1jr ,t 1j0 r Þ4t 1jr a t 1j0 r ,

b3  Et ½sD ðt 0ir ; t1jr ÞsD ðt0ir ; t1j0 r Þ ¼ Pr½t0ir ¼ t1jr ¼ t 1j0 r ,

e  Et ½sD ðt0ir ; t1jr Þ ¼ Pr½t 0ir ¼ t1jr :

References [1] D. Bamber, The area above the ordinal dominance graph and the area below the receiver operating characteristic graph, Journal of Mathematical Psychology 12 (1975) 387–415. [2] J.A. Hanley, B.J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology 143 (1) (1982) 29–36. [3] E.R. DeLong, D.M. DeLong, D.L. Clarke-Pearson, Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach, Biometrics 44 (3) (1988) 837–845. [4] A.P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition 30 (7) (1997) 1145–1159.

[5] K. Fukunaga, R.R. Hayes, Estimation of classifier performance, IEEE Transactions on Pattern Analysis and Machine Intelligence 11 (10) (1989) 1087–1101. [6] S.V. Beiden, M.A. Maloof, R.F. Wagner, A general model for finite-sample effects in training and testing of competing classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (12) (2003) 1561–1569. [7] MAQC-II Consortium L. Shi, et al., The MAQC-II project: a comprehensive study of common practices for the development and validation of microarray-based predictive models, Nature Biotechnology 28 (2010) 827–838. [8] S.S. Dave, G. Wright, B. Tan, A. Rosenwald, R.D. Gascoyne, W.C. Chan, R.I. Fisher, R.M. Braziel, L.M. Rimsza, T.M. Grogan, T.P. Miller, M. LeBlanc, T.C. Greiner, D.D. Weisenburger, J.C. Lynch, J. Vose, J.O. Armitage, E.B. Smeland, S. Kvaloy, H. Holte, J. Delabie, J.M. Connors, P.M. Lansdorp, Q. Ouyang, T.A. Lister, A.J. Davies, A.J. Norton, H.K. Muller-Hermelink, G. Ott, E. Campo, E. Montserrat, W.H. Wilson, E.S. Jaffe, R. Simon, L. Yang, J. Powell, H. Zhao, N. Goldschmidt, M. Chiorazzi, L.M. Staudt, Prediction of survival in follicular lymphoma based on molecular features of tumor-infiltrating immune cells, New England Journal of Medicine 351 (21) (2004) 2159–2169. [9] R. Tibshirani, Immune signatures in follicular lymphoma, New England Journal of Medicine 352 (14) (2005) 1496–1497. (author reply 1496–1497). [10] S. Michiels, D. Koscielny, C. Hill, Prediction of cancer outcome with microarrays: a multiple random validation strategy, The Lancet 365 (9458) (2005) 488–492. [11] Y. Bengio, Y. Grandvalet, No unbiased estimator of the variance of k-fold cross-validation, Journal of Machine Learning Research 5 (2004) 1089–1105. [12] B. Hanczar, J. Hua, C. Sima, J. Weinstein, M. Bittner, E.R. Dougherty, Smallsample precision of ROC-related estimates, Bioinformatics 26 (2010) 822–830. [13] W.A. Yousef, R.F. Wagner, M.H. Loew, Assessing classifiers from two independent data sets using ROC analysis: a nonparametric approach, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (11) (2006) 1809–1817. [14] C.E. Metz, Basic principles of ROC analysis, Seminars in Nuclear Medicine 8 (1978) 283–298. [15] M.S. Pepe, The Statistical Evaluation of Medical Tests for Classification and Prediction, Oxford University Press, Oxford, United Kingdom, 2003. [16] T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters 27 (8) (2006) 861–874. [17] R.H. Randles, D.A. Wolfe, Introduction to the Theory of Nonparametric Statistics, in: Wiley Series in Probability and Mathematical Statistics, Wiley, New York, 1979. [18] A.J. Lee, U-Statistics: Theory and Practice, M. Dekker, New York, 1990. [19] G. Casella, R.L. Berger, Statistical inference, in: Duxbury Advanced Series, 2nd ed., Duxbury/Thomson Learning, Australia, Pacific Grove, CA, 2002. [20] B.D. Gallas, One-shot estimate of MRMC variance: AUC, Academic Radiology 13 (3) (2006) 353. [21] B.D. Gallas, A. Bandos, F.W. Samuelson, R.F. Wagner, A framework for random-effects ROC analysis: biases with the bootstrap and other variance estimators, Communications in Statistics—Theory and Methods 38 (2009) 2586–2603. [22] B. Efron, Improvements on cross-validation: the .632 þ bootstrap method, Journal of the American Statistical Association 92 (1997) 548–560. [23] W.A. Yousef, R.F. Wagner, M.H. Loew, Comparison of non-parametric methods for assessing classifier performance in terms of ROC parameters, in: Proceedings of the 33rd Applied Imagery Pattern Recognition Workshop, 2004, IEEE Computer Society, 2004, pp. 190–195. [24] W. Hoeffding, A class of statistics with asymptotically normal distribution, The Annals of Mathematical Statistics 19 (3) (1948) 293–325. [25] K. Fukunaga, Introduction to Statistical Pattern Recognition, in: Computer Science and Scientific Computing, 2nd ed.,Academic Press, Boston, 1990. [26] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, Series B 39 (1) (1977) 1–38. [27] J. Bilmes, A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models, Technical Report, TR-97-021, Department of Electrical Engineering and Computer Science U.C. Berkeley, 1998. [28] I.T. Nabney, Netlab: Algorithms for Pattern Recognition, Springer, London, UK, 2002. [29] A. Asuncion, D. Newman, UCI machine learning repository, 2007, URL /http://www.ics.uci.edu/  mlearn/MLRepository.htmlS. [30] D.D. Dorfman, K.S. Berbaum, C.E. Metz, Receiver operating characteristic rating analysis—generalization to the population of readers and patients with the jackknife method, Investigative Radiology 27 (9) (1992) 723–731. [31] R.F. Wagner, C.E. Metz, G. Campbell, Assessment of medical imaging systems and computer aids: a tutorial review, Academic Radiology 14 (2007) 723–748. [32] A.N. Obuchowski, H.E. Rockette, Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests: an ANOVA approach with dependent observations, Communications in Statistics—Simulation and Computation 24 (2) (1995) 285–308. [33] S.V. Beiden, R.F. Wagner, G. Campbell, Components-of-variance models and multiple-bootstrap experiments: an alternative method for random-effects, receiver operating characteristic analysis, Academic Radiology 7 (5) (2000) 341–349.

W. Chen et al. / Pattern Recognition 45 (2012) 2661–2671

[34] H.H. Barrett, M.A. Kupinski, E. Clarkson, Probabilistic foundations of the MRMC method. in: Medical Imaging 2005: Image Perception, Observer Performance, and Technology Assessment, vol. 5749, SPIE, San Diego, CA, USA, 2005, p. 21. [35] I. Guyon, An introduction to variable and feature selection, Journal of Machine Learning Research 3 (2003) 1157–1182.

2671

¨ [36] N. Meinshausen, P. Buhlmann, Stability selection, Journal of the Royal Statistical Society, Series B 72 (2010) 417–473. ¨ [37] N. Meinshausen, P. Buhlmann, High-dimensional graphs and variable selection with the lasso, The Annals of Statistics 34 (3) (2006) 1436–1462. [38] L. Wasserman, K. Roeder, High-dimensional variable selection, The Annals of Statistics 35 (5A) (2009) 2178–2201.

Weijie Chen earned his PhD in Medical Physics from the University of Chicago and is currently in the Division of Imaging and Applied Mathematics, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, the US Food and Drug administration. His research interests include methodologies for objective evaluation of diagnostic devices in general and statistical pattern classifiers in particular.

Brandon D. Gallas earned his PhD in Applied Mathematics from the University of Arizona and is currently in the Division of Imaging and Applied Mathematics, OSEL, CDRH. His research is driven by his regulatory work reviewing imaging devices and the reader studies that support them, as well as computer aides for detection and diagnosis of disease in imaging and other multi-dimensional data acquisition devices.

Waleed Yousef received his B.Sc. in Electrical Engineering, Ain Shams University, Egypt; his M.Sc. in Computer Science, Helwan University, Egypt, PhD in Computer Engineering from The George Washington University, US, and his M.Sc. in Statistics from The George Washington University, US. He worked as a research fellow at the US Food and Drug Administration in the area of Machine Learning. Currently, he is an assistant professor at Helwan University, Faculty of Computers and Information, Computer Science Department, Egypt. He is the academic collaborator and research partner with MESC for research and development, an R&D company. His area of research includes machine learning and data analysis. Currently, he is the principal investigator of a funded project for designing a computer aided detection system for detecting abnormalities in mammograms.