ROC representation for the discriminability of multi-classification markers

Pattern Recognition 60 (2016) 770–777 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr RO...

Download PDF

1000KB Sizes 0 Downloads 25 Views

Report

PDF Reader
Full Text

Pattern Recognition 60 (2016) 770–777

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

ROC representation for the discriminability of multi-classiﬁcation markers Yun-Jhong Wu a, Chin-Tsang Chiang b,n a b

Department of Statistics, University of Michigan, United States Institute of Applied Mathematical Sciences, National Taiwan University, Taiwan

art ic l e i nf o

a b s t r a c t

Article history: Received 1 December 2014 Received in revised form 9 April 2016 Accepted 26 June 2016 Available online 1 July 2016

In this paper, the receiver operating characteristic (ROC) representation and its accuracy measures are well-deﬁned and meaningful assessments for the discriminability of multi-classiﬁcation markers are shown. Given a set of classiﬁers * , a parameterized system can be used to characterize the corresponding optimal ROC manifold. A connection with the decision set further leads to a better understanding of some geometric features of optimal ROC manifolds and preserves the simplicity in computing the hypervolume under the ROC manifold (HUM). In addition, it motivates us to address the necessary and sufﬁcient conditions for the existence of the HUM. To sum up, this work provides working scientists with an extension of the two-class ROC analysis to the multi-classiﬁcation ROC analysis in a theoretically sound manner. & 2016 Elsevier Ltd. All rights reserved.

Keywords: Discriminability Hypervolume Manifold Optimal classiﬁcation Receiver operating characteristic Utility

1. Introduction Receiver operating characteristic (ROC) analysis, which was originally developed for radar signal detection, is a technique initially created for assessing the performance of binary classiﬁcation markers and has been extended to multi-classiﬁcation (see [1–3]). In application, a marker is generally referred to as a traceable substance whose presence indicates the existence of some state such as a particular disease condition. As the ROC curve for binary classiﬁcation, the ROC manifold is a natural extension to display the trade-off between the correct classiﬁcation probabilities and misclassiﬁcation probabilities. However, the deﬁnition of ROC manifold can be stated in a more mathematically rigorous manner. We address the concern about the existence of the hypervolume under the ROC manifold (HUM), which is an analog of the area under the ROC curve (AUC). A signiﬁcant research ﬁnding by [4] also indicated that the existence of the HUM is still in doubt. To clarify the problem and demonstrate its importance in application, a theoretical uniﬁcation of ROC manifolds will be established in this paper. Typically,a multi-classiﬁcation task is mainly based on data of the l , where a multi-categorical response G type (G, Y) and a classiﬁer G stands for the true class with K possible values 2 = {1, … , K}, Y ∈ @ l is a random denotes a univariate or multivariate marker value,and G n

Corresponding author. E-mail address: [email protected] (C.-T. Chiang).

http://dx.doi.org/10.1016/j.patcog.2016.06.024 0031-3203/& 2016 Elsevier Ltd. All rights reserved.

function from @ to 2 . An extension of ROC analysis to multi-classiﬁcation has been initially developed in sequential classiﬁcation procedures,which have excited interest for practical and theoretical simplicity. These algorithms simplify multi-classiﬁcation tasks to a series of binary classiﬁcations as the form G¼k versus G ∈ {k + 1, … , K} by order k = 1, … , K . The ﬁrst systematic study of a ternary classiﬁcation problem can be traced back to the paper of [5]. For a univariate marker value Y , Scurﬁeld constructed the ROC manifold for ternary classiﬁcation to visualize a set generated by 3 l , Y) , p l l {p1σ (1) (G 2σ (2) (G , Y) , p3σ (3) (G , Y)} ∈ [0, 1] , where s is a permul tation function on {1, 2, 3} and p (G , Y) represents the conditional jk

l = j|G = k ), which we will call the performance probability P (G probability hereinafter. To accommodate a multivariate marker, Mossman [1] also developed a classiﬁcation rule by utilizing a mapping between each G and Y . Although such classiﬁcation actually offers a perspective to extend traditional ROC analysis for multiclassiﬁcation, this work is generally not optimal in terms of performance probabilities and is of limited applicability. In practice, a monotone likelihood ratio (MLR) condition should be satisﬁed (cf. [6]) to ensure the optimality of commonly used sequential procedures. By applying a multinomial logistic regression model, Li and Fine [7] extended the foregoing approach to address a multi-categorical response. Based on the ROC manifold generated by the correct classiﬁcation probabilities, Zhang and Li [8] further employed a general semi-parametric model of [9] to seek an optimal composite marker. As we shall indicate in this paper, an optimal ROC manifold enjoys some geometric characteristics such as the regularity (see

Y.-J. Wu, C.-T. Chiang / Pattern Recognition 60 (2016) 770–777

771

[10]) and smoothness. Under some suitable conditions, the equality between the corresponding HUM and the correctness probability (CP) can also be found in [1–3,6,11]. However, non-optimality of ROC manifolds might lead to the lack of explanation for such a particular summary assessment. Indeed, ROC analysis gives an illuminating insight into the assessment for the discriminability of markers. It is rational to l , Y) = P (G l = j|G = k ), adopt the performance probabilities pjk (G j, k = 1, … , K , to assess the considered classiﬁcation procedures. l in a family of classiﬁers * , its performance function For any G l , Y) = (p (G l , Y) , …, p (G l , Y) , …, p (G l , Y) , …, p (G l , Y))⊤ can p (G 11 1K K1 KK be naturally plotted in a general ROC set:

⎧ ⎪ 2 9=⎨ ( ξ11, …, ξ1K , …, ξK1, …, ξKK )⊤ ∈ ⎡⎣ 0, 1⎤⎦K ⎪ ⎩ K

:

⎫

⎪

∑ ξjk = 1, 0 ≤ ξjk ≤ 1<⎪⎬. j=1

⎭

(1.1) K2

Following from the above deﬁnition, 9 is a unit cube in  space that contains a K (K − 1)-dimensional hyperplane and sufﬁciently represents all possible performance functions. In application, practitioners often focus on partial information of 9 such as a subset corresponding to the correct classiﬁcation probabilities {pkk : k ∈ 2} or the misclassiﬁcation probabilities {pjk : j, k ∈ 2 with j ≠ k}. For such purpose, the symbol S is used to denote the performance probabilities of interest, which generates a smaller ROC set 9S in 9 , and the considered operators or sets restricted to 9S are subscripted by S. Thus, a partial performance can be determined as a projection from 9 onto 9S . As it is well known in binary classiﬁcation, the corresponding performance probabilities of a set of classiﬁers in 9S with S = {p11, p12 } might not necessarily form a ROC curve but can be plotted as a representation for the discriminability of classiﬁcation procedures. It is noted that such a representation is not straightforward for arbitrary K-classiﬁcation tasks. Starting with the concept of a proper assessment for the discriminability of markers, the ROC representation is brought up in this work and is introduced in the next section.

2. ROC representation The performance of a classiﬁcation procedure can be rel , Y)) for presented by a function of performance probabilities ϕ (p (G different choices of functions ϕ such as the performance function and the expected utility deﬁned in Sections 2.1 and 2.2. To assess the discriminability of Y in the sense of “fairness”, the probabilitybased performance assessment should be a function only of markers and invariant with respect to chosen classiﬁers. With an argument slightly different from [12], this assessment is also shown to be equivalent to the ROC representation. 2.1. Performance sets

Fig. 1. A notional performance set ϕ{p , p } (*) (yellow region and black boundary 11 22 curve) for a set of binary classiﬁcation classiﬁers * with each red point representing the performance function of a particular classiﬁer in [0, 1]2 . (For interpretation of the references to color in this ﬁgure caption, the reader is referred to the web version of this paper.)

deﬁne the performance set of the collection to be

l ): G l ∈ *} . ϕ (*) = {ϕ (G

(2.2)

Here, the set * consists of a collection of deterministic and random classiﬁers (see Fig. 1) deﬁned on @ with outputs in 2 . Let Pr denote a probability measure that generates the conditional probabilities pjk's that deﬁne ϕ, 1 (·) be the indicator function, and Wλ follow a Bernoulli distribution with parameter λ ∈ [0, 1]. Throughout this paper, we consider the set * in a probability sense. Deﬁnition 2.1. The set * is said to be convex with respect to Pr lλ ) ≤ λp (G l1) + (1 − λ ) p (G l2 ) for each λ ∈ [0, 1] and all and S if pjk (G jk jk l1, G l2 ∈ * , where lλ ∈ * for every pjk G G ∈S implies lλ = ∑2 1 (Wλ = 2 − ℓ) G lℓ . G ℓ= 1 This deﬁnition deﬁnes the convexity of a set of classiﬁers based upon the convexity of the corresponding set of performance vectors. Intrinsically, the set ϕS (*) contains performance vectors of all existing classiﬁers in * with respect to a speciﬁed marker value Y and conveys information about the classiﬁcation capacity of Y with respect to * . As a representation of the discrimination ability, ϕ (*) should depend only on Y . A remarkable characteristic of the performance set in the following theorem further enables us to identify optimal classiﬁcation procedures. Theorem 2.1. Suppose that * is a convex set with respect to Pr and S. Then, the performance set ϕ (*) of Y is convex and compact, and so is ϕS (*) for any subset of performance probabilities S. Proof. See Appendix A.

l , Y) of a To simplify notations, the performance probability pjk (G l ), j, k ∈ 2 . The performance function given Y is denoted by pjk (G l deϕ (·) is the conditional probabilities acting on the classiﬁer G noted by

l ) = p (G l ), …, p (G l ), …, p (G l ), …, p (G l) ϕ (G 11 1K K1 KK

(

⊤

),

(2.1)

l . As for the construction of a proper in which it still depends on G accuracy measure for the discriminability of Y , it is reasonable to

To characterize the convexity and compactness of ϕ (*) (or ϕS (*)), it sufﬁces to portray the boundary set ∂ϕ (*) (or ∂ϕS (*)), which will be shown to be related to the optimality of classiﬁers. In the next subsection, a parameterized system in the decision theory is employed to analyze and compute ∂ϕ (*). The properties in Theorem 2.1 further assure the existence of utility classiﬁers (see Deﬁnition 2.3), whose performances fall in ∂ϕ (*), and elucidate the optimality of ∂ϕ (*). Thus, one can just consider ∂ϕ (*) rather than the whole ϕ (*) or an arbitrary subset of ϕ (*). For an

772

Y.-J. Wu, C.-T. Chiang / Pattern Recognition 60 (2016) 770–777

illustration, let us consider a binary classiﬁcation problem with Y given G ¼k following a univariate normal distribution with mean μ 2 k0 and variance σ0 , k¼ 1,2, where μ01 < μ02 . The performance ly ) of a classiﬁer G ly = 1 (Y ≤ y) + 2·1 (Y > y) can function ϕ (G {p11, p22 }

)⊤ ,

be easily computed to be (Φ ((y − μ01) /σ0 ) , 1 − Φ ((y − μ02 ) /σ0 where Φ (·) represents the standard normal distribution. Since the probability density functions of Y given G ¼ k, k = 1, 2, satisfy a MLR condition with respect to k, the performance functions ly ) : y ∈ } can be shown to form on the upper boundary {ϕ{p11, p22 } (G of the performance set ϕ{p11, p22 } (*) (see Fig. 1), which is called the optimal ROC curve with respect to {p11, p22 }. 2.2. Parameterized optimal ROC manifolds

l1 dominates another classiﬁer G l2 in * Deﬁnition 2.2. A classiﬁer G l1≽S G l2, if with respect to the set of performances S, denoted by G l1) ≥ p (G l2 ) and p (G l1) ≤ p (G l2 ) for all pkk and p ∈ S with pkk (G jk kk jk jk l1 strictly dominates another classiﬁer G l2 in * j ≠ k . A classiﬁer G l1≻S G l2, if at with respect to the set of performances S, denoted by G least one of the above inequalities is strict. To simplify the presentation, we retain only one representative of classiﬁers with the same S performance vector in * . It follows that the resulting set also has the same performance set ϕS (*) of Y . In l in * is said light of concept of partially ordered sets, a classiﬁer G to be maximal in * if no other classiﬁer in * exists that dominates l . With the dominance relation as a partial ordering on ϕ (*), the G S compactness of ϕS (*) in Theorem 2.1 assures that it is upper lj : G lℓ 1⪯S G lℓ 2 ∀ ℓ1 < ℓ2 ∈ Γ}, where Γ is bounded. Thus, each chain {G a countable or uncountable index set, is bounded above and all classiﬁers in the chain are dominated by a maximal classiﬁer. The so-called chain is a collection in which any pairs of elements are comparable in the sense of dominance. An application of Zorn's lemma (see e.g. [13]) further shows that the performance function of a maximal classiﬁer belongs to ϕS (*). Since ϕS (*) is convex and compact, the performance function of a maximal classiﬁer belongs to ∂ϕS (*) relative to the ROC set 9S . In theoretical development, a parameterized representation of ∂ϕS (*) would be convenient to explore the properties for the set of performance functions of maximal classiﬁers. Meanwhile, practitioners should be attracted by the form of these maximal classiﬁers. It can be found that maximization of the expected utility, which is an optimization criterion, is helpful to achieve both theoretical and practical interests. l can be deﬁned As in the decision theory, one of the utilities of G as

∑ ujk 1 (Gl = j, G = k) j, k

∑ ujk P (Gl = j, G = k), j, k

l )] = u⊤ϕ (G l ). E [U (G

(2.5)

l . Since So given u ∈ < deﬁnes the utility functional U acting on G l ) for any l ) is equivalent to maximizing (c u)⊤ϕ (G maximizing u⊤ϕ (G ﬁxed c > 0, the condition ( ∑j, k u2jk )1/2 = 1/c is made for scale invariance. The third condition in Deﬁnition 2.3 is imposed to ∼ ∼ locate u in a subspace containing ϕ (*) − K−11K2 , where 1K2 denotes the vector (1, … , 1)⊤ in K . It is ensured that each point in ∂ϕ (*) corresponds to a unique utility u in the following discussion about optimality. Thus, the standardized u will include K2 − K − 1 free utility values. Particularly, in 9S , ujk's are naturally set to be zero for all pjk ∉ S and the number of free utility values will reduce to

#S − #{k: {p1k , …, pKk } ⊂ S} − 1,

(2.6)

where # denotes the cardinality of a set. Interestingly, the utility is the same as the negative Bayes risk with positive and negative utility values being treated as gain and loss, respectively, in classiﬁcation. For any given 0 ≠ u ∈ < with norm equal 1, the lu = argsup l u⊤ϕ (G l ) can be found with the utility classiﬁer G G ∈* expected utility

l u )] = sup u⊤ϕ (G l ). E [U (G l∈ * G

(2.7)

As expected, the convexity and compactness of ϕ (*) in Theorem 2.1 imply that u⊤ϕ (*) is a closed interval and, thus, the supremum lu ) ∈ ∂ϕ (*) and the is also the maximum, which leads to ϕ (G lu ) as lu in * . To justify the characterization of ϕ (G existence of G that of performance functions of maximal classiﬁers, it remains to establish a connection with maximization of the expected utility criterion. The following theorem states a well-known result in the statistical decision theory (cf. [14]).

l is maximal in convex set * with respect Theorem 2.2. A classiﬁer G to the set of performances S if and only if it is a utility classiﬁer in 9S . Proof. See Appendix A. It follows from Theorems 2.1 to 2.2 that the optimal ROC manifold with respect to S is deﬁned to be

l ) ∈ 9 S: G l is maximal in *} , MS :={ϕS (G

(2.8)

in which there might have several maximal classiﬁers in * , and fully represents the discriminability of a multi-classiﬁcation marker (see Fig. 2). We should stress that the results achieved in this paper are mainly based on a formulation in terms of utility. This groundwork should lead to a better understanding of the geometric characteristics of optimal ROC manifolds and the conditions for S with positive HUM. As an alternative approach, Schubert et al. [15] utilized a different functional to determine the optimal ROC manifold. Before giving more precise characterization of parameterized optimal ROC manifolds, the interpretability and computability are illustrated in terms of decision set.

(2.3) 2.3. Connection with decision set

with the expected utility value

l )] = E [U (G

By the assumption that G follows a multinomial distribution with parameters pk = P (G = k ), k ∈ 2 , and pk's can be absorbed by ujk's, the expected utility of a classiﬁer in (2.4) is further simpliﬁed as

2

A parameterized system, which is designed for the performance set ϕ (*) of a collection of classiﬁers * , should be helpful for us to analyze and compute the boundary ∂ϕ (*) of ϕ (*) in theory l is considered as better than and practice. Intuitively, a classiﬁer G another in * if it associates with higher correct classiﬁcation probabilities {pkk : k ∈ 2} or lower misclassiﬁcation probabilities {pjk : j, k ∈ 2 with j ≠ k}. Thus, classiﬁers partially ranked with respect to an ordering by a dominance relation are naturally introduced.

l) = U (G

2

of u ∈ K that satisﬁes (a) ukk ≥ 0 ∀ k ∈ 2 , (b) ujk ≤ 0 ∀ j, k ∈ 2 K with j ≠ k , and (c) ∑ j = 1 ujk = K−1 ∀ k ∈ 2 .

(2.4)

where the utilities ujk's are deﬁned in the following deﬁnition: Deﬁnition 2.3. The collection < of standardized utilities is the set

Let fk (y), L jk (y), and L (y) denote the density function of Y given G ¼k, the likelihood ratio function f j (y) /fk (y) for j, k ∈ 2 with

j ≠ k , and (L1K (y) , …, L (K − 1) K (y))T , respectively. Further, the sets {y ∈ @: L jK (y) = c}, j = 1, … , K − 1, are assumed to have measure zero for any positive constant c. The decision set, which is spanned

Y.-J. Wu, C.-T. Chiang / Pattern Recognition 60 (2016) 770–777

773

Fig. 2. Notional examples of optimal ROC manifold MS for ternary classiﬁcation in 9S with S representing (a) a set of correct classiﬁcation probabilities {p11, p22 , p33 } and (b) a set of misclassiﬁcation probabilities {p12 , p23, p31}.

by likelihood ratio values, has been utilized in some applied ﬁelds (e.g. [16]) and simpliﬁes the computation of ROC manifolds. From the expected utility in (2.4), we can also derive an explicit form of optimal classiﬁers with an argument slightly different l ) = ∑K U (G l ) 1 (G l = k ), from [12]. By using (2.3), the equality U (G k=1 and the iterative expectation E [X1] = E [E [X1|X2 ]] for any generic random quantities X1 and X2, one has the following expected util: lity of G

⎡ l )] = E ⎢ E [U (G ⎢⎣

K

⎤

k=1

⎥⎦

∑ E [U (Gl ) 1 (Gl = k)|Y] ⎥.

(2.9)

It is noted that the inner expectation is with respect to the probl , G ) given Y and outer expectation is with ability distribution of (G respect to the distribution of Y . This decomposition facilitates us to construct a utility classiﬁer via maximizing the conditionalexpected utility pointwise over @ . When the inequality

(2.10)

1≤ j ≤ K

l = k|Y = y) = 1. By absorbing is satisﬁed for each y ∈ @ , we set P (G K pi into uki's and (2.4), (2.10) can be rewritten as minj ∈ 2 ∑i = 1 lu satisﬁes (uki − uji ) L iK (y) ≥ 0. Thus, the utility classiﬁer G

lu = k|Y = y) = 1 if P (G K j≠k

∑ (uki − uji ) L iK (y) ≥ 0, y ∈ @}, i=1

optimal classiﬁers are still well-deﬁned. Second, although the transformation from Y to L (Y) may reduce the dimensionality of markers, there is no information loss since the minimal sufﬁciency of the statistic L (Y) for (G, Y) assures the invariance for performance functions of classiﬁers, which is evidenced by l ) = E [E [P (G l = j|Y)|L (Y)]|G = k] = E [P (G l = j|L (Y))|G = k]. pjk (G

l ) 1 (G l = k )|Y = y] ≥ max E [U (G l ) 1 (G l = j )|Y = y] E [U (G

L (y) ∈ Dk (u):= ⋂ {L (y):

to formulate optimal classiﬁer through original marker values or their combinations. Usually, linear classiﬁers suffer from a serious loss of information. Although such commonly used procedures are easy to explain, one can only achieve the sub-optimality in classiﬁcation. At ﬁrst sight, it seems that using the decision set to express classiﬁers might involving some complexities in the overlapping of Dk (u)′s, the dimensionality of the decision set, and the domains of fk (y)′s. However, these doubts can be fully clariﬁed as follows: ﬁrst, since Dk's can be overlapped only at the boundaries, ⋂kK= 1 Dk (u) is a subset of ∂Dk (u) with P (L (Y) ∈ ⋂kK= 1 Dk (u)) = 0, which implies that

k ∈ 2.

(2.11)

It is clear that {Dk (u) : k ∈ 2} is a partition of the space spanned by the likelihood ratio scores L (Y) and the intersection K ∩j ≠ k {L (y) : ∑i = 1 (uki − uji ) pi L iK (y) = 0, y ∈ @} is a critical point c (u). Such a decision set + = {Dk (u) : k ∈ 2} has been proposed by [12,16,17], among others, to describe classiﬁers. When each maximal classiﬁer can be manifested as a convex combination of classiﬁers in the decision set spanned by the likelihood ratios, the likelihood L (·) is assured to be the measurement of an optimal marker for K-classiﬁcation. On the contrary, it will be impractical

(2.12)

l⁎

It follows from (2.12) that there exists a classiﬁer G from @ to 2 l. with the same performance function value of G From Theorem 2.2, a classiﬁer with maximum correct classiﬁcation probabilities {pkk : k ∈ 2} can be shown to be a utility classiﬁer with ujk = 0 for j, k ∈ 2 with j ≠ k , and vice versa. For a non-degenerate case with ukk > 0, one can further simplify Dk (u) in (2.11) as

⎧ ⎫ u Dk (u) = ⋂ ⎨ L (y): L jk (y) ≥ kk , y ∈ @⎬ , ujj ⎭ j≠k ⎩ ⎪

⎪

⎪

⎪

k ∈ 2, (2.13)

−1 , … , u−(K1− 1)(K − 1), 1)⊤ . In with an explicit critical point c (u) = uKK ·(u11 practice, it is easier to use c (u) as K − 1 threshold values in + to represent an optimal classiﬁer when S = {pkσ (k ) : k ∈ 2}, which assures that c (u) is a bijective function of u .

Example 2.1. With a realization y of univariate marker value Y , a common approach is to classify subjects sequentially by the

774

Y.-J. Wu, C.-T. Chiang / Pattern Recognition 60 (2016) 770–777

l deﬁned by classiﬁer G

⎧ ⎫ l = k if Y ∈ ⋂ ⎨ Y: L jk (Y) < ujj ⎬ , G ukk ⎭ j≠k ⎩ ⎪

⎪

⎪

⎪

k ∈ 2. (2.14)

It achieves optimality only when {fk : k ∈ 2} satisﬁes the MLR condition with respect to k, which means that L jk (y) is a monotone function of y . Without loss of generality, L jk (y) is assumed to be strictly increasing for each j > k . The optimal classiﬁer is directly derived from (2.14) as ⎧1 if Y < minj > 1L−j11 (u jj /u11) , ⎪ ⎪ l G (Y) = ⎨ k (1 < k < K ) if max j < kL−jk1 (u jj /ukk ) < Y < minj > k L−jk1 (u jj /ukk ) , ⎪ ⎪K if Y > max j < K L−jK1 (u jj /uKK ) . ⎩

(2.15)

3. Characterization of optimal ROC manifolds To make a comparison based on a speciﬁc pjk ∈ S , a maximal

l in * with respect to S has the highest correct classiclassiﬁer G l ) for j¼ k or the lowest misclassiﬁcation ﬁcation probability pjk (G l ) for j ≠ k among all classiﬁers with ﬁxed values probability p (G

Fig. 3. A parametric system for MS based on the supporting function/utility.

jk

of other performance probabilities in S. From the geometric perl ) is the highest point spective, the performance function value ϕS (G on the set generated by S⧹{pjk }. As one can see, the optimality of classiﬁers shares both theoretical and practical importance. Without optimality in classiﬁers, the ROC manifold could be an arbitrary subset of ϕS (*) rather than a manifold in the context of geometry. Since few features could be identiﬁed for (non-optimal) ROC manifolds, estimation of ROC manifolds and related summary measures might lead to an ambiguous and complicated situation. With this motivation, we introduce optimal ROC manifolds for multi-classiﬁcation as an extension of optimal ROC curves for binary classiﬁcation. For this type of optimization problem, our ﬁrst strategy is to show that a set of performance functions of maximal classiﬁers is a manifold. Roughly speaking, the structure of such a set is locally similar to the Euclidean space when the set * is an uncountably inﬁnite set. The developed mechanism for the optimal ROC manifold MS deﬁned in (2.8) is mainly based on the expected utility or the supporting function in the terminology of convex analysis. Let us consider the hyperplane

HS (r , u) = {p ∈ 9 S : uT p = r} given u ∈ < and r ∈ .

(3.1)

It follows that the performance set ϕS (*) can be re-expressed as ∪r ∈ (HS (r , u) ∩ ϕS (*)) and the real value r can be treated as the l 's with ϕ (G l ) ∈ HS (r , u) ∩ ϕ (*) (see expected utility of classiﬁers G S

of L (Y) given G¼k, k ∈ 2 . It follows that the optimal manifold MS should be smooth if fLk , k ∈ 2 , are smooth. For binary classiﬁcation tasks, the optimal ROC curve MS for S = {p11, p22 } can be expressed as a function TS (p11) = FL2 (F L−11 (1 − p11)) of p11 since FL1 is assumed to be invertible. This particularly simple form greatly facilitates practitioners to model ROC curves. Even for markers with regularly-used distributions, closed-form expressions of ROC manifolds as a function of some pjk's seem to be unattainable. Furthermore, modeling on MS would become intricate for K ≥ 3. Admittedly, the manifold MS can be re-

(

)

presented by a continuous function TS pS ⧹ {pjk } with pjk ∈ S on the domain, which is the projection of ϕ (*) onto 9S ⧹ {pjk }. For each ﬁxed p ∈ 9S ⧹ {pjk }, let us consider the corresponding classiﬁers with performances located in ϕS (*). All of them can be shown to be dominated by l0 with its performance in the same set. a unique maximal classiﬁer G

(

)

l0 ). In practice, Thus, it is straightforward to have TS pS ⧹ {pjk } = pjk (G researchers might be interested in exploring a trade-off among pjk's. With the constructed parameterization, the supporting hyperplane l ) , u) is a tangent hyperplane at the point on the HS (supGl ∈ * uT ϕS (G manifold TS (u) and a parameterized curve along TS (u) has a tangent vector lying in the tangent space, so is normal to u ∈ < . Evidenced by lu ) /∂p (G lu ) = − u j k /ujk can be treated as the the above fact, ∂pjk (G ′ ′ j′ k′ trade-off between pjk and p j′ k′ at TS (u).

S

Fig. 3). Thus, a parametric version of MS is naturally established as a function TS (u) for each u ∈ < with the output to the subset

⎛ ⎞ l ), u⎟ ∩ ϕ (*), TS (u) = HS ⎜ sup uT ϕS (G S ⎝ Gl∈ * ⎠

(3.2)

which reduces to a s-tuple in 9S with s being the number of free utility values. Of course, this depends on how many classiﬁers produce this supremum ϕS value. There might only exist one and, hence, s¼ 1. With the parametric system in (3.2) and the convexity of * with respect to Pr and S, the optimal ROC manifold MS is indeed an at most s-dimensional manifold in terms of geometry since the set can be parameterized as a convex function on a sdimensional Euclidean space. The above parameterization supplies some intrinsic characterizalu ) = ∫ tion of MS through MS and the expression pjk (G fk (y) d y . L (y)∈ Dj (u)

Let fLk and FLk denote the respective density and distribution functions

4. Existence of the hypervolume under the manifold In a ROC set 9S , the corresponding optimal ROC manifold might involve a complication in visualization when the numbers of both considered classes and performance probabilities pjk's are all greater than three. It was pointed out by [6], a summary index of optimal ROC manifolds should facilitate comparisons for all markers and provide a reasonable ordering for the performances of markers. As an analog of the AUC, the hypervolume under the manifold (HUM) has been proposed for multi-classiﬁcation in the foregoing literature (e.g. [2,4,5]). To help practitioners in computing the HUMs and plotting 2D- and 3D-ROC manifolds, Novoselova et al. [18] further developed computationally efﬁcient software tools. Recently, Li et al. [19] also developed a practical approach to identify the relative order of marker values with the largest HUM. However, there is still no clear progress to answer a

Y.-J. Wu, C.-T. Chiang / Pattern Recognition 60 (2016) 770–777

fundamental problem for the existence of the HUM. Without a proper speciﬁcation of S, the considered HUM might be zero even with optimality. For binary classiﬁcation tasks, the performance set ϕ{p11, p22 } (*) usually separates the ROC set 9{p11, p22 } and assures the existence of the optimal AUC. With the continuity of the optimal ROC manifold MS, it is possible to construct a separation of the ROC set 9S by MS. In practice, this separation is necessary for the set under MS to have a volume, which is denoted by VS. Let vec [·] be the vectorization operation acting on a matrix. To clearly characterize such an accuracy assessment, a series of results are further established as follows. Theorem 4.1. For K ≥ 3, suppose that 9S contains two coordinates pij and pik with i, j ≠ k , and fLi , fLj , and fLk have a common domain. Then, there exists a continuous mapping pS : [0, 1]↦9S with pS (0) = vec [1 (j′ = k′)] and pS (1) = vec [1 ((j′ , k′) = (j′ , σ (j′)))] for arbitrary permutation function s with σ (k ) ≠ k such that {pS (t ) : t ∈ [0, 1]} ∩ ϕS (*) = ∅. Proof. See Appendix A. Due to the closedness of {(t , pS (t )) : t ∈ [0, 1]} and ϕS (*), the distance between ϕS (*) and ∂9S can be shown to be positive by Theorem 4.1. Generally, an optimal ROC manifold might not enclose a set with positive hypervolume. It follows from Theorem 4.1 that both optimal and non-optimal HUMs can be well-deﬁned only if S = {pkσ (k ) : k ∈ 2} for some permutation function s. Our other focus in this section is on the case of degenerate MS, i.e. MS can be parameterized as a smooth function deﬁned on an at most K − 2 Euclidean space. When K ≥ 3 and S = {pkσ (k ) : k ∈ 2} contains both correct classiﬁcation and misclassiﬁcation probabilities, a maximal classiﬁer in * with respect to S must be of the l ) = 0 for j ≠ k . More precisely, the dimension of TS type with pjk (G can be shown to be less than K − 1 from the property ∼ l ) ∈ MS } ⊂ {ϕ ∼(G l ) ∈ M ∼} for S = {pkk : k ∈ 2 with pkk ∈ S}. {ϕS (G S S Therefore, MS is unable to create a separation in 9S . The condition in the following theorem further gives the essential ingredients for the well-behaved VS. Theorem 4.2. For K ≥ 3, suppose that S = {pkk : k ∈ 2} or S = {pkσ (k ) : k ∈ 2 with σ (k ) ≠ k} for some permutation function s acting on 2 . Then, MS separates 9S into two disjoint sets in which the interior of each set is open and connected. Proof. See Appendix A. One might think that the HUM can be deﬁned as the hypervolume under MS only on the domain of MS in a sense similar to the partial AUC (cf. [20]). Unfortunately, the induced accuracy measure is still problematic in practice although this view would circumvent the problem whether MS can actually separate 9S . Speciﬁcally in 9S generated by all misclassiﬁcation probabilities, Zhang and Li [4] provided an argument to elucidate that both the HUMs of perfect and useless markers might be zero. For an arbitrary S, we characterize the condition for the occurrence of this untestable phenomenon and relate it to the condition in Theorem 4.2. Theorem 4.3. For K ≥ 3, the HUM VS under TS (pS ⧹ {pjk } ) with pjk ∈ S has the following properties: (i) (Near perfect marker) VS → 0 as pjk → δjk ∀ pjk ∈ S ; (ii) (Non-informative marker) VS ¼ 0 as p jk1 = p jk2 ∀ p jk1, p jk2 ∈ S ; if and only if neither S = {pkk : k ∈ 2} nor S = {pkσ (k) : k ∈ 2 with σ (k ) ≠ k} for some permutation function σ .

775

Proof. See Appendix A. As a consequence, the HUM is a rational summary index for the discriminability of a marker if and only if the performance probabilities of interest satisfy the condition in the above theorem. To illustrate the hypotheses in Theorem 4.3, let us consider the fol2 lowing two binary classiﬁcation examples: (a) G = ∑ℓ= 1 ℓ·1 (Y ∈ @ℓ ) with @1⋂@2 = ∅ and @1⋃@2 = @ and (b) G is independent of Y. For l = ∑2 ℓ·1 (Y ∈ @ℓ ) is a perfect clasthe case of a perfect marker, G ℓ= 1

l + (1 − Wλ ) G lℓ : G lℓ = ℓ, ℓ = 1, 2, λ ∈ [0, 1]} is the siﬁer and {Wλ G collection of maximum classiﬁers. As for the case of non-informative Y , the corresponding collection of maximum classiﬁers is l1 + (1 − Wλ ) G l2: G lℓ = ℓ, ℓ = 1, 2, λ ∈ [0, 1]}. When S is neither {Wλ G {p11, p22 } nor {p12 , p21}, the corresponding HUMs can be easily shown to be zero. Given any speciﬁc permutation s0 on 2 , the HUM corresponding to S = {pkσ 0 (k ) : k ∈ 2} in Theorem 4.3 has been shown to be equal to the correctness probability (CP) (see [6] for general K and [16] for K ¼3):

⎛ P ⎜⎜ ⎝

K

⎞

K

k=1

⎠

k=1

∏ fk (Yσ 0 (k) ⎟⎟ ≥ ∏ fk (Yσ (k) )

∀ σ |Gσ 0 (1), …, Gσ 0 (K ) ).

(4.1)

For explanatory simplicity, S = {pkk : k ∈ 2} is considered in the following discussion. Three parametric models, which were illustrated by [6], are further used to compute the well-behaved HUM VS under the optimal ROC manifold MS with S satisfying the form in Theorem 4.3.

(

)⊤

⊤ Example 4.1. Let θ0 = θ01 and θ0K = 0 . Under the , … , θ0⊤(K − 1) validity of a multinomial logistic regression model, one has LkK (y) = exp (θ0⊤k Y) P (G = K ) /P (G = k ). It has been derived by [6] ⊤ K that VS = P ∑k = 1 (θ0k − θ0σ (k ) ) Yk ≥ 0 ∀ σ |G1 = 1, … , GK = K ).

(

Example 4.2. Suppose that Y given G¼k follows a multivariate normal distribution with mean μ0k and covariance matrix Σ0k , k ∈ 2 . Then, VS = P ( ∑kK= 1 (Yk − μ0σ (k) )⊤Σ−0σ1(k) (Yk − μ0σ (k) ) − (Yk − μ0k )⊤Σ−0k1(Yk − μ0k ) ≥ 0 ∀ σ |G1 = 1, … , GK = K ) .

Example 4.3. Suppose that Y is univariate with the corresponding family of distributions fk (y), k ∈ 2 , satisfying the MLR condition with respect to θ0 . The likelihood ratio L k1, K (y ; θ0k1) /L k2, K (y ; θ0k2 ) for k1, k2 ∈ 2 with k1 ≠ k2 can be shown to be strictly increasing in y for each θ0k1 > θ0k2 . In light of this fact and (4.1), VS = P (Yk1 > Yk2, θ0k1 > θ0k2 ∀ k1 ≠ k2 |G1 = 1, … , GK = K ) can be derived.

5. Conclusion For the discriminability of multi-classiﬁcation markers, this paper provides a theoretical framework to show that a proper assessment based on the performance probabilities is exactly the corresponding optimal ROC manifolds. Through a parameterization of the utility-maximization criterion, the optimal ROC manifolds are demonstrated as manifolds. This assures some practical and desirable features and supports work directly in modeling ROC manifolds. In addition, we give the necessary and sufﬁcient conditions for the existence of the HUM. When researchers are especially interested in some performance probabilities with respect to a suitable ROC subset, the usefulness of the corresponding HUM can be justiﬁed. Conclusively, this paper extends the scientiﬁc groundwork for a more general multi-class ROC analysis.

776

Y.-J. Wu, C.-T. Chiang / Pattern Recognition 60 (2016) 770–777

Acknowledgments The research of the corresponding author was partially supported by the National Science Council grants 97-2118-M-002020-MY2 and 102-2118-M-002-003 (Taiwan). The authors would like to thank two reviewers for their constructive comments on this paper.

A.1. Proof of Theorem 2.1

l1, G l2 ∈ * and G lλ with Wλ asProof. For arbitrary two classiﬁers G l1, G l2 ), one can derive that for sumed to be independent of (G, G every j, k ∈ 2

lλ ) = E [1 (G lλ = j )|G = k ] pjk (G 2

⎤

ℓ= 1

⎥⎦

∑ 1 (Wλ = 2 − ℓ) 1 (Glℓ = j)|G = k⎥

l1) + (1 − λ ) p (G l2 ) . = λpjk (G jk

(A.1)

It follows that for any S

lλ ) = λϕ (G l1) + (1 − λ ) ϕ (G l2 ), ϕS (G S S

(A.2)

lλ is also a classiﬁer in * since * is convex. Let which implies that G ln } be a sequence of classiﬁers in * such that ϕ (G ln ) converges to {G S l ′ s have values in 2 , given ε > 0, there exists some point p . Since G 0

1

∑(

pS (t ) =

ℓ

− 1) [(1 − 2t ) pS (0.5) − 2 (ℓ − t ) pS (ℓ)] 1[0.5 ℓ ,0.5 (1 +ℓ)) (t ),

ℓ= 0

where pS (0.5) = vec [1 − 1 (i′ = j′) + (21 (i′ = j′) − 1) 1{(i, j), (i, σ (i))} ((i′, j′))]. Since fL σ (i) have a common domain, no classiﬁer satifLj and

l ) ≠ p (G l ) and p (G l ) = 1 (i′ = j ) for i′ ≠ i . Thus, sﬁes piσ (i) (G ij i′j {pS (t ) : t ∈ [0, 0.5]} ∩ ϕS (*) = ∅. Similarly, no classiﬁer can satisfy l ) = 1 − 1 (i = σ (i )). As a consequence, one l ) = δ ij and p (G pij (G iσ (i ) has {pS (t ) : t ∈ [0.5, 1]} ∩ ϕS (*) = ∅. □

Appendix A. Proof of Theorems

⎡ = E⎢ ⎢⎣

only need to investigate the case of #S = K + 1 with {pkσ (k ) : k ∈ 2} ⊂ S . For some pij and piσ (i) ∈ S with σ (i ) ≠ j , we deﬁne

A.4. Proof of Theorem 4.2

lj with the Proof. For a given Y , we can construct a trivial classiﬁer G lℓ ) = 1 (ℓ = σ (k )) for corresponding performance probabilities pkσ (k ) (G lλ = Wλ G lℓ + (1 − Wλ ) G lℓ every k, ℓ ∈ 2 . As shown in Theorem 2.1, G 1 2 K l with ∑k = 1 pkσ (k ) (Gλ ) = 1 is a classiﬁer in * for every ℓ1, ℓ2 ∈ 2 and λ ∈ [0, 1]. Thus, in 9S , ϕS (*) always contains a (K − 1)-simplex ⎧ ⎫ K ⎪ ⎪ lλ ) : G lλ = Wλ G lℓ + (1 − Wλ ) G lℓ with ∑ p l ⎨ ϕ S (G kσ (k ) (Gλ ) = 1, ℓ1, ℓ2 ∈ 2, λ ∈ [0, 1] ⎬ 1 2 ⎪ ⎪ ⎩ ⎭ k =1

Since every continuous path h (t ) = (h1 (t ) , … , hK (t ))⊤ from (1, … , 1)⊤ K to (0, … , 0)⊤ in 9S has a point satisfying ∑k = 1 pkσ (k ) = 1, this simplex □ can separate 9S . The proof is completed. A.5. Proof of Theorem 4.3

n

a positive constant Mε such that

ln, Y)∥ > Mε ) < ε, P (sup ∥(G

(A.3)

n

where ∥·∥ denotes a Euclidean norm of a vector. By Prokhorov's theorem [21] and Lebesgue's dominated convergence theorem, ln , Y)} converging in distribution to there exists a subsequence {(G i l l0 ) = p , which is l (G0, Y) in which G0 is a classiﬁer in * with ϕ (G S

0

corresponding to the fact that the limit of a sequence of classiﬁers, if it exists as the limit of a sequence of functions, is considered as a classiﬁer. This further implies the closedness of ϕS (*). Together with the boundedness of 9S , the compactness of ϕS (*) is immediately obtained. The proof for the compactness and convexity of ϕS (*) goes along the same lines and is omitted here. □

Proof. In this proof, a conditional probability vector is said to be dominated by another conditional probability vector of the same length if the dominance condition in Deﬁnition 2.2 is satisﬁed. Similar to the argument in the proof of Theorem 4.1, we only need to consider non-degenerate MS. Suppose that S is not one of the performance sets stated in the theorem. It follows that there exists {p jk1, p jk2 } in S. Since

{p : p S

l S is dominated by some ϕ S (G ) ∈ M S

}

∼ l) ∈ M ∼ and p ∼ ∈ [0, 1] # S ⧹ S } for S∼ ⊂ S , ⊂ {p S : p S∼ is dominated by some ϕ S∼ (G S S⧹S

an upper bound of VS is easily obtained by the inequality VS ≤ V S∼ . Further, the inequality VS ≤ mink1≠ k2, p jk , p jk ∈ S V{p jk , p jk } holds and 1 2 1 2 V{p jk , p jk } of a near perfect marker approaches to zero. As for a non1

2

A.2. Proof of Theorem 2.2

informative marker, V{p jk , p jk } = 0 is a direct consequence of 1 2 l ) = p (G l ). Coupled with the inequality VS ≤ V{p , p }, VS ¼0 p jk1 (G jk2 jk jk

Proof. It follows from Theorem 2.1 that ϕS (*) is a convex set. lu ) ∈ ∂ϕ (*), there exists a hyperplane conTogether with ϕS (G S lu ) but no interior point of ϕ (*). By standardizing the taining ϕS (G S lu can be represented normal vector of the hyperplane as utility u , G as a utility classiﬁer. Conversely, suppose not; that is, some lu ) < ujk p (G l ⁎ ) for some (j,k), we have l ⁎ ≻S G lu . Since ujk p (G G u u jk jk ⁎ ⊤ ⊤ l l lu is a utility u ϕS (G u ) < u ϕS (Gu ), which contradicts that G classiﬁer. □

is, thus, obtained. Conversely, given a prefect marker and S = {pkk : k ∈ 2}, the corresponding VS ( = 1) is the hypervolume of a unit K-cube and VS ¼0 for S = {pkσ (k ) : k ∈ 2 with σ (k ) ≠ k}. For a non-informative marker, VS can be calculated to be 1/K!, which is the hypervolume K under the hyperplane p ∈ 9{pkσ (k) = 1, k ∈ 2}: ∑k = 1 pkσ (k ) = 1 ⊂ 9S for any S satisfying the condition. This is precisely the assertion of the theorem. □

1

{

2

}

A.3. Proof of Theorem 4.1 References

∼ Proof. For any S ⊂ S with {p S∼(t ) : t ∈ [0, 1]} ∩ ϕ S∼(*) = ∅, any pS (t ) with the projection p S∼(t ) onto 9 S∼ has no intersection with ϕS (*). Basically, the K-dimensional set 9S can be separated by MS only when its dimension is at least K − 1. It is not necessary to verify the condition in Theorem 4.1 by the argument of path construction for a degenerate MS with the dimension less than K − 1. Thus, we

[1] D. Mossman, Three-way ROCs, Med. Decis. Mak. 19 (January (1)) (1999) 78–89. [2] S. Dreiseitl, L. Ohno-Machado, M. Binder, Comparing three-class diagnostic tests by three-way ROC analysis, Med. Decis. Mak. 20 (September (3)) (2000) 323–331. [3] X. He, C. Metz, B. Tsui, J. Links, E. Frey, Three-class ROC analysis—a decision theoretic approach under the ideal observer framework, IEEE Trans. Med. Imaging 25 (May (5)) (2006) 571–581.

Y.-J. Wu, C.-T. Chiang / Pattern Recognition 60 (2016) 770–777

[4] D.C. Edwards, C.E. Metz, R.M. Nishikawa, The hypervolume under the ROC hypersurface of near-guessing and near-perfect observers in n-class classiﬁcation tasks, IEEE Trans. Med. Imaging 24 (March (3)) (2005) 293–299. [5] B.K. Scurﬁeld, Multiple-event forced-choice tasks in the theory of signal detectability, J. Math. Psychol. 40 (September (3)) (1996) 253–269. [6] Y.J. Wu, C.T. Chiang, Optimal receiver operating characteristic manifolds, J. Math. Psychol. 57 (October (5)) (2013) 237–248. [7] J. Li, J.P. Fine, ROC analysis with multiple classes and multiple tests: methodology and its application in microarray studies, Biostat 9 (July (3)) (2008) 566–576. [8] Y. Zhang, J. Li, Combining multiple markers for multi-category classiﬁcation: an ROC surface approach, Aust. N. Z. J. Stat. 53 (1) (2011) 63–78. [9] A.K. Han, Nonparametric analysis of a generalized regression model: the maximum rank correlation estimator, J. Econom. 35 (2–3) (1987) 303–316. [10] J. Jost, Riemannian Geometry and Geometric Analysis, 5th ed., Springer, New York, April 2008. [11] P.S. Heckerling, Parametric three-way receiver operating characteristic surface analysis using mathematica, Med. Decis. Mak. 21 (September (5)) (2001) 409–417. [12] D.C. Edwards, C.E. Metz, M.A. Kupinski, Ideal observers and optimal ROC hypersurfaces in n-class classiﬁcation, IEEE Trans. Med. Imaging 23 (July (7)) (2004) 891–895.

777

[13] P.R. Halmos, Naive Set Theory, Springer, New York, 1998. [14] J.O. Berger, Statistical Decision Theory and Bayes Analysis, Springer, New York, 1985. [15] C.M. Schubert, S.N. Thorsen, M.E. Oxley, The ROC manifold for classiﬁcation systems, Pattern Recognit. 44 (February (2)) (2011) 350–362. [16] B.K. Scurﬁeld, Generalization of the theory of signal detectability to n-event m-dimensional forced-choice tasks, J. Math. Psychol. 42 (March (1)) (1998) 5–31. [17] X. He, E.C. Frey, Three-class ROC analysis—the equal error utility assumption and the optimality of three-class ROC surface using the ideal observer, IEEE Trans. Med. Imaging 25 (August (8)) (2006) 979–986. [18] N. Novoselova, C.D. Beffa, J. Wang, J. Li, F. Pessler, F. Klawonn, HUM calculator and HUM package for R: easy-to-use software tools for multicategory receiver operating characteristic analysis, BMC Bioinform. 30 (June (11)) (2014) 1635–1636. [19] J. Li, Y. Chow, W.K. Wong, T.Y. Wong, Sorting multiple classes in multi-dimensional ROC analysis: parametric and nonparametric approaches, Biomarkers 19 (February (1)) (2014) 1–8. [20] D.K. McClish, Analyzing a portion of the ROC curve, Med. Decis. Mak. 9 (July (3)) (1989) 190–195. [21] A.W.v.d. Vaart, Asymptotic Statistics, Cambridge University Press, Cambridge, 1998.

Yun-Jhong Wu is a Ph.D. candidate in statistics at the University of Michigan. He received a M.S. in mathematics in 2011 and B.A. in sociology in 2008 from National Taiwan University. His current research interests include statistical methodology and machine learning algorithm design for network data analysis and matrix/tensor decompositions.

Chin-Tsang Chiang is a professor in the Institute of Applied Mathematical Sciences at National Taiwan University. He received a Ph.D. degree in mathematical science from the Johns Hopkins University in 1998. His current research interests include statistical methods for nonparametric and semiparametric models, and the ROC curve analysis.

ROC representation for the discriminability of multi-classification markers

ROC representation for the discriminability of multi-classification markers

Recommend Documents