Pattern Recognition 60 (2016) 770–777
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/pr
ROC representation for the discriminability of multi-classification markers Yun-Jhong Wu a, Chin-Tsang Chiang b,n a b
Department of Statistics, University of Michigan, United States Institute of Applied Mathematical Sciences, National Taiwan University, Taiwan
art ic l e i nf o
a b s t r a c t
Article history: Received 1 December 2014 Received in revised form 9 April 2016 Accepted 26 June 2016 Available online 1 July 2016
In this paper, the receiver operating characteristic (ROC) representation and its accuracy measures are well-defined and meaningful assessments for the discriminability of multi-classification markers are shown. Given a set of classifiers * , a parameterized system can be used to characterize the corresponding optimal ROC manifold. A connection with the decision set further leads to a better understanding of some geometric features of optimal ROC manifolds and preserves the simplicity in computing the hypervolume under the ROC manifold (HUM). In addition, it motivates us to address the necessary and sufficient conditions for the existence of the HUM. To sum up, this work provides working scientists with an extension of the two-class ROC analysis to the multi-classification ROC analysis in a theoretically sound manner. & 2016 Elsevier Ltd. All rights reserved.
Keywords: Discriminability Hypervolume Manifold Optimal classification Receiver operating characteristic Utility
1. Introduction Receiver operating characteristic (ROC) analysis, which was originally developed for radar signal detection, is a technique initially created for assessing the performance of binary classification markers and has been extended to multi-classification (see [1–3]). In application, a marker is generally referred to as a traceable substance whose presence indicates the existence of some state such as a particular disease condition. As the ROC curve for binary classification, the ROC manifold is a natural extension to display the trade-off between the correct classification probabilities and misclassification probabilities. However, the definition of ROC manifold can be stated in a more mathematically rigorous manner. We address the concern about the existence of the hypervolume under the ROC manifold (HUM), which is an analog of the area under the ROC curve (AUC). A significant research finding by [4] also indicated that the existence of the HUM is still in doubt. To clarify the problem and demonstrate its importance in application, a theoretical unification of ROC manifolds will be established in this paper. Typically,a multi-classification task is mainly based on data of the l , where a multi-categorical response G type (G, Y) and a classifier G stands for the true class with K possible values 2 = {1, … , K}, Y ∈ @ l is a random denotes a univariate or multivariate marker value,and G n
Corresponding author. E-mail address:
[email protected] (C.-T. Chiang).
http://dx.doi.org/10.1016/j.patcog.2016.06.024 0031-3203/& 2016 Elsevier Ltd. All rights reserved.
function from @ to 2 . An extension of ROC analysis to multi-classification has been initially developed in sequential classification procedures,which have excited interest for practical and theoretical simplicity. These algorithms simplify multi-classification tasks to a series of binary classifications as the form G¼k versus G ∈ {k + 1, … , K} by order k = 1, … , K . The first systematic study of a ternary classification problem can be traced back to the paper of [5]. For a univariate marker value Y , Scurfield constructed the ROC manifold for ternary classification to visualize a set generated by 3 l , Y) , p l l {p1σ (1) (G 2σ (2) (G , Y) , p3σ (3) (G , Y)} ∈ [0, 1] , where s is a permul tation function on {1, 2, 3} and p (G , Y) represents the conditional jk
l = j|G = k ), which we will call the performance probability P (G probability hereinafter. To accommodate a multivariate marker, Mossman [1] also developed a classification rule by utilizing a mapping between each G and Y . Although such classification actually offers a perspective to extend traditional ROC analysis for multiclassification, this work is generally not optimal in terms of performance probabilities and is of limited applicability. In practice, a monotone likelihood ratio (MLR) condition should be satisfied (cf. [6]) to ensure the optimality of commonly used sequential procedures. By applying a multinomial logistic regression model, Li and Fine [7] extended the foregoing approach to address a multi-categorical response. Based on the ROC manifold generated by the correct classification probabilities, Zhang and Li [8] further employed a general semi-parametric model of [9] to seek an optimal composite marker. As we shall indicate in this paper, an optimal ROC manifold enjoys some geometric characteristics such as the regularity (see
Y.-J. Wu, C.-T. Chiang / Pattern Recognition 60 (2016) 770–777
771
[10]) and smoothness. Under some suitable conditions, the equality between the corresponding HUM and the correctness probability (CP) can also be found in [1–3,6,11]. However, non-optimality of ROC manifolds might lead to the lack of explanation for such a particular summary assessment. Indeed, ROC analysis gives an illuminating insight into the assessment for the discriminability of markers. It is rational to l , Y) = P (G l = j|G = k ), adopt the performance probabilities pjk (G j, k = 1, … , K , to assess the considered classification procedures. l in a family of classifiers * , its performance function For any G l , Y) = (p (G l , Y) , …, p (G l , Y) , …, p (G l , Y) , …, p (G l , Y))⊤ can p (G 11 1K K1 KK be naturally plotted in a general ROC set:
⎧ ⎪ 2 9=⎨ ( ξ11, …, ξ1K , …, ξK1, …, ξKK )⊤ ∈ ⎡⎣ 0, 1⎤⎦K ⎪ ⎩ K
:
⎫
⎪
∑ ξjk = 1, 0 ≤ ξjk ≤ 1<⎪⎬. j=1
⎭
(1.1) K2
Following from the above definition, 9 is a unit cube in space that contains a K (K − 1)-dimensional hyperplane and sufficiently represents all possible performance functions. In application, practitioners often focus on partial information of 9 such as a subset corresponding to the correct classification probabilities {pkk : k ∈ 2} or the misclassification probabilities {pjk : j, k ∈ 2 with j ≠ k}. For such purpose, the symbol S is used to denote the performance probabilities of interest, which generates a smaller ROC set 9S in 9 , and the considered operators or sets restricted to 9S are subscripted by S. Thus, a partial performance can be determined as a projection from 9 onto 9S . As it is well known in binary classification, the corresponding performance probabilities of a set of classifiers in 9S with S = {p11, p12 } might not necessarily form a ROC curve but can be plotted as a representation for the discriminability of classification procedures. It is noted that such a representation is not straightforward for arbitrary K-classification tasks. Starting with the concept of a proper assessment for the discriminability of markers, the ROC representation is brought up in this work and is introduced in the next section.
2. ROC representation The performance of a classification procedure can be rel , Y)) for presented by a function of performance probabilities ϕ (p (G different choices of functions ϕ such as the performance function and the expected utility defined in Sections 2.1 and 2.2. To assess the discriminability of Y in the sense of “fairness”, the probabilitybased performance assessment should be a function only of markers and invariant with respect to chosen classifiers. With an argument slightly different from [12], this assessment is also shown to be equivalent to the ROC representation. 2.1. Performance sets
Fig. 1. A notional performance set ϕ{p , p } (*) (yellow region and black boundary 11 22 curve) for a set of binary classification classifiers * with each red point representing the performance function of a particular classifier in [0, 1]2 . (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)
define the performance set of the collection to be
l ): G l ∈ *} . ϕ (*) = {ϕ (G
(2.2)
Here, the set * consists of a collection of deterministic and random classifiers (see Fig. 1) defined on @ with outputs in 2 . Let Pr denote a probability measure that generates the conditional probabilities pjk's that define ϕ, 1 (·) be the indicator function, and Wλ follow a Bernoulli distribution with parameter λ ∈ [0, 1]. Throughout this paper, we consider the set * in a probability sense. Definition 2.1. The set * is said to be convex with respect to Pr lλ ) ≤ λp (G l1) + (1 − λ ) p (G l2 ) for each λ ∈ [0, 1] and all and S if pjk (G jk jk l1, G l2 ∈ * , where lλ ∈ * for every pjk G G ∈S implies lλ = ∑2 1 (Wλ = 2 − ℓ) G lℓ . G ℓ= 1 This definition defines the convexity of a set of classifiers based upon the convexity of the corresponding set of performance vectors. Intrinsically, the set ϕS (*) contains performance vectors of all existing classifiers in * with respect to a specified marker value Y and conveys information about the classification capacity of Y with respect to * . As a representation of the discrimination ability, ϕ (*) should depend only on Y . A remarkable characteristic of the performance set in the following theorem further enables us to identify optimal classification procedures. Theorem 2.1. Suppose that * is a convex set with respect to Pr and S. Then, the performance set ϕ (*) of Y is convex and compact, and so is ϕS (*) for any subset of performance probabilities S. Proof. See Appendix A.
l , Y) of a To simplify notations, the performance probability pjk (G l ), j, k ∈ 2 . The performance function given Y is denoted by pjk (G l deϕ (·) is the conditional probabilities acting on the classifier G noted by
l ) = p (G l ), …, p (G l ), …, p (G l ), …, p (G l) ϕ (G 11 1K K1 KK
(
⊤
),
(2.1)
l . As for the construction of a proper in which it still depends on G accuracy measure for the discriminability of Y , it is reasonable to
To characterize the convexity and compactness of ϕ (*) (or ϕS (*)), it suffices to portray the boundary set ∂ϕ (*) (or ∂ϕS (*)), which will be shown to be related to the optimality of classifiers. In the next subsection, a parameterized system in the decision theory is employed to analyze and compute ∂ϕ (*). The properties in Theorem 2.1 further assure the existence of utility classifiers (see Definition 2.3), whose performances fall in ∂ϕ (*), and elucidate the optimality of ∂ϕ (*). Thus, one can just consider ∂ϕ (*) rather than the whole ϕ (*) or an arbitrary subset of ϕ (*). For an
772
Y.-J. Wu, C.-T. Chiang / Pattern Recognition 60 (2016) 770–777
illustration, let us consider a binary classification problem with Y given G ¼k following a univariate normal distribution with mean μ 2 k0 and variance σ0 , k¼ 1,2, where μ01 < μ02 . The performance ly ) of a classifier G ly = 1 (Y ≤ y) + 2·1 (Y > y) can function ϕ (G {p11, p22 }
)⊤ ,
be easily computed to be (Φ ((y − μ01) /σ0 ) , 1 − Φ ((y − μ02 ) /σ0 where Φ (·) represents the standard normal distribution. Since the probability density functions of Y given G ¼ k, k = 1, 2, satisfy a MLR condition with respect to k, the performance functions ly ) : y ∈ } can be shown to form on the upper boundary {ϕ{p11, p22 } (G of the performance set ϕ{p11, p22 } (*) (see Fig. 1), which is called the optimal ROC curve with respect to {p11, p22 }. 2.2. Parameterized optimal ROC manifolds
l1 dominates another classifier G l2 in * Definition 2.2. A classifier G l1≽S G l2, if with respect to the set of performances S, denoted by G l1) ≥ p (G l2 ) and p (G l1) ≤ p (G l2 ) for all pkk and p ∈ S with pkk (G jk kk jk jk l1 strictly dominates another classifier G l2 in * j ≠ k . A classifier G l1≻S G l2, if at with respect to the set of performances S, denoted by G least one of the above inequalities is strict. To simplify the presentation, we retain only one representative of classifiers with the same S performance vector in * . It follows that the resulting set also has the same performance set ϕS (*) of Y . In l in * is said light of concept of partially ordered sets, a classifier G to be maximal in * if no other classifier in * exists that dominates l . With the dominance relation as a partial ordering on ϕ (*), the G S compactness of ϕS (*) in Theorem 2.1 assures that it is upper lj : G lℓ 1⪯S G lℓ 2 ∀ ℓ1 < ℓ2 ∈ Γ}, where Γ is bounded. Thus, each chain {G a countable or uncountable index set, is bounded above and all classifiers in the chain are dominated by a maximal classifier. The so-called chain is a collection in which any pairs of elements are comparable in the sense of dominance. An application of Zorn's lemma (see e.g. [13]) further shows that the performance function of a maximal classifier belongs to ϕS (*). Since ϕS (*) is convex and compact, the performance function of a maximal classifier belongs to ∂ϕS (*) relative to the ROC set 9S . In theoretical development, a parameterized representation of ∂ϕS (*) would be convenient to explore the properties for the set of performance functions of maximal classifiers. Meanwhile, practitioners should be attracted by the form of these maximal classifiers. It can be found that maximization of the expected utility, which is an optimization criterion, is helpful to achieve both theoretical and practical interests. l can be defined As in the decision theory, one of the utilities of G as
∑ ujk 1 (Gl = j, G = k) j, k
∑ ujk P (Gl = j, G = k), j, k
l )] = u⊤ϕ (G l ). E [U (G
(2.5)
l . Since So given u ∈ < defines the utility functional U acting on G l ) for any l ) is equivalent to maximizing (c u)⊤ϕ (G maximizing u⊤ϕ (G fixed c > 0, the condition ( ∑j, k u2jk )1/2 = 1/c is made for scale invariance. The third condition in Definition 2.3 is imposed to ∼ ∼ locate u in a subspace containing ϕ (*) − K−11K2 , where 1K2 denotes the vector (1, … , 1)⊤ in K . It is ensured that each point in ∂ϕ (*) corresponds to a unique utility u in the following discussion about optimality. Thus, the standardized u will include K2 − K − 1 free utility values. Particularly, in 9S , ujk's are naturally set to be zero for all pjk ∉ S and the number of free utility values will reduce to
#S − #{k: {p1k , …, pKk } ⊂ S} − 1,
(2.6)
where # denotes the cardinality of a set. Interestingly, the utility is the same as the negative Bayes risk with positive and negative utility values being treated as gain and loss, respectively, in classification. For any given 0 ≠ u ∈ < with norm equal 1, the lu = argsup l u⊤ϕ (G l ) can be found with the utility classifier G G ∈* expected utility
l u )] = sup u⊤ϕ (G l ). E [U (G l∈ * G
(2.7)
As expected, the convexity and compactness of ϕ (*) in Theorem 2.1 imply that u⊤ϕ (*) is a closed interval and, thus, the supremum lu ) ∈ ∂ϕ (*) and the is also the maximum, which leads to ϕ (G lu ) as lu in * . To justify the characterization of ϕ (G existence of G that of performance functions of maximal classifiers, it remains to establish a connection with maximization of the expected utility criterion. The following theorem states a well-known result in the statistical decision theory (cf. [14]).
l is maximal in convex set * with respect Theorem 2.2. A classifier G to the set of performances S if and only if it is a utility classifier in 9S . Proof. See Appendix A. It follows from Theorems 2.1 to 2.2 that the optimal ROC manifold with respect to S is defined to be
l ) ∈ 9 S: G l is maximal in *} , MS :={ϕS (G
(2.8)
in which there might have several maximal classifiers in * , and fully represents the discriminability of a multi-classification marker (see Fig. 2). We should stress that the results achieved in this paper are mainly based on a formulation in terms of utility. This groundwork should lead to a better understanding of the geometric characteristics of optimal ROC manifolds and the conditions for S with positive HUM. As an alternative approach, Schubert et al. [15] utilized a different functional to determine the optimal ROC manifold. Before giving more precise characterization of parameterized optimal ROC manifolds, the interpretability and computability are illustrated in terms of decision set.
(2.3) 2.3. Connection with decision set
with the expected utility value
l )] = E [U (G
By the assumption that G follows a multinomial distribution with parameters pk = P (G = k ), k ∈ 2 , and pk's can be absorbed by ujk's, the expected utility of a classifier in (2.4) is further simplified as
2
A parameterized system, which is designed for the performance set ϕ (*) of a collection of classifiers * , should be helpful for us to analyze and compute the boundary ∂ϕ (*) of ϕ (*) in theory l is considered as better than and practice. Intuitively, a classifier G another in * if it associates with higher correct classification probabilities {pkk : k ∈ 2} or lower misclassification probabilities {pjk : j, k ∈ 2 with j ≠ k}. Thus, classifiers partially ranked with respect to an ordering by a dominance relation are naturally introduced.
l) = U (G
2
of u ∈ K that satisfies (a) ukk ≥ 0 ∀ k ∈ 2 , (b) ujk ≤ 0 ∀ j, k ∈ 2 K with j ≠ k , and (c) ∑ j = 1 ujk = K−1 ∀ k ∈ 2 .
(2.4)
where the utilities ujk's are defined in the following definition: Definition 2.3. The collection < of standardized utilities is the set
Let fk (y), L jk (y), and L (y) denote the density function of Y given G ¼k, the likelihood ratio function f j (y) /fk (y) for j, k ∈ 2 with
j ≠ k , and (L1K (y) , …, L (K − 1) K (y))T , respectively. Further, the sets {y ∈ @: L jK (y) = c}, j = 1, … , K − 1, are assumed to have measure zero for any positive constant c. The decision set, which is spanned
Y.-J. Wu, C.-T. Chiang / Pattern Recognition 60 (2016) 770–777
773
Fig. 2. Notional examples of optimal ROC manifold MS for ternary classification in 9S with S representing (a) a set of correct classification probabilities {p11, p22 , p33 } and (b) a set of misclassification probabilities {p12 , p23, p31}.
by likelihood ratio values, has been utilized in some applied fields (e.g. [16]) and simplifies the computation of ROC manifolds. From the expected utility in (2.4), we can also derive an explicit form of optimal classifiers with an argument slightly different l ) = ∑K U (G l ) 1 (G l = k ), from [12]. By using (2.3), the equality U (G k=1 and the iterative expectation E [X1] = E [E [X1|X2 ]] for any generic random quantities X1 and X2, one has the following expected util: lity of G
⎡ l )] = E ⎢ E [U (G ⎢⎣
K
⎤
k=1
⎥⎦
∑ E [U (Gl ) 1 (Gl = k)|Y] ⎥.
(2.9)
It is noted that the inner expectation is with respect to the probl , G ) given Y and outer expectation is with ability distribution of (G respect to the distribution of Y . This decomposition facilitates us to construct a utility classifier via maximizing the conditionalexpected utility pointwise over @ . When the inequality
(2.10)
1≤ j ≤ K
l = k|Y = y) = 1. By absorbing is satisfied for each y ∈ @ , we set P (G K pi into uki's and (2.4), (2.10) can be rewritten as minj ∈ 2 ∑i = 1 lu satisfies (uki − uji ) L iK (y) ≥ 0. Thus, the utility classifier G
lu = k|Y = y) = 1 if P (G K j≠k
∑ (uki − uji ) L iK (y) ≥ 0, y ∈ @}, i=1
optimal classifiers are still well-defined. Second, although the transformation from Y to L (Y) may reduce the dimensionality of markers, there is no information loss since the minimal sufficiency of the statistic L (Y) for (G, Y) assures the invariance for performance functions of classifiers, which is evidenced by l ) = E [E [P (G l = j|Y)|L (Y)]|G = k] = E [P (G l = j|L (Y))|G = k]. pjk (G
l ) 1 (G l = k )|Y = y] ≥ max E [U (G l ) 1 (G l = j )|Y = y] E [U (G
L (y) ∈ Dk (u):= ⋂ {L (y):
to formulate optimal classifier through original marker values or their combinations. Usually, linear classifiers suffer from a serious loss of information. Although such commonly used procedures are easy to explain, one can only achieve the sub-optimality in classification. At first sight, it seems that using the decision set to express classifiers might involving some complexities in the overlapping of Dk (u)′s, the dimensionality of the decision set, and the domains of fk (y)′s. However, these doubts can be fully clarified as follows: first, since Dk's can be overlapped only at the boundaries, ⋂kK= 1 Dk (u) is a subset of ∂Dk (u) with P (L (Y) ∈ ⋂kK= 1 Dk (u)) = 0, which implies that
k ∈ 2.
(2.11)
It is clear that {Dk (u) : k ∈ 2} is a partition of the space spanned by the likelihood ratio scores L (Y) and the intersection K ∩j ≠ k {L (y) : ∑i = 1 (uki − uji ) pi L iK (y) = 0, y ∈ @} is a critical point c (u). Such a decision set + = {Dk (u) : k ∈ 2} has been proposed by [12,16,17], among others, to describe classifiers. When each maximal classifier can be manifested as a convex combination of classifiers in the decision set spanned by the likelihood ratios, the likelihood L (·) is assured to be the measurement of an optimal marker for K-classification. On the contrary, it will be impractical
(2.12)
l⁎
It follows from (2.12) that there exists a classifier G from @ to 2 l. with the same performance function value of G From Theorem 2.2, a classifier with maximum correct classification probabilities {pkk : k ∈ 2} can be shown to be a utility classifier with ujk = 0 for j, k ∈ 2 with j ≠ k , and vice versa. For a non-degenerate case with ukk > 0, one can further simplify Dk (u) in (2.11) as
⎧ ⎫ u Dk (u) = ⋂ ⎨ L (y): L jk (y) ≥ kk , y ∈ @⎬ , ujj ⎭ j≠k ⎩ ⎪
⎪
⎪
⎪
k ∈ 2, (2.13)
−1 , … , u−(K1− 1)(K − 1), 1)⊤ . In with an explicit critical point c (u) = uKK ·(u11 practice, it is easier to use c (u) as K − 1 threshold values in + to represent an optimal classifier when S = {pkσ (k ) : k ∈ 2}, which assures that c (u) is a bijective function of u .
Example 2.1. With a realization y of univariate marker value Y , a common approach is to classify subjects sequentially by the
774
Y.-J. Wu, C.-T. Chiang / Pattern Recognition 60 (2016) 770–777
l defined by classifier G
⎧ ⎫ l = k if Y ∈ ⋂ ⎨ Y: L jk (Y) < ujj ⎬ , G ukk ⎭ j≠k ⎩ ⎪
⎪
⎪
⎪
k ∈ 2. (2.14)
It achieves optimality only when {fk : k ∈ 2} satisfies the MLR condition with respect to k, which means that L jk (y) is a monotone function of y . Without loss of generality, L jk (y) is assumed to be strictly increasing for each j > k . The optimal classifier is directly derived from (2.14) as ⎧1 if Y < minj > 1L−j11 (u jj /u11) , ⎪ ⎪ l G (Y) = ⎨ k (1 < k < K ) if max j < kL−jk1 (u jj /ukk ) < Y < minj > k L−jk1 (u jj /ukk ) , ⎪ ⎪K if Y > max j < K L−jK1 (u jj /uKK ) . ⎩
(2.15)
3. Characterization of optimal ROC manifolds To make a comparison based on a specific pjk ∈ S , a maximal
l in * with respect to S has the highest correct classiclassifier G l ) for j¼ k or the lowest misclassification fication probability pjk (G l ) for j ≠ k among all classifiers with fixed values probability p (G
Fig. 3. A parametric system for MS based on the supporting function/utility.
jk
of other performance probabilities in S. From the geometric perl ) is the highest point spective, the performance function value ϕS (G on the set generated by S⧹{pjk }. As one can see, the optimality of classifiers shares both theoretical and practical importance. Without optimality in classifiers, the ROC manifold could be an arbitrary subset of ϕS (*) rather than a manifold in the context of geometry. Since few features could be identified for (non-optimal) ROC manifolds, estimation of ROC manifolds and related summary measures might lead to an ambiguous and complicated situation. With this motivation, we introduce optimal ROC manifolds for multi-classification as an extension of optimal ROC curves for binary classification. For this type of optimization problem, our first strategy is to show that a set of performance functions of maximal classifiers is a manifold. Roughly speaking, the structure of such a set is locally similar to the Euclidean space when the set * is an uncountably infinite set. The developed mechanism for the optimal ROC manifold MS defined in (2.8) is mainly based on the expected utility or the supporting function in the terminology of convex analysis. Let us consider the hyperplane
HS (r , u) = {p ∈ 9 S : uT p = r} given u ∈ < and r ∈ .
(3.1)
It follows that the performance set ϕS (*) can be re-expressed as ∪r ∈ (HS (r , u) ∩ ϕS (*)) and the real value r can be treated as the l 's with ϕ (G l ) ∈ HS (r , u) ∩ ϕ (*) (see expected utility of classifiers G S
of L (Y) given G¼k, k ∈ 2 . It follows that the optimal manifold MS should be smooth if fLk , k ∈ 2 , are smooth. For binary classification tasks, the optimal ROC curve MS for S = {p11, p22 } can be expressed as a function TS (p11) = FL2 (F L−11 (1 − p11)) of p11 since FL1 is assumed to be invertible. This particularly simple form greatly facilitates practitioners to model ROC curves. Even for markers with regularly-used distributions, closed-form expressions of ROC manifolds as a function of some pjk's seem to be unattainable. Furthermore, modeling on MS would become intricate for K ≥ 3. Admittedly, the manifold MS can be re-
(
)
presented by a continuous function TS pS ⧹ {pjk } with pjk ∈ S on the domain, which is the projection of ϕ (*) onto 9S ⧹ {pjk }. For each fixed p ∈ 9S ⧹ {pjk }, let us consider the corresponding classifiers with performances located in ϕS (*). All of them can be shown to be dominated by l0 with its performance in the same set. a unique maximal classifier G
(
)
l0 ). In practice, Thus, it is straightforward to have TS pS ⧹ {pjk } = pjk (G researchers might be interested in exploring a trade-off among pjk's. With the constructed parameterization, the supporting hyperplane l ) , u) is a tangent hyperplane at the point on the HS (supGl ∈ * uT ϕS (G manifold TS (u) and a parameterized curve along TS (u) has a tangent vector lying in the tangent space, so is normal to u ∈ < . Evidenced by lu ) /∂p (G lu ) = − u j k /ujk can be treated as the the above fact, ∂pjk (G ′ ′ j′ k′ trade-off between pjk and p j′ k′ at TS (u).
S
Fig. 3). Thus, a parametric version of MS is naturally established as a function TS (u) for each u ∈ < with the output to the subset
⎛ ⎞ l ), u⎟ ∩ ϕ (*), TS (u) = HS ⎜ sup uT ϕS (G S ⎝ Gl∈ * ⎠
(3.2)
which reduces to a s-tuple in 9S with s being the number of free utility values. Of course, this depends on how many classifiers produce this supremum ϕS value. There might only exist one and, hence, s¼ 1. With the parametric system in (3.2) and the convexity of * with respect to Pr and S, the optimal ROC manifold MS is indeed an at most s-dimensional manifold in terms of geometry since the set can be parameterized as a convex function on a sdimensional Euclidean space. The above parameterization supplies some intrinsic characterizalu ) = ∫ tion of MS through MS and the expression pjk (G fk (y) d y . L (y)∈ Dj (u)
Let fLk and FLk denote the respective density and distribution functions
4. Existence of the hypervolume under the manifold In a ROC set 9S , the corresponding optimal ROC manifold might involve a complication in visualization when the numbers of both considered classes and performance probabilities pjk's are all greater than three. It was pointed out by [6], a summary index of optimal ROC manifolds should facilitate comparisons for all markers and provide a reasonable ordering for the performances of markers. As an analog of the AUC, the hypervolume under the manifold (HUM) has been proposed for multi-classification in the foregoing literature (e.g. [2,4,5]). To help practitioners in computing the HUMs and plotting 2D- and 3D-ROC manifolds, Novoselova et al. [18] further developed computationally efficient software tools. Recently, Li et al. [19] also developed a practical approach to identify the relative order of marker values with the largest HUM. However, there is still no clear progress to answer a
Y.-J. Wu, C.-T. Chiang / Pattern Recognition 60 (2016) 770–777
fundamental problem for the existence of the HUM. Without a proper specification of S, the considered HUM might be zero even with optimality. For binary classification tasks, the performance set ϕ{p11, p22 } (*) usually separates the ROC set 9{p11, p22 } and assures the existence of the optimal AUC. With the continuity of the optimal ROC manifold MS, it is possible to construct a separation of the ROC set 9S by MS. In practice, this separation is necessary for the set under MS to have a volume, which is denoted by VS. Let vec [·] be the vectorization operation acting on a matrix. To clearly characterize such an accuracy assessment, a series of results are further established as follows. Theorem 4.1. For K ≥ 3, suppose that 9S contains two coordinates pij and pik with i, j ≠ k , and fLi , fLj , and fLk have a common domain. Then, there exists a continuous mapping pS : [0, 1]↦9S with pS (0) = vec [1 (j′ = k′)] and pS (1) = vec [1 ((j′ , k′) = (j′ , σ (j′)))] for arbitrary permutation function s with σ (k ) ≠ k such that {pS (t ) : t ∈ [0, 1]} ∩ ϕS (*) = ∅. Proof. See Appendix A. Due to the closedness of {(t , pS (t )) : t ∈ [0, 1]} and ϕS (*), the distance between ϕS (*) and ∂9S can be shown to be positive by Theorem 4.1. Generally, an optimal ROC manifold might not enclose a set with positive hypervolume. It follows from Theorem 4.1 that both optimal and non-optimal HUMs can be well-defined only if S = {pkσ (k ) : k ∈ 2} for some permutation function s. Our other focus in this section is on the case of degenerate MS, i.e. MS can be parameterized as a smooth function defined on an at most K − 2 Euclidean space. When K ≥ 3 and S = {pkσ (k ) : k ∈ 2} contains both correct classification and misclassification probabilities, a maximal classifier in * with respect to S must be of the l ) = 0 for j ≠ k . More precisely, the dimension of TS type with pjk (G can be shown to be less than K − 1 from the property ∼ l ) ∈ MS } ⊂ {ϕ ∼(G l ) ∈ M ∼} for S = {pkk : k ∈ 2 with pkk ∈ S}. {ϕS (G S S Therefore, MS is unable to create a separation in 9S . The condition in the following theorem further gives the essential ingredients for the well-behaved VS. Theorem 4.2. For K ≥ 3, suppose that S = {pkk : k ∈ 2} or S = {pkσ (k ) : k ∈ 2 with σ (k ) ≠ k} for some permutation function s acting on 2 . Then, MS separates 9S into two disjoint sets in which the interior of each set is open and connected. Proof. See Appendix A. One might think that the HUM can be defined as the hypervolume under MS only on the domain of MS in a sense similar to the partial AUC (cf. [20]). Unfortunately, the induced accuracy measure is still problematic in practice although this view would circumvent the problem whether MS can actually separate 9S . Specifically in 9S generated by all misclassification probabilities, Zhang and Li [4] provided an argument to elucidate that both the HUMs of perfect and useless markers might be zero. For an arbitrary S, we characterize the condition for the occurrence of this untestable phenomenon and relate it to the condition in Theorem 4.2. Theorem 4.3. For K ≥ 3, the HUM VS under TS (pS ⧹ {pjk } ) with pjk ∈ S has the following properties: (i) (Near perfect marker) VS → 0 as pjk → δjk ∀ pjk ∈ S ; (ii) (Non-informative marker) VS ¼ 0 as p jk1 = p jk2 ∀ p jk1, p jk2 ∈ S ; if and only if neither S = {pkk : k ∈ 2} nor S = {pkσ (k) : k ∈ 2 with σ (k ) ≠ k} for some permutation function σ .
775
Proof. See Appendix A. As a consequence, the HUM is a rational summary index for the discriminability of a marker if and only if the performance probabilities of interest satisfy the condition in the above theorem. To illustrate the hypotheses in Theorem 4.3, let us consider the fol2 lowing two binary classification examples: (a) G = ∑ℓ= 1 ℓ·1 (Y ∈ @ℓ ) with @1⋂@2 = ∅ and @1⋃@2 = @ and (b) G is independent of Y. For l = ∑2 ℓ·1 (Y ∈ @ℓ ) is a perfect clasthe case of a perfect marker, G ℓ= 1
l + (1 − Wλ ) G lℓ : G lℓ = ℓ, ℓ = 1, 2, λ ∈ [0, 1]} is the sifier and {Wλ G collection of maximum classifiers. As for the case of non-informative Y , the corresponding collection of maximum classifiers is l1 + (1 − Wλ ) G l2: G lℓ = ℓ, ℓ = 1, 2, λ ∈ [0, 1]}. When S is neither {Wλ G {p11, p22 } nor {p12 , p21}, the corresponding HUMs can be easily shown to be zero. Given any specific permutation s0 on 2 , the HUM corresponding to S = {pkσ 0 (k ) : k ∈ 2} in Theorem 4.3 has been shown to be equal to the correctness probability (CP) (see [6] for general K and [16] for K ¼3):
⎛ P ⎜⎜ ⎝
K
⎞
K
k=1
⎠
k=1
∏ fk (Yσ 0 (k) ⎟⎟ ≥ ∏ fk (Yσ (k) )
∀ σ |Gσ 0 (1), …, Gσ 0 (K ) ).
(4.1)
For explanatory simplicity, S = {pkk : k ∈ 2} is considered in the following discussion. Three parametric models, which were illustrated by [6], are further used to compute the well-behaved HUM VS under the optimal ROC manifold MS with S satisfying the form in Theorem 4.3.
(
)⊤
⊤ Example 4.1. Let θ0 = θ01 and θ0K = 0 . Under the , … , θ0⊤(K − 1) validity of a multinomial logistic regression model, one has LkK (y) = exp (θ0⊤k Y) P (G = K ) /P (G = k ). It has been derived by [6] ⊤ K that VS = P ∑k = 1 (θ0k − θ0σ (k ) ) Yk ≥ 0 ∀ σ |G1 = 1, … , GK = K ).
(
Example 4.2. Suppose that Y given G¼k follows a multivariate normal distribution with mean μ0k and covariance matrix Σ0k , k ∈ 2 . Then, VS = P ( ∑kK= 1 (Yk − μ0σ (k) )⊤Σ−0σ1(k) (Yk − μ0σ (k) ) − (Yk − μ0k )⊤Σ−0k1(Yk − μ0k ) ≥ 0 ∀ σ |G1 = 1, … , GK = K ) .
Example 4.3. Suppose that Y is univariate with the corresponding family of distributions fk (y), k ∈ 2 , satisfying the MLR condition with respect to θ0 . The likelihood ratio L k1, K (y ; θ0k1) /L k2, K (y ; θ0k2 ) for k1, k2 ∈ 2 with k1 ≠ k2 can be shown to be strictly increasing in y for each θ0k1 > θ0k2 . In light of this fact and (4.1), VS = P (Yk1 > Yk2, θ0k1 > θ0k2 ∀ k1 ≠ k2 |G1 = 1, … , GK = K ) can be derived.
5. Conclusion For the discriminability of multi-classification markers, this paper provides a theoretical framework to show that a proper assessment based on the performance probabilities is exactly the corresponding optimal ROC manifolds. Through a parameterization of the utility-maximization criterion, the optimal ROC manifolds are demonstrated as manifolds. This assures some practical and desirable features and supports work directly in modeling ROC manifolds. In addition, we give the necessary and sufficient conditions for the existence of the HUM. When researchers are especially interested in some performance probabilities with respect to a suitable ROC subset, the usefulness of the corresponding HUM can be justified. Conclusively, this paper extends the scientific groundwork for a more general multi-class ROC analysis.
776
Y.-J. Wu, C.-T. Chiang / Pattern Recognition 60 (2016) 770–777
Acknowledgments The research of the corresponding author was partially supported by the National Science Council grants 97-2118-M-002020-MY2 and 102-2118-M-002-003 (Taiwan). The authors would like to thank two reviewers for their constructive comments on this paper.
A.1. Proof of Theorem 2.1
l1, G l2 ∈ * and G lλ with Wλ asProof. For arbitrary two classifiers G l1, G l2 ), one can derive that for sumed to be independent of (G, G every j, k ∈ 2
lλ ) = E [1 (G lλ = j )|G = k ] pjk (G 2
⎤
ℓ= 1
⎥⎦
∑ 1 (Wλ = 2 − ℓ) 1 (Glℓ = j)|G = k⎥
l1) + (1 − λ ) p (G l2 ) . = λpjk (G jk
(A.1)
It follows that for any S
lλ ) = λϕ (G l1) + (1 − λ ) ϕ (G l2 ), ϕS (G S S
(A.2)
lλ is also a classifier in * since * is convex. Let which implies that G ln } be a sequence of classifiers in * such that ϕ (G ln ) converges to {G S l ′ s have values in 2 , given ε > 0, there exists some point p . Since G 0
1
∑(
pS (t ) =
ℓ
− 1) [(1 − 2t ) pS (0.5) − 2 (ℓ − t ) pS (ℓ)] 1[0.5 ℓ ,0.5 (1 +ℓ)) (t ),
ℓ= 0
where pS (0.5) = vec [1 − 1 (i′ = j′) + (21 (i′ = j′) − 1) 1{(i, j), (i, σ (i))} ((i′, j′))]. Since fL σ (i) have a common domain, no classifier satifLj and
l ) ≠ p (G l ) and p (G l ) = 1 (i′ = j ) for i′ ≠ i . Thus, sfies piσ (i) (G ij i′j {pS (t ) : t ∈ [0, 0.5]} ∩ ϕS (*) = ∅. Similarly, no classifier can satisfy l ) = 1 − 1 (i = σ (i )). As a consequence, one l ) = δ ij and p (G pij (G iσ (i ) has {pS (t ) : t ∈ [0.5, 1]} ∩ ϕS (*) = ∅. □
Appendix A. Proof of Theorems
⎡ = E⎢ ⎢⎣
only need to investigate the case of #S = K + 1 with {pkσ (k ) : k ∈ 2} ⊂ S . For some pij and piσ (i) ∈ S with σ (i ) ≠ j , we define
A.4. Proof of Theorem 4.2
lj with the Proof. For a given Y , we can construct a trivial classifier G lℓ ) = 1 (ℓ = σ (k )) for corresponding performance probabilities pkσ (k ) (G lλ = Wλ G lℓ + (1 − Wλ ) G lℓ every k, ℓ ∈ 2 . As shown in Theorem 2.1, G 1 2 K l with ∑k = 1 pkσ (k ) (Gλ ) = 1 is a classifier in * for every ℓ1, ℓ2 ∈ 2 and λ ∈ [0, 1]. Thus, in 9S , ϕS (*) always contains a (K − 1)-simplex ⎧ ⎫ K ⎪ ⎪ lλ ) : G lλ = Wλ G lℓ + (1 − Wλ ) G lℓ with ∑ p l ⎨ ϕ S (G kσ (k ) (Gλ ) = 1, ℓ1, ℓ2 ∈ 2, λ ∈ [0, 1] ⎬ 1 2 ⎪ ⎪ ⎩ ⎭ k =1
Since every continuous path h (t ) = (h1 (t ) , … , hK (t ))⊤ from (1, … , 1)⊤ K to (0, … , 0)⊤ in 9S has a point satisfying ∑k = 1 pkσ (k ) = 1, this simplex □ can separate 9S . The proof is completed. A.5. Proof of Theorem 4.3
n
a positive constant Mε such that
ln, Y)∥ > Mε ) < ε, P (sup ∥(G
(A.3)
n
where ∥·∥ denotes a Euclidean norm of a vector. By Prokhorov's theorem [21] and Lebesgue's dominated convergence theorem, ln , Y)} converging in distribution to there exists a subsequence {(G i l l0 ) = p , which is l (G0, Y) in which G0 is a classifier in * with ϕ (G S
0
corresponding to the fact that the limit of a sequence of classifiers, if it exists as the limit of a sequence of functions, is considered as a classifier. This further implies the closedness of ϕS (*). Together with the boundedness of 9S , the compactness of ϕS (*) is immediately obtained. The proof for the compactness and convexity of ϕS (*) goes along the same lines and is omitted here. □
Proof. In this proof, a conditional probability vector is said to be dominated by another conditional probability vector of the same length if the dominance condition in Definition 2.2 is satisfied. Similar to the argument in the proof of Theorem 4.1, we only need to consider non-degenerate MS. Suppose that S is not one of the performance sets stated in the theorem. It follows that there exists {p jk1, p jk2 } in S. Since
{p : p S
l S is dominated by some ϕ S (G ) ∈ M S
}
∼ l) ∈ M ∼ and p ∼ ∈ [0, 1] # S ⧹ S } for S∼ ⊂ S , ⊂ {p S : p S∼ is dominated by some ϕ S∼ (G S S⧹S
an upper bound of VS is easily obtained by the inequality VS ≤ V S∼ . Further, the inequality VS ≤ mink1≠ k2, p jk , p jk ∈ S V{p jk , p jk } holds and 1 2 1 2 V{p jk , p jk } of a near perfect marker approaches to zero. As for a non1
2
A.2. Proof of Theorem 2.2
informative marker, V{p jk , p jk } = 0 is a direct consequence of 1 2 l ) = p (G l ). Coupled with the inequality VS ≤ V{p , p }, VS ¼0 p jk1 (G jk2 jk jk
Proof. It follows from Theorem 2.1 that ϕS (*) is a convex set. lu ) ∈ ∂ϕ (*), there exists a hyperplane conTogether with ϕS (G S lu ) but no interior point of ϕ (*). By standardizing the taining ϕS (G S lu can be represented normal vector of the hyperplane as utility u , G as a utility classifier. Conversely, suppose not; that is, some lu ) < ujk p (G l ⁎ ) for some (j,k), we have l ⁎ ≻S G lu . Since ujk p (G G u u jk jk ⁎ ⊤ ⊤ l l lu is a utility u ϕS (G u ) < u ϕS (Gu ), which contradicts that G classifier. □
is, thus, obtained. Conversely, given a prefect marker and S = {pkk : k ∈ 2}, the corresponding VS ( = 1) is the hypervolume of a unit K-cube and VS ¼0 for S = {pkσ (k ) : k ∈ 2 with σ (k ) ≠ k}. For a non-informative marker, VS can be calculated to be 1/K!, which is the hypervolume K under the hyperplane p ∈ 9{pkσ (k) = 1, k ∈ 2}: ∑k = 1 pkσ (k ) = 1 ⊂ 9S for any S satisfying the condition. This is precisely the assertion of the theorem. □
1
{
2
}
A.3. Proof of Theorem 4.1 References
∼ Proof. For any S ⊂ S with {p S∼(t ) : t ∈ [0, 1]} ∩ ϕ S∼(*) = ∅, any pS (t ) with the projection p S∼(t ) onto 9 S∼ has no intersection with ϕS (*). Basically, the K-dimensional set 9S can be separated by MS only when its dimension is at least K − 1. It is not necessary to verify the condition in Theorem 4.1 by the argument of path construction for a degenerate MS with the dimension less than K − 1. Thus, we
[1] D. Mossman, Three-way ROCs, Med. Decis. Mak. 19 (January (1)) (1999) 78–89. [2] S. Dreiseitl, L. Ohno-Machado, M. Binder, Comparing three-class diagnostic tests by three-way ROC analysis, Med. Decis. Mak. 20 (September (3)) (2000) 323–331. [3] X. He, C. Metz, B. Tsui, J. Links, E. Frey, Three-class ROC analysis—a decision theoretic approach under the ideal observer framework, IEEE Trans. Med. Imaging 25 (May (5)) (2006) 571–581.
Y.-J. Wu, C.-T. Chiang / Pattern Recognition 60 (2016) 770–777
[4] D.C. Edwards, C.E. Metz, R.M. Nishikawa, The hypervolume under the ROC hypersurface of near-guessing and near-perfect observers in n-class classification tasks, IEEE Trans. Med. Imaging 24 (March (3)) (2005) 293–299. [5] B.K. Scurfield, Multiple-event forced-choice tasks in the theory of signal detectability, J. Math. Psychol. 40 (September (3)) (1996) 253–269. [6] Y.J. Wu, C.T. Chiang, Optimal receiver operating characteristic manifolds, J. Math. Psychol. 57 (October (5)) (2013) 237–248. [7] J. Li, J.P. Fine, ROC analysis with multiple classes and multiple tests: methodology and its application in microarray studies, Biostat 9 (July (3)) (2008) 566–576. [8] Y. Zhang, J. Li, Combining multiple markers for multi-category classification: an ROC surface approach, Aust. N. Z. J. Stat. 53 (1) (2011) 63–78. [9] A.K. Han, Nonparametric analysis of a generalized regression model: the maximum rank correlation estimator, J. Econom. 35 (2–3) (1987) 303–316. [10] J. Jost, Riemannian Geometry and Geometric Analysis, 5th ed., Springer, New York, April 2008. [11] P.S. Heckerling, Parametric three-way receiver operating characteristic surface analysis using mathematica, Med. Decis. Mak. 21 (September (5)) (2001) 409–417. [12] D.C. Edwards, C.E. Metz, M.A. Kupinski, Ideal observers and optimal ROC hypersurfaces in n-class classification, IEEE Trans. Med. Imaging 23 (July (7)) (2004) 891–895.
777
[13] P.R. Halmos, Naive Set Theory, Springer, New York, 1998. [14] J.O. Berger, Statistical Decision Theory and Bayes Analysis, Springer, New York, 1985. [15] C.M. Schubert, S.N. Thorsen, M.E. Oxley, The ROC manifold for classification systems, Pattern Recognit. 44 (February (2)) (2011) 350–362. [16] B.K. Scurfield, Generalization of the theory of signal detectability to n-event m-dimensional forced-choice tasks, J. Math. Psychol. 42 (March (1)) (1998) 5–31. [17] X. He, E.C. Frey, Three-class ROC analysis—the equal error utility assumption and the optimality of three-class ROC surface using the ideal observer, IEEE Trans. Med. Imaging 25 (August (8)) (2006) 979–986. [18] N. Novoselova, C.D. Beffa, J. Wang, J. Li, F. Pessler, F. Klawonn, HUM calculator and HUM package for R: easy-to-use software tools for multicategory receiver operating characteristic analysis, BMC Bioinform. 30 (June (11)) (2014) 1635–1636. [19] J. Li, Y. Chow, W.K. Wong, T.Y. Wong, Sorting multiple classes in multi-dimensional ROC analysis: parametric and nonparametric approaches, Biomarkers 19 (February (1)) (2014) 1–8. [20] D.K. McClish, Analyzing a portion of the ROC curve, Med. Decis. Mak. 9 (July (3)) (1989) 190–195. [21] A.W.v.d. Vaart, Asymptotic Statistics, Cambridge University Press, Cambridge, 1998.
Yun-Jhong Wu is a Ph.D. candidate in statistics at the University of Michigan. He received a M.S. in mathematics in 2011 and B.A. in sociology in 2008 from National Taiwan University. His current research interests include statistical methodology and machine learning algorithm design for network data analysis and matrix/tensor decompositions.
Chin-Tsang Chiang is a professor in the Institute of Applied Mathematical Sciences at National Taiwan University. He received a Ph.D. degree in mathematical science from the Johns Hopkins University in 1998. His current research interests include statistical methods for nonparametric and semiparametric models, and the ROC curve analysis.