A two dimensional accuracy-based measure for classification performance

Information Sciences 382–383 (2017) 60–80 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/i...

Download PDF

1MB Sizes 1 Downloads 55 Views

Report

PDF Reader
Full Text

Information Sciences 382–383 (2017) 60–80

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

A two dimensional accuracy-based measure for classiﬁcation performance Mariano Carbonero-Ruz, Francisco Jose Martínez-Estudillo, Francisco Fernández-Navarro∗, David Becerra-Alonso, Alfonso Carlos Martínez-Estudillo Department of Quantitative Methods, Universidad Loyola Andalucia, c/ Escritor Aguayo, 4, Córdoba, Spain

a r t i c l e

i n f o

Article history: Received 27 June 2016 Revised 30 November 2016 Accepted 4 December 2016 Available online 7 December 2016 Keywords: Classiﬁcation metrics Imbalanced classiﬁcation Accuracy

a b s t r a c t Accuracy has been used traditionally to evaluate the performance of classiﬁers. However, it is well known that accuracy is not able to capture all the different factors that characterize the performance of a multiclass classiﬁer. In this manuscript, accuracy is studied and analyzed as a weighted average of the classiﬁcation rate of each class. This perspective allows us to propose the dispersion of the classiﬁcation rate of each class as its complementary measure. In this sense, a graphical performance metric, which is deﬁned in a two dimensional space composed by accuracy and dispersion, is proposed to evaluate the performance of classiﬁers. We show that the combined values of accuracy and dispersion must fall within a clearly bounded two dimensional region, different for each problem. The nature of this region depends only on the a priori probability of each class, and not on the classiﬁer used. Thus, the performance of multiclassiﬁers is represented in a two dimensional space where the models can be compared in a more fair manner, providing greater awareness of the strategies that are more accurate when trying to improve the performance of a classiﬁer. Furthermore we experimentally analyze the behavior of seven different performance metrics based on the computation of the confusion matrix values in several scenarios, identifying clusters and relationships between measures. As shown in the experimentation, the graphical metric proposed is specially suitable in challenging, highly imbalanced and with a high number of classes datasets. The approach proposed is a novel point of view to address the evaluation of multiclassiﬁers and it is an alternative to other evaluation measures used in machine learning. © 2016 Elsevier Inc. All rights reserved.

1. Introduction Comparing learning/classiﬁcation models is a complex and still open challenge. The ﬁrst issue is the selection of the property of the model’s performance that we want to measure. For example, we could be interested in accuracy, speed, cost or even maybe in readability of the model. Once a performance measure is chosen, the next concern is to estimate it in as unbiased a manner as possible [37]. After that we should test whether the differences in the performance obtained by the

∗

Corresponding author. E-mail addresses: [email protected], [email protected] (F. Fernández-Navarro).

http://dx.doi.org/10.1016/j.ins.2016.12.005 0020-0255/© 2016 Elsevier Inc. All rights reserved.

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

61

model alone or in relation to others are statistically different [12]. The ﬁrst component of this puzzle and related issues are the focus of this manuscript. Performance measures have undoubtedly received a great amount attention from the machine learning community. Special mention should be made to the Ferri et al.’s work [21] which proposed a taxonomy that divides the performance classiﬁcation metrics in three families. Furthermore, Ferri et al.’s work [21] empirically tested 34 binary classiﬁcation metrics and 18 multi-class metrics with 30 datasets coming from the UCI Machine Learning Repository and performed a sensitivity analysis in terms of several traits. Another signiﬁcant study about performance measures was carried out in Sokolava and Lapalme [42] where the authors studied the type of changes made to a confusion matrix that do not affect a measure, thus preserving the classiﬁer’s evaluation (measure invariance). The measures considered for binary classiﬁcation were also generalized to multiclass classiﬁcation. From a different point of view, in Carauna and Nicolescu [7] the authors conducted an empirical study using a several learning models and classiﬁcation performance metrics. They showed that the metrics span a low dimensional manifold and derived a new scalar measure based on the combination of state-of-the-art metrics (combining linearly the AUC, RMSE and error rate metrics). Finally, it is also important to mentioning that most of the metrics reported in the literature to evaluate a classiﬁer were inspired in metrics coming from areas such as medicine, statistics or information retrieval [6,24,29]. For example, the F-measure was originally introduced in the ﬁeld of information retrieval, but nowadays it is routinely used as a performance metric for problems such as binary classiﬁcation, multi-label classiﬁcation, and structured output prediction [43]. As previously mentioned, Ferri et al. classiﬁed the existing performance metrics in three groups [21]: • Metrics based on a threshold and a qualitative understanding of the error: accuracy, geometric accuracy, macro-averaged accuracy, mean F-measure, Kappa statistic and minimum sensitivity. They are used when we want to minimize the number of classiﬁcation errors. • Metrics based on a probabilistic understanding of error, i.e., measuring the deviation from the true probability: mean absolute error, mean squared error or the LogLoss error (cross-entropy). These performance metrics are specially useful when we want to test the reliability of the model. • Metrics based on how well the model ranks the examples: Area Under the Curve (AUC) for binary problems [5] and the extension for multiclass problems [26]. Performance classiﬁcation measures could also be categorized according to the type of problems they can handle [21]. For example, they can be designed for either binary or multiclass classiﬁcation, or proposed to mutually exclusive (one pattern belongs to only one class) or overlapping (one pattern can belong to several classes) class problems. Some measures assume that the model provides only discrete scores (boolean classiﬁers can be evaluated using solely metrics based on a threshold) whereas other suppose that the model can also give a real-valued score (probabilistic or fuzzy classiﬁer could be evaluated with confusion-only-based metrics as well as with metrics based on a probabilistic understanding of error or metrics based on how well the model ranks the examples. Finally, problems could also be divided according to the hierarchy of the classes: (i) ﬂat problems (all classes on the same level) and (ii) hierarchical problems. Despite the numerous number of performance classiﬁcation metrics existing in literature, accuracy, C, has been by far the most commonly used metric to evaluate classiﬁcation learning models. However, accuracy, C, is not able to capture the different factors that characterize the performance of a classiﬁer [17]. For example, C is specially misleading in imbalanced classiﬁcation problems [32,36]. Besides, authors such as Ferri et al. already proved through experimentation that the vast majority of existing metrics deﬁned using confusion matrices are highly (linearly) correlated among each other in most of the classiﬁcation scenarios [21]. Motivated by these two facts, we propose a two-dimensional framework for the comparison of learning models composed by two metrics: the accuracy and the dispersion of the success rate among the different classes. The starting point of this work is the fact that accuracy can be seen as a weighted average of the classiﬁcation rate of each class in a multiclass problem. In this sense, the natural measure of C’s representability is its associated dispersion measure D and it is therefore a complementary measure to evaluate a classiﬁer. C and D are both included under the umbrella of metrics based on a threshold and a qualitative understanding error. The proposed framework for comparing classiﬁcation models is designed for models outputting discrete scores (or outputting real-valued scores but considering solely the confusion matrix to evaluate them) which are tested in ﬂat mutually-exclusive classiﬁcation problems. The second objective of this work is to provide the mathematical relation between the two metrics considered in the framework. We show that the combined values of C and D must fall within a clearly bounded two dimensional region, different for each problem. The nature of this region depends only on the a priori probability of each class, and not on the classiﬁer used. That allows us to visualize the performance of a multiclassiﬁer and the comparison between classiﬁers, providing greater awareness of the strategies that are most suitable when trying to improve a classiﬁer. A detailed explanation of how to obtain the boundaries of this region along with a number of examples of its usefulness are also provided throughout the manuscript. The statistical validity of this research work is shown by adopting an empirical approach [7,21,35]. Thus, during the experimentation a highly competitive classiﬁcation model was applied to a selection of real world datasets and its performance was evaluated through various performance measures. These metric results are then compared by using correlation matrices. Hence, we analyzed how seven confusion-matrix-only-based metrics correlate to each other in a similar effort to the one shown in Ferri’s work [21] to ascertain to what extent and in what situations the results obtained with one metric are extensible to the other metrics. The results show that most of the metrics are linearly correlated to C, where D is not. These

62

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

differences become even larger in multi-class problems with a high number of classes, in highly imbalanced problems and in challenging problems (problems where the level of overall accuracy is relatively low). As aforementioned, Carauna and Nicolescu already considered aggregate classiﬁcation metrics [7]. Speciﬁcally, they proposed the linear combination of the AUC, RMSE and accuracy metrics, thus obtaining a new scalar metric. In our opinion, although the metrics are in the same scale, they correspond to three different aspects of a classiﬁer (see the taxonomy in [21]) and the linear combination in a scalar metric could be inadequate. In our proposal, the two metrics that were combined belong to the same family and they were expressed in the same scale by deﬁning the upper bounds of D. Furthermore, as it is shown in the experimentation, the two metrics considered were not linearly correlated in the datasets considered suggesting that these two metrics could be non-cooperative and could therefore be used to guide multi-objective algorithms. In the next section, a literature review including the most representative confusion-only-based classiﬁcation metrics coupled with a description of existing metrics that provides a visualization tool to evaluate models, is presented. We then describe the evaluation two-dimensional framework in Section 3. Section 4 studies the range for the measure proposed given different values of accuracy, and for each dataset in particular. Finally, the computational experiments of this study are provided in Section 5, and the paper closes with conclusions and remarks (Section 6). 2. Related work Two closely related aspects to our proposal are reviewed in this section: (i) ﬁrst, the most relevant measures based on the computation of the confusion matrix values are fully described; (ii) after that, existing performance measures providing also a visualization perspective are described in the next subsection. 2.1. Performance measures computed using the confusion matrix This section focuses on performance measures that rely solely on the information obtained from the confusion matrix. Consequently, performance measures that either incorporate information in addition to that conveyed by the confusion matrix or account for classiﬁers that are not discrete are not considered here. Speciﬁcally, six performance measures were considered being all derived from the contingency or confusion matrix M which is deﬁned as:

M=

ni j ;

Q

ni j = N

(1)

i, j=1

where Q is the number of classes, N is the number of training or testing patterns and nij represents the number of times the patterns are predicted to be in class j when they really belong to class i. The metric proposed was compared to: • C: The Correct Classiﬁcation Rate (CCR or C as denoted in [17]) is the most common measure used to assess the performance of a classiﬁer. It is deﬁned as the percentage of correctly classiﬁed patterns (or conversely the percentage of misclassiﬁcation errors):

C=

Q 1 n j j. N

(2)

j=1

• MS: The minimum sensitivity (MS) of a classiﬁer is the minimum value of the sensitivities for each class. This measure has been recently used in Machine Learning to evaluate the performance of a classiﬁer in imbalanced classiﬁcation environments. For example, accuracy C and MS are optimized through a two stage evolutionary process in [25], while, the optimization is carried out by a Pareto-based multiobjective optimization methodology based on a memetic evolutionary algorithm in [17]. Finally, MS is deﬁned as:

MS = min

S j ; j = 1, . . . , Q ,

(3)

where S j = n j j /n j is the number of patterns correctly predicted to be in class j with respect to the total number of patterns in class j (sensitivity for class i) and nj is the number of patterns in the jth class. • AC: The Average Accuracy (AC) is the arithmetic average per-class effectiveness of a classiﬁer. It is usually referred as macro-average [38] and is deﬁned as:

AC =

Q 1 S j. Q

(4)

j=1

• GM: The Geometric Mean (GM) corresponds to the geometric average of the partial accuracies of each class:

GM =

Q j=1

Q1 Sj

.

(5)

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

63

• WKappa: The Kappa statistic is a measure of agreement between two classiﬁers [8], although it has been extensively employed as a classiﬁer performance [44]. The Weighted Kappa (WKappa) is a modiﬁed version of the Kappa statistic that allows to assign different weights to different levels of aggregation between two variables [9]:

po ( w ) − pc ( w ) 1 − pc ( w )

W Kappa =

(6)

where

po ( w ) =

Q Q 1 wi j ni j , N

(7)

Q Q 1 wi j ni. n. j , N

(8)

i=1 j=1

and

pc ( w ) =

i=1 j=1

where ni. = Qj=1 ni j and n. j = Q n for i, j ∈ {1, . . . , Q } and the weight wij quantiﬁes the degree of discrepancy bei=1 i j tween the true and the predicted class. • F-score: The F-score measures the relations between data’s positive labels and those given by a classiﬁer based on a perclass average. This performance metric has been widely employed in information retrieval [4]. The F-score for multi-class classiﬁcation problems is deﬁned as:

Fβ =

(β 2 + 1 )(P · R ) β 2 (P + R )

(9)

where P is the precision deﬁned as:

Q P=

j=1

Pj

Q

, Pj =

njj n. j

(10)

and R is the recall and it is deﬁned as:

Q

R=

j=1

Q

Rj

, Rj =

njj n j.

(11)

All the metrics considered are sensitive to changes in the classiﬁcation threshold as they are deﬁned using values coming from the confusion matrix. However only two of the metrics considered are sensitive to class frequency changes (C and WKappa). F-score is also partially inﬂuenced by changes in class frequencies as described in [21]. Meanwhile, AC, GM and MS are not inﬂuenced by changes in class frequencies, but better characterize the performance of a classiﬁer in unbalanced classiﬁcation problems [33]. Our proposal is also included in this subgroup of performance measures. 2.2. Graphical performance measures Graphical analysis methods and their associated metrics have proven to be very useful tools in studying both the behavior and the performance classiﬁers. Traditionally, graphical analysis methods have solely been developed to score classiﬁers.1 Among these, the Receiver Operating Characteristic (ROC) analysis [15,16] is the most popular graphical metric within in the machine learning community. ROC plots allow a classiﬁer to be evaluated and optimized over all possible operating points. The ROC Area Under the Curve (AUC) has become a standard performance evaluation criterion in two class instance recognition problems, used to compare different classiﬁers independently of operating points, priors, and cost [5,22]. Many works has been made to generalize the AUC metric to multiclass classiﬁcation problems. For example, the deﬁnition of the ROC AUC to the case of more than two classes by averaging pairwise comparisons is extended in [26]. A one versus all approach (considering the AUC between each class with respect to all other) was implemented to estimate a simpliﬁed version of the Volume under the ROC Surface (VUS) in [41]. The multi-dimensional operating characteristic was approximated using a pairwise approach and discounting some interactions aiming to reduce the computational burden of the VUS estimation in [34]. A multiobjetive optimization approach where the objective is to simultaneously minimize Q (Q − 1 ) ﬁtness functions corresponding to the misclassiﬁcation rates given by the off-diagonal elements of the confusion matrix was proposed in [14]. Despite the signiﬁcant efforts made by machine learning researchers to extend the ROC analysis to multiclass problems, state-of-the-art proposals have many practical important issues, such as the computational complexity and representational comprehensibility,2 that precludes their use in practice. 1

A scoring classiﬁer provides a real-valued score on each instance and class. Probabilistic classiﬁers are a special case of scoring classiﬁers. The original advantage provided by the graphical two dimensional representation of bi-class ROC curves is lost among the increased number of dimensions to plot. 2

64

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

From a different perspective, Lift charts are a graphical metrics that plot the true positive patterns against the dataset size. They are closely related to the ROC curves, and, therefore, inherit all their advantages and shortcomings. They are traditionally used in business problems [2]. Precision-Recall (PR) curves are similar to ROC curves and Lift charts as they study the trade-off between the well classiﬁed positive and the number of misclassiﬁed negative patterns plotting the precision of the model as a function of its recall [11]. Cost curves are a graphical technique and an alternative to ROC curves for visualizing the performance of binary classiﬁers [13]. Brier curves are a different way of visualizing the classiﬁer performance in the cost space [27,28]. Another visualization technique in two dimensions is presented in [1] where the authors use the metrics deﬁned in [7] and apply the multidimensional scaling technique to analyze the results of the classiﬁer with different metrics and domains. Basically, research papers analyzed as an alternative to ROC curves have almost the same problems than those presented in the ROC analysis section: they are very effective in binary problems but they have no natural or straightforward extension to multi-class classiﬁcation problems. Finally, the minimum of the accuracies obtained for each class (minimum sensitivity MS) is considered as a complementary measure for the C metric in [17]. These two measures were represented in the unit-square and simultaneously optimized in different ways [17,19,25]. This measurement, however, has clear limitations of its own: (i) MS only carries the information about one of the classes, focusing the attention only on the worst classiﬁed class and (ii) MS is an excessively conservative measurement. 3. The proposed framework for evaluating classiﬁers 3.1. Deﬁning accuracy and dispersion for multiclass classiﬁcation problems Accuracy C is the most commonly used measure for the evaluation of the success of a classiﬁer. In a problem with Q classes and N instances, where Nq of those instances belong to the qth class (q = 1, . . . , Q), the accuracy of a classiﬁer is given by:

Q

q=1 Cq

C=

N

,

(12)

where Cq is the number of correctly classiﬁed instances that belong to class q. C can therefore be rewritten as:

C=

Q Q Cq Nq = cq pq , Nq N q=1

where cq =

Cq Nq

(13)

q=1

is again the accuracy for class q and pq =

Nq N

is the probability (based on the available dataset) an instance

has to belong to that same class. Thus, C can be interpreted as the weighted average of the accuracies per class, where the weights are those class probabilities just mentioned. Since C is a weighted measure, it seems natural for it to be presented along with its corresponding variance (or typical deviation), also weighted. Being an average for C, the associated variance informs of its validity. The pair deﬁned by accuracy and this extra measure allows for a more detailed evaluation of the quality of a classiﬁer, while serving as a proﬁle for the comparison of different classiﬁers. It must remain clear, however, that this interpretation of C as an average only complements the usual interpretation. It’s an alternative way of looking at the same value: • When understood in the usual manner, C is the average success of all the instances in a classiﬁcation process. The deviation does not add value to this particular deﬁnition. • When understood as the average success of classes, variability does indeed provide useful information for the decider. Using the terms presented in the previous section, the variance D2 and the deviation D of a classiﬁer are deﬁned as:

D2 =

Q

(cq − C )2 pq .

D=

D2 .

(14)

q=1

A close to zero value of D2 (or D) corresponds to a greater representativity of C. The optimal outcome takes place when D2 (or D) is exactly zero. This only happens when the success rate is equal in all classes. On the other hand, D2 (or D) increases with the differences between success rates on each class. It must therefore be very useful to have such a measure for problems where a homogeneous success rate per class is important. On the question of whether to use D or D2 , the advantage D presents is that it has the same scale as C, while D2 has squared units. Since they both have values within [0, 1], values of D2 will always be smaller than those of D. This makes the differentiations made by D much more apparent. D is therefore chosen for the analysis and visualization of results in this article. However, from algebraic point of view and because of its very deﬁnition, D2 is easier to use. Therefore, D2 is chosen for the theoretical study of the measure. The ﬁnal value provided one way or another, will be presented as D. Along with C, this value allows for the representation of any classiﬁer in the two dimensional space (C, D). This representation will allow us to rank classiﬁers with similar values

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

65

Table 1 Analysis of differences among the most promising metrics for imbalanced classiﬁcation. CM(A ) = CM(B ) = 0.40 GMM(A ) = GMM(B ) = MSM(A ) = MSM(B ) = 0.00 ACM(A) ≈ ACM(B) (≈0.16) D M(A) = D M(B)

⎛

⎜ ⎜ ⎜ M (A ) = ⎜ ⎜ ⎝

80 5 5 5 5 100

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

⎞

⎛

0 19 ⎜ 5 0⎟ ⎟ ⎜ 0⎟ ⎜ 5 ⎟; M (B ) = ⎜ 5 0⎟ ⎜ ⎝ 4 0⎠ 0 0

20 0 0 0 0 0

41 0 0 0 0 1

0 0 0 0 0 38

0 0 0 0 1 1

0 0 0 0 0 60

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

Fig. 1. Illustrative example of how a set different classiﬁers could be ordered in the (C, D) space.

for C, where those with lower values for D will be preferred. Although the (C, D) plane is not a completely ordered space, the numerical and graphical information they provide can still be useful, bearing in mind that, in the worst case scenario for a decision to be made, C alone can always prevail as the ﬁnal decider.

3.2. A preliminary comparison of the merits of existing metrics As previously mentioned, the main goal of this paper is to propose a two dimensional framework for evaluating classiﬁers, especially when practitioners have to address an imbalanced classiﬁcation problem [23]. Therefore, the ﬁrst step in the proposal should be to show that the existing measures are not able to capture existing differences in classiﬁcation performance for the type of problem selected in certain scenarios. With this purpose in mind, we have created a synthetic scenario. Two classiﬁers (A and B) produce the following Ms (Q = 6, N = 200) in a certain task. Table 1 presents the results of C and those metrics that are not sensitive to class frequency changes (AC, MS and GM). Note that C, AC, MS and GM were unable to detect any performance difference between classiﬁers A and B. D is the only metric able to discriminate the best classiﬁcation model for the situation exposed (even AC fails at detecting differences in classiﬁcation). One would expect that a valid measure of performance would output a better performance of classiﬁer B compared to A, as the classiﬁer B is able to better discriminate a greater amount of classes, however this is not possible using the state-of-the-art metrics. This synthetic situation is more common than expected due to the fact that, when classifying imbalanced datasets, classiﬁers are not speciﬁcally trained to optimize MS, and GM metrics tend to report a performance of zero on those two metrics. Thus, D will usually provide a different information on the classiﬁer. Hence, D promotes the improvement of the minority class accuracies as a whole, generating a more uniformed distribution of the accuracies per class.

3.3. Ordering classiﬁers in the (C, D) space In order to illustrate the advantages (and uncertainties) that arise from the deﬁnition of the dispersion as an additional measure to accuracy, let us consider the scenario represented in Fig. 1, where ﬁve classiﬁers are represented. From the accuracy point of view, these classiﬁers would be ordered according to the sequence Q, P, (R,S), T. Although a statistical quantiﬁcation of the signiﬁcance of the differences would also be necessary, it can be preliminarily said that classiﬁers P and Q performed more poorly than the other three. Not much else could be said when taking only C into consideration. The additional information provided by D allows a reﬁnement of the interpretation: S is better than R, since it has a lower dispersion. The same can be said about T with respect to R, since it is better both in accuracy and dispersion. However, the comparison between S and T is not possible, since neither is better than the other under both criteria. To solve this issue, it is necessary to estimate the upper boundary of D for each dataset to compare the results with respect to C using the same scale.

66

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

4. On the upper boundaries for D One of the aims of this work is to present an upper boundary for D, for each value of C. This section intends to prove that such a boundary is determined by both accuracy and the probabilities pq . A global boundary for all datasets will be presented ﬁrst. It will then be used as a canvas for the deﬁnitive one. Result 1. For each value 0 ≤ C ≤ 1 is:

D2 ∗ (C ),

(15)

where ∗ (C ) = C − C 2 . The equality is met if and only if each and every accuracy per class is 0 or 1. Proof. Since

D2 =

Q

c p q=1 q q

Q

= C and

(cq − C )2 pq =

q=1

=

Q

Q

q=1

Q

pq = 1:

cq2 pq − 2C

q=1

cq2 pq − C 2

q=1

Q

Q

cq pq + C 2

q=1

Q

pq

q=1

cq pq − C 2 = C − C 2 = ∗ (C ).

(16)

q=1

The inequality is derived from 0 ≤ cq ≤ 1, and the equality between cq and cq2 happens only when they are 0 or 1. Once this ﬁrst boundary is established, π ∗ (C ) = the problem to be solved can now be made.

∗ (C ) is deﬁned as the equivalent boundary for D. The deﬁnition of

Deﬁnition. For every 0 ≤ C ≤ 1 let us deﬁne:

2 (C ) = max D2 (c1 , . . . , cQ ),

(17)

c∈FC

where FC = {c : cq pq = C, 0 cq 1}. The existence of this maximum, that allows for a good deﬁnition of 2 , is ensured 2 by D being continuous in its arguments as a function of the weights per class, and by FC being a compact set. Result 2. For each 0 ≤ C ≤ 1, the value for 2 (C) is met at the frontier of FC .

Proof. It is immediate to verify that D2 c1 , . . . , cQ is a strictly convex function in FC , with only one critical point c1 = . . . = cQ = C, that is a minimum of the function. Therefore the maximum will be found at the frontier of the feasible region. Such frontier is characterized in the next section, along with the points that are potentially optimal in it. 4.1. Vertices for FC

Given the equations for FC , it can be seen that it is the intersection of the unit Q-cube with the hyperplane cq pq = C. It is therefore a polyhedric set with a frontier determined by linear segments. On each one of these segments, the convexity of the optimized function will lead to ﬁnding the maximum in one of its two extremes, that are nothing but vertices of the feasible region. Thus, it will suﬃce to characterize such vertices and and then limit the optimization to them. 4.1.1. Characterization

Let V be a vertex of FC , obtained as the intersection of an edge of the Q-cube with the hyperplane cq pq = C. All the coordinates of V except one will be 0 or 1 (boundary condition), and the remaining one will be determined by the intersection with the hyperplane as long as a feasible value (0 ≤ cq ≤ 1) is met. For each one of these vertices, the pair (I, i) is deﬁned. I represents the set of indexes for cq = 1, while i is the index where the value for cq is not necessarily extreme (0 by default if ci = 1). The pair (I, i) will be used from now on as the representation of the vertex it refers to. Let us consider Fig. 2, made for a 3-class problem. The vertex V1 is obtained as the intersection of the edge c1 = c3 = 1 and the plane c1 p1 + c2 p2 + c3 p3 = C, which means that c2 = p1 (C − c1 p1 − c1 p1 ). Therefore, in this example, V1 would be 2 identiﬁed by the pair ({1, 3}, 2). If C increased to the point of placing the vertex in the plane c2 = 1, then V1 = (1, 1, 1 ) would be identiﬁed as I = ({1, 2, 3}, 0 ). 4.1.2. Redeﬁnition of the problem The elements presented on the previous subsection allow for the problem to be redeﬁned. The search space can now be reduced to the ﬁnite set of vertices of FC . The pair (I, i) that deﬁnes these vertices is therefore assigned by (C, I, i) to the variance calculated in the represented vertex.

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

67

Fig. 2. Characterization of the vertices of FC .

Deﬁnition. The set of extremes of the problem is the set:

S=

s (I ) =

p q : I ⊆ {1 , . . . , Q }

(18)

I

Result 3. It is veriﬁed that:

(C, I, i ) =

s(I ) + (C−spi(I )) − C 2 ∗ (C )

i > 0, s ( I ) C s ( I ) + pi i=0

(19)

Proof. Let V be the vertex associated to the pair (I, i), i.e., (C, I, i ) = D2 (V ). Suppose i > 0. Therefore, using the deﬁnitions of I and i:

(C, I, i ) =

cq2 pq − C 2 =

pq +

I

C−

I

pi

pq

2

pi − C 2 = s ( I ) +

(C − s(I ) )2 pi

− C2

(20)

The condition over the value for C is an immediate consequence of the feasibility condition 0 ≤ ci ≤ 1. For i = 0, all the components in V take one of its extreme values, and the outcome can be obtained by using Result 1. If we combined all these results, the following is obtained. Result 4. For each 0 < C < 1 we have:

max

(C ) = (I, i ) i = 0 ∗ (C ) 2

{(C, I, i ) : s(I ) < C < s(I ) + pi } C ∈/ S

(21)

C∈S

Function 2 (and from it, , the boundary for D) will be obtained, given this result, by calculating for each one of the feasible pairs (I, i), the family of curves (C, I, i) and from them the maximum at each point. 4.2. Building the curve (C) 4.2.1. Introduction and preliminary results This subsection analyzes the maximization problem proposed in Result 4. Let us begin by describing the aspect of each one of the elements that make this family and show a ﬁrst approximation for the calculation of the boundary being searched for. Property. If i = 0, (C, I, i) restricted to the feasible values for C is a decreasing parabolic arc when C < s(I) and increasing when C > s(I ) + pi . Proof. It is immediate, by its own expression, that it is indeed a parabola. As far as the growth of the parabola is concerned, it is easy to verify that:

d (C, I, i ) dC

C=s(I )

= −2s(I ) < 0

68

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

Fig. 3. An example of the relative position of curves.

Fig. 4. Representation of the (C, D) space for a binary classiﬁcation problem with probabilities p

1 2

and 1 − p.

d (C, I, i ) dC

= 2 ( 1 − ( s ( I ) + pi ) ) > 0

(22)

C=s(I )+ pi

Then, because is a parabola, the sign of its derivate is negative to the left of s(I ) and positive to the right of s (I ) + pi and the result holds. Fig. 3 shows the appearance of one of these curves, π (C, I, i ) = (C, I, i ) being the boundary of the deviation. The properties of such paraboles are described in Appendix A. In order to obtain the curve it will suﬃce with the representation of each one of the arcs, choosing the highest one for a given value of C, as it is shown in the following example. Example 1. Suppose a problem with two classes with probabilities p and pi are: I

i

s (I )

pi

O O {1} {2}

1 2 2 1

0 0 p 1− p

p 1− p 1− p p

1 2

and 1 − p (Fig. 4). The possible values for I, i, s(I)

Bearing in mind that the expression of each curve 2 is:

2 (C ) =

⎧ 2 C − C2 ⎪ p ⎪ ⎪ ⎨ (C−p)2 p+

1−p 2

−C ⎪ ⎪ ⎪ ⎩1 − p + C2 1−p

0C p − C2

(C−(1−p)) p

pC 2

1 2

− C2

1 2

(23)

C 1− p

1− pC 1

Although the plot procedure for all curves could be applied to the resolution of any problem, its complexity quickly increases with Q. Let us bear in mind that the feasible set has in general Q2Q−1 possible vertices. This makes the calculation almost impossible for a too large number of classes. This leads us to the search of an alternative procedure. An example with three classes will present it and help, along with the previous example, the later proof and properties the method is based on. Example 2. Let us consider a 3-class problem with p1 = 0.2, p2 = 0.35 and p3 = 0.45 as probabilities per class. Applying the previous procedure (without going again over it step by step), the following curve is obtained (Fig. 5). Figs. 4 and 5 share the following common features: (i) symmetry and (ii) between two consecutive extremes, curve is, either one parabola (as it happens on the ﬁrst part of the curve), or the combination of two parabolas that meet at an intermediate point. Once both results are proven, a procedure that is based on them and makes a feasible upper boundary plot is proposed. Result 5. Function (C) is symmetric with respect to C =

1 2.

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

69

Fig. 5. Representation of the (C, D) space for a three class classiﬁcation problem with probabilities with p1 = 0.2, p2 = 0.35 and p3 = 0.45.

Proof. The result is immediate in the set of extreme points, since in them the curve coincides with π ∗ , that is symmetric. For the rest of the interval [0, 1], (C) is deﬁned by:

(C ) =

max

(I, i ) i = 0

{π (C, I, i ) : s(I ) < C < s(I ) + pi }

(24)

Therefore, in order to prove the symmetry, it would suﬃce to verify that for every feasible election of I, i and C it is possible to ﬁnd indexes J and j such that π (C, I, i ) = π (1 − C, J, j ) in the interval s(J ), s(J ) + p j . Let J = {1, . . . , Q } − {I, i} and j = i. It is immediate to verify that s(J ) = 1 − (s(I ) + pi ) and therefore s(J ) + p j = 1 − s(I ), hence deducing the restriction on the interval. The coincidence of the value of the function is only a matter of arithmetics. In order to enunciate and prove the following result, as well as the later algorithm for the calculation of curve , the following elements related to the set of extrema are deﬁned. Deﬁnition. Given s ∈ S, let us deﬁne:

Is = {I ⊂ {1, . . . , Q } : s(I ) = s} i(s ) = min {i ∈ I : I ∈ / Is } j (s ) = min {i ∈ I : I ∈ Is }

(25)

where Is is the set of all possible index combinations with a sum of probabilities s, i(s) is the index of the lowest probability that can be omitted in order to obtain the sum s and j(s) is the minimum needed for the calculation. Since these deﬁnitions will later on be used, let us clarify their meaning with an example. Example 3. Suppose a ﬁve class problem with probabilities p1 = p2 = p3 = 0.1; p4 = 0.3 and p5 = 0.4 and let s = 0.5. Thus,

I0.5 = {{1, 2, 4}, {1, 3, 4}, {2, 3, 4}, {1, 5}, {2, 5}, {3, 5}}, are all the combinations with a 0.5 sum of probabilities. Therefore, i(0.5 ) = 1, since the lowest index can be omitted in order to obtain the desired result (for instance, using combination {2, 3, 4}) and j (0.5 ) = 1 since it is the lowest index that can be used to obtain this value by taking for example {1, 2, 4}). Notice also that since the per-class probabilities are ordered, the lowest index coincides with the least probability included in one or another case. Let us suppose from this point, that this is the general case, p1 . . . pQ which can be so by simply relabeling classes. Result 6. Let us consider the increasing ordered set S of extremes of the problem, and let s and t be consecutive elements of it. For each C ∈ [s, t] it can be veriﬁed that: 1. If s + pi(s ) = t, (C ) = π (C, I, i(s ) ) where I is any element of Is verifying i(s) ∈ I. 2. If s + pi(s ) > t,

C C∗ π (C, I, i(s ) ) (C ) = π (C, J − { j (t )}, j (t ) ) C C ∗

(26)

where I is deﬁned as in 1 and J is any element of It where the minimum j(t) is reached, and C∗ is the only intersection point of both curves in the considered interval. Proof. In order to prove each one these statements, let us indicate some necessary aspects of them. Since s is an extreme and is continuous (maximum in a ﬁnite set of continuous functions), its expression in a certain range to the right of s will be:

(C ) = max{π (C, I, i ) : s(I ) = s}

(27)

i∈ /I

C−s 2

Since the expression of the curves is given by (C, I, i ) = s + ( p ) − C 2 , the maximum will be reached when pi is minii mum. Since the set of probabilities is ordered increasingly, it will be i = i(s ) by its very deﬁnition. With a similar reasoning, expression in a range to the left of t will be given by:

(C ) = π (C, J − { j (t )}, j (t ) ),

(28)

70

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

with J under the conditions indicated in the statement. Finally, given any two curves of the family analyzed, say π (C, A, a) and π (C, B, b), the discriminant of the equation (C, A, a ) − (C, B, b) = 0 is:

4 ( s ( A ) − s ( B ) ) ( s ( A ) + pa − ( s ( B ) + pb ) ) pa pb

(29)

with a sign that depends only on the relative position of the intervals that deﬁne both curves: [s(A ), s(A ) + pa ] and [s(B ), s(B ) + pb ]. Let us now analyze each one of the two cases proposed. 1. Let us consider any other element π (C, J, j) deﬁned in the interval [s, t]. Since there are no other extremes between s and t, it will have to be s(J) ≤ s and s(J ) + p j t. Under these terms, the discriminant will be zero or negative depending on whether the previous inequality is at the equality or not, with no intersection point between both curves except for, at most, s or t. Since π (C, I, i(s)) dominates in s, it will do so during the entire [s, t] interval, hence proving the result. 2. Let us consider J under the terms in Result 2, which implies s(J ) = t < s(I ) + pi(s ) and s(J − { j (t )} ) < s (otherwise s(J − { j (t )} ) = s and j(t) < i(s) would imply 2 (C ) = π (C, J − { j (t )}, j (t ) ) for the right side values near s). Under these terms, the equation:

(C, J − { j (t )}, j (t ) ) − (C, I, i(s ) ) = 0

(30)

has a positive discriminant, and there are two intersection points. Let us verify that one and only one of them is in the [s, t] interval. The fact that each one of the curves dominates the other guarantees, because of the continuity of both, that at least one of the solutions is found between both extremes, because the difference between both functions is continuous and changes its sign in the extremes of the interval. Let C∗ be its corresponding accuracy. The derivative of the difference (C, J − { j (t )}, j (t ) ) − (C, I, i(s ) ) is given by:

d ((C, J − { j (t )}, j (t ) ) − (C, I, i(s ) )) = −2 C dC

1 1 − pi ( s ) p j (t )

+

s s (J ) − p j (t ) pi ( s )

(31)

that evaluated in both extremes takes the values:

d (π (C, J − { j (t )}, j (t ) ) − π (C, I, i(s ) )) dC

=−

2 (s (J ) − s ) > 0 p j (t )

d (π (C, J − { j (t )}, j (t ) ) − π (C, I, i(s ) )) dC

=−

2 t − s + pi ( s ) > 0 p j (t ) pi(s )

C=s C=t

(32)

Since the derived function is a line in C and positive at both extremes of the interval, it will be positive in the entire interval. Therefore the difference between both functions is increasing, proving that C∗ is the accuracy for the only intersection point of curves s and t. From the deﬁnition of the curves, the domination in each one of the subintervals that this point deﬁnes can be inferred. This is not valid if pi(s ) = p j (t ) , since in that case the difference would not be quadratic, but linear. It is immediate to verify that under those circumstances, the only intersection point is exactly the mid-point on interval [s, t]. These results justify the validity of the following procedure for making curve . 4.2.2. Algorithm for the construction of curve Using the symmetry, and knowing the characteristics of curve between each two consecutive extremes, its construction will be doing going along the C axes, extreme by extreme, from 0 to 21 , obtaining the remaining half of the curve by symmetry. The step by step procedure is as follows: 1. Determine and order the subset of extremes with values lower than 2. For each extreme calculate:

i(t ) = i(st )

1 2.

j (t ) = j (st )

Let 0 = s0 < s1 < . . . < sT be such extremes.

(33)

bearing in mind that it will not make sense to calculate the second amount if t = 0, since s0 = 0 has no preceding extreme. Take t = 0. 3. Repeat, for each t: If st+1 = st + pi(t ) ,

2 (C ) = st + Otherwise:

(C ) = 2

until t = T − 1.

(C − st )2 pi(t )

− C 2 st C st+1

2 t) st + (C−s − C2 pi t

()

st+1 − p j (t+1 ) +

(34)

st C C ∗

(C−(st+1 −p j(t+1) )) − C 2 C ∗ C s t+1 p j (t+1 ) 2

(35)

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

4. If sT <

71

1 2:

(C ) = sT +

(C − sT )2 pi ( T )

− C 2 sT C

1 2

(36)

5. Finally, (C) is deﬁned as:

1 C 1 2

(C ) = (1 − C )

(37)

Example 4. Applying it to Example 2, remember that p1 = 0.2, p2 = 0.35 and p3 = 0.45. Step 1. In this case T = 3 and:

0 = s0 < 0.2 = s1 < 0.35 = s2 < 0.45 = s3 <

1 2

Step 2. It is immediate that: t

i (t )

j (t )

0 1 2 3

1 2 1 1

1 2 3

Step 3. • Since s1 = s0 + pi(0 ) :

2 (C ) =

C2 − C 2 0 C 0.2 0.2

• Since s2 = s1 + pi(1 ) :

2 (C ) = 0C.22 + 0.35

(C−0,2 )2

−C

0.35 2

− C2

0.2 C 0.275 0.275 C 0.35

C ∗ = 0, 275 is the intersection point between both curves. • Since s3 = s2 + pi(2 ) :

+ (C ) = 0C.35 2 2

0.45

(C−0.35 )2

− C2

0.2

− C2

0.35 C 0.4055 0.4055 C 0.45

C ∗ = 0.4055 is in this case the intersection point. Step 4. Since s3 <

1 2:

(C ) = 0.45 +

(C − 0.45 )2 0.2

− C 2 0.45 C

1 2

Step 5. The rest of the curve is deﬁned by symmetry. 4.3. Ordering classiﬁers in the (C, D) space Once C and D are calculated and represented for different classiﬁers, they can be represented along with the curve associated to the dataset that the classiﬁers have tried to tackle. This representation is a tool to visualize the basic elements of the problem as well as the main characteristics and their possible solutions. Considering the example in Fig. 6, the following conclusions could be obtained: • Algorithms P and Q are equivalent in terms of accuracy, but not with respect to dispersion. The same happens with R and S, that are more precise than P and Q. • In terms of dispersion, Q is preferred over P and S preferred over R. • S is therefore the best of all four algorithms, while P is the worst. The comparison of the remaining two algorithms is more complicated, since although R slightly improves Q on accuracy, the dispersion is the same for both. R is almost on the limit of the maximum dispersion possible (very close to the curve, while Q is further from this limit).

72

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

Fig. 6. Illustrative example of how a set different classiﬁers are in the (C, D) space using the D upper bound.

Fig. 7. Example of representation of c2 and DN in space against c1 .

4.4. Normalized dispersion Deﬁnition. Given the pair (C, D) (0 < C < 1) the normalized dispersion DN is deﬁned as:

DN =

D

(38)

(C )

By deﬁnition, the normalized dispersion veriﬁes 0 ≤ DN ≤ 1. Except for optimal classiﬁers (C = 1) or the opposite (C = 0), both uninteresting for their straightforwardness (all instances either correctly or incorrectly classiﬁed), the spatial representation of the pair (C, DN ) is the unit square. The best classiﬁer is found for the highest accuracy and the lowest dispersion. No longer the comparison is subject to the boundaries of the (C, D) plane, but according to the upper limits of . Example 5. Let us consider the 2-class problem with probability p 12 for the ﬁrst of the two classes. Let there be a classiﬁer returning an accuracy C; a value from which the condition C 1 − pcan be assumed without the loss of generality, since such a result is obtained by simply assigning all instances to the second class. Let us consider all possible combinations of feasible accuracies per class c1 and c2 , that are bound by the condition pc1 + (1 − p)c2 = C, which allows for c2 to be a function of c1 and C. Bearing in mind that 0 ≤ c1 , c2 ≤ 1 the following is obtained:

c2 =

1 −C C − pc1 1− c1 1 1− p p

(39)

and in order to calculate the DN associated to each combination, D2 and2 are needed:

D2 (C ) = p(c1 − C )2 + (1 − p)(c2 − C )2 = p(c1 − C )2 + 2 (C ) = 1 − p +

p2 p (c1 − C )2 = (c1 − C )2 1− p 1− p

(C − (1 − p))2 p

− C2 =

1− p (1 − C )2 p

(40)

(41)

hence:

DN (C ) =

p |c1 − C | 1 − p 1 −C

(42)

Fig. 7 shows the simultaneous representation of c2 and DN in space against c1 . It can be appreciated how the value for the normalized dispersion reaches its maximum value when class 2, being the majority class, is perfectly classiﬁed and the classiﬁcation success is minimum for the minority class. The reduction of accuracy in the second class implies an increase in the ﬁrst and a reduction in the dispersion until a zero is reached where both coincide (as well as C). From that point, dispersion increases again.

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

73

Fig. 8. Graphical representation of the correspondence between classiﬁers in the (C, D) space and the (C, DN ) space.

On the other hand, Fig. 8 shows how the normalization for D (in Fig. 6) clariﬁes the comparisons between algorithms in terms of their dispersion: R is slightly worse than P according to this criterion, despite having a lower absolute dispersion, and very inferior to Q, with which it had the same value. However, the normalization also has as a consequence the loss of information about the problem, since sight is lost as to the real possibilities of how to optimize D. In the initial representation, the low value of dispersion for the accuracy achieved by C D indicates that the possibilities of improving this criterion are reduced. On the other hand, considering P and Q, the representation shows that there is room for improvement. This matter, relevant as it is in order to establish goals for the improvement of classiﬁers, is omitted when DN is analyzed. The joint representation of both the regular, and the normalized plot, can be the solution. 5. Computational experiments This section presents the design of the experimental study followed in this paper (Section 5.1), the results obtained for classiﬁcation datasets (Section 5.2), and its corresponding discussion (Section 5.3). 5.1. Experimental study In this subsection, the experiments are clearly described, deﬁning the datasets and algorithms considered to validate the proposal and the parameters to be optimized. The performance measures used for evaluating the proposal are those described in Section 2.1. 5.1.1. Datasets selected Table 2 shows the characteristics of the 29 datasets, including the number of patterns, attributes and classes, and also the class distribution (number of patterns per class). The publicly available classiﬁcation datasets were obtained from benchmark repositories (UCI [3] and mldata.org [39]). The synthetic toy dataset was generated as proposed in [10] with 300 patterns. Saureus4 (S4) corresponds to a real predictive microbiology problem of discriminating the growth/no growth of Staphylococcus Aureus and it was obtained from [20]. The selected datasets include four binary problems and twenty ﬁve multi-class problems and present different numbers of instances, features and classes (see Table 2). One of the main goals when choosing the datasets to consider was ensuring that they were imbalanced. To measure imbalance in multi-class datasets, we use the “coeﬃcient of variation” (C.V.) as recommended by Wu et al. [45]. Speciﬁcally, C.V. is the proportion of the deviation in the observed number of examples for each class versus the expected number of examples in each class. For our purposes, datasets with a C.V. above 0.7071 (a class ratio of 3:1 on a binary dataset) are considered highly imbalanced datasets. This evenly divides our pool of available datasets into 14 highly imbalanced datasets and 15 datasets with a class ratio below 3:1. Finally, it is also important to mention that all nominal attributes were transformed into as many binary attributes as the number of categories and all the datasets were property standardized, considering only the training set to obtain the mean and standard deviation for each variable. 5.1.2. Algorithm selected to validate the proposal and model selection Recently, 179 classiﬁers arising from 17 families with 121 data sets (the whole UCI data base excluding the large-scale problems and other known real problems) were evaluated to determine the most competitive classiﬁers of the current literature in [18]. The Extreme Learning Machine (KELM) [30] method achieved the highest value of Probability of Achieving the Maximum Accuracies (PAMA, in %) with a total probability of 13.2%. Furthermore, the KELM method also showed a competitive performance in the Percentage of the Maximum Accuracy (PMA) for each dataset (averaged over all the datasets) and in the probability of achieving 95% (P95) of the maximum accuracy over all the datasets. For all that and for its extreme simplicity, the KELM method was selected to validate our proposal. Extreme Learning Machine (ELM) is an eﬃcient algorithm that determines the output weights of a Single Layer Feedforward Neural Network (SLFNN) using an analytical solution instead of the standard gradient descent algorithm [31]. During

74

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80 Table 2 Characteristics of the twenty nine datasets used for the experiments: number of instances (#Pat.), total number of inputs (#Attr.), number of classes (#Classes), per-class distribution of the instances and C.V. Dataset

#Pat.

#Attr.

#Classes

Class distribution

C.V.

Hepatitis (HP)a Breast-cancer (BC) Haberman (HB) Card (CR)

155 286 306 690

19 15 3 51

2 2 2 2

(32, 123) (201, 85) (225, 81) (307, 383)

0.8303 0.5736 0.6655 0.1558

Contact-lenses (CL)a Pasture (PA) Squash-stored (SS) Squash-unstored (SU) Tae (TA) Newthyroid (NT)a Balance-scale (BS)

24 36 52 52 151 215 625

6 25 51 52 54 5 4

3 3 3 3 3 3 3

(15, 5, 4) (12, 12, 12) (23, 21, 8) (24, 24, 4) (49, 50, 52) (30, 150, 35) (288, 49, 288)

0.7603 0.0 0 0 0 0.4699 0.6662 0.1329 0.9472 0.6623

Lymph (LY)a Saureus4 (S4)a Vehicle (VH) SWD (SW) Car (CA)a

148 287 946 10 0 0 1728

38 3 18 10 21

4 4 4 4 4

(2, 81, 61, 4) (117, 45, 12, 113) (199, 212, 218, 218) (32, 352, 399, 217) (1210, 384, 69, 65)

1.0840 0.7213 0.0423 0.6582 1.2495

Bondrate (BO)a Toy (TO) Eucalyptus (EU) Anneal (AN)a LEV (LE)a

57 300 736 898 10 0 0

37 2 91 59 4

5 5 5 5 5

(6, 33, 12, 5, 1) (35, 87, 79, 68, 31) (180, 107, 130, 214, 105) (8, 99, 684, 67, 40) (93, 280, 403, 197, 27)

1.1141 0.4265 0.3263 1.5811 0.7458

Automobile (AU) Glass (GL)a Winequality-red (WR)a

205 214 1599

71 9 11

6 6 6

(3, 22, 67, 54, 32, 27) (70, 76, 17, 13, 9, 29) (10, 53, 681, 638, 199, 18)

0.6734 0.8339 1.1717

Zoo (ZO)a Segmentation (SG)

101 2310

16 19

7 7

(41, 20, 5, 13, 4, 8, 10) (330, 330, 330, 330, 330, 330, 330)

0.8937 0.0 0 0 0

Ecoli (EC)a

336

7

8

(143, 77, 52, 35, 20, 5, 2, 2)

1.1604

ESL (ES)a

488

4

9

0.9457

ERA (ER)

10 0 0

4

9

(2, 12, 38, 100, 116, 135, 62, 19, 4) (92, 142, 181, 172, 158, 118, 88, 31, 18)

a

0.5303

Symbol used to denote highly imbalanced datasets (C.V. ≥ 0.7071).

the training process, ELM determines its training parameters, β, by minimizing a Least Squared Error function. The output function of the kernelized ELM (KELM) [30] for the pattern x is deﬁned as:

f ( x ) = K ( x )T

I

γ

−1 + ELM

Y,

(43)

where Y ∈ RN × RQ is the target output matrix (N is the number of patterns and Q the number of classes), K(x ) : RK → RN is the vector of kernel functions K(x )T = [K (x, x1 ), . . . , K (x, xN )] (K is the number of attributes in the original dataset) and γ is a regularization parameter. The Gaussian kernel function here considered is

K (x, xi ) = exp(−k||x − xi ||2 ),

i = 1, . . . , N

(44)

where k ∈ R is the kernel parameter. Similarly the kernel matrix ELM = [i, j ]i, j=1,...,N is deﬁned element by element as

i, j = K (xi , x j ).

(45)

The experimental design was conducted using 30 random stratiﬁed splits of 75% and 25% of the patterns for the training and test sets respectively (as suggested in [40]). The KELM classiﬁer was run using the implementation available in the Extreme Learning Machines webpage.3 The optimal two hyperparameter values for the KELM model were selected using a nested ﬁve fold cross-validation over the training set and the criteria for selecting the best conﬁguration was the C metric (k ∈ {10−3 , 10−2 , . . . , 103 } and γ ∈ {10−3 , 10−2 , . . . , 103 }). Finally, each pair of metrics are compared by means of the Correlation test using a level of signiﬁcance of α = 0.1.

3

http://www.ntu.edu.sg/home/egbhuang/

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

75

Fig. 9. Dendograms for the number of classes analysis.

5.2. Results Six different correlation matrices are analyzed aiming to study the dependencies or relationships among the different metrics considered. It is important to mention that we compute the correlation for each dataset, i.e., we analyze the results provided by the KELM method for one dataset and the corresponding 30 combinations of the cross-validation procedure. Not merging results from different datasets is crucial, since measure values are inﬂuenced in very different ways depending on the dataset, e.g. number of classes, imbalance, problem diﬃculty, etc [21]. Consequently, we construct one correlation matrix per dataset and then, we average the 29 correlation matrices (considering the absolute values of the correlation coeﬃcients). Since there are seven averaged correlation matrices4 , and they are diﬃcult to understand at a glance, we will use dendrograms (as it was done in [21]); where the linkage distance is deﬁned as (1 - correlation). A dendrogram shows clusters of performance measures according to how strongly correlated the metrics are. This kind of diagrams has several advantages: we can easily visualize the clusters formed by the measures, as well as the linkage distance among clusters. The ﬁrst scenario is constructed separating the datasets in two subsets: the ﬁrst one where datasets with less than ﬁve classes were included and the second one where only datasets with ﬁve or more classes were included. The second one considers the C performance of the KELM as the test variable and the last one is built by splitting the datasets according to their imbalanced ratio. The analysis was done for each of the three cases separately in order to take into account the effect of the variable in the correlation results. According to Fig. 9, the pairs of measures that are less correlated (they are not signiﬁcantly correlated) are C − DN and W Kappa − DN when datasets with less than ﬁve classes are considered, and C − DN , AC − DN , W kappa − DN and Fβ =1 − DN when datasets with ﬁve or more classes are considered. Fig. 9 shows how DN is grouped with the metrics designed for imbalanced classiﬁcation problems (GM, AC or GM) when Q < 5 (Fig. 9a) while it is considered an independent metric to all the others when Q ≥ 5 (Fig. 9b). On the other hand, if the accuracy provided by the model is taken as a variable test (Fig. 10), C is not signiﬁcantly correlated to MS, GM, WKappa and DN on datasets where the classiﬁer reported a low accuracy are considered. C is not signiﬁcantly correlated to GM and DN when datasets with a high accuracy reported are taken into account. These facts are also shown in Fig. 10 where DN is grouped with the metrics designed for imbalanced classiﬁcation problems when the datasets are not specially challenging (C ≥ 0.7700) (Fig. 10b). DN is considered an independent metric (along with the WKappe metric) to all the others when challenging datasets are being analyzed (Fig. 10a). Finally, Fig. 11 shows the correlation dendograms when the C.V. is considered as the variable test. The pairs of metrics that are less correlated are DN − C, DN − AC, DN − W Kappa, DN − Fβ =1 when datasets with a C.V. value above 0.7071 are considered (Fig. 11b) and DN − C, DN − AC, DN − Fβ =1 , GM − W Kappa, MS − W Kappa when datasets with a C.V. below 0.7071 are tested (Fig. 11a).

4

Corresponding to the two correlation matrices per scenario and the general correlation matrix considering all the datasets.

76

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

Fig. 10. Dendograms for the accuracy analysis.

Fig. 11. Dendograms for the C.V. analysis.

5.3. Discussion As it could be seen in Fig. 9, the DN metric provides signiﬁcantly different information to the one provided by the remaining measures, specially when datasets with a high number of classes are considered. DN is not signiﬁcantly correlated to any of the measures considered. GM and MS are both signiﬁcantly correlated to C. These correlations are expected because datasets with a high number of classes are also highly imbalanced and DN , GM and MS are all measures for imbalanced datasets. DN shows its superiority when compared to GM and MS because only measure that is not signiﬁcantly correlated to C (the most common measure used in standard classiﬁcation). On the other hand, Fig. 10 did not provide particularly interesting information as DN was correlated to all metrics except to WKappa datasets with high accuracy. In turn, DN was only correlated only to WKappa when datasets with a low accuracy are considered. According to this information, the proposed metric is not useful when straightforward datasets are evaluated. However, it is a competitive measure when more challenging datasets are tested. Fig. 11 shows once more the importance of the proposed metric in imbalanced datasets as DN was not signiﬁcantly correlated to any of the metrics considered except to GM and MS, which again were correlated to C in both cases. We hypothesize that these correlations of GM and MS to C are due to the fact that, in imbalanced datasets, when classiﬁers are not speciﬁcally trained to optimize these two metrics they tend to return a zero value performance. Note that if the accuracy

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

77

Fig. 12. Dendrogram of standard correlations between the metrics for all datasets.

of one class is zero, then MS and GM are zero. On the other hand, DN usually provides an additional information about the classiﬁer. The scenario shown in Table 1, where DN is the only metric capable of discriminating the best classiﬁcation model, has already been discussed. DN promotes the improvement of the accuracies of the minority classes as a whole generating a more uniformed distribution of the accuracies per class. MS and GM are thus grouped in the dendogram because they both share a high degree of restrictiveness. Finally, the overall correlation dendograms considering all the datasets and metrics are included Fig. 12, again reinforcing the importance of DN for being the only measure that is not signiﬁcantly correlated to C. According to Fig. 12, the pair of measures that are less correlated is C − DN . In our opinion, this justiﬁes the proposal of the new metric as a complementary metric to the C metric specially for datasets with a high number of classes, imbalanced and challenging. Furthermore, the selected pair (C − DN ) would be ideal to guide, for example, the evolution of a multi-objective algorithm, since they have low linear correlations and may implicitly be non-cooperative objectives. 6. Conclusions This paper presents a new approach for evaluating the performance of multiclass classiﬁers based on a 2-D performance measure. The composite proposed metric is made of the overall accuracy and the deviation of the accuracies per class. Thus, accuracy C is analyzed here as a weighted average of the classiﬁcation rate for each class, while the dispersion D is the corresponding deviation of the individual accuracies per class. To our knowledge, none of the performance measures for evaluating classiﬁers proposed in the machine learning literature so far tackle the problem of a global optimization of accuracy C, where differences in class accuracies are minimized. Please note that other approaches with two values to measure the performance of a classiﬁer, like precision-recall or sensitivity-speciﬁcity, are designed only for two class problems. The mathematical relationship between the two measures is studied in depth and determined for each problem. Speciﬁcally, the optimum upper bound of D for each C and for each classiﬁcation problem is estimated. This boundary allows for the deﬁnition of the normalized measurement DN with values between 0 and 1, that analyzes the dispersion of the results per class obtained by a classiﬁer in a more intuitive way. Moreover, a heuristic procedure to build the (C, DN ) region for each classiﬁcation problem is proposed in this manuscript. On the other hand, the relationship between C and DN deﬁnes a region in the (C, DN ) space where each classiﬁer can be depicted. Thus, in our opinion, one of the main advantages of the proposal is that multiclass classiﬁers are represented in this region in a natural, intuitive and straightforward way as the performance of the classiﬁers are visualized in a two dimensional space, independent of the number of classes. This contrasts with what happens in the ROC analysis, that is speciﬁcally designed for binary classiﬁcation problems. To verify the validity of the proposal, we have tested a competitive classiﬁer (the Kernel Extreme Learning Machine method) with 29 benchmarks datasets and the performances reported by the state-of-the-art metrics relying only on values coming from the confusion matrix were extracted and analyzed in depth. We have studied the existing correlations among the metrics considered in this research work and obtained the following ﬁndings: (i) D appears to be the less correlated metric with respect to C when all the datasets are considered; (ii) D seems to be a competitive metric (providing additional information to the remaining metrics) when datasets with a high number of classes, highly imbalanced or challenging are

78

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

considered. Therefore, the (C, DN ) pair could very well be considered for parameter selection when models need to be tested in such classiﬁcation environments. Finally, other applications of this novel two-dimensional measure include: • The measure enables an assessment of classiﬁer performance over the full operating range (of possible scores) when dealing with a scoring classiﬁer. We are able to visualize the behavior of a classiﬁer across its operating range. • The approach presented in this paper could be used to determine the margin of improvement that exists in the result obtained by a classiﬁer in a particular problem. This will allows us to estimate the distance from each classiﬁer to the optimal solution. Acknowledgments This work was partially supported by the TIN2014-54583- C2-1-R project of the Spanish Ministry of Economy and Competitiveness (MINECO), FEDER funds and the P2011- TIC-7508 project of the “Junta de Andalucía” (Spain). Appendix A. Properties to specify the nature of the parabolas In order to specify the nature of these parabolas, we will need to use some of its properties. Property 1 (C ) ∗ (C ) = C − C 2 . Equality is reached when and only when C = s(I ) for a certain I. From now on, these values will be called extremes. Proof It suﬃces to consider that:

D2 =

Q q=1

cq2 pq − C 2

Q

cq pq − C 2 = C − C 2 = ∗ (C )

q=1

since 0 ≤ cq ≤ 1. The equality will only be reached if cq = cq2 for each q. This means that each value should be either 0 or 1. We call I the set of indexes for proven results where cq = 1. Property 2 For each I and each i we have (s(I ) ) = (s(I ), I, i ) = ∗ (s(I ) ). Proof The proof is an immediate consequence of the previous result: those points are precisely the only ones where the upper boundary is reached. In terms of the classiﬁcation problem, the interpretation is that the maximum variance, and therefore the worst case scenario for a certain C is reached when each class is perfectly classiﬁed (cq = 1) or always incorrectly classiﬁed (cq = 0). Property 3 (C, I, i) is, as a function of C, a decreasing parabola to the left of s(I) and a growing one to the right of s(I ) + pi , where it intersects with ∗ (C). As a consequence, its value is strictly lower than that of the boundary within the interval deﬁned for those extremes, and strictly greater than the outside of the boundary (see Fig. A.13).

Fig. A.13. Illustration of Property 3.

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

79

Proof The immediate consequence for Property 2 is that both are the only intersection points, while increased and decreased conditions are deduced from the convexity of the function and the sign of its derivative with respect to C:

d (s(I ), I, i ) = −2s(I ) < 0 dC

d (s(I ) + pi , I, i ) = 2(1 − (s(I ) + pi )) > 0 dC

Property 4 Given a second set of indexes J with an additional index j ∈ J, the relative position of the curves (c, I, i) and (c, J, j) is exclusively determined by s(I), pi , s(J) and pj . Proof It is immediate: only these four values deﬁne the parabolas. References [1] R. Alaiz-Rodríguez, N. Japkowicz, P. Tischer, Visualizing classiﬁer performance on different domains, in: Tools with Artiﬁcial Intelligence, 2008. ICTAI’08. 20th IEEE International Conference on, vol. 2, 2008, pp. 3–10. [2] C. Apte, E. Bibelnieks, R. Natarajan, E. Pednault, F. Tipu, D. Campbell, B. Nelson, Segmentation-based modeling for advanced targeted marketing, in: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2001, pp. 408–413. [3] A. Asuncion, D. Newman, UCI machine learning repository, 2007. URL http://www.ics.uci.edu/∼mlearn/MLRepository.html. [4] R. Baeza-Yates, B. Ribeiro-Neto, et al., Modern Information Retrieval, vol. 463, ACM press New York, 1999. [5] A.P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognit. 30 (7) (1997) 1145–1159. [6] T. Byrt, J. Bishop, J.B. Carlin, Bias, prevalence and kappa, J. Clin. Epidemiol. 46 (5) (1993) 423–429. [7] R. Caruana, A. Niculescu-Mizil, Data mining in metric space: an empirical analysis of supervised learning performance criteria, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 69–78. [8] J. Cohen, A coeﬃcient of agreement for nominal scales, Educ. Psychol. Meas. 20 (1) (1960) 37–46. [9] M. Cruz-Ramirez, C. Hervas-Martinez, J. Sanchez-Monedero, P. Gutierrez, A preliminary study of ordinal metrics to guide a multi-objective evolutionary algorithm, in: Intelligent Systems Design and Applications (ISDA), 2011 11th International Conference on, 2011, pp. 1176–1181. [10] J.F.P. da Costa, H. Alonso, J.S. Cardoso, The unimodal model for the classiﬁcation of ordinal data, Neural Netw. 21 (2008) 78–91. [11] J. Davis, M. Goadrich, The relationship between precision-recall and ROCcurves, in: Proceedings of the 23rd International Conference on Machine Learning, ACM, 2006, pp. 233–240. [12] J. Demšar, Statistical comparisons of classiﬁers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [13] C. Drummond, R. Holte, Cost curves: an improved method for visualizing classiﬁer performance, Mach. Learn. 65 (1) (2006) 95–130. [14] R.M. Everson, J.E. Fieldsend, Multi-class ROC analysis from a multi-objective optimisation perspective, Pattern Recognit. Lett. 27 (8) (2006) 918–927. [15] T. Fawcett, ROC graphs: notes and practical considerations for researchers, Mach. Learn. 31 (2004) 1–38. [16] T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. 27 (8) (2006) 861–874. [17] J.C. Fernández Caballero, F.J. Martínez, C. Hervás, P.A. Gutiérrez, Sensitivity versus accuracy in multiclass problems using memetic pareto evolutionary neural networks, Neural Netw. IEEE Trans. 21 (5) (2010) 750–770. [18] M. Fernández-Delgado, E. Cernadas, S. Barro, D. Amorim, Do we need hundreds of classiﬁers to solve real world classiﬁcation problems? J. Mach. Learn. Res. 15 (1) (2014) 3133–3181. [19] F. Fernández-Navarro, C. Hervás-Martínez, C. García-Alonso, M. Torres-Jimenez, Determination of relative agrarian technical eﬃciency by a dynamic over-sampling procedure guided by minimum sensitivity, Expert Syst. Appl. 38 (10) (2011) 12483–12490. [20] F. Fernández-Navarro, C. Hervás-Martínez, P.A. Gutiérrez, A dynamic over-sampling procedure based on sensitivity for multi-class problems, Pattern Recognit. 44 (8) (2011) 1821–1833. [21] C. Ferri, J. Hernández-Orallo, R. Modroiu, An experimental comparison of performance measures for classiﬁcation, Pattern Recognit. Lett. 30 (1) (2009) 27–38. [22] P. Flach, H. Blockeel, C. Ferri, J. Hernández-Orallo, J. Struyf, Decision support for data mining, in: Data Mining and Decision Support, in: The Springer International Series in Engineering and Computer Science, vol. 745, 2003, pp. 81–90. [23] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, F. Herrera, Ordering-based pruning for improving the performance of ensembles of classiﬁers in the framework of imbalanced datasets, Inf. Sci. 354 (2016) 178–196. [24] L.A. Goodman, W.H. Kruskal, Measures of association for cross classiﬁcations∗ , J. Am. Stat. Assoc. 49 (268) (1954) 732–764. [25] P.A. Gutiérrez, C. Hervás-Martínez, F.J. Martínez-Estudillo, M. Carbonero, A two-stage evolutionary algorithm based on sensitivity and accuracy for multi-class problems, Inf. Sci. 197 (0) (2012) 20–37. [26] D.J. Hand, R.J. Till, A simple generalisation of the area under the ROC curve for multiple class classiﬁcation problems, Mach. Learn. 45 (2) (2001) 171–186. [27] J. Hernández-Orallo, P. Flach, C. Ferri, Brier curves: a new cost-based visualisation of classiﬁer performance, in: Proceedings of 28th International Conference on Machine Learning (ICML 2011), 2011, pp. 585–592. [28] J. Hernández-Orallo, P. Flach, C. Ferri, A uniﬁed view of performance metrics: translating threshold choice into expected classiﬁcation loss, J. Mach. Learn. Res. 13 (1) (2012) 2813–2869. [29] G. Hripcsak, A.S. Rothschild, Agreement, the F-measure, and reliability in information retrieval, J. Am. Med. Inform. Assoc. 12 (3) (2005) 296–298. [30] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classiﬁcation, IEEE Trans. Syst. Man Cybern.Part B 42 (2) (2012) 513–529. [31] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, in: IEEE International Conference on Neural Networks - Conference Proceedings, vol. 2, 2004, pp. 985–990. [32] N. Japkowicz, M. Shah, Evaluating Learning Algorithms: A Classiﬁcation Perspective, Cambridge University Press, 2011. [33] L.A. Jeni, J.F. Cohn, F. De La Torre, Facing imbalanced data–recommendations for the use of performance metrics, in: Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, IEEE, 2013, pp. 245–251. [34] T.C.W. Landgrebe, R.P.W. Duin, Approximating the multiclass ROC by pairwise analysis, Pattern Recognit. Lett. 28 (13) (2007) 1747–1758. [35] C. Liu, P. Frazier, L. Kumar, Comparative assessment of the measures of thematic classiﬁcation accuracy, Remote Sens. Environ 107 (4) (2007) 606–616. [36] V. López, A. Fernández, S. García, V. Palade, F. Herrera, An insight into classiﬁcation with imbalanced data: empirical results and current trends on using data intrinsic characteristics, Inf. Sci. 250 (2013) 113–141. [37] M. Markatou, H. Tian, S. Biswas, G.M. Hripcsak, Analysis of variance of cross-validation estimators of the generalization error, J. Mach. Learn. Res. 6 (2005) 1127–1168. [38] T.M. Mitchell, Machine Learning, WCB McGraw-Hill, 1997. [39] PASCAL, Pascal (Pattern Analysis, Statistical Modelling and Computational Learning) machine learning benchmarks repository, 2011. http://mldata.org/. [40] L. Prechelt, PROBEN1: A set of Neural Network Benchmark Problems and Benchmarking Rules, Technical Report, Fakultt fr Informatik (Universitt Karlsruhe), 1994.

80

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

[41] F. Provost, T. Fawcett, Robust classiﬁcation for imprecise environments, Mach. Learn. 42 (3) (2001) 203–231. [42] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classiﬁcation tasks, Inf. Process. Manage. 45 (4) (2009) 427–437. ´ , A. Jachnik, W. Cheng, E. Hüllermeier, On the Bayes-Optimality of F-Measure maximizers, J. Mach. Learn. Res. 15 (2014) [43] W. Waegeman, K. Dembczynski 3333–3388. [44] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2005. [45] J. Wu, H. Xiong, J. Chen, Cog: local decomposition for rare class analysis, Data Min. Knowl. Discov. 20 (2) (2010) 191–220.

A two dimensional accuracy-based measure for classification performance

A two dimensional accuracy-based measure for classification performance

Recommend Documents