A two dimensional accuracy-based measure for classification performance

A two dimensional accuracy-based measure for classification performance

Information Sciences 382–383 (2017) 60–80 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/i...

1MB Sizes 1 Downloads 55 Views

Information Sciences 382–383 (2017) 60–80

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

A two dimensional accuracy-based measure for classification performance Mariano Carbonero-Ruz, Francisco Jose Martínez-Estudillo, Francisco Fernández-Navarro∗, David Becerra-Alonso, Alfonso Carlos Martínez-Estudillo Department of Quantitative Methods, Universidad Loyola Andalucia, c/ Escritor Aguayo, 4, Córdoba, Spain

a r t i c l e

i n f o

Article history: Received 27 June 2016 Revised 30 November 2016 Accepted 4 December 2016 Available online 7 December 2016 Keywords: Classification metrics Imbalanced classification Accuracy

a b s t r a c t Accuracy has been used traditionally to evaluate the performance of classifiers. However, it is well known that accuracy is not able to capture all the different factors that characterize the performance of a multiclass classifier. In this manuscript, accuracy is studied and analyzed as a weighted average of the classification rate of each class. This perspective allows us to propose the dispersion of the classification rate of each class as its complementary measure. In this sense, a graphical performance metric, which is defined in a two dimensional space composed by accuracy and dispersion, is proposed to evaluate the performance of classifiers. We show that the combined values of accuracy and dispersion must fall within a clearly bounded two dimensional region, different for each problem. The nature of this region depends only on the a priori probability of each class, and not on the classifier used. Thus, the performance of multiclassifiers is represented in a two dimensional space where the models can be compared in a more fair manner, providing greater awareness of the strategies that are more accurate when trying to improve the performance of a classifier. Furthermore we experimentally analyze the behavior of seven different performance metrics based on the computation of the confusion matrix values in several scenarios, identifying clusters and relationships between measures. As shown in the experimentation, the graphical metric proposed is specially suitable in challenging, highly imbalanced and with a high number of classes datasets. The approach proposed is a novel point of view to address the evaluation of multiclassifiers and it is an alternative to other evaluation measures used in machine learning. © 2016 Elsevier Inc. All rights reserved.

1. Introduction Comparing learning/classification models is a complex and still open challenge. The first issue is the selection of the property of the model’s performance that we want to measure. For example, we could be interested in accuracy, speed, cost or even maybe in readability of the model. Once a performance measure is chosen, the next concern is to estimate it in as unbiased a manner as possible [37]. After that we should test whether the differences in the performance obtained by the



Corresponding author. E-mail addresses: [email protected], [email protected] (F. Fernández-Navarro).

http://dx.doi.org/10.1016/j.ins.2016.12.005 0020-0255/© 2016 Elsevier Inc. All rights reserved.

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

61

model alone or in relation to others are statistically different [12]. The first component of this puzzle and related issues are the focus of this manuscript. Performance measures have undoubtedly received a great amount attention from the machine learning community. Special mention should be made to the Ferri et al.’s work [21] which proposed a taxonomy that divides the performance classification metrics in three families. Furthermore, Ferri et al.’s work [21] empirically tested 34 binary classification metrics and 18 multi-class metrics with 30 datasets coming from the UCI Machine Learning Repository and performed a sensitivity analysis in terms of several traits. Another significant study about performance measures was carried out in Sokolava and Lapalme [42] where the authors studied the type of changes made to a confusion matrix that do not affect a measure, thus preserving the classifier’s evaluation (measure invariance). The measures considered for binary classification were also generalized to multiclass classification. From a different point of view, in Carauna and Nicolescu [7] the authors conducted an empirical study using a several learning models and classification performance metrics. They showed that the metrics span a low dimensional manifold and derived a new scalar measure based on the combination of state-of-the-art metrics (combining linearly the AUC, RMSE and error rate metrics). Finally, it is also important to mentioning that most of the metrics reported in the literature to evaluate a classifier were inspired in metrics coming from areas such as medicine, statistics or information retrieval [6,24,29]. For example, the F-measure was originally introduced in the field of information retrieval, but nowadays it is routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction [43]. As previously mentioned, Ferri et al. classified the existing performance metrics in three groups [21]: • Metrics based on a threshold and a qualitative understanding of the error: accuracy, geometric accuracy, macro-averaged accuracy, mean F-measure, Kappa statistic and minimum sensitivity. They are used when we want to minimize the number of classification errors. • Metrics based on a probabilistic understanding of error, i.e., measuring the deviation from the true probability: mean absolute error, mean squared error or the LogLoss error (cross-entropy). These performance metrics are specially useful when we want to test the reliability of the model. • Metrics based on how well the model ranks the examples: Area Under the Curve (AUC) for binary problems [5] and the extension for multiclass problems [26]. Performance classification measures could also be categorized according to the type of problems they can handle [21]. For example, they can be designed for either binary or multiclass classification, or proposed to mutually exclusive (one pattern belongs to only one class) or overlapping (one pattern can belong to several classes) class problems. Some measures assume that the model provides only discrete scores (boolean classifiers can be evaluated using solely metrics based on a threshold) whereas other suppose that the model can also give a real-valued score (probabilistic or fuzzy classifier could be evaluated with confusion-only-based metrics as well as with metrics based on a probabilistic understanding of error or metrics based on how well the model ranks the examples. Finally, problems could also be divided according to the hierarchy of the classes: (i) flat problems (all classes on the same level) and (ii) hierarchical problems. Despite the numerous number of performance classification metrics existing in literature, accuracy, C, has been by far the most commonly used metric to evaluate classification learning models. However, accuracy, C, is not able to capture the different factors that characterize the performance of a classifier [17]. For example, C is specially misleading in imbalanced classification problems [32,36]. Besides, authors such as Ferri et al. already proved through experimentation that the vast majority of existing metrics defined using confusion matrices are highly (linearly) correlated among each other in most of the classification scenarios [21]. Motivated by these two facts, we propose a two-dimensional framework for the comparison of learning models composed by two metrics: the accuracy and the dispersion of the success rate among the different classes. The starting point of this work is the fact that accuracy can be seen as a weighted average of the classification rate of each class in a multiclass problem. In this sense, the natural measure of C’s representability is its associated dispersion measure D and it is therefore a complementary measure to evaluate a classifier. C and D are both included under the umbrella of metrics based on a threshold and a qualitative understanding error. The proposed framework for comparing classification models is designed for models outputting discrete scores (or outputting real-valued scores but considering solely the confusion matrix to evaluate them) which are tested in flat mutually-exclusive classification problems. The second objective of this work is to provide the mathematical relation between the two metrics considered in the framework. We show that the combined values of C and D must fall within a clearly bounded two dimensional region, different for each problem. The nature of this region depends only on the a priori probability of each class, and not on the classifier used. That allows us to visualize the performance of a multiclassifier and the comparison between classifiers, providing greater awareness of the strategies that are most suitable when trying to improve a classifier. A detailed explanation of how to obtain the boundaries of this region along with a number of examples of its usefulness are also provided throughout the manuscript. The statistical validity of this research work is shown by adopting an empirical approach [7,21,35]. Thus, during the experimentation a highly competitive classification model was applied to a selection of real world datasets and its performance was evaluated through various performance measures. These metric results are then compared by using correlation matrices. Hence, we analyzed how seven confusion-matrix-only-based metrics correlate to each other in a similar effort to the one shown in Ferri’s work [21] to ascertain to what extent and in what situations the results obtained with one metric are extensible to the other metrics. The results show that most of the metrics are linearly correlated to C, where D is not. These

62

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

differences become even larger in multi-class problems with a high number of classes, in highly imbalanced problems and in challenging problems (problems where the level of overall accuracy is relatively low). As aforementioned, Carauna and Nicolescu already considered aggregate classification metrics [7]. Specifically, they proposed the linear combination of the AUC, RMSE and accuracy metrics, thus obtaining a new scalar metric. In our opinion, although the metrics are in the same scale, they correspond to three different aspects of a classifier (see the taxonomy in [21]) and the linear combination in a scalar metric could be inadequate. In our proposal, the two metrics that were combined belong to the same family and they were expressed in the same scale by defining the upper bounds of D. Furthermore, as it is shown in the experimentation, the two metrics considered were not linearly correlated in the datasets considered suggesting that these two metrics could be non-cooperative and could therefore be used to guide multi-objective algorithms. In the next section, a literature review including the most representative confusion-only-based classification metrics coupled with a description of existing metrics that provides a visualization tool to evaluate models, is presented. We then describe the evaluation two-dimensional framework in Section 3. Section 4 studies the range for the measure proposed given different values of accuracy, and for each dataset in particular. Finally, the computational experiments of this study are provided in Section 5, and the paper closes with conclusions and remarks (Section 6). 2. Related work Two closely related aspects to our proposal are reviewed in this section: (i) first, the most relevant measures based on the computation of the confusion matrix values are fully described; (ii) after that, existing performance measures providing also a visualization perspective are described in the next subsection. 2.1. Performance measures computed using the confusion matrix This section focuses on performance measures that rely solely on the information obtained from the confusion matrix. Consequently, performance measures that either incorporate information in addition to that conveyed by the confusion matrix or account for classifiers that are not discrete are not considered here. Specifically, six performance measures were considered being all derived from the contingency or confusion matrix M which is defined as:



M=

ni j ;



Q 

ni j = N

(1)

i, j=1

where Q is the number of classes, N is the number of training or testing patterns and nij represents the number of times the patterns are predicted to be in class j when they really belong to class i. The metric proposed was compared to: • C: The Correct Classification Rate (CCR or C as denoted in [17]) is the most common measure used to assess the performance of a classifier. It is defined as the percentage of correctly classified patterns (or conversely the percentage of misclassification errors):

C=

Q 1 n j j. N

(2)

j=1

• MS: The minimum sensitivity (MS) of a classifier is the minimum value of the sensitivities for each class. This measure has been recently used in Machine Learning to evaluate the performance of a classifier in imbalanced classification environments. For example, accuracy C and MS are optimized through a two stage evolutionary process in [25], while, the optimization is carried out by a Pareto-based multiobjective optimization methodology based on a memetic evolutionary algorithm in [17]. Finally, MS is defined as:

MS = min





S j ; j = 1, . . . , Q ,

(3)

where S j = n j j /n j is the number of patterns correctly predicted to be in class j with respect to the total number of patterns in class j (sensitivity for class i) and nj is the number of patterns in the jth class. • AC: The Average Accuracy (AC) is the arithmetic average per-class effectiveness of a classifier. It is usually referred as macro-average [38] and is defined as:

AC =

Q 1  S j. Q

(4)

j=1

• GM: The Geometric Mean (GM) corresponds to the geometric average of the partial accuracies of each class:

 GM =

Q  j=1

Q1 Sj

.

(5)

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

63

• WKappa: The Kappa statistic is a measure of agreement between two classifiers [8], although it has been extensively employed as a classifier performance [44]. The Weighted Kappa (WKappa) is a modified version of the Kappa statistic that allows to assign different weights to different levels of aggregation between two variables [9]:

po ( w ) − pc ( w ) 1 − pc ( w )

W Kappa =

(6)

where

po ( w ) =

Q Q 1  wi j ni j , N

(7)

Q Q 1  wi j ni. n. j , N

(8)

i=1 j=1

and

pc ( w ) =

i=1 j=1



where ni. = Qj=1 ni j and n. j = Q n for i, j ∈ {1, . . . , Q } and the weight wij quantifies the degree of discrepancy bei=1 i j tween the true and the predicted class. • F-score: The F-score measures the relations between data’s positive labels and those given by a classifier based on a perclass average. This performance metric has been widely employed in information retrieval [4]. The F-score for multi-class classification problems is defined as:

Fβ =

(β 2 + 1 )(P · R ) β 2 (P + R )

(9)

where P is the precision defined as:

Q P=

j=1

Pj

Q

, Pj =

njj n. j

(10)

and R is the recall and it is defined as:

Q

R=

j=1

Q

Rj

, Rj =

njj n j.

(11)

All the metrics considered are sensitive to changes in the classification threshold as they are defined using values coming from the confusion matrix. However only two of the metrics considered are sensitive to class frequency changes (C and WKappa). F-score is also partially influenced by changes in class frequencies as described in [21]. Meanwhile, AC, GM and MS are not influenced by changes in class frequencies, but better characterize the performance of a classifier in unbalanced classification problems [33]. Our proposal is also included in this subgroup of performance measures. 2.2. Graphical performance measures Graphical analysis methods and their associated metrics have proven to be very useful tools in studying both the behavior and the performance classifiers. Traditionally, graphical analysis methods have solely been developed to score classifiers.1 Among these, the Receiver Operating Characteristic (ROC) analysis [15,16] is the most popular graphical metric within in the machine learning community. ROC plots allow a classifier to be evaluated and optimized over all possible operating points. The ROC Area Under the Curve (AUC) has become a standard performance evaluation criterion in two class instance recognition problems, used to compare different classifiers independently of operating points, priors, and cost [5,22]. Many works has been made to generalize the AUC metric to multiclass classification problems. For example, the definition of the ROC AUC to the case of more than two classes by averaging pairwise comparisons is extended in [26]. A one versus all approach (considering the AUC between each class with respect to all other) was implemented to estimate a simplified version of the Volume under the ROC Surface (VUS) in [41]. The multi-dimensional operating characteristic was approximated using a pairwise approach and discounting some interactions aiming to reduce the computational burden of the VUS estimation in [34]. A multiobjetive optimization approach where the objective is to simultaneously minimize Q (Q − 1 ) fitness functions corresponding to the misclassification rates given by the off-diagonal elements of the confusion matrix was proposed in [14]. Despite the significant efforts made by machine learning researchers to extend the ROC analysis to multiclass problems, state-of-the-art proposals have many practical important issues, such as the computational complexity and representational comprehensibility,2 that precludes their use in practice. 1

A scoring classifier provides a real-valued score on each instance and class. Probabilistic classifiers are a special case of scoring classifiers. The original advantage provided by the graphical two dimensional representation of bi-class ROC curves is lost among the increased number of dimensions to plot. 2

64

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

From a different perspective, Lift charts are a graphical metrics that plot the true positive patterns against the dataset size. They are closely related to the ROC curves, and, therefore, inherit all their advantages and shortcomings. They are traditionally used in business problems [2]. Precision-Recall (PR) curves are similar to ROC curves and Lift charts as they study the trade-off between the well classified positive and the number of misclassified negative patterns plotting the precision of the model as a function of its recall [11]. Cost curves are a graphical technique and an alternative to ROC curves for visualizing the performance of binary classifiers [13]. Brier curves are a different way of visualizing the classifier performance in the cost space [27,28]. Another visualization technique in two dimensions is presented in [1] where the authors use the metrics defined in [7] and apply the multidimensional scaling technique to analyze the results of the classifier with different metrics and domains. Basically, research papers analyzed as an alternative to ROC curves have almost the same problems than those presented in the ROC analysis section: they are very effective in binary problems but they have no natural or straightforward extension to multi-class classification problems. Finally, the minimum of the accuracies obtained for each class (minimum sensitivity MS) is considered as a complementary measure for the C metric in [17]. These two measures were represented in the unit-square and simultaneously optimized in different ways [17,19,25]. This measurement, however, has clear limitations of its own: (i) MS only carries the information about one of the classes, focusing the attention only on the worst classified class and (ii) MS is an excessively conservative measurement. 3. The proposed framework for evaluating classifiers 3.1. Defining accuracy and dispersion for multiclass classification problems Accuracy C is the most commonly used measure for the evaluation of the success of a classifier. In a problem with Q classes and N instances, where Nq of those instances belong to the qth class (q = 1, . . . , Q), the accuracy of a classifier is given by:

Q

q=1 Cq

C=

N

,

(12)

where Cq is the number of correctly classified instances that belong to class q. C can therefore be rewritten as:

C=

Q Q   Cq Nq = cq pq , Nq N q=1

where cq =

Cq Nq

(13)

q=1

is again the accuracy for class q and pq =

Nq N

is the probability (based on the available dataset) an instance

has to belong to that same class. Thus, C can be interpreted as the weighted average of the accuracies per class, where the weights are those class probabilities just mentioned. Since C is a weighted measure, it seems natural for it to be presented along with its corresponding variance (or typical deviation), also weighted. Being an average for C, the associated variance informs of its validity. The pair defined by accuracy and this extra measure allows for a more detailed evaluation of the quality of a classifier, while serving as a profile for the comparison of different classifiers. It must remain clear, however, that this interpretation of C as an average only complements the usual interpretation. It’s an alternative way of looking at the same value: • When understood in the usual manner, C is the average success of all the instances in a classification process. The deviation does not add value to this particular definition. • When understood as the average success of classes, variability does indeed provide useful information for the decider. Using the terms presented in the previous section, the variance D2 and the deviation D of a classifier are defined as:

D2 =

Q 

(cq − C )2 pq .

D=



D2 .

(14)

q=1

A close to zero value of D2 (or D) corresponds to a greater representativity of C. The optimal outcome takes place when D2 (or D) is exactly zero. This only happens when the success rate is equal in all classes. On the other hand, D2 (or D) increases with the differences between success rates on each class. It must therefore be very useful to have such a measure for problems where a homogeneous success rate per class is important. On the question of whether to use D or D2 , the advantage D presents is that it has the same scale as C, while D2 has squared units. Since they both have values within [0, 1], values of D2 will always be smaller than those of D. This makes the differentiations made by D much more apparent. D is therefore chosen for the analysis and visualization of results in this article. However, from algebraic point of view and because of its very definition, D2 is easier to use. Therefore, D2 is chosen for the theoretical study of the measure. The final value provided one way or another, will be presented as D. Along with C, this value allows for the representation of any classifier in the two dimensional space (C, D). This representation will allow us to rank classifiers with similar values

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

65

Table 1 Analysis of differences among the most promising metrics for imbalanced classification. CM(A ) = CM(B ) = 0.40 GMM(A ) = GMM(B ) = MSM(A ) = MSM(B ) = 0.00 ACM(A) ≈ ACM(B) (≈0.16) D M(A) = D M(B)



⎜ ⎜ ⎜ M (A ) = ⎜ ⎜ ⎝

80 5 5 5 5 100

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0





0 19 ⎜ 5 0⎟ ⎟ ⎜ 0⎟ ⎜ 5 ⎟; M (B ) = ⎜ 5 0⎟ ⎜ ⎝ 4 0⎠ 0 0

20 0 0 0 0 0

41 0 0 0 0 1

0 0 0 0 0 38

0 0 0 0 1 1

0 0 0 0 0 60

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

Fig. 1. Illustrative example of how a set different classifiers could be ordered in the (C, D) space.

for C, where those with lower values for D will be preferred. Although the (C, D) plane is not a completely ordered space, the numerical and graphical information they provide can still be useful, bearing in mind that, in the worst case scenario for a decision to be made, C alone can always prevail as the final decider.

3.2. A preliminary comparison of the merits of existing metrics As previously mentioned, the main goal of this paper is to propose a two dimensional framework for evaluating classifiers, especially when practitioners have to address an imbalanced classification problem [23]. Therefore, the first step in the proposal should be to show that the existing measures are not able to capture existing differences in classification performance for the type of problem selected in certain scenarios. With this purpose in mind, we have created a synthetic scenario. Two classifiers (A and B) produce the following Ms (Q = 6, N = 200) in a certain task. Table 1 presents the results of C and those metrics that are not sensitive to class frequency changes (AC, MS and GM). Note that C, AC, MS and GM were unable to detect any performance difference between classifiers A and B. D is the only metric able to discriminate the best classification model for the situation exposed (even AC fails at detecting differences in classification). One would expect that a valid measure of performance would output a better performance of classifier B compared to A, as the classifier B is able to better discriminate a greater amount of classes, however this is not possible using the state-of-the-art metrics. This synthetic situation is more common than expected due to the fact that, when classifying imbalanced datasets, classifiers are not specifically trained to optimize MS, and GM metrics tend to report a performance of zero on those two metrics. Thus, D will usually provide a different information on the classifier. Hence, D promotes the improvement of the minority class accuracies as a whole, generating a more uniformed distribution of the accuracies per class.

3.3. Ordering classifiers in the (C, D) space In order to illustrate the advantages (and uncertainties) that arise from the definition of the dispersion as an additional measure to accuracy, let us consider the scenario represented in Fig. 1, where five classifiers are represented. From the accuracy point of view, these classifiers would be ordered according to the sequence Q, P, (R,S), T. Although a statistical quantification of the significance of the differences would also be necessary, it can be preliminarily said that classifiers P and Q performed more poorly than the other three. Not much else could be said when taking only C into consideration. The additional information provided by D allows a refinement of the interpretation: S is better than R, since it has a lower dispersion. The same can be said about T with respect to R, since it is better both in accuracy and dispersion. However, the comparison between S and T is not possible, since neither is better than the other under both criteria. To solve this issue, it is necessary to estimate the upper boundary of D for each dataset to compare the results with respect to C using the same scale.

66

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

4. On the upper boundaries for D One of the aims of this work is to present an upper boundary for D, for each value of C. This section intends to prove that such a boundary is determined by both accuracy and the probabilities pq . A global boundary for all datasets will be presented first. It will then be used as a canvas for the definitive one. Result 1. For each value 0 ≤ C ≤ 1 is:

D2  ∗ (C ),

(15)

where ∗ (C ) = C − C 2 . The equality is met if and only if each and every accuracy per class is 0 or 1. Proof. Since

D2 =

Q

c p q=1 q q

Q 

= C and

(cq − C )2 pq =

q=1

=

Q 

Q

q=1

Q 

pq = 1:

cq2 pq − 2C

q=1

cq2 pq − C 2 

q=1

Q 

Q 

cq pq + C 2

q=1

Q 

pq

q=1

cq pq − C 2 = C − C 2 = ∗ (C ).

(16)

q=1

The inequality is derived from 0 ≤ cq ≤ 1, and the equality between cq and cq2 happens only when they are 0 or 1. Once this first boundary is established, π ∗ (C ) = the problem to be solved can now be made.



∗ (C ) is defined as the equivalent boundary for D. The definition of

Definition. For every 0 ≤ C ≤ 1 let us define:

2 (C ) = max D2 (c1 , . . . , cQ ),

(17)

c∈FC

where FC = {c : cq pq = C, 0  cq  1}. The existence of this maximum, that allows for a good definition of 2 , is ensured 2 by D being continuous in its arguments as a function of the weights per class, and by FC being a compact set. Result 2. For each 0 ≤ C ≤ 1, the value for 2 (C) is met at the frontier of FC .





Proof. It is immediate to verify that D2 c1 , . . . , cQ is a strictly convex function in FC , with only one critical point c1 = . . . = cQ = C, that is a minimum of the function. Therefore the maximum will be found at the frontier of the feasible region. Such frontier is characterized in the next section, along with the points that are potentially optimal in it. 4.1. Vertices for FC

Given the equations for FC , it can be seen that it is the intersection of the unit Q-cube with the hyperplane cq pq = C. It is therefore a polyhedric set with a frontier determined by linear segments. On each one of these segments, the convexity of the optimized function will lead to finding the maximum in one of its two extremes, that are nothing but vertices of the feasible region. Thus, it will suffice to characterize such vertices and and then limit the optimization to them. 4.1.1. Characterization

Let V be a vertex of FC , obtained as the intersection of an edge of the Q-cube with the hyperplane cq pq = C. All the coordinates of V except one will be 0 or 1 (boundary condition), and the remaining one will be determined by the intersection with the hyperplane as long as a feasible value (0 ≤ cq ≤ 1) is met. For each one of these vertices, the pair (I, i) is defined. I represents the set of indexes for cq = 1, while i is the index where the value for cq is not necessarily extreme (0 by default if ci = 1). The pair (I, i) will be used from now on as the representation of the vertex it refers to. Let us consider Fig. 2, made for a 3-class problem. The vertex V1 is obtained as the intersection of the edge c1 = c3 = 1 and the plane c1 p1 + c2 p2 + c3 p3 = C, which means that c2 = p1 (C − c1 p1 − c1 p1 ). Therefore, in this example, V1 would be 2 identified by the pair ({1, 3}, 2). If C increased to the point of placing the vertex in the plane c2 = 1, then V1 = (1, 1, 1 ) would be identified as I = ({1, 2, 3}, 0 ). 4.1.2. Redefinition of the problem The elements presented on the previous subsection allow for the problem to be redefined. The search space can now be reduced to the finite set of vertices of FC . The pair (I, i) that defines these vertices is therefore assigned by (C, I, i) to the variance calculated in the represented vertex.

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

67

Fig. 2. Characterization of the vertices of FC .

Definition. The set of extremes of the problem is the set:



S=

s (I ) =





p q : I ⊆ {1 , . . . , Q }

(18)

I

Result 3. It is verified that:



(C, I, i ) =

s(I ) + (C−spi(I )) − C 2 ∗ (C )

i > 0, s ( I )  C  s ( I ) + pi i=0

(19)

Proof. Let V be the vertex associated to the pair (I, i), i.e., (C, I, i ) = D2 (V ). Suppose i > 0. Therefore, using the definitions of I and i:

(C, I, i ) =



cq2 pq − C 2 =





pq +

I

C−

I

pi

pq

2

pi − C 2 = s ( I ) +

(C − s(I ) )2 pi

− C2

(20)

The condition over the value for C is an immediate consequence of the feasibility condition 0 ≤ ci ≤ 1. For i = 0, all the components in V take one of its extreme values, and the outcome can be obtained by using Result 1. If we combined all these results, the following is obtained. Result 4. For each 0 < C < 1 we have:



max

 (C ) = (I, i ) i = 0 ∗ (C ) 2

{(C, I, i ) : s(I ) < C < s(I ) + pi } C ∈/ S

(21)

C∈S

Function 2 (and from it, , the boundary for D) will be obtained, given this result, by calculating for each one of the feasible pairs (I, i), the family of curves (C, I, i) and from them the maximum at each point. 4.2. Building the curve (C) 4.2.1. Introduction and preliminary results This subsection analyzes the maximization problem proposed in Result 4. Let us begin by describing the aspect of each one of the elements that make this family and show a first approximation for the calculation of the boundary being searched for. Property. If i = 0, (C, I, i) restricted to the feasible values for C is a decreasing parabolic arc when C < s(I) and increasing when C > s(I ) + pi . Proof. It is immediate, by its own expression, that it is indeed a parabola. As far as the growth of the parabola is concerned, it is easy to verify that:



 d (C, I, i ) dC

C=s(I )

= −2s(I ) < 0

68

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

Fig. 3. An example of the relative position of curves.

Fig. 4. Representation of the (C, D) space for a binary classification problem with probabilities p 

1 2

and 1 − p.



 d (C, I, i ) dC

= 2 ( 1 − ( s ( I ) + pi ) ) > 0

(22)

C=s(I )+ pi

Then, because  is a parabola, the sign of its derivate is negative to the left of s(I ) and positive to the right of s (I ) + pi and the result holds. Fig. 3 shows the appearance of one of these curves, π (C, I, i ) = (C, I, i ) being the boundary of the deviation. The properties of such paraboles are described in Appendix A. In order to obtain the curve  it will suffice with the representation of each one of the arcs, choosing the highest one for a given value of C, as it is shown in the following example. Example 1. Suppose a problem with two classes with probabilities p  and pi are: I

i

s (I )

pi

O O {1} {2}

1 2 2 1

0 0 p 1− p

p 1− p 1− p p

1 2

and 1 − p (Fig. 4). The possible values for I, i, s(I)

Bearing in mind that the expression of each curve 2 is:

2 (C ) =

⎧ 2 C − C2 ⎪ p ⎪ ⎪ ⎨ (C−p)2 p+

1−p 2

−C ⎪ ⎪ ⎪ ⎩1 − p + C2 1−p

0C  p − C2

(C−(1−p)) p

pC  2

1 2

− C2

1 2

(23)

C 1− p

1− pC 1

Although the plot procedure for all curves could be applied to the resolution of any problem, its complexity quickly increases with Q. Let us bear in mind that the feasible set has in general Q2Q−1 possible vertices. This makes the calculation almost impossible for a too large number of classes. This leads us to the search of an alternative procedure. An example with three classes will present it and help, along with the previous example, the later proof and properties the method is based on. Example 2. Let us consider a 3-class problem with p1 = 0.2, p2 = 0.35 and p3 = 0.45 as probabilities per class. Applying the previous procedure (without going again over it step by step), the following curve  is obtained (Fig. 5). Figs. 4 and 5 share the following common features: (i) symmetry and (ii) between two consecutive extremes, curve  is, either one parabola (as it happens on the first part of the curve), or the combination of two parabolas that meet at an intermediate point. Once both results are proven, a procedure that is based on them and makes a feasible upper boundary plot is proposed. Result 5. Function (C) is symmetric with respect to C =

1 2.

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

69

Fig. 5. Representation of the (C, D) space for a three class classification problem with probabilities with p1 = 0.2, p2 = 0.35 and p3 = 0.45.

Proof. The result is immediate in the set of extreme points, since in them the curve coincides with π ∗ , that is symmetric. For the rest of the interval [0, 1], (C) is defined by:

(C ) =

max

(I, i ) i = 0

{π (C, I, i ) : s(I ) < C < s(I ) + pi }

(24)

Therefore, in order to prove the symmetry, it would suffice to verify that for every feasible election of I, i and C it is  possible to find indexes J and j such that π (C, I, i ) = π (1 − C, J, j ) in the interval s(J ), s(J ) + p j . Let J = {1, . . . , Q } − {I, i} and j = i. It is immediate to verify that s(J ) = 1 − (s(I ) + pi ) and therefore s(J ) + p j = 1 − s(I ), hence deducing the restriction on the interval. The coincidence of the value of the function is only a matter of arithmetics. In order to enunciate and prove the following result, as well as the later algorithm for the calculation of curve , the following elements related to the set of extrema are defined. Definition. Given s ∈ S, let us define:

Is = {I ⊂ {1, . . . , Q } : s(I ) = s} i(s ) = min {i ∈ I : I ∈ / Is } j (s ) = min {i ∈ I : I ∈ Is }

(25)

where Is is the set of all possible index combinations with a sum of probabilities s, i(s) is the index of the lowest probability that can be omitted in order to obtain the sum s and j(s) is the minimum needed for the calculation. Since these definitions will later on be used, let us clarify their meaning with an example. Example 3. Suppose a five class problem with probabilities p1 = p2 = p3 = 0.1; p4 = 0.3 and p5 = 0.4 and let s = 0.5. Thus,

I0.5 = {{1, 2, 4}, {1, 3, 4}, {2, 3, 4}, {1, 5}, {2, 5}, {3, 5}}, are all the combinations with a 0.5 sum of probabilities. Therefore, i(0.5 ) = 1, since the lowest index can be omitted in order to obtain the desired result (for instance, using combination {2, 3, 4}) and j (0.5 ) = 1 since it is the lowest index that can be used to obtain this value by taking for example {1, 2, 4}). Notice also that since the per-class probabilities are ordered, the lowest index coincides with the least probability included in one or another case. Let us suppose from this point, that this is the general case, p1  . . .  pQ which can be so by simply relabeling classes. Result 6. Let us consider the increasing ordered set S of extremes of the problem, and let s and t be consecutive elements of it. For each C ∈ [s, t] it can be verified that: 1. If s + pi(s ) = t, (C ) = π (C, I, i(s ) ) where I is any element of Is verifying i(s) ∈ I. 2. If s + pi(s ) > t,

 C  C∗ π (C, I, i(s ) ) (C ) = π (C, J − { j (t )}, j (t ) ) C  C ∗

(26)

where I is defined as in 1 and J is any element of It where the minimum j(t) is reached, and C∗ is the only intersection point of both curves in the considered interval. Proof. In order to prove each one these statements, let us indicate some necessary aspects of them. Since s is an extreme and  is continuous (maximum in a finite set of continuous functions), its expression in a certain range to the right of s will be:

(C ) = max{π (C, I, i ) : s(I ) = s}

(27)

i∈ /I

C−s 2

Since the expression of the curves is given by (C, I, i ) = s + ( p ) − C 2 , the maximum will be reached when pi is minii mum. Since the set of probabilities is ordered increasingly, it will be i = i(s ) by its very definition. With a similar reasoning, expression  in a range to the left of t will be given by:

(C ) = π (C, J − { j (t )}, j (t ) ),

(28)

70

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

with J under the conditions indicated in the statement. Finally, given any two curves of the family analyzed, say π (C, A, a) and π (C, B, b), the discriminant of the equation (C, A, a ) − (C, B, b) = 0 is:

4 ( s ( A ) − s ( B ) ) ( s ( A ) + pa − ( s ( B ) + pb ) ) pa pb

(29)

with a sign that depends only on the relative position of the intervals that define both curves: [s(A ), s(A ) + pa ] and [s(B ), s(B ) + pb ]. Let us now analyze each one of the two cases proposed. 1. Let us consider any other element π (C, J, j) defined in the interval [s, t]. Since there are no other extremes between s and t, it will have to be s(J) ≤ s and s(J ) + p j  t. Under these terms, the discriminant will be zero or negative depending on whether the previous inequality is at the equality or not, with no intersection point between both curves except for, at most, s or t. Since π (C, I, i(s)) dominates in s, it will do so during the entire [s, t] interval, hence proving the result. 2. Let us consider J under the terms in Result 2, which implies s(J ) = t < s(I ) + pi(s ) and s(J − { j (t )} ) < s (otherwise s(J − { j (t )} ) = s and j(t) < i(s) would imply 2 (C ) = π (C, J − { j (t )}, j (t ) ) for the right side values near s). Under these terms, the equation:

(C, J − { j (t )}, j (t ) ) − (C, I, i(s ) ) = 0

(30)

has a positive discriminant, and there are two intersection points. Let us verify that one and only one of them is in the [s, t] interval. The fact that each one of the curves dominates the other guarantees, because of the continuity of both, that at least one of the solutions is found between both extremes, because the difference between both functions is continuous and changes its sign in the extremes of the interval. Let C∗ be its corresponding accuracy. The derivative of the difference (C, J − { j (t )}, j (t ) ) − (C, I, i(s ) ) is given by:

 

d ((C, J − { j (t )}, j (t ) ) − (C, I, i(s ) )) = −2 C dC

1 1 − pi ( s ) p j (t )



+

s s (J ) − p j (t ) pi ( s )



(31)

that evaluated in both extremes takes the values:



 d (π (C, J − { j (t )}, j (t ) ) − π (C, I, i(s ) )) dC

=−

2 (s (J ) − s ) > 0 p j (t )

 d (π (C, J − { j (t )}, j (t ) ) − π (C, I, i(s ) )) dC

=−

   2 t − s + pi ( s ) > 0 p j (t ) pi(s )

C=s C=t

(32)

Since the derived function is a line in C and positive at both extremes of the interval, it will be positive in the entire interval. Therefore the difference between both functions is increasing, proving that C∗ is the accuracy for the only intersection point of curves s and t. From the definition of the curves, the domination in each one of the subintervals that this point defines can be inferred. This is not valid if pi(s ) = p j (t ) , since in that case the difference would not be quadratic, but linear. It is immediate to verify that under those circumstances, the only intersection point is exactly the mid-point on interval [s, t]. These results justify the validity of the following procedure for making curve . 4.2.2. Algorithm for the construction of curve  Using the symmetry, and knowing the characteristics of curve  between each two consecutive extremes, its construction will be doing going along the C axes, extreme by extreme, from 0 to 21 , obtaining the remaining half of the curve by symmetry. The step by step procedure is as follows: 1. Determine and order the subset of extremes with values lower than 2. For each extreme calculate:

i(t ) = i(st )

1 2.

j (t ) = j (st )

Let 0 = s0 < s1 < . . . < sT be such extremes.

(33)

bearing in mind that it will not make sense to calculate the second amount if t = 0, since s0 = 0 has no preceding extreme. Take t = 0. 3. Repeat, for each t: If st+1 = st + pi(t ) ,

2 (C ) = st + Otherwise:

 (C ) = 2

until t = T − 1.



(C − st )2 pi(t )

− C 2 st  C  st+1

2 t) st + (C−s − C2 pi t

()

st+1 − p j (t+1 ) +

(34)

st  C  C ∗

(C−(st+1 −p j(t+1) )) − C 2 C ∗  C  s t+1 p j (t+1 ) 2

(35)

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

4. If sT <

71

1 2:

(C ) = sT +

(C − sT )2 pi ( T )

− C 2 sT  C 

1 2

(36)

5. Finally, (C) is defined as:

1 C 1 2

(C ) = (1 − C )

(37)

Example 4. Applying it to Example 2, remember that p1 = 0.2, p2 = 0.35 and p3 = 0.45. Step 1. In this case T = 3 and:

0 = s0 < 0.2 = s1 < 0.35 = s2 < 0.45 = s3 <

1 2

Step 2. It is immediate that: t

i (t )

j (t )

0 1 2 3

1 2 1 1

1 2 3

Step 3. • Since s1 = s0 + pi(0 ) :

2 (C ) =

C2 − C 2 0  C  0.2 0.2

• Since s2 = s1 + pi(1 ) :

 2 (C ) = 0C.22 + 0.35

(C−0,2 )2

−C

0.35 2

− C2

0.2  C  0.275 0.275  C  0.35

C ∗ = 0, 275 is the intersection point between both curves. • Since s3 = s2 + pi(2 ) :



+  (C ) = 0C.35 2 2

0.45

(C−0.35 )2

− C2

0.2

− C2

0.35  C  0.4055 0.4055  C  0.45

C ∗ = 0.4055 is in this case the intersection point. Step 4. Since s3 <

1 2:

(C ) = 0.45 +

(C − 0.45 )2 0.2

− C 2 0.45  C 

1 2

Step 5. The rest of the curve is defined by symmetry. 4.3. Ordering classifiers in the (C, D) space Once C and D are calculated and represented for different classifiers, they can be represented along with the  curve associated to the dataset that the classifiers have tried to tackle. This representation is a tool to visualize the basic elements of the problem as well as the main characteristics and their possible solutions. Considering the example in Fig. 6, the following conclusions could be obtained: • Algorithms P and Q are equivalent in terms of accuracy, but not with respect to dispersion. The same happens with R and S, that are more precise than P and Q. • In terms of dispersion, Q is preferred over P and S preferred over R. • S is therefore the best of all four algorithms, while P is the worst. The comparison of the remaining two algorithms is more complicated, since although R slightly improves Q on accuracy, the dispersion is the same for both. R is almost on the limit of the maximum dispersion possible (very close to the  curve, while Q is further from this limit).

72

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

Fig. 6. Illustrative example of how a set different classifiers are in the (C, D) space using the D upper bound.

Fig. 7. Example of representation of c2 and DN in space against c1 .

4.4. Normalized dispersion Definition. Given the pair (C, D) (0 < C < 1) the normalized dispersion DN is defined as:

DN =

D

(38)

(C )

By definition, the normalized dispersion verifies 0 ≤ DN ≤ 1. Except for optimal classifiers (C = 1) or the opposite (C = 0), both uninteresting for their straightforwardness (all instances either correctly or incorrectly classified), the spatial representation of the pair (C, DN ) is the unit square. The best classifier is found for the highest accuracy and the lowest dispersion. No longer the comparison is subject to the boundaries of the (C, D) plane, but according to the upper limits of . Example 5. Let us consider the 2-class problem with probability p  12 for the first of the two classes. Let there be a classifier returning an accuracy C; a value from which the condition C  1 − pcan be assumed without the loss of generality, since such a result is obtained by simply assigning all instances to the second class. Let us consider all possible combinations of feasible accuracies per class c1 and c2 , that are bound by the condition pc1 + (1 − p)c2 = C, which allows for c2 to be a function of c1 and C. Bearing in mind that 0 ≤ c1 , c2 ≤ 1 the following is obtained:

c2 =

1 −C C − pc1 1−  c1  1 1− p p

(39)

and in order to calculate the DN associated to each combination, D2 and2 are needed:

D2 (C ) = p(c1 − C )2 + (1 − p)(c2 − C )2 = p(c1 − C )2 + 2 (C ) = 1 − p +

p2 p (c1 − C )2 = (c1 − C )2 1− p 1− p

(C − (1 − p))2 p

− C2 =

1− p (1 − C )2 p

(40)

(41)

hence:

DN (C ) =

p |c1 − C | 1 − p 1 −C

(42)

Fig. 7 shows the simultaneous representation of c2 and DN in space against c1 . It can be appreciated how the value for the normalized dispersion reaches its maximum value when class 2, being the majority class, is perfectly classified and the classification success is minimum for the minority class. The reduction of accuracy in the second class implies an increase in the first and a reduction in the dispersion until a zero is reached where both coincide (as well as C). From that point, dispersion increases again.

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

73

Fig. 8. Graphical representation of the correspondence between classifiers in the (C, D) space and the (C, DN ) space.

On the other hand, Fig. 8 shows how the normalization for D (in Fig. 6) clarifies the comparisons between algorithms in terms of their dispersion: R is slightly worse than P according to this criterion, despite having a lower absolute dispersion, and very inferior to Q, with which it had the same value. However, the normalization also has as a consequence the loss of information about the problem, since sight is lost as to the real possibilities of how to optimize D. In the initial representation, the low value of dispersion for the accuracy achieved by C D indicates that the possibilities of improving this criterion are reduced. On the other hand, considering P and Q, the representation shows that there is room for improvement. This matter, relevant as it is in order to establish goals for the improvement of classifiers, is omitted when DN is analyzed. The joint representation of both the regular, and the normalized plot, can be the solution. 5. Computational experiments This section presents the design of the experimental study followed in this paper (Section 5.1), the results obtained for classification datasets (Section 5.2), and its corresponding discussion (Section 5.3). 5.1. Experimental study In this subsection, the experiments are clearly described, defining the datasets and algorithms considered to validate the proposal and the parameters to be optimized. The performance measures used for evaluating the proposal are those described in Section 2.1. 5.1.1. Datasets selected Table 2 shows the characteristics of the 29 datasets, including the number of patterns, attributes and classes, and also the class distribution (number of patterns per class). The publicly available classification datasets were obtained from benchmark repositories (UCI [3] and mldata.org [39]). The synthetic toy dataset was generated as proposed in [10] with 300 patterns. Saureus4 (S4) corresponds to a real predictive microbiology problem of discriminating the growth/no growth of Staphylococcus Aureus and it was obtained from [20]. The selected datasets include four binary problems and twenty five multi-class problems and present different numbers of instances, features and classes (see Table 2). One of the main goals when choosing the datasets to consider was ensuring that they were imbalanced. To measure imbalance in multi-class datasets, we use the “coefficient of variation” (C.V.) as recommended by Wu et al. [45]. Specifically, C.V. is the proportion of the deviation in the observed number of examples for each class versus the expected number of examples in each class. For our purposes, datasets with a C.V. above 0.7071 (a class ratio of 3:1 on a binary dataset) are considered highly imbalanced datasets. This evenly divides our pool of available datasets into 14 highly imbalanced datasets and 15 datasets with a class ratio below 3:1. Finally, it is also important to mention that all nominal attributes were transformed into as many binary attributes as the number of categories and all the datasets were property standardized, considering only the training set to obtain the mean and standard deviation for each variable. 5.1.2. Algorithm selected to validate the proposal and model selection Recently, 179 classifiers arising from 17 families with 121 data sets (the whole UCI data base excluding the large-scale problems and other known real problems) were evaluated to determine the most competitive classifiers of the current literature in [18]. The Extreme Learning Machine (KELM) [30] method achieved the highest value of Probability of Achieving the Maximum Accuracies (PAMA, in %) with a total probability of 13.2%. Furthermore, the KELM method also showed a competitive performance in the Percentage of the Maximum Accuracy (PMA) for each dataset (averaged over all the datasets) and in the probability of achieving 95% (P95) of the maximum accuracy over all the datasets. For all that and for its extreme simplicity, the KELM method was selected to validate our proposal. Extreme Learning Machine (ELM) is an efficient algorithm that determines the output weights of a Single Layer Feedforward Neural Network (SLFNN) using an analytical solution instead of the standard gradient descent algorithm [31]. During

74

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80 Table 2 Characteristics of the twenty nine datasets used for the experiments: number of instances (#Pat.), total number of inputs (#Attr.), number of classes (#Classes), per-class distribution of the instances and C.V. Dataset

#Pat.

#Attr.

#Classes

Class distribution

C.V.

Hepatitis (HP)a Breast-cancer (BC) Haberman (HB) Card (CR)

155 286 306 690

19 15 3 51

2 2 2 2

(32, 123) (201, 85) (225, 81) (307, 383)

0.8303 0.5736 0.6655 0.1558

Contact-lenses (CL)a Pasture (PA) Squash-stored (SS) Squash-unstored (SU) Tae (TA) Newthyroid (NT)a Balance-scale (BS)

24 36 52 52 151 215 625

6 25 51 52 54 5 4

3 3 3 3 3 3 3

(15, 5, 4) (12, 12, 12) (23, 21, 8) (24, 24, 4) (49, 50, 52) (30, 150, 35) (288, 49, 288)

0.7603 0.0 0 0 0 0.4699 0.6662 0.1329 0.9472 0.6623

Lymph (LY)a Saureus4 (S4)a Vehicle (VH) SWD (SW) Car (CA)a

148 287 946 10 0 0 1728

38 3 18 10 21

4 4 4 4 4

(2, 81, 61, 4) (117, 45, 12, 113) (199, 212, 218, 218) (32, 352, 399, 217) (1210, 384, 69, 65)

1.0840 0.7213 0.0423 0.6582 1.2495

Bondrate (BO)a Toy (TO) Eucalyptus (EU) Anneal (AN)a LEV (LE)a

57 300 736 898 10 0 0

37 2 91 59 4

5 5 5 5 5

(6, 33, 12, 5, 1) (35, 87, 79, 68, 31) (180, 107, 130, 214, 105) (8, 99, 684, 67, 40) (93, 280, 403, 197, 27)

1.1141 0.4265 0.3263 1.5811 0.7458

Automobile (AU) Glass (GL)a Winequality-red (WR)a

205 214 1599

71 9 11

6 6 6

(3, 22, 67, 54, 32, 27) (70, 76, 17, 13, 9, 29) (10, 53, 681, 638, 199, 18)

0.6734 0.8339 1.1717

Zoo (ZO)a Segmentation (SG)

101 2310

16 19

7 7

(41, 20, 5, 13, 4, 8, 10) (330, 330, 330, 330, 330, 330, 330)

0.8937 0.0 0 0 0

Ecoli (EC)a

336

7

8

(143, 77, 52, 35, 20, 5, 2, 2)

1.1604

ESL (ES)a

488

4

9

0.9457

ERA (ER)

10 0 0

4

9

(2, 12, 38, 100, 116, 135, 62, 19, 4) (92, 142, 181, 172, 158, 118, 88, 31, 18)

a

0.5303

Symbol used to denote highly imbalanced datasets (C.V. ≥ 0.7071).

the training process, ELM determines its training parameters, β, by minimizing a Least Squared Error function. The output function of the kernelized ELM (KELM) [30] for the pattern x is defined as:

 f ( x ) = K ( x )T

I

γ

−1 + ELM

Y,

(43)

where Y ∈ RN × RQ is the target output matrix (N is the number of patterns and Q the number of classes), K(x ) : RK → RN is the vector of kernel functions K(x )T = [K (x, x1 ), . . . , K (x, xN )] (K is the number of attributes in the original dataset) and γ is a regularization parameter. The Gaussian kernel function here considered is

K (x, xi ) = exp(−k||x − xi ||2 ),

i = 1, . . . , N

(44)

where k ∈ R is the kernel parameter. Similarly the kernel matrix ELM = [i, j ]i, j=1,...,N is defined element by element as

i, j = K (xi , x j ).

(45)

The experimental design was conducted using 30 random stratified splits of 75% and 25% of the patterns for the training and test sets respectively (as suggested in [40]). The KELM classifier was run using the implementation available in the Extreme Learning Machines webpage.3 The optimal two hyperparameter values for the KELM model were selected using a nested five fold cross-validation over the training set and the criteria for selecting the best configuration was the C metric (k ∈ {10−3 , 10−2 , . . . , 103 } and γ ∈ {10−3 , 10−2 , . . . , 103 }). Finally, each pair of metrics are compared by means of the Correlation test using a level of significance of α = 0.1.

3

http://www.ntu.edu.sg/home/egbhuang/

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

75

Fig. 9. Dendograms for the number of classes analysis.

5.2. Results Six different correlation matrices are analyzed aiming to study the dependencies or relationships among the different metrics considered. It is important to mention that we compute the correlation for each dataset, i.e., we analyze the results provided by the KELM method for one dataset and the corresponding 30 combinations of the cross-validation procedure. Not merging results from different datasets is crucial, since measure values are influenced in very different ways depending on the dataset, e.g. number of classes, imbalance, problem difficulty, etc [21]. Consequently, we construct one correlation matrix per dataset and then, we average the 29 correlation matrices (considering the absolute values of the correlation coefficients). Since there are seven averaged correlation matrices4 , and they are difficult to understand at a glance, we will use dendrograms (as it was done in [21]); where the linkage distance is defined as (1 - correlation). A dendrogram shows clusters of performance measures according to how strongly correlated the metrics are. This kind of diagrams has several advantages: we can easily visualize the clusters formed by the measures, as well as the linkage distance among clusters. The first scenario is constructed separating the datasets in two subsets: the first one where datasets with less than five classes were included and the second one where only datasets with five or more classes were included. The second one considers the C performance of the KELM as the test variable and the last one is built by splitting the datasets according to their imbalanced ratio. The analysis was done for each of the three cases separately in order to take into account the effect of the variable in the correlation results. According to Fig. 9, the pairs of measures that are less correlated (they are not significantly correlated) are C − DN and W Kappa − DN when datasets with less than five classes are considered, and C − DN , AC − DN , W kappa − DN and Fβ =1 − DN when datasets with five or more classes are considered. Fig. 9 shows how DN is grouped with the metrics designed for imbalanced classification problems (GM, AC or GM) when Q < 5 (Fig. 9a) while it is considered an independent metric to all the others when Q ≥ 5 (Fig. 9b). On the other hand, if the accuracy provided by the model is taken as a variable test (Fig. 10), C is not significantly correlated to MS, GM, WKappa and DN on datasets where the classifier reported a low accuracy are considered. C is not significantly correlated to GM and DN when datasets with a high accuracy reported are taken into account. These facts are also shown in Fig. 10 where DN is grouped with the metrics designed for imbalanced classification problems when the datasets are not specially challenging (C ≥ 0.7700) (Fig. 10b). DN is considered an independent metric (along with the WKappe metric) to all the others when challenging datasets are being analyzed (Fig. 10a). Finally, Fig. 11 shows the correlation dendograms when the C.V. is considered as the variable test. The pairs of metrics that are less correlated are DN − C, DN − AC, DN − W Kappa, DN − Fβ =1 when datasets with a C.V. value above 0.7071 are considered (Fig. 11b) and DN − C, DN − AC, DN − Fβ =1 , GM − W Kappa, MS − W Kappa when datasets with a C.V. below 0.7071 are tested (Fig. 11a).

4

Corresponding to the two correlation matrices per scenario and the general correlation matrix considering all the datasets.

76

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

Fig. 10. Dendograms for the accuracy analysis.

Fig. 11. Dendograms for the C.V. analysis.

5.3. Discussion As it could be seen in Fig. 9, the DN metric provides significantly different information to the one provided by the remaining measures, specially when datasets with a high number of classes are considered. DN is not significantly correlated to any of the measures considered. GM and MS are both significantly correlated to C. These correlations are expected because datasets with a high number of classes are also highly imbalanced and DN , GM and MS are all measures for imbalanced datasets. DN shows its superiority when compared to GM and MS because only measure that is not significantly correlated to C (the most common measure used in standard classification). On the other hand, Fig. 10 did not provide particularly interesting information as DN was correlated to all metrics except to WKappa datasets with high accuracy. In turn, DN was only correlated only to WKappa when datasets with a low accuracy are considered. According to this information, the proposed metric is not useful when straightforward datasets are evaluated. However, it is a competitive measure when more challenging datasets are tested. Fig. 11 shows once more the importance of the proposed metric in imbalanced datasets as DN was not significantly correlated to any of the metrics considered except to GM and MS, which again were correlated to C in both cases. We hypothesize that these correlations of GM and MS to C are due to the fact that, in imbalanced datasets, when classifiers are not specifically trained to optimize these two metrics they tend to return a zero value performance. Note that if the accuracy

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

77

Fig. 12. Dendrogram of standard correlations between the metrics for all datasets.

of one class is zero, then MS and GM are zero. On the other hand, DN usually provides an additional information about the classifier. The scenario shown in Table 1, where DN is the only metric capable of discriminating the best classification model, has already been discussed. DN promotes the improvement of the accuracies of the minority classes as a whole generating a more uniformed distribution of the accuracies per class. MS and GM are thus grouped in the dendogram because they both share a high degree of restrictiveness. Finally, the overall correlation dendograms considering all the datasets and metrics are included Fig. 12, again reinforcing the importance of DN for being the only measure that is not significantly correlated to C. According to Fig. 12, the pair of measures that are less correlated is C − DN . In our opinion, this justifies the proposal of the new metric as a complementary metric to the C metric specially for datasets with a high number of classes, imbalanced and challenging. Furthermore, the selected pair (C − DN ) would be ideal to guide, for example, the evolution of a multi-objective algorithm, since they have low linear correlations and may implicitly be non-cooperative objectives. 6. Conclusions This paper presents a new approach for evaluating the performance of multiclass classifiers based on a 2-D performance measure. The composite proposed metric is made of the overall accuracy and the deviation of the accuracies per class. Thus, accuracy C is analyzed here as a weighted average of the classification rate for each class, while the dispersion D is the corresponding deviation of the individual accuracies per class. To our knowledge, none of the performance measures for evaluating classifiers proposed in the machine learning literature so far tackle the problem of a global optimization of accuracy C, where differences in class accuracies are minimized. Please note that other approaches with two values to measure the performance of a classifier, like precision-recall or sensitivity-specificity, are designed only for two class problems. The mathematical relationship between the two measures is studied in depth and determined for each problem. Specifically, the optimum upper bound of D for each C and for each classification problem is estimated. This boundary allows for the definition of the normalized measurement DN with values between 0 and 1, that analyzes the dispersion of the results per class obtained by a classifier in a more intuitive way. Moreover, a heuristic procedure to build the (C, DN ) region for each classification problem is proposed in this manuscript. On the other hand, the relationship between C and DN defines a region in the (C, DN ) space where each classifier can be depicted. Thus, in our opinion, one of the main advantages of the proposal is that multiclass classifiers are represented in this region in a natural, intuitive and straightforward way as the performance of the classifiers are visualized in a two dimensional space, independent of the number of classes. This contrasts with what happens in the ROC analysis, that is specifically designed for binary classification problems. To verify the validity of the proposal, we have tested a competitive classifier (the Kernel Extreme Learning Machine method) with 29 benchmarks datasets and the performances reported by the state-of-the-art metrics relying only on values coming from the confusion matrix were extracted and analyzed in depth. We have studied the existing correlations among the metrics considered in this research work and obtained the following findings: (i) D appears to be the less correlated metric with respect to C when all the datasets are considered; (ii) D seems to be a competitive metric (providing additional information to the remaining metrics) when datasets with a high number of classes, highly imbalanced or challenging are

78

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

considered. Therefore, the (C, DN ) pair could very well be considered for parameter selection when models need to be tested in such classification environments. Finally, other applications of this novel two-dimensional measure include: • The measure enables an assessment of classifier performance over the full operating range (of possible scores) when dealing with a scoring classifier. We are able to visualize the behavior of a classifier across its operating range. • The approach presented in this paper could be used to determine the margin of improvement that exists in the result obtained by a classifier in a particular problem. This will allows us to estimate the distance from each classifier to the optimal solution. Acknowledgments This work was partially supported by the TIN2014-54583- C2-1-R project of the Spanish Ministry of Economy and Competitiveness (MINECO), FEDER funds and the P2011- TIC-7508 project of the “Junta de Andalucía” (Spain). Appendix A. Properties to specify the nature of the parabolas In order to specify the nature of these parabolas, we will need to use some of its properties. Property 1 (C )  ∗ (C ) = C − C 2 . Equality is reached when and only when C = s(I ) for a certain I. From now on, these values will be called extremes. Proof It suffices to consider that:

D2 =

Q  q=1

cq2 pq − C 2 

Q 

cq pq − C 2 = C − C 2 = ∗ (C )

q=1

since 0 ≤ cq ≤ 1. The equality will only be reached if cq = cq2 for each q. This means that each value should be either 0 or 1. We call I the set of indexes for proven results where cq = 1. Property 2 For each I and each i we have (s(I ) ) = (s(I ), I, i ) = ∗ (s(I ) ). Proof The proof is an immediate consequence of the previous result: those points are precisely the only ones where the upper boundary is reached. In terms of the classification problem, the interpretation is that the maximum variance, and therefore the worst case scenario for a certain C is reached when each class is perfectly classified (cq = 1) or always incorrectly classified (cq = 0). Property 3 (C, I, i) is, as a function of C, a decreasing parabola to the left of s(I) and a growing one to the right of s(I ) + pi , where it intersects with ∗ (C). As a consequence, its value is strictly lower than that of the boundary within the interval defined for those extremes, and strictly greater than the outside of the boundary (see Fig. A.13).

Fig. A.13. Illustration of Property 3.

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

79

Proof The immediate consequence for Property 2 is that both are the only intersection points, while increased and decreased conditions are deduced from the convexity of the function and the sign of its derivative with respect to C:

d (s(I ), I, i ) = −2s(I ) < 0 dC

d (s(I ) + pi , I, i ) = 2(1 − (s(I ) + pi )) > 0 dC

Property 4 Given a second set of indexes J with an additional index j ∈ J, the relative position of the curves (c, I, i) and (c, J, j) is exclusively determined by s(I), pi , s(J) and pj . Proof It is immediate: only these four values define the parabolas. References [1] R. Alaiz-Rodríguez, N. Japkowicz, P. Tischer, Visualizing classifier performance on different domains, in: Tools with Artificial Intelligence, 2008. ICTAI’08. 20th IEEE International Conference on, vol. 2, 2008, pp. 3–10. [2] C. Apte, E. Bibelnieks, R. Natarajan, E. Pednault, F. Tipu, D. Campbell, B. Nelson, Segmentation-based modeling for advanced targeted marketing, in: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2001, pp. 408–413. [3] A. Asuncion, D. Newman, UCI machine learning repository, 2007. URL http://www.ics.uci.edu/∼mlearn/MLRepository.html. [4] R. Baeza-Yates, B. Ribeiro-Neto, et al., Modern Information Retrieval, vol. 463, ACM press New York, 1999. [5] A.P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognit. 30 (7) (1997) 1145–1159. [6] T. Byrt, J. Bishop, J.B. Carlin, Bias, prevalence and kappa, J. Clin. Epidemiol. 46 (5) (1993) 423–429. [7] R. Caruana, A. Niculescu-Mizil, Data mining in metric space: an empirical analysis of supervised learning performance criteria, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 69–78. [8] J. Cohen, A coefficient of agreement for nominal scales, Educ. Psychol. Meas. 20 (1) (1960) 37–46. [9] M. Cruz-Ramirez, C. Hervas-Martinez, J. Sanchez-Monedero, P. Gutierrez, A preliminary study of ordinal metrics to guide a multi-objective evolutionary algorithm, in: Intelligent Systems Design and Applications (ISDA), 2011 11th International Conference on, 2011, pp. 1176–1181. [10] J.F.P. da Costa, H. Alonso, J.S. Cardoso, The unimodal model for the classification of ordinal data, Neural Netw. 21 (2008) 78–91. [11] J. Davis, M. Goadrich, The relationship between precision-recall and ROCcurves, in: Proceedings of the 23rd International Conference on Machine Learning, ACM, 2006, pp. 233–240. [12] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [13] C. Drummond, R. Holte, Cost curves: an improved method for visualizing classifier performance, Mach. Learn. 65 (1) (2006) 95–130. [14] R.M. Everson, J.E. Fieldsend, Multi-class ROC analysis from a multi-objective optimisation perspective, Pattern Recognit. Lett. 27 (8) (2006) 918–927. [15] T. Fawcett, ROC graphs: notes and practical considerations for researchers, Mach. Learn. 31 (2004) 1–38. [16] T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. 27 (8) (2006) 861–874. [17] J.C. Fernández Caballero, F.J. Martínez, C. Hervás, P.A. Gutiérrez, Sensitivity versus accuracy in multiclass problems using memetic pareto evolutionary neural networks, Neural Netw. IEEE Trans. 21 (5) (2010) 750–770. [18] M. Fernández-Delgado, E. Cernadas, S. Barro, D. Amorim, Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15 (1) (2014) 3133–3181. [19] F. Fernández-Navarro, C. Hervás-Martínez, C. García-Alonso, M. Torres-Jimenez, Determination of relative agrarian technical efficiency by a dynamic over-sampling procedure guided by minimum sensitivity, Expert Syst. Appl. 38 (10) (2011) 12483–12490. [20] F. Fernández-Navarro, C. Hervás-Martínez, P.A. Gutiérrez, A dynamic over-sampling procedure based on sensitivity for multi-class problems, Pattern Recognit. 44 (8) (2011) 1821–1833. [21] C. Ferri, J. Hernández-Orallo, R. Modroiu, An experimental comparison of performance measures for classification, Pattern Recognit. Lett. 30 (1) (2009) 27–38. [22] P. Flach, H. Blockeel, C. Ferri, J. Hernández-Orallo, J. Struyf, Decision support for data mining, in: Data Mining and Decision Support, in: The Springer International Series in Engineering and Computer Science, vol. 745, 2003, pp. 81–90. [23] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, F. Herrera, Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets, Inf. Sci. 354 (2016) 178–196. [24] L.A. Goodman, W.H. Kruskal, Measures of association for cross classifications∗ , J. Am. Stat. Assoc. 49 (268) (1954) 732–764. [25] P.A. Gutiérrez, C. Hervás-Martínez, F.J. Martínez-Estudillo, M. Carbonero, A two-stage evolutionary algorithm based on sensitivity and accuracy for multi-class problems, Inf. Sci. 197 (0) (2012) 20–37. [26] D.J. Hand, R.J. Till, A simple generalisation of the area under the ROC curve for multiple class classification problems, Mach. Learn. 45 (2) (2001) 171–186. [27] J. Hernández-Orallo, P. Flach, C. Ferri, Brier curves: a new cost-based visualisation of classifier performance, in: Proceedings of 28th International Conference on Machine Learning (ICML 2011), 2011, pp. 585–592. [28] J. Hernández-Orallo, P. Flach, C. Ferri, A unified view of performance metrics: translating threshold choice into expected classification loss, J. Mach. Learn. Res. 13 (1) (2012) 2813–2869. [29] G. Hripcsak, A.S. Rothschild, Agreement, the F-measure, and reliability in information retrieval, J. Am. Med. Inform. Assoc. 12 (3) (2005) 296–298. [30] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classification, IEEE Trans. Syst. Man Cybern.Part B 42 (2) (2012) 513–529. [31] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, in: IEEE International Conference on Neural Networks - Conference Proceedings, vol. 2, 2004, pp. 985–990. [32] N. Japkowicz, M. Shah, Evaluating Learning Algorithms: A Classification Perspective, Cambridge University Press, 2011. [33] L.A. Jeni, J.F. Cohn, F. De La Torre, Facing imbalanced data–recommendations for the use of performance metrics, in: Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, IEEE, 2013, pp. 245–251. [34] T.C.W. Landgrebe, R.P.W. Duin, Approximating the multiclass ROC by pairwise analysis, Pattern Recognit. Lett. 28 (13) (2007) 1747–1758. [35] C. Liu, P. Frazier, L. Kumar, Comparative assessment of the measures of thematic classification accuracy, Remote Sens. Environ 107 (4) (2007) 606–616. [36] V. López, A. Fernández, S. García, V. Palade, F. Herrera, An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics, Inf. Sci. 250 (2013) 113–141. [37] M. Markatou, H. Tian, S. Biswas, G.M. Hripcsak, Analysis of variance of cross-validation estimators of the generalization error, J. Mach. Learn. Res. 6 (2005) 1127–1168. [38] T.M. Mitchell, Machine Learning, WCB McGraw-Hill, 1997. [39] PASCAL, Pascal (Pattern Analysis, Statistical Modelling and Computational Learning) machine learning benchmarks repository, 2011. http://mldata.org/. [40] L. Prechelt, PROBEN1: A set of Neural Network Benchmark Problems and Benchmarking Rules, Technical Report, Fakultt fr Informatik (Universitt Karlsruhe), 1994.

80

M. Carbonero-Ruz et al. / Information Sciences 382–383 (2017) 60–80

[41] F. Provost, T. Fawcett, Robust classification for imprecise environments, Mach. Learn. 42 (3) (2001) 203–231. [42] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classification tasks, Inf. Process. Manage. 45 (4) (2009) 427–437. ´ , A. Jachnik, W. Cheng, E. Hüllermeier, On the Bayes-Optimality of F-Measure maximizers, J. Mach. Learn. Res. 15 (2014) [43] W. Waegeman, K. Dembczynski 3333–3388. [44] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2005. [45] J. Wu, H. Xiong, J. Chen, Cog: local decomposition for rare class analysis, Data Min. Knowl. Discov. 20 (2) (2010) 191–220.