Effective active learning strategy for multi-label learning

Effective active learning strategy for multi-label learning

Communicated by Shiliang Sun Accepted Manuscript Effective active learning strategy for multi-label learning Oscar Reyes, Carlos Morell, Sebastian ´...

4MB Sizes 0 Downloads 84 Views

Communicated by Shiliang Sun

Accepted Manuscript

Effective active learning strategy for multi-label learning Oscar Reyes, Carlos Morell, Sebastian ´ Ventura PII: DOI: Reference:

S0925-2312(17)31337-1 10.1016/j.neucom.2017.08.001 NEUCOM 18727

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

14 September 2015 18 June 2017 6 August 2017

Please cite this article as: Oscar Reyes, Carlos Morell, Sebastian ´ Ventura, Effective active learning strategy for multi-label learning, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.08.001

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Effective active learning strategy for multi-label learning Oscar Reyesa , Carlos Morellb , Sebasti´an Venturaa,c,∗ a Department

of Computer Science and Numerical Analysis, University of C´ordoba, Spain of Computer Science, Universidad Central de Las Villas, Cuba c Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia

CR IP T

b Department

Abstract

AN US

Data labelling is commonly an expensive process that requires expert handling. In multi-label data, data labelling is further complicated owing to the experts must label several times each example, as each example belongs to various categories. Active learning is concerned with learning accurate classifiers by choosing which examples will be labelled, reducing the labelling effort and the cost of training an accurate model. The main challenge in performing multi-label active learning is designing effective strategies that measure the informative potential of unlabelled examples across all labels. This paper presents a new active learning strategy for working on multi-label data. Two uncertainty measures based on the base classifier predictions and the inconsistency of a predicted label set, respectively, were defined to select the most informative examples. The proposed strategy was compared to several state-of-the-art strategies on a large number of datasets. The experimental results showed the effectiveness of the proposal for better multi-label active learning. Keywords: Multi-label active learning, multi-label classification, label ranking

M

1. Introduction

AC

CE

PT

ED

In recent years, the study of problems that involve data associated with more than one label at the same time has attracted a great deal of attention [1–6]. Particular multi-label problems include text categorization [7–9], classification of emotions evoked by music [10], semantic annotation of images [11–14], classification of music and videos [15–17], classification of protein and gene function [18–23], acoustic classification [24], chemical data analysis [25] and more. Multi-label learning is concerned with learning a model able to predict a set of labels for an unseen example. In multi-label learning, two tasks have been studied [26–28]: multi-label classification and label ranking. Multi-label classification task aims to find a model where, for a given test example, the label space is divided into relevant and irrelevant label sets. On the other hand, label ranking task aims to provide, for a given test example, a ranking of labels according to their relevance values. Nowadays, it is more common to find ∗ Corresponding

author. Tel:+34957212218; fax:+34957218630. Email addresses: [email protected] (Oscar Reyes), [email protected] (Carlos Morell), [email protected] (Sebasti´an Ventura) Preprint submitted to Neurocomputing

multi-label learning algorithms able to produce, at the same time, both a bi-partition of the label space and a consistent ranking of labels. Most multi-label algorithms have been proposed for working on supervised learning environments, i.e. scenarios where all training examples are labelled. However, data labelling is commonly a very expensive process that requires expert handling. In multi-label data, experts must label several times each example, as each example belongs to various categories. The situation is further complicated when an expert labels a dataset with a large number of examples and label classes. Consequently, several real-world scenarios nowadays can contain a small number of labelled data and a large number of unlabelled data simultaneously. To date, there are two main areas that are concerned with learning models from labelled and unlabelled data, known as Semi-Supervised Learning [29] and Active Learning [30]. Active Learning (AL) is concerned with learning better classifiers by choosing which examples are labelled for training. Consequently, the labelling effort and the cost of training an accurate model are reduced. AL methods are involved in the acquisition of their own training data. A selection strategy iteratively selects examples from the unlabelled set that seem to be August 18, 2017

ACCEPTED MANUSCRIPT

gies were compared over a large number of multi-label datasets, and two multi-label learning tasks were analysed; multi-label classification and label ranking. The experiments were carried out on 18 multi-label datasets. To compare the performance of the AL strategies, in addition to a visual comparison of learning curves of the strategies, the experimental study included a statistical analysis based on non-parametric tests, resulting in a more robust analysis. The experimental stage showed the effectiveness of the proposal, obtaining significantly better results than previous multi-label AL strategies. This paper is arranged as follows: Section 2 briefly describes the multi-label learning and active learning paradigms, and the state-of-the-art in the development of AL strategies for multi-label data. Section 3 presents the basis of our proposal. Section 4 describes the experimental set-up and shows the experimental results. Finally, section 5 provides some concluding remarks.

AN US

CR IP T

the most informative for the model that is being trained. In this work, we focused on AL scenarios in which a large collection of unlabelled data and a small set of labelled data are available, known as pool-based AL [31]. The usefulness and effectiveness of AL methods have been proved in several domains [32–37]. For more than a decade, a considerable number of AL methods for single-label data have been proposed, for an interesting survey see [30]. However, AL methods for multi-label data have been far less studied. The main challenge in performing AL on multi-label data is designing effective strategies that measure the unified informative potential of unlabelled examples across all labels. Most state-of-the-art multi-label AL strategies employ the Binary Relevance [26] approach to breaking down a multi-label problem into several binary classification problems. Multi-label AL strategies have been generally assessed on the multi-label classification task. However, the performance with regard to the label ranking task has been rarely considered. On the other hand, most AL strategies use informativenessbased1 criteria to select the most useful unlabelled examples. However, strategies that only select informative examples usually do not exploit either the structure of unlabelled data or the label space information, leading to a sub-optimal performance [38]. In this work, an effective multi-label AL strategy was proposed, named as Uncertainty Sampling based on Category Vector Inconsistency and Ranking of Scores (CVIRS). Two measures based on the base classifier predictions and the inconsistency of predicted label sets, respectively, were defined. A rank aggregation problem was formulated to compute the unified uncertainty of an unlabelled example across all labels. The rank aggregation problem was based on the probabilities with that the base classifier predicts whether an example belongs or not to a certain label. On the other hand, the inconsistency of a predicted label set, for a given unlabelled example, was computed by means of the distance between the predicted label set and the label sets of the labelled examples. To the best of our knowledge, this paper presents the first attempt to propose a multi-label AL strategy that computes the uncertainty of unlabelled examples by means of a rank aggregation method, and this uncertainty measure is also combined with another measure that takes into account the label space information. On the other hand, in contrast to the majority of works related to multi-label AL, in this paper, several AL strate-

M

2. Preliminaries

In this section, a brief descriptions of the multi-label learning and active learning paradigms are exposed. On the other hand, a review of the state-of-the-art multilabel AL strategies is portrayed. 2.1. Multi-label learning and active learning

AC

CE

PT

ED

A multi-label problem comprises a feature space F and a label space L with cardinality equal to q (number of labels). A multi-label example i is represented as a tuple hXi ,Yi i, where Xi is the feature vector and Yi the category vector of the example i. Let us say Yi is a binary vector that contains q components, where component Yi` represents whether the example i belongs to the `-th label or not. Let us say Φ is a multi-label classifier able to resolve multi-label classification and label ranking tasks at the same time. Therefore, for a given test example, (i) Φ partitions the label space L into a relevant label set (positive labels) and an irrelevant label set (negative labels), and also (ii) Φ returns a ranking of labels according to their relevance. The multi-label learning algorithms can be organised into two main categories [28]: the problem transformation methods and algorithm adaptation methods. The problem transformation methods transform a multilabel dataset into one or more single-label datasets. Afterwards, for each transformed dataset a single-label classifier is executed, and an aggregation strategy is finally performed in order to obtain the results. On the

1 Informativeness measures the effectiveness of an example by reducing the uncertainty of a model.

2

ACCEPTED MANUSCRIPT

other hand, the algorithm adaptation category comprises the algorithms designed to directly handle the multilabel data. As for active learning, it is an iterative process that aims to construct a better classifier by selecting unlabelled examples. Let us say Φ is the base classifier used in the AL process. On pool-based AL scenarios, we have a small set of labelled data L s and a large set of unlabelled data U s . On the other hand, we have an AL strategy γ that selects a set of unlabelled examples from U s using some selection criterion, e.g. an uncertainty measure. The following steps are commonly performed in an AL process:

CR IP T

in each iteration. Most state-of-the-art multi-label AL strategies have been designed to select only one unlabelled example in each iteration (dubbed as myopic strategies) [38–43, 45–51, 53–56]. Myopic strategies can easily select a batch of unlabelled examples, e.g. by selecting the most informative examples from the unlabelled set. However, the main drawback of selecting a set of unlabelled examples in a greedy manner lies that the selected examples may be similar, resulting in an information redundancy. On the other hand, there are AL strategies that select a batch of unlabelled examples in each iteration by taking into account the diversity of the selected examples (dubbed as batch-mode strategies). As for batch-mode multi-label AL, to date, very few works have been proposed [44, 52]. The few existing works based on batch-mode multi-label AL formulate the problem of selecting the best batch of unlabelled examples as an NP problem, and the methods used to resolve the NP problem have a high computational cost. Consequently, the application of these methods is difficult, practically speaking, for large-scale multi-label datasets. Table 1 shows a summary of state-of-the-art multilabel AL strategies.

AN US

1. γ selects unlabelled examples from U s 2. The selected unlabelled examples are labelled by an annotator (e.g. a human expert) 3. The selected examples are added to L s and removed from U s 4. Φ is trained with the labelled set L s 5. The performance of Φ is assessed 6. IF not stop-condition THEN go to step 1.

PT

2.2. Related works

ED

M

In AL literature, several stopping conditions have been used. Commonly, the AL process is repeated β times (number of iterations). However, we can use as stopping criterion whether the performance of the base classifier has attained a certain level. The evaluation manner of base classifier’s performances depends on the problem studied. Commonly, the performance of the base classifier is assessed by means of using a test set and analysing an evaluation measure.

CE

Multi-label AL strategies can be classified according to the manner that the labels of unlabelled examples are queried. Most multi-label AL strategies are designed to query all the label assignments of the selected unlabelled examples [39–48]. On the other hand, there are multi-label AL strategies that query the relevance of an example-label pair [38, 49–56], i.e. the strategy analyzes whether a specific label is relevant to a selected example. Strategies that query all labels of an example may lead to information redundancy and more annotation effort in real-world problems with a large number of labels. On the other hand, AL strategies that select example-label pairs avoid information redundancy, but they may ignore the interaction between labels and can obtain a limited supervision from each query [55]. Multi-label AL strategies can be also classified in the manner that a set of unlabelled examples is selected

Source [39] [40] [49] [50] [41] [42] [43] [44] [45] [46] [47] [48] [51] [52] [53] [38] [54] [55] [56]

Year Type Label assignment 2004 Myopic All 2006 Myopic All 2009 Myopic EL 2009 Myopic EL 2009 Myopic All 2009 Myopic All 2009 Myopic All 2011 Batch-mode All 2011 Myopic All 2012 Myopic All 2012 Myopic All 2013 Myopic All 2013 Myopic EL 2014 Batch-mode EL 2014 Myopic EL 2014 Myopic EL 2014 Myopic EL 2015 Myopic EL 2015 Myopic EL

Table 1: Summary of state-of-the-art multi-label AL strategies. The AL strate-

AC

gies are ordered by year of publication. All: all the label assignments; EL: example-label pairs assignment.

In this work, we focused on myopic AL strategies that query all the label assignments of the selected examples, since our proposal belongs to this category of AL strategies. Next, we summarised the most relevant myopic AL strategies that query all the label assignments. In [39], two multi-label AL strategies, named Max Loss (ML) and Mean Max Loss (MML), were proposed. The two strategies select the unlabelled examples which have the maximum or mean loss value over the predicted labels. MML strategy considers the multi-label 3

ACCEPTED MANUSCRIPT

information taking into account the loss produced in each label. ML strategy calculates the loss value only on the label predicted with the most certainty. The effectiveness of the approaches was proved on two multilabel datasets for image classification. In [40], the Binary Minimum (BinMin) strategy was proposed. BinMin selects the unlabelled example that, considering a target label, minimises the distance among the restricting hyperplane and the centre of the maximum radius hyper-ball of each binary SVM classifier. The effectiveness of the approach was assessed on one multi-label dataset for text classification. In [41], the Maximum Loss Reduction with Maximal Confidence (MMC) strategy was presented. MMC is based on the principles of Expected Error Reduction [30]; it selects those unlabelled examples that maximise the reduction rate of the expected model loss. MMC predicts the number of labels for an unlabelled example following a process named as “LR-based prediction method”. In each iteration, the labelled set is transformed into a single-label dataset and a Logistic Regression classifier is trained. The prediction of the Logistic Regression classifier is used to calculate the uncertainty of the unlabelled examples. The effectiveness of the approach was proved on seven multi-label datasets for text classification. In [42], a general framework for multi-label AL was proposed. The authors defined three dimensions: evidence, class and weight. The evidence dimension represents the type of evidence to use for computing the usefulness of an unlabelled example. The class dimension represents how to combine the values of a vector of evidence. The weight dimension shows whether all labels are treated equally or not. The authors showed that CMN strategy obtains the best results. CMN takes into account the confidence of predictions as a type of evidence (C), the minimum value of a confidence vector (M), and it treats all labels alike (N). The effectiveness of the approach was assessed on two multi-label datasets for text classification. In [48], the Max-Margin Prediction Uncertainty (MMU) and Label Cardinality Inconsistency (LCI) strategies were proposed. MMU models the uncertainty of an example by computing the separation margin between the predicted groups of positive and negative labels. On the other hand, LCI strategy measures the uncertainty of an example as the distance between the number of predicted positive labels and the label cardinality (average number of labels per example) of the current labelled set. The effectiveness of these AL strategies was tested on three multi-label datasets for image classification and one dataset for text classifica-

AN US

CR IP T

tion. The bibliographic revision revealed us that previous works have been often assessed on the multi-label classification task. However, the performance of these AL strategies on the label ranking task has been rarely considered. Most multi-label AL strategies have been tested with BR-SVM as a base classifier, i.e. the Binary Relevance approach using binary SVM classifiers [39– 41, 48]. Most AL strategies use informativeness-based criteria to select the most useful unlabelled examples. However, strategies that only select informative examples usually do not exploit the label space information, leading to a sub-optimal performance. Some strategies simply extend the binary uncertainty concept for multilabel data by aggregating the value associated with each label, e.g. taking the minimum [40] or the average value over all labels [39, 41, 43, 53].

M

3. A new multi-label active learning strategy In this section, the basis of a new multi-label AL strategy is presented. The multi-label AL strategy combines two measures for selecting the most informative unlabelled example. The two measures are based on the classifier predictions and the inconsistency of predicted label sets, respectively. Measure based on a rank aggregation problem

CE

PT

ED

Let Φ be a multi-label classifier which, for a given unseen example, returns probabilities for each possible label ` ∈ L. We have a probability that an example i belongs to the `-th label (PΦ (`=1|i)) and a probability that i does not belong to the `-th label (PΦ (`=0|i)). So, the difference margin in classifier predictions with respect to whether the given example i belongs or does not belong to the `-th label can be computed as mi,` Φ =|PΦ (`=1|i) − PΦ (`=0|i)|.

(1)

AC

A large margin value on the `-th label means that the classifier Φ has a small error in predicting whether the example belongs or does not belong to this label. On the other hand, a small margin value on the `-th label means that it is more ambiguous for the current classifier to predict whether the example belongs or does not belong to the label `. So, given an unlabelled example i and a classifier Φ, we can obtain a vector of margin vali,q i,2 ues MiΦ =[mi,1 Φ ,mΦ , . . . ,mΦ ], one margin value for each label ` ∈ L. The problem is how to properly aggregate the multi-label information for computing the unified informative value of an unlabelled example. 4

ACCEPTED MANUSCRIPT

We consider that for computing the utility of an unlabelled example, it is important to consider the information regarding all unlabelled examples. Note that, we focus on pool-based AL scenarios, therefore a vector of margin values for each unlabelled example i ∈ U s can be computed. i Given the vectors MiΦ1 , MiΦ2 , . . . , MΦ|Us | of the unlabelled examples i1 , i2 , . . . , i|U s | , respectively, q rankings of examples τ1 , τ2 , . . . , τq can be computed; one ranking for each label ` ∈ L. The ranking of unlabelled examples τ` is computed as

CR IP T

i

Measure based on category vector inconsistency In addition to the defined uncertainty measure based on rank aggregation problem, we consider important to take into account the information from label space in computing the uncertainty of an unlabelled example. The idea is straightforward, the labelled set and unlabelled set of data are drawn from the same underlying distribution, therefore it is expected that the label sets predicted by the base classifier and the label sets of labelled examples share common properties. In this work, we propose to use a measure based on category vector inconsistency, computing the difference between the predicted label set for an unlabelled example and the label sets of the current labelled examples. Table 2 shows the contingency table given the category vectors Yi and Yj of the examples i and j, respectively. Let a be the number of components such that Yi` =Y j` =1, b is the number of components such that Yi` =1 and Y j` =0, c is the number of components such that Yi` =0 and Y j` =1, and d is the number of components such that Yi` =Y j` =0.

,`

τ` =(iπ1 ,iπ2 , . . . ,iπ|U s | ) | miΦπ1 ,` < miΦπ2 ,` . . . < mΦπ|U s | . (2)

AN US

The ranking τ` is an ordering (permutation or full list) of the unlabelled examples according to their margin values on the `-th label. So, we want to find a ranking of examples τ0 by aggregating the information of the rankings τ1 , τ2 , . . . , τq , in such a manner that the examples placed in the first positions of the final ranking τ0 correspond to the most uncertain examples. This formulation for aggregating the margin values is equivalent to the well-known Rank Aggregation problem [57, 58]. Several rank aggregation methods have been proposed in the literature [57–59]. However, the use of sophisticated rank aggregation methods would not be practical in our situation, considering that, (i) nowadays, it is common to find multi-label datasets with a large number of labels, and (ii) a large number of unlabelled examples are available in pool-based AL. Consequently, in this work, we used the simplest rank aggregation method, known as the Borda’s method [57]. Borda’s method is a positional method that assigns a score to an element in correspondence to the positions in which this element appears in each ranking [57]. The advantage of the positional methods is that they are computationally efficient. However, the positional methods neither optimise any distance criteria nor satisfy Condorcet’s criterion. Condorcet’s criterion states that if an element defeats every other element in pairwise majority voting, this element should be ranked first [57]. Based on Borda’s method, the score of an example i is computed as P `∈L (|U s |-τ` (i)) , (3) s(i)= q(|U s |-1)

Yi Yj 1 0

1 a c

0 b d

M

Table 2: Contingency table between two category vectors.

Given the category vectors Yi and Yj , the normalised Hamming distance is computed as b+c , (4) q where q is the number of labels. The distance dH returns the proportion of labels for which two examples differ in their category vectors. However, we also consider important to measure the difference among the structures of category vectors. Structures mean combinations of zeros and ones that are commonly found in a set of binary vectors. In multilabel context, the label sets that appear more frequently in a dataset create structures that, with a high probability, can be found in the category vectors of labelled examples. To compute the differences among the structures of two binary vectors, the entropy distance defined in [60] was used. The normalised entropy distance among two category vectors Yi and Yj is computed as

AC

CE

PT

ED

dH (Yi ,Yj )=

dE (Yi ,Yj )=

where τ` (i) is the position of the example i in the ranking τ` . The greater the value of s(i), the greater the uncertainty of the example i taking the information across all labels.

2H(Yi ,Yj ) − H(Yi ) − H(Yj ) , H(Yi ,Yj )

(5)

where the joint entropy among Yi and Yj is computed as 5

ACCEPTED MANUSCRIPT

of how one probability distribution diverges from a second expected probability distribution. In the multi-label AL context, KLD has been widely used to measure the degree of new models preserving the existing knowledge contained in old ones, commonly this has been made by using the probabilities of the instances belonging to each possible label [44, 49, 52]. In this work, however, we did not use the probabilities computed by the base classifiers, we used the information regarding label memberships instead, that it is discrete by nature. In this manner, we intend to avoid some bias that can be introduced by the different manners in which multilabel classifiers compute the probabilities of an instance belonging to each label.

a b c d H(Yi ,Yj )=H4 ( , , , ). q q q q According to the properties of the discrete entropy, H4 is equal to

CR IP T

b+c a+d b+c b c a b c d , )+ H2 ( , ) H4 ( , , , )=H2 ( q q q q q q q b+c b+c a+d a d + H2 ( , ) q a+d a+d

The entropy of a category vector Y is computed as w w s s w s H(Y)=H2 ( , )= − log2 ( ) − log2 ( ), q q q q q q where w and s are the numbers of ones (positive labels) and zeros (negative labels), respectively, of the category vector Y. Based on the dH and dE distance functions, the inconsistency of the predicted category vector, given unlabelled example i, is computed as 1 X fu (Yi ,Yj ), |L s | j∈L

(6)

s

if dH (Yi ,Yj ) < 1 if dH (Yi ,Yj )=1 ,

argmax i ∈ Us

M

   dE (Yi ,Yj ) fu (Yi ,Yj )=    1

Uncertainty sampling is one of the most simple and commonly used AL strategy [31]. This type of strategy selects those unlabelled examples which are least certain for the base classifier. Based on the two measures previously defined, the most informative example from U s is selected as

AN US

v(i)=

Active learning strategy

AC

CE

PT

ED

where Yi is the category vector predicted by the base classifier, and Yj is the category vector of the example j that belongs to the labelled set L s . The greater the value of v(i), the greater the inconsistency of the category vector of the example i. The distance function dE is more flexible than the distance dH ; the former can recognize existing structures (patterns) among two binary vectors. As a matter of example, given two category vectors Yi =[010101] and Yj =[101010], dH (Yi ,Yj )=1, owing to Yi and Yj differ in all their components. However, dE (Yi ,Yj )=0 when this case happens, since Yi and Yj have the same structure, the alternation of two symbols. Note that, for this case, we assigned the maximum value to fu , i.e. a value equal to 1, to represent that the base classifier is predicting a category vector completely inverse to the ones existing in the labelled set. Consequently, a greater uncertainty value will be given to the corresponding example, being it more likely to be selected for querying its actual labels. It is worth to note that this second uncertainty measure has some relationship with the Kullback-Leibler Divergence (KLD) [61]. KLD divergence is a measure

6

s(i) · v(i)

(7)

We named this new strategy, Uncertainty sampling based on Category Vector Inconsistency and Ranking of Scores (CVIRS). This strategy selects those unlabelled examples having the most unified uncertainty computed by means of the rank aggregation problem formulated, and at the same time, the most inconsistent predicted category vectors. This approach can be used with any multi-label classifier that can obtain proper probability estimates from its outputs. Note that our proposal is somehow related with Density-Weighted methods. Density-Weighted methods consider that the most informative example should not only be uncertain, but it should also be “representative” of the underlying distribution [30]. According to the categories of AL strategies portrayed in Section 2, the proposed strategy can be categorised as myopic, and it also queries all the label assignments. Regarding the computational complexity to compute the score of an unlabelled example by means of the rank aggregation problem formulated, let fts (Φ) be the cost function of the multi-label classifier Φ to classify an unlabelled example. To compute the margin vector of each unlabelled example i ∈ U s , O(|U s | · fts (Φ)) steps are needed. To compute the q rankings of unlabelled examples, O(q · |U s |2 ) steps are needed. Although, if an efficient sort algorithm is used, then the computational

ACCEPTED MANUSCRIPT

from 194 up to 28,596 examples, from 10 up to 32,001 features, from 6 up to 374 labels, from 15 up to 4,937 different subsets of labels, from 1.014 up to 26.044 label cardinality, and from 0.009 up to 0.485 label density. On the multi-label AL context, the dataset Corel5k was previously used in [39, 48, 51]. The datasets Emotions, Enron, Medical and Genbase were used in [51]. Scene and Yeast were previously used in [49, 51, 62]. The datasets Arts, Business, Entertainment and Health were used in [41].

CR IP T

complexity could be reduced to O(q · |U s | · log(|U s |)) steps. In addition, in order to reduce the computational complexity of computing the q rankings, only a subset of the U s could be considered. Regarding the computational complexity to compute the inconsistency of a category vector predicted for an unlabelled example, O(q · |L s |) steps are needed. Generally speaking, the CVIRS strategy requires O(max(|U s | · fts (Φ), q · |U s |2 )) steps to determine the utility of an unlabelled example, owing to q · |U s |2 >> q · |L s |. 4. Experimental study

AN US

In this section, a description of the multi-label datasets used in the experiments, the evaluation of multi-label models and AL strategies, and other settings used in the empirical study are explained. Finally, the experimental results and the statistical analysis are presented. 4.1. Multi-label datasets

ds 54 27 133 198 15 502 32 94 753 1341 3175 4937 2856 599 233 337 530 335

lc 3.932 1.869 1.014 4.237 1.074 26.044 1.252 1.245 3.378 2.160 3.522 2.867 2.402 1.654 1.599 1.414 1.429 1.644

ED

q 7 6 19 14 6 174 27 45 53 22 374 161 159 26 30 21 22 32

PT

d 10 72 260 103 294 68 1186 1449 1001 500 499 500 1836 23146 21924 32001 30324 30605

CE

n 194 593 645 2417 2407 502 662 978 1702 28596 5000 13811 7395 7484 11214 12730 12828 9205

M

In the experiments, 18 real multi-label datasets were used2 . Multi-label datasets with different scales and from different application domains were included to analyse the performance of the multi-label AL strategies in datasets with different properties. Dataset Flags Emotions Birds Yeast Scene Cal500 Genbase Medical Enron TMC2007-500 Corel5k Corel16k Bibtex Arts Business Entertainment Recreation Health

4.2. Evaluation of multi-label models and active learning strategies In this work, several evaluation measures were used to assess the multi-label classifiers induced by the AL process. The multi-label evaluation measures are divided into two categories [26]: label-based measures and example-based measures. The example-based measures are further categorised into ranking-based and bipartition-based measures. The label-based measures used in this work were the Micro-Average F1 -Measure (MiF1 ) and Macro-Average F1 -Measure (MaF1 ). The micro approach aggregates the true positive, true negative, false positive and false negative values of all labels, and then calculates the F1 -measure. The macro approach computes the F1 measure for each label and then the values are averaged over all labels. The MiF1 and MaF1 measures are defined as

ld 0.485 0.311 0.053 0.303 0.179 0.150 0.046 0.028 0.064 0.098 0.009 0.018 0.015 0.064 0.053 0.067 0.065 0.051

MiF1 =F1 (

tpi ,

i=1

MaF1 =

q X i=1

f pi ,

q X i=1

tni ,

q X

f ni ),

(8)

i=1

q 1X F1 (tpi , f pi , tni , f ni ), q i=1

(9)

where F1 function computes the F1 -Measure given the true positive (tp), false positive ( f p), true negative (tn), and false negative ( f n) values, and q represents the number of labels. As for multi-label classification task, a classifier Φ predicts, for a given test example i, the set of labels pi . Let us say ti is the actual label set of the example i. The bipartition-based measures used in this work were the Hamming Loss (HL ) and Example-based F1 -Measure (F1Ex ). HL averages the symmetrical differences among the predicted and actual label sets, while F1Ex calculates the F1 -Measure on all examples in the test set.

AC

Table 3: Statistics of the benchmark datasets, number of examples (n), number of features (d), number of labels (q), different subsets of labels (d s ), label cardinality (lc ) and label density (ld ). The datasets are ordered by their complexity calculated as n × d × q.

Table 3 shows some statistics of the datasets. The label cardinality is the average number of labels per example. The label density is the label cardinality divided by the total number of labels. The datasets vary in size: 2 The datasets are available to download http://www.uco.es/grupos/kdis/kdiswiki/index.php/Resources

q X

m

1 X | ti 4pi | , m i=1 q m X 2 | t i ∩ pi | 1 F1Ex = , m i=1 | ti | + | pi | HL =

(10) (11)

where 4 denotes the symmetric difference between two sets, and m is the number of test examples.

at

7

ACCEPTED MANUSCRIPT

For the sake of fairness, the AL strategies were tested with BR-SVM classifier, since most state-ofthe-art multi-label AL strategies have been tested with BR-SVM as a base classifier. A linear kernel and a penalty parameter equal to 1.0 were used, as proposed in [41]. Logistic regression models are fitted to the outputs of SVMs to obtain proper probability estimates. All strategies were evaluated by a 10-fold crossvalidation in each dataset. For each fold execution, the iterative experimental protocol described in Algorithm 1 was adopted. The 5% of the training set T r was randomly selected to construct the labelled set L s . Therefore, the initial classifier was trained with few labelled examples. The non-selected examples of T r were used to create the unlabelled set U s . The maximum number of iteration β was set to 750. In each iteration, the effectiveness of the multi-label classifier Φ was tested by classifying the test set T s . This experimental protocol is similar to previous experimental protocols used in [41, 42, 48, 51].

RL =

m 1 X | (`a , `b ) : Ri (`a ) > Ri (`b ), (`a , `b ) ∈ ti × t¯i | , m i=1 | ti || t¯i |

AP =

m 1 X 1 X | {`0 ∈ ti : Ri (`0 ) ≤ Ri (`)} | , m i=1 | ti | `∈t Ri (`)

CR IP T

Regarding label-ranking task, a classifier Φ provides, for a given test example i, a ranking of labels Ri , where Ri (`) is the rank predicted for the label `. The rankingbased measures used in this work were the Ranking Loss (RL ), Average Precision (AP ) and One Error (OE ). RL averages the proportion of label pairs that are incorrectly ordered. AP averages how many times a particular label is ranked above another label which is in the actual label set. OE averages how many times the top-ranked label is not in the set of true labels of the example. (12)

(13)

i

Pm

i=1

    1 δ(`)=    0

δ(argmin Ri (`)), `∈L ` < ti otherwise,

(14)

AN US

OE = m1

where t¯i denotes the complementary set of ti in the label space L. As for the evaluation of the effectiveness of AL strategies, the AL methods are commonly evaluated by visually comparing learning curves. A learning curve is constructed by means of plotting an evaluation measure as a function of the number of labelled examples that exist in the labelled set. Through a visual comparison, a strategy is superior to the alternatives if it dominates them for most of the points along their learning curves [30]. However, visually comparing several learning curves can be very confusing, as several intersections among the learning curves could occur. In this work, in addition to a visual comparison, the Area Under Learning Curve (AUC) to compare several AL strategies in a quantitative manner was used. To analyse and validate the results, several non-parametric statistical tests were used. Friedman’s test [63] was performed to evaluate whether there were significant differences in the results. If Friedman’s test indicated that the results were significantly different, the Shaffer post-hoc test [64] was used to perform all pairwise comparisons, as proposed in [65].

M

Algorithm 1: Experimental protocol.

1 2

ED

3 4 5 6

PT

7 8 9

CE

10 11 12

AC

13

Input : T r → training set of multi-label examples T s → test set of multi-label examples Φ → multi-label classifier γ → multi-label AL strategy θ → oracle for labelling unlabelled examples s → number of sampling examples β → maximum number of iterations begin // Construct the labelled and unlabelled sets L s ← Resample (s, T r ); U s ← T r \L s ; // Train Φ with L s Φ ← Train (L s , Φ); for iter ← 1 to β do // Select informative example from U s i ← SelectInformativeExample (γ, Φ, U s ); // Label the selected example Label (θ, i); // Update the labelled and unlabelled sets L s ← L s ∪ {i} ; U s ← U s \ {i} ; // Train Φ with L s Φ ← Train (L s , Φ); // Evaluate Φ on T s Test (T s , Φ); end end

The AL strategies selected only one unlabelled example in each iteration since all strategies considered in the experimental study are not optimal for working in batch-mode AL scenarios. The labelling process was done in a simulated environment since the label sets of the examples of U s are actually known. All strategies were implemented with JCLAL framework [66], it is a class library that allows an easy implementation of any AL method. The algorithms as standalone runnable files are available in order to facilitate the replicability of the

4.3. Experimental setting In the experimental study, our proposal -CVIRS- was compared with the most relevant myopic AL strategies that query all the label assignments (see Section 2): BinMin [40], ML [39], MML [39], MMC [41], CMN [42], MMU [48] and LCI [48]. In addition, a Random strategy, that randomly chooses examples from the unlabelled set, was considered in the comparison. 8

ACCEPTED MANUSCRIPT

experiments3 . 4.4. Results and discussion The study was divided into two parts. In the first one, a comparative study between the AL strategies for multi-label classification task was conducted, whereas the second one focussed on the label ranking task.

AC

CE

PT

ED

M

AN US

CR IP T

4.4.1. Multi-label classification task Figures 1-4 represent the learning curves of the AL strategies on Emotions, Medical, Yeast and TMC2007500 datasets. Each graph represents a plot of a function where the x-axis is the number of labelled examples and the y-axis is the value obtained by the multi-label classifier for a certain evaluation measure. Figure 1 shows that CVIRS strategy obtained the best outcomes for the MiF1 and MaF1 measures on the Emotions dataset. The CMN, ML, MML and MMC strategies performed better than Random strategy. The LCI and MMU strategies showed a poor performance on this dataset. Figure 2 shows that MMC, ML and MML strategies obtained worse results than Random strategy on the Medical dataset. The CVIRS, CMN, MMU and LCI strategies showed the best performance. As for the HL measure, Figure 3 shows that CVIRS, CMN, LCI and BinMin strategies outperformed the rest of the strategies on the Yeast dataset, their learning curves were under the learning curve of the Random strategy. The ML, MML and MMC strategies showed a poor performance. For the F1Ex measure, the CVIRS, LCI, BinMin and MMU strategies obtained the best performance. Figure 4 shows that CVIRS strategy obtained the best effectiveness on the TMC2007-500 dataset. For the F1Ex measure, the CMN, BinMin, ML, MML and MMC strategies showed a poor performance - their learning curves were dominated by the learning curve of the Random strategy. The MMU, ML, MML and MMC strategies had the lowest effectiveness at the MaF1Ex measure. To compare the AL strategies in a quantitative manner, the AUC values were estimated and a statistical analysis was conducted. Tables 4-7 show the AUC results obtained by the nine strategies compared in the experimental study. In all cases, the best results are highlighted in bold typeface in the tables, “↓” indicates “the smaller the better”, and “↑” indicates “the larger the better”. In the tables, the last two rows show the average

rank (Avg. Rank) and the ranking position (Pos.) for each strategy according to the Friedman’s test. As for the label-based measures (MiF1 and MaF1 ), CVIRS strategy generally showed a good performance on the 18 multi-label datasets. The MMU and LCI strategies were effective on Medical, Enron and Yeast datasets. The CMN strategy obtained good results on Birds, Genbase, Medical and Enron datasets. The BinMin and ML strategies performed well on Corel5k, Corel16k, Arts, Entertainment, Recreation and Health datasets. As for the bipartition-based measures (HL and F1Ex ), the CVIRS strategy generally showed a good effectiveness on the 18 multi-label datasets. The MMU and LCI strategies were effective on Medical and Yeast datasets. The CMN strategy performed well on Emotions, Birds, Genbase, Medical and Enron datasets. The BinMin strategy obtained good results on Flags, Yeast, Corel5k, Corel16k and Entertainment datasets. According to the average rankings computed by Friedman’s test, the three AL strategies that obtained the best performance for the label-based and bipartitionbased measures were CVIRS, CMN and BinMin, in this order. The MMC and Random strategies had the worst outcomes. Friedman’s test rejected all null hypotheses at a significance level α=0.05. Therefore, we can conclude that there were significant differences between the observed AUC’s values in the bipartition-based and label-based measures considered. Afterwards, a Shaffer’s post-hoc test for all pairwise comparisons was carried out. In the statistical analysis, the adjusted p-values [67] were considered. The adjusted p-values take into account the fact that multiple tests are conducted and they can be directly compared with any significance level [65]. The multiple comparisons were illustrated as a directed graph. An edge γ1 → γ2 shows that the strategy γ1 outperforms to strategy γ2 . Each edge is labelled with the evaluation measures for which γ1 outperformed to γ2 method. The adjusted p-values of the Shaffer’s test are shown between parentheses. Figure 5 shows the results of the Shaffer’s test for the bipartition-based and label-based measures. From a statistical point of view, our proposal CVIRS- significantly outperformed to Random, ML, MML, MMC, LCI and MMU strategies on all labelbased and bipartition-based measures. As for MiF1Ex measure, the BinMin, CMN and LCI strategies significantly outperformed to the Random strategy. The CMN and BinMin strategies performed better than MMC and MML strategies. Shaffer’s test did not detect significant differences between Random,

3 http://www.uco.es/grupos/kdis/kdiswiki/MLAL

9

(a) The performance at the MiF1

AN US

CR IP T

ACCEPTED MANUSCRIPT

(b) The performance at the MaF1

measure.

measure.

AC

CE

PT

ED

M

Figure 1: The performance of the AL strategies on the Emotions dataset.

(b) The performance at the MiF1

(a) The performance at the F1Ex measure.

measure.

Figure 2: The performance of the AL strategies on the Medical dataset.

MMC, ML, MML and MMU strategies at the significance level considered.

strategy. Furthermore, the CMN strategy performed better than ML and MMC strategies. Shaffer’s test did not detect significant differences between ML, MML, MMC, MMU, LCI and Random strategies at the signif-

Regarding MaF1Ex measure, the BinMin and CMN strategies significantly outperformed to the Random 10

AN US

CR IP T

ACCEPTED MANUSCRIPT

(a) The performance at the HL measure.

(b) The performance at the F1Ex measure.

AC

CE

PT

ED

M

Figure 3: The performance of the AL strategies on the Yeast dataset.

(b) The performance at the MaF1

(a) The performance at the F1Ex measure.

measure.

Figure 4: The performance of the AL strategies on the TMC2007-500 dataset.

icance level considered.

between ML, MML, MMC, MMU, LCI and Random strategies at the significance level considered.

As for HL measure, the CMN and BinMin strategies significantly outperformed to the Random strategy. Shaffer’s test did not detect significant differences

With regard to F1Ex measure, CMN and BinMin performed better than MML, MMC and Random strategies. 11

ACCEPTED MANUSCRIPT

Flags Emotions Birds Genbase Cal500 Medical Yeast Scene Enron Corel5k Corel16k TMC2007-500 Bibtex Arts Business Entertainment Recreation Health Avg. Rank Pos.

Random 0.541 0.616 0.265 0.945 0.330 0.648 0.575 0.630 0.420 0.101 0.099 0.598 0.203 0.200 0.305 0.259 0.199 0.301 7.806 9

BinMin 0.691 0.621 0.333 0.949 0.336 0.648 0.630 0.634 0.436 0.168 0.161 0.608 0.299 0.266 0.366 0.343 0.268 0.359 3.583 3

ML 0.668 0.640 0.384 0.952 0.331 0.570 0.618 0.618 0.372 0.126 0.145 0.589 0.274 0.260 0.476 0.323 0.265 0.347 6.000 6

Multi-label AL strategy MML MMC CMN 0.671 0.671 0.683 0.643 0.644 0.658 0.385 0.387 0.412 0.946 0.923 0.956 0.330 0.332 0.332 0.556 0.609 0.665 0.608 0.616 0.640 0.608 0.616 0.640 0.378 0.384 0.457 0.128 0.120 0.158 0.146 0.149 0.152 0.584 0.584 0.608 0.286 0.289 0.312 0.262 0.259 0.265 0.391 0.375 0.387 0.304 0.298 0.332 0.264 0.258 0.268 0.332 0.315 0.347 6.528 6.639 3.278 7 8 2

MMU 0.688 0.601 0.326 0.921 0.329 0.665 0.780 0.642 0.447 0.154 0.155 0.597 0.298 0.249 0.411 0.334 0.261 0.357 5.056 5

LCI 0.681 0.607 0.396 0.940 0.328 0.665 0.784 0.630 0.450 0.157 0.154 0.600 0.314 0.260 0.422 0.333 0.255 0.341 4.722 4

CVIRS 0.692 0.659 0.415 0.963 0.346 0.667 0.658 0.643 0.464 0.160 0.158 0.620 0.321 0.264 0.436 0.350 0.273 0.371 1.389 1

CR IP T

Dataset

Table 4: The AUC results at the MiF1 (↑) measure. Friedman’s test rejected the null hypothesis with a p-value equal to 6.121E-11.

Flags Emotions Birds Genbase Cal500 Medical Yeast Scene Enron Corel5k Corel16k TMC2007-500 Bibtex Arts Business Entertainment Recreation Health Avg. Rank Pos.

Random 0.569 0.517 0.304 0.751 0.161 0.352 0.385 0.645 0.152 0.274 0.033 0.485 0.111 0.132 0.135 0.154 0.142 0.123 7.528 9

BinMin 0.583 0.520 0.255 0.785 0.156 0.348 0.413 0.640 0.173 0.315 0.059 0.497 0.145 0.171 0.158 0.200 0.209 0.188 3.972 3

ML 0.572 0.608 0.309 0.806 0.154 0.310 0.416 0.624 0.147 0.303 0.048 0.479 0.149 0.147 0.159 0.191 0.207 0.171 5.778 7

Multi-label AL strategy MML MMC CMN 0.576 0.562 0.592 0.636 0.636 0.642 0.310 0.311 0.332 0.753 0.699 0.794 0.154 0.151 0.162 0.312 0.317 0.376 0.408 0.396 0.393 0.612 0.628 0.650 0.152 0.154 0.171 0.310 0.300 0.321 0.054 0.051 0.062 0.473 0.467 0.500 0.154 0.152 0.152 0.148 0.147 0.167 0.161 0.158 0.158 0.195 0.187 0.197 0.205 0.204 0.198 0.169 0.155 0.174 5.306 6.667 2.972 4 8 2

MMU 0.575 0.495 0.239 0.735 0.156 0.370 0.398 0.647 0.170 0.300 0.060 0.476 0.150 0.155 0.148 0.190 0.197 0.188 5.750 6

LCI 0.567 0.498 0.311 0.788 0.146 0.369 0.396 0.634 0.166 0.309 0.061 0.487 0.151 0.159 0.149 0.194 0.190 0.185 5.389 5

AN US

Dataset

CVIRS 0.588 0.654 0.330 0.785 0.170 0.383 0.400 0.651 0.185 0.314 0.065 0.521 0.156 0.170 0.170 0.201 0.218 0.194 1.639 1

Table 5: The AUC results at the MaF1 (↑) measure. Friedman’s test rejected the null hypothesis with a p-value equal to 1.029E-10. Random 0.365 0.234 0.200 0.006 0.196 0.019 0.251 0.145 0.086 0.045 0.052 0.084 0.029 0.295 0.188 0.245 0.289 0.187 7.639 9

BinMin 0.301 0.235 0.117 0.005 0.197 0.018 0.247 0.136 0.082 0.017 0.036 0.078 0.017 0.180 0.115 0.170 0.239 0.128 4.000 3

PT

ED

Flags Emotions Birds Genbase Cal500 Medical Yeast Scene Enron Corel5k Corel16k TMC2007-500 Bibtex Arts Business Entertainment Recreation Health Avg. Rank Pos.

Multi-label AL strategy MML MMC CMN 0.311 0.313 0.304 0.226 0.224 0.222 0.091 0.089 0.078 0.005 0.007 0.004 0.201 0.196 0.199 0.023 0.020 0.018 0.272 0.281 0.248 0.141 0.144 0.140 0.095 0.091 0.076 0.022 0.020 0.017 0.040 0.042 0.036 0.085 0.085 0.079 0.020 0.023 0.017 0.138 0.139 0.198 0.098 0.110 0.105 0.172 0.194 0.182 0.225 0.234 0.231 0.124 0.129 0.133 5.556 6.389 3.667 6 8 2

M

Dataset

ML 0.313 0.228 0.091 0.004 0.200 0.021 0.266 0.142 0.095 0.023 0.039 0.084 0.021 0.140 0.084 0.168 0.210 0.112 5.083 5

MMU 0.304 0.244 0.117 0.008 0.199 0.018 0.250 0.142 0.085 0.019 0.037 0.087 0.021 0.213 0.099 0.177 0.221 0.154 5.806 7

LCI 0.310 0.241 0.085 0.006 0.190 0.019 0.248 0.139 0.080 0.019 0.042 0.082 0.019 0.158 0.102 0.178 0.233 0.161 5.028 4

CVIRS 0.294 0.221 0.083 0.003 0.185 0.018 0.243 0.137 0.078 0.017 0.035 0.078 0.014 0.205 0.094 0.165 0.213 0.119 1.833 1

CE

Table 6: The AUC results at the HL (↓) measure. Friedman’s test rejected the null hypothesis with a p-value equal to 5.829E-9.

AC

Dataset

Flags Emotions Birds Genbase Cal500 Medical Yeast Scene Enron Corel5k Corel16k TMC2007-500 Bibtex Arts Business Entertainment Recreation Health Avg. Rank Pos.

Random 0.601 0.555 0.488 0.955 0.335 0.625 0.553 0.599 0.424 0.099 0.086 0.600 0.203 0.198 0.374 0.296 0.221 0.332 7.444 9

BinMin 0.677 0.563 0.522 0.958 0.336 0.616 0.565 0.594 0.432 0.158 0.147 0.589 0.271 0.272 0.393 0.366 0.281 0.373 3.778 3

ML 0.644 0.587 0.518 0.958 0.331 0.521 0.558 0.573 0.377 0.124 0.135 0.574 0.268 0.255 0.540 0.340 0.274 0.358 6.139 6

Multi-label AL strategy MML MMC CMN 0.647 0.648 0.663 0.592 0.587 0.615 0.520 0.520 0.605 0.953 0.930 0.958 0.330 0.332 0.330 0.510 0.575 0.639 0.547 0.533 0.554 0.554 0.578 0.610 0.382 0.388 0.455 0.123 0.113 0.146 0.139 0.137 0.139 0.570 0.567 0.592 0.271 0.269 0.290 0.256 0.253 0.273 0.421 0.458 0.438 0.322 0.301 0.357 0.270 0.268 0.281 0.331 0.320 0.361 6.722 7.056 3.639 7 8 2

MMU 0.662 0.539 0.513 0.941 0.328 0.640 0.565 0.614 0.440 0.130 0.140 0.605 0.274 0.254 0.477 0.354 0.284 0.374 4.833 5

LCI 0.657 0.547 0.576 0.950 0.327 0.645 0.566 0.597 0.450 0.147 0.141 0.607 0.283 0.272 0.485 0.358 0.286 0.371 4.083 4

CVIRS 0.674 0.616 0.590 0.960 0.345 0.650 0.568 0.634 0.465 0.158 0.144 0.622 0.295 0.275 0.498 0.362 0.290 0.386 1.306 1

Table 7: The AUC results at the F1Ex (↑) measure. Friedman’s test rejected the null hypothesis with a p-value equal to 4.523E-11.

12

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

the Medical dataset. The ML, MML, BinMin and MMC strategies showed a poor performance. Figure 9 shows that CVIRS, BinMin, LCI and CMN strategies obtained the best results at AP and RL measures on the Yeast dataset. The MML and MMC strategies had a lower effectiveness. Tables 8-10 show the AUC results obtained by the AL strategies at the ranking-based measures (OE , RL and AP ). In all cases, the best results are highlighted in bold typeface in the tables. In the tables, the last two rows show the average rank (Avg. Rank) and the ranking position (Pos.) for each strategy according to the Friedman’s test. As for ranking-based measures, CVIRS strategy generally performed well on the 18 multi-label datasets. BinMin strategy obtained good results on Flags, Corel16k and TMC2007-500 datasets. LCI strategy had a good performance on the Arts dataset. CMN strategy showed a good effectiveness on Birds and Emotions datasets. According to the average rankings computed by Friedman’s test, for the OE measure the strategies that obtained the best results were CVIRS, LCI and CMN, in this order. With regard to the RL measure, the strategy that had the best performance was CVIRS. As for the AP measure, the strategies that obtained the best results were CVIRS, BinMin and CMN, in this order. Friedman’s test rejected all null hypotheses at a significance level α=0.05. Thus, we can conclude that there were significant differences between the observed AUC’s values at the ranking-based measures considered. Afterwards, a Shaffer’s post-hoc test for all pairwise comparisons was conducted. Figure 10 shows the results of the Shaffer’s test. Regarding OE measure, the CVIRS strategy significantly outperformed to Random, ML, MML, MMC, MMU and BinMin strategies. Furthermore, the LCI strategy performed better than Random strategy. The statistical test did not detect significant differences between Random, MMC, ML, MML, MMU, BinMin and CMN strategies at the significance level considered. As for RL measure, the CVIRS strategy significantly outperformed all strategies. BinMin strategy outperformed to the Random strategy. The Shaffer’s test did not detect significant differences between Random, ML, MML, MMC, MMU, CMN and LCI strategies at the significance level considered. With regard to AP measure, the CVIRS strategy significantly outperformed to Random, MMU, LCI, ML, MML and MMC strategies. The BinMin and CMN strategies performed better than Random strategy. Significant differences between CVIRS, BinMin and CMN

ED

Figure 5: Significant differences between AL strategies according to the Shaffer’s test at the significance level α=0.05.

PT

Furthermore, LCI strategy outperformed to Random and MMC strategies. Shaffer’s test did not detect significant differences between ML, MML, MMC, MMU and Random strategies at the significance level considered.

AC

CE

4.4.2. Label ranking task Figures 6-9 represent the learning curves of the AL strategies on Emotions, Cal500, Genbase, Medical and Yeast datasets. As for the AP and RL measures, Figure 6 shows that CVIRS and CMN strategies obtained the best results on the Emotions dataset. The MMU strategy had worse results than Random strategy. Figure 7 shows that CVIRS strategy had the best performance at AP and RL measures on the Cal500 dataset. The CMN, ML, MML and MMU strategies showed a poor performance on this dataset - their learning curves were dominated by the learning curve of the Random strategy. Figure 8 shows that CVIRS and LCI strategies obtained the best effectiveness at AP and RL measures on 13

AN US

CR IP T

ACCEPTED MANUSCRIPT

(a) The performance at the AP measure.

(b) The performance at the RL measure.

AC

CE

PT

ED

M

Figure 6: The performance of the multi-label AL strategies on the Emotions dataset.

(a) The performance at the AP measure.

(b) The performance at the RL measure.

Figure 7: The performance of the multi-label AL strategies on the Cal500 dataset.

strategies were not detected. Shaffer’s test did not detect significant differences between Random, MMU, LCI, ML, MML and MMC strategies at the significance level considered.

4.4.3. Discussion The experimental study aimed to compare our proposal with several state-of-the-art AL strategies for two multi-label learning tasks over many multi-label 14

AN US

CR IP T

ACCEPTED MANUSCRIPT

(a) The performance at the AP measure.

(b) The performance at the RL measure.

AC

CE

PT

ED

M

Figure 8: The performance of the multi-label AL strategies on the Medical dataset.

(a) The performance at the AP measure.

(b) The performance at the RL measure.

Figure 9: The performance of the multi-label AL strategies on the Yeast dataset.

datasets. The evidence suggested that our proposal CVIRS- was effective on the two tasks analysed; multilabel classification and label ranking tasks. Analysing the average rankings returned by the Friedman’s test,

the CVIRS, CMN and BinMin strategies had the best results for the multi-label classification task. As for the label ranking task, the strategies that obtained the best effectiveness were CVIRS, BinMin, CMN and LCI. 15

ACCEPTED MANUSCRIPT

Flags Emotions Birds Genbase Cal500 Medical Yeast Scene Enron Corel5k Corel16k TMC2007-500 Bibtex Arts Business Entertainment Recreation Health Avg. Rank Pos.

Random 0.332 0.324 0.901 0.037 0.800 0.303 0.382 0.328 0.699 0.930 0.902 0.339 0.621 0.841 0.799 0.842 0.823 0.789 7.083 9

BinMin 0.252 0.322 0.853 0.022 0.847 0.320 0.342 0.332 0.618 0.851 0.840 0.313 0.586 0.760 0.726 0.738 0.773 0.764 4.444 4

Multi-label AL strategy MML MMC CMN 0.344 0.272 0.274 0.305 0.299 0.290 0.784 0.782 0.776 0.050 0.065 0.039 0.848 0.824 0.840 0.432 0.356 0.299 0.406 0.456 0.351 0.357 0.345 0.336 0.781 0.768 0.640 0.913 0.915 0.849 0.846 0.851 0.848 0.360 0.361 0.321 0.609 0.612 0.566 0.736 0.739 0.756 0.654 0.700 0.679 0.775 0.788 0.750 0.787 0.777 0.781 0.674 0.642 0.773 6.389 6.167 4.333 8 7 3

ML 0.347 0.305 0.785 0.033 0.845 0.416 0.392 0.349 0.791 0.928 0.848 0.350 0.615 0.739 0.500 0.754 0.757 0.697 5.972 6

MMU 0.239 0.321 0.845 0.040 0.808 0.299 0.372 0.328 0.650 0.866 0.844 0.366 0.588 0.780 0.659 0.723 0.788 0.642 4.861 5

LCI 0.276 0.322 0.801 0.045 0.780 0.294 0.358 0.335 0.687 0.874 0.842 0.349 0.571 0.736 0.675 0.735 0.772 0.640 4.167 2

CVIRS 0.238 0.276 0.775 0.029 0.769 0.283 0.326 0.321 0.618 0.852 0.838 0.308 0.549 0.771 0.641 0.703 0.747 0.634 1.583 1

CR IP T

Dataset

Table 8: The AUC results at the OE (↓) measure. Friedman’s test rejected the null hypothesis with a p-value equal to 1.601E-8.

Flags Emotions Birds Genbase Cal500 Medical Yeast Scene Enron Corel5k Corel16k TMC2007-500 Bibtex Arts Business Entertainment Recreation Health Avg. Rank Pos.

Random 0.302 0.201 0.198 0.009 0.249 0.089 0.229 0.131 0.189 0.501 0.399 0.088 0.301 0.365 0.299 0.350 0.311 0.287 7.417 9

BinMin 0.261 0.206 0.158 0.006 0.250 0.118 0.220 0.137 0.225 0.425 0.346 0.078 0.268 0.262 0.280 0.240 0.248 0.221 4.528 2

ML 0.293 0.194 0.132 0.008 0.254 0.171 0.231 0.147 0.188 0.377 0.351 0.084 0.273 0.285 0.166 0.224 0.258 0.241 5.667 7

Multi-label AL strategy MML MMC CMN 0.282 0.279 0.288 0.192 0.190 0.184 0.132 0.134 0.128 0.008 0.029 0.014 0.255 0.251 0.256 0.183 0.113 0.089 0.237 0.254 0.220 0.154 0.145 0.133 0.185 0.183 0.214 0.372 0.374 0.457 0.348 0.350 0.345 0.085 0.085 0.078 0.274 0.270 0.261 0.283 0.300 0.259 0.190 0.296 0.284 0.221 0.249 0.261 0.267 0.274 0.258 0.239 0.230 0.238 5.444 5.806 4.917 6 8 5

MMU 0.266 0.211 0.156 0.008 0.253 0.087 0.226 0.127 0.192 0.431 0.349 0.079 0.264 0.262 0.254 0.257 0.254 0.233 4.889 4

LCI 0.289 0.207 0.133 0.007 0.248 0.085 0.221 0.136 0.218 0.426 0.356 0.077 0.267 0.254 0.231 0.264 0.255 0.236 4.806 3

AN US

Dataset

CVIRS 0.266 0.184 0.126 0.005 0.234 0.083 0.216 0.123 0.174 0.406 0.341 0.077 0.248 0.250 0.226 0.225 0.240 0.210 1.528 1

Table 9: The AUC results at the RL (↓) measure. Friedman’s test rejected the null hypothesis with a p-value equal to 1.732E-7. Random 0.685 0.759 0.399 0.960 0.345 0.757 0.681 0.790 0.448 0.113 0.166 0.726 0.301 0.296 0.444 0.364 0.302 0.356 7.250 9

BinMin 0.792 0.768 0.418 0.980 0.340 0.725 0.695 0.792 0.453 0.163 0.183 0.740 0.356 0.392 0.436 0.439 0.398 0.412 3.806 2

PT

ED

Flags Emotions Birds Genbase Cal500 Medical Yeast Scene Enron Corel5k Corel16k TMC2007-500 Bibtex Arts Business Entertainment Recreation Health Avg. Rank Pos.

Multi-label AL strategy MML MMC CMN 0.761 0.775 0.774 0.778 0.782 0.789 0.507 0.507 0.514 0.966 0.943 0.968 0.331 0.340 0.332 0.628 0.704 0.754 0.664 0.644 0.692 0.775 0.782 0.792 0.392 0.401 0.474 0.122 0.129 0.152 0.170 0.175 0.178 0.714 0.714 0.738 0.341 0.354 0.375 0.394 0.385 0.394 0.592 0.302 0.462 0.395 0.374 0.424 0.390 0.378 0.388 0.421 0.400 0.400 6.417 6.611 4.139 7 8 3

M

Dataset

ML 0.757 0.778 0.507 0.975 0.332 0.642 0.677 0.780 0.387 0.124 0.168 0.720 0.337 0.390 0.618 0.431 0.405 0.443 5.667 6

MMU 0.778 0.758 0.424 0.968 0.338 0.755 0.680 0.807 0.447 0.134 0.175 0.717 0.377 0.376 0.498 0.428 0.411 0.402 4.833 5

LCI 0.771 0.765 0.492 0.960 0.349 0.764 0.690 0.789 0.450 0.138 0.174 0.722 0.374 0.408 0.488 0.439 0.401 0.410 4.556 4

CVIRS 0.787 0.790 0.519 0.978 0.367 0.775 0.699 0.804 0.479 0.150 0.184 0.744 0.387 0.386 0.525 0.460 0.414 0.425 1.722 1

CE

Table 10: The AUC results at the AP (↑) measure. Friedman’s test rejected the null hypothesis with a p-value equal to 3.137E-9.

datasets that have a large number of labels (e.g. Cal500, Corel5k and Bibtex datasets). In datasets with a large number of labels, the performance of CVIRS can be affected due to the positional method used to resolve the rank aggregation problem formulated in this work.

AC

It is worth to note that, although the Shaffer’s test did not detect significant differences between CVIRS, CMN and BinMin strategies in some of the evaluation measures considered, the CVIRS strategy was ranked at the first position of all average rankings computed by Friedman’s test.

The evidence indicated that the CVIRS strategy performed well using BR-SVM as its base classifier and the experimental settings set in this work. However, for future researches, it would be important to test our proposal with other experimental settings. It is also important to study the effectiveness of our strategy with base classifiers that do not follow the BR approach, e.g.

Generally speaking, CVIRS strategy performed well on multi-label datasets with diverse characteristics. The evidence suggested that CVIRS strategy obtained better results on multi-label datasets that have a small number of labels (e.g. Emotions, Birds, Yeast, Arts, Business, Entertainments, Health and Recreation datasets) than on 16

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

base classifiers which can obtain proper probability estimates from their outputs. CVIRS is not restricted to base classifiers that use problem transformation methods, it can also be used with multi-label learning algorithms belonging to algorithm adaptation category. An extensive comparison of several AL strategies was conducted over 18 multi-label datasets, showing that CVIRS strategy is competitive with respect to the state-of-the-art multi-label AL strategies. The CVIRS strategy was effective on multi-label datasets with diverse characteristics. It also performed well for the two tasks analysed; multi-label classification and label ranking tasks. The evidence suggested that the uncertainty measure based on the rank aggregation problem is a good approximation to compute the unified uncertainty of an unlabelled example. On the other hand, the results demonstrated the benefits of combining an uncertainty measure with another measure that brings into account the label space information. Future research will study more effective approaches to resolve the rank aggregation problem formulated in this work. In addition, we will study multi-label AL strategies for working on batch-mode scenarios, an area where scarce studies have been carried out. It would also be useful to analyse the effectiveness of our approach using incremental learning algorithms to speed up the updating of the base classifiers in each iteration.

ED

Acknowledgements

Figure 10: Significant differences between AL strategies according to the Shaffer’s test at the significance level α=0.05.

PT

This research was supported by the Spanish Ministry of Economy and Competitiveness, project TIN-201455252-P, and by FEDER funds.

multi-label classifiers belonging to the algorithm adaptation category.

CE

References [1] O. Reyes, C. Morell, S. Ventura, Scalable extensions of the ReliefF algorithm for weighting and selecting features on the multilabel learning context, Neurocomputing 161 (2015) 168–182. [2] M. A. Tahir, J. Kittler, A. Bouridane, Multi-label classification using stacked spectral kernel discriminant analysis, Neurocomputing 171 (2015) 127–137. [3] H. Liu, X. Wu, S. Zhang, Neighbor selection for multilabel classification, Neurocomputing 182 (2015) 187–196. [4] Z. Chen, Z. Hao, A unified multi-label classification framework with supervised low-dimensional embedding, Neurocomputing 171 (2016) 1563–1575. [5] J. Li, Y. Rao, F. Jin, H. Chen, X. Xiang, Multi-label maximum entropy model for social emotion classification over short text, Neurocomputing 210 (2016) 247–256. [6] X. Jia, F. Suna, H. Lib, Y. Caoa, X. Zhanga, Image multi-label annotation based on supervised nonnegative matrix factorization with new matching measurement, Neurocomputing 219 (2017) 518–525.

5. Conclusions

AC

In this work, an effective AL strategy for working on multi-label data was proposed, named as CVIRS. CVIRS selects the most informative examples of combining two measures. The first measure computes the unified uncertainty of an unlabelled example based on base classifier predictions. In order to aggregate the information across all labels, a rank aggregation problem was defined, and a simple rank aggregation method to resolve it was used. The second measure computes the inconsistency of a predicted label set by taking into account the information about the label space of the labelled set. The CVIRS strategy can be used with any 17

ACCEPTED MANUSCRIPT

[27] [28] [29] [30] [31]

[32]

AN US

[33]

edge Discovery Handbook, 2nd Edition, Springer-Verlag, New York, USA, 2010, Ch. Mining Multi-label Data, pp. 667–686. G. Madjarov, D. Kocev, D. Gjorgjevikj, An extensive experimental comparison of methods for multi-label learning, Pattern Recognition 45 (2012) 3084–3104. E. Gibaja, S. Ventura, Multi-label learning: a review of the state of the art and ongoing research, WIREs Data Mining and Knowledge Discovery 4 (2014) 411–444. X. Zhu, A. B. Goldberg, Introduction to Semi-Supervised Learning, Morgan & Claypool Publishers, 2009. B. Settles, Active Learning, 1st Edition, Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan & Claypool Publishers, 2012. D. Lewis, W. Gale, A sequential algorithm for training text classifiers, in: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Springer-Verlag New York, Inc., New York, USA, 1994, pp. 3–12. J. Fu, S. Lee, Certainty-based active learning for sampling imbalanced datasets, Neurocomputing 119 (2013) 350–358. W. Wu, Y. Liu, M. Guo, C. Wang, X. Liu, A probabilistic model of active learning with multiple noisy oracles, Neurocomputing 118 (2013) 253–262. S. Jones, L. Shao, K. Du, Active learning for human action retrieval using query pool selection, Neurocomputing 124 (2014) 89–96. X. Zhang, S. Wang, X. Zhu, X. Yun, G. Wu, Y. Wang, Update vs. upgrade: Modeling with indeterminate multi-class active learning, Neurocomputing 162 (2015) 163–170. J. Zhou, S. Sun, Gaussian process versus margin sampling active learning, Neurocomputing 167 (2015) 122–131. H. Yu, C. Sun, W. Yang, X. Yang, X. Zuo, AL-ELM: One uncertainty-based active learning algorithm using extreme learning machine, Neurocomputing 166 (2015) 140–150. S. Huang, R. Jin, Z. Zhou, Active learning by querying informative and representative examples, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (10) (2014) 1936–1949. X. Li, L. Wang, E. Sung, Multi-label SVM active learning for image classification, in: Proceedings of the International Conference on Image processing, Vol. 4, IEEE, 2004, pp. 2207– 2210. K. Brinker, From Data and Information Analysis to Knowledge Engineering, Springer Berlin Heidelberg, 2006, Ch. On Active Learning in Multi-label Classification, pp. 206–213. B. Yang, J. Sun, T. Wang, Z. Chen, Effective multi-label active learning for text classification, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Paris, France, 2009, pp. 917– 926. A. Esuli, F. Sebastiani, Active Learning Strategies for MultiLabel Text Classification, in: Advances in Information Retrieval, Springer Berlin Heidelberg, Toulouse, France, 2009, pp. 102–113. M. Singh, E. Curran, P. Cunningham, Active learning for multilabel image annotation, in: Proceedings of the 19th Irish Conference on Artificial Intelligence and Cognitive Science, 2009, pp. 173–182. S. Chakraborty, V. Balasubramanian, S. Panchanathan, Optimal Batch Selection for Active Learning in Multi-label Classification, in: Proceedings of the 19th ACM international conference on Multimedia, ACM, Scottsdale, Arizona, USA, 2011, pp. 1413–1416. C. W. Hung, H. T. Lin, Multi-label active learning with auxiliary learner, in: Proceedings of the Asian Conference on Machine Learning, Vol. 20, JMLR: Workshop and Conference Proceed-

CR IP T

[7] A. McCallum, Multi-label text classification with a mixture model trained by EM, in: Working Notes of the AAAI’99 Workshop on Text Learning, 1999, pp. 1–7. [8] A. Srivastava, B. Zane-Ulman, Discovering recurring anomalies in text reports regarding complex space systems, in: Proceedings of the Aerospace Conference, IEEE, 2005, pp. 55–63. [9] I. Katakis, G. Tsoumakas, I. Vlahavas, Multilabel text classification for automated tag suggestion, in: Proceedings of the ECML/PKDD Discovery Challenge, Vol. 75, 2008, pp. 75–83. [10] T. Li, M. Ogihara, Detecting emotion in music, in: Proceedings of the International Symposium on Music Information Retrieval, Vol. 3, Washington DC., USA, 2003, pp. 239–240. [11] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, M. I. Jordan, Matching words and pictures, Journal of Machine Learning Research 3 (2003) 1107–1135. [12] S. Yang, S. Kim, Y. Ro, Semantic home photo categorization, IEEE Transactions on Circuits and Systems for Video Technology 17 (2007) 324–335. [13] E. Correa, A. Plastino, A. Freitas, A Genetic Algorithm for Optimizing the Label Ordering in Multi-Label Classifier Chains, in: Proceedings of the 25th International Conference on Tools with Artificial Intelligence, IEEE, 2013, pp. 469–476. [14] Z. Y. He, C. Chen, J. J. Bu, P. Li, D. Cai, Multi-view based multi-label propagation for image annotation, Neurocomputing 168 (2015) 853–860. [15] M. Boutell, J. Luo, X. Shen, C. Brown, Learning multi-label scene classification, Pattern Recognition 37 (9) (2004) 1757– 1771. [16] D. Turnbull, L. Barrington, D. Torres, G. Lanckriet, Semantic annotation and retrieval of music and sound effects, IEEE Transactions on Audio, Speech, and Language Processing 16 (2) (2008) 467–476. [17] J.Wang, Y. Zhao, X. Wu, X. Hua, A transductive multi-label learning approach for video concept detection, Pattern Recognition 44 (2010) 2274–2286. [18] A. Elisseeff, J. Weston, A kernel method for multi-labelled classification, in: Advances in Neural Information Processing Systems, Vol. 14, MIT Press, 2001, pp. 681–687. [19] S. Diplarisa, G. Tsoumakas, P. Mitkas, I. Vlahavas, Protein classification with multiple algorithms, in: Proceedings of the 10th Panhellenic Conference on Informatics, Springer Berlin Heidelberg, 2005, pp. 448–456. [20] M. L. Zhang, Z. H. Zhou, Multi-label neural networks with applications to functional genomics and text categorization, IEEE Transactions on knowledge and Data Engineering 18 (2006) 1338–1351. [21] N. Cesa-Bianchi, G. Valentini, Hierarchical cost-sensitive algorithms for genome-wide gene function prediction, Journal of Machine Learning Research 8 (2010) 14–29. [22] F. Otero, A. Freitas, C. Johnson, A hierarchical multi-label classification ant colony algorithm for protein function prediction, Memetic Computing 2 (3) (2010) 165–181. [23] M. G. Larese, P. Granitto, J. G´omez, Spot defects detection in cDNA microarray images, Pattern Analysis and Applications 16 (2013) 307–319. [24] F. Briggs, et. al., The 9th annual MLSP competition: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment, in: Proceedings of the International Workshop on Machine Learning for Signal Processing, IEEE, 2013, pp. 1–8. [25] E. Ukwatta, J. Samarabandu, Vision based metal spectral analysis using multi-label classification, in: Proceedings of the Canadian Conference on Computer and Robot Vision, IEEE, 2009, pp. 132–139. [26] G. Tsoumakas, I. Katakis, I. Vlahavas, Data Mining and Knowl-

[34] [35] [36]

M

[37] [38]

ED

[39]

PT

[40]

CE

[41]

AC

[42]

[43]

[44]

[45]

18

ACCEPTED MANUSCRIPT

ings, 2011, pp. 315–330. [46] P. Wang, P. Zhang, L. Guo, Mining multi-label data streams using ensemble-based active learning, in: Proceedings of the 12th SIAM International Conference on Data Mining, 2012, pp. 1131–1140. [47] J. Tang, Z.-J. Zha, D. Tao, T.-S. Chua, Semantic-gap-oriented active learning for multilabel image annotation, IEEE Transactions on Image Processing 21 (4) (2012) 2354–2360. [48] X. Li, Y. Guo, Active Learning with Multi-Label SVM Classification, in: Proceedings of the 23-th International Joint Conference on Artificial Intelligence, AAAI Press, 2013, pp. 1479– 1485. [49] G. Qi, X. Hua, Y. Rui, J. Tang, H. Zhang, Two-dimensional multi-label active learning with an efficient online adaptation model for image classification, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (10) (2009) 1880–1897. [50] X. Zhang, J. Cheng, C. Xu, H. Lu, S. Ma, Multi-view multi-label active learning for image classification, in: Proceedings of the IEEE International Conference on Multimedia and Expo, IEEE, 2009, pp. 258–261. [51] S. Huang, Z. Zhou, Active query driven by uncertainty and diversity for incremental multi-label learning, in: Proceedings of the 13th International Conference on Data Mining, IEEE, 2013, pp. 1079–1084. [52] B. Zhang, Y. Wang, F. Chen, Multilabel image classification via high-order label correlation driven active learning, IEEE Transactions on Image Processing 23 (3) (2014) 1430–144. [53] J. Wu, V. Sheng, J. Zhang, P. Zhao, Z. Cui, Multi-label active learning for image classification, in: Proceedings of the IEEE International Conference on Image Processing, IEEE, 2014, pp. 5227–5231. [54] D. Vasisht, A. Damianou, Active learning for sparse bayesian multilabel classification, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA, 2014, pp. 472–481. [55] S. Huang, S. Chen, Z. Zhou, Multi-label active learning: Query type matters, in: Proceedings of the 24th International Conference on Artificial Intelligence, AAAI Press, 2015, pp. 946–952. [56] C.Ye, J. Wu, V. Sheng, P. Zhao, Z. Cui, Multi-label active learning with label correlation for image classification, in: IEEE International Conference on Image Processing, IEEE, 2015, pp. 3437–3441. [57] C. Dwork, R. Kumar, M. Naor, D. Sivakumar, Rank Aggregation Methods for the Web, in: Proceedings of the 10th World Wide Web Conference, ACM, Hong Kong, Hong Kong, 2001, pp. 613–622. [58] N. Ailon, M. Charikar, A. Newman., Aggregating inconsistent information: ranking and clustering, in: Proceedings of the 37th Annual ACM Symposium on Theory of Computing, ACM, Baltimore, Maryland, USA, 2005, pp. 684–693. [59] R. Fagin, R. Kumar, M. Mahdian, D. Sivakumar, E. Vee, Comparing and aggregating rankings with ties, in: Proceedings of the ACM Symposium on Principles of Database Systems, ACM, Paris, France, 2004, pp. 47–58. [60] S. Guiasu, C. Reischer, Some remarks on entropic distance, entropic measure of connexion and hamming distance, RAIROTheoretical Informatics 13 (4) (1979) 395–407. [61] S. Kullback, R. A. Leibler, On information and sufficiency, The annals of mathematical statistics 22 (1) (1951) 79–86. [62] G. Qi, X. Hua, Y. Rui, J. Tang, H. Zhang, Two-dimensional active learning for image classification, in: Proceedings of the Conference on Computer Vision and Pattern Recognition, IEEE, 2008, pp. 1–8. [63] M. Friedman, A comparison of alternative tests of significance for the problem of m rankings, The Annals of Mathematical

CR IP T

Statistics 11 (1940) 86–92. [64] J. Shaffer, Modified sequentially rejective multiple test procedures, Journal of the American Statistical Association 81 (395) (1986) 826–831. [65] S. Garc´ıa, F. Herrera, An extension on “Statistical Comparisons of Classifiers over Multiple Data Sets’’ for all pairwise comparisons, Journal of Machine Learning Research 9 (2008) 2677– 2694. [66] O. Reyes, E. P´erez, M. C. Rodr´ıguez-Hern´andez, H. M. Fardoun, S. Ventura, Jclal: a java framework for active learning, Journal of Machine Learning Research 17 (1) (2016) 3271– 3275. [67] S. P. Wright, Adjusted p-values for simultaneous inference, Biometrics 48 (1992) 1005–1013.

ED

M

AN US

Oscar Reyes was born in Holgun, Cuba, in 1984. He received the B.S. and M.Sc. degrees in Computer Science from the University of Holgun, Cuba, in 2008 and 2011, respectively. He is currently an Assistant Professor with the Department of Computer Science of University of Holgun, Cuba and member of Knowledge Discovery and Intelligent Systems Research Laboratory of University of Crdoba, Spain. He is currently working toward the Ph. D. degree. His current research interests are in the fields of data mining, machine learning, metaheuristics, and their applications.

AC

CE

PT

Carlos Morell received his B.S degree in Computers Science and his Ph.D. in Artificial Intelligence from the Universidad Central de Las Villas, in 1995 and 2005, respectively. Currently, he is a Professor in the Department of Computer Science at the same university. In addition he leads the Artificial Intelligence Research Laboratory. His teaching and research interests include Machine Learning, Soft Computing and Programming Languages.

Sebastian Ventura is currently an Associate Professor in the Department of Computer Science and Numerical Analysis at the University of Cordoba, where he heads the Knowledge Discovery and In19

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

telligent Systems Research Laboratory. He received his BSc and Ph.D. degrees in sciences from the University of Cordoba, Spain, in 1989 and 1996, respectively. He has published more than 150 papers in journals and scientific conferences, and he has edited three books and several special issues in international journals. He has also been engaged in twelve research projects (being the coordinator of four of them) supported by the Spanish and Andalusian governments and the European Union. His main research interests are in the fields of computational intelligence, machine learning, data mining, and their applications. Dr. Ventura is a senior member of the IEEE Computer, the IEEE Computational Intelligence and the IEEE Systems, Man and Cybernetics Societies, as well as the Association of Computing Machinery (ACM).

20