Dynamic ensemble pruning based on multi-label classification

Dynamic ensemble pruning based on multi-label classification

Neurocomputing 150 (2015) 501–512 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Dynamic...

773KB Sizes 2 Downloads 79 Views

Neurocomputing 150 (2015) 501–512

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Dynamic ensemble pruning based on multi-label classification Fotini Markatopoulou b, Grigorios Tsoumakas a,n, Ioannis Vlahavas a a b

Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece Information Technologies Institute, CERTH, 57001 Thessaloniki, Greece

art ic l e i nf o

a b s t r a c t

Article history: Received 25 December 2013 Received in revised form 19 July 2014 Accepted 21 July 2014 Available online 6 October 2014

Dynamic (also known as instance-based) ensemble pruning selects a (potentially) different subset of models from an ensemble during prediction based on the given unknown instance with the goal of maximizing prediction accuracy. This paper models dynamic ensemble pruning as a multi-label classification task, by considering the members of the ensemble as labels. Multi-label training examples are constructed by evaluating whether ensemble members are accurate or not on the original training set via cross-validation. We show that classification accuracy is maximized when learning algorithms that optimize example-based precision are used in the multi-label classification task. Results comparing the proposed framework against state-of-the-art dynamic ensemble pruning approaches in a variety of datasets using a heterogeneous ensemble of 200 classifiers show that it leads to significantly improved accuracy. & 2014 Elsevier B.V. All rights reserved.

Keywords: Ensemble pruning Ensemble selection Multi-label classification Dynamic classifier fusion

1. Introduction Supervised ensemble methods are concerned with the production and the combination of multiple predictive models. One dimension along which we could categorize such methods is based on the number of models that affect the final decision. Usually all models are taken into consideration. When models are classifiers, this is called classifier fusion. Some methods, however, select just one model from the ensemble. When models are classifiers, this is called classifier selection. A third option, standing in between of these two, is to select a subset of the ensemble's models. This is mainly called ensemble pruning or ensemble selection [26]. Ensemble pruning methods can be either static, meaning that they select a fixed subset of the original ensemble for all test instances, or dynamic, also called instance-based, where a different subset of the original ensemble may be selected for each different test instance. The rationale of using dynamic ensemble pruning approaches is that different models have different areas of expertise in the instance space. Therefore, static approaches that are forced to select a fixed subset prior to seeing an unclassified instance may have a theoretical disadvantage compared to dynamic ones. On the other hand, static approaches lead to improved space complexity as they typically retain a small percentage of the original ensemble, in contrast to dynamic approaches that need to retain the complete original ensemble.

n

Corresponding author. E-mail addresses: [email protected] (F. Markatopoulou), [email protected] (G. Tsoumakas), [email protected] (I. Vlahavas). http://dx.doi.org/10.1016/j.neucom.2014.07.063 0925-2312/& 2014 Elsevier B.V. All rights reserved.

We propose a new approach to the instance-based ensemble pruning problem by modeling it as a multi-label learning task [24]. Labels correspond to classifiers and multi-label training examples are formed based on the ability of each classifier to correctly classify each original training example. This way we can take advantage of recent advances in the area of multi-label learning and attack collectively, instead of separately, the problems of predicting whether each classifier will classify correctly a given unclassified instance. This paper builds upon our previous work [18] and extends it in the following main directions: (a) it approximately doubles the number of datasets of the empirical comparison, providing further evidence of the effectiveness of the proposed algorithm and (b) it employs a thresholding strategy that automatically computes the threshold that optimizes precision, leading to a fairer comparison against the state-of-the-art. The rest of this paper is organized as follows. Section 2 reviews related work on dynamic ensemble pruning and Section 3 discusses the proposed approach. Section 4 presents the experimental setup. Section 5 presents the empirical study and Section 6 discusses the conclusions of this work.

2. Related work Many approaches deal directly with instance-based ensemble pruning [27,4,13,10]. These are presented in Section 2.2. However, there are also some that deal with dynamic approaches to classifier selection and fusion [30,7,21,19], which can be considered as extreme cases of ensemble pruning. In addition, some

502

F. Markatopoulou et al. / Neurocomputing 150 (2015) 501–512

Table 1 Summary of dynamic classifier selection, fusion and pruning methods. Classifier selection Ensemble pruning

DS, OLA, LCA, MCB, [19,7] k-NN-based Clustering-based Ordering-based Other

Classifier fusion

DVS, KNORA, [31] [14] [16,32,10,4] [17] DV

dynamic classifier selection approaches, may in some cases (e.g. ties) take into consideration more than one model. We therefore discuss in Section 2.1 such methods too. Section 2.3 discusses the issues of time complexity improvement and diversity in the context of dynamic ensemble pruning methods. For ease of reference and navigation of this section, Table 1 shows the category (i.e. selection, fusion, pruning) of each method that is discussed, using either its acronym where available (e.g. KNORA) or its citation. 2.1. Dynamic classifier selection and fusion The approach in Woods et al. [30] starts with retrieving the k nearest neighbors of a given test instance from the training set. It then classifies this test instance using the most competent classifier within this local region. In case of ties, majority voting is applied. The local performance of classifiers is assessed using two different metrics. The first one, called overall local accuracy (OLA), measures the percentage of correct classifications of a model for the examples that exist in the local region. The second one, called local class accuracy (LCA), measures the percentage of correct classifications of a model within the local region too, but only for those examples where the model had given the same output as the one it gives for the current unlabeled instance being considered. A very similar approach to this one was proposed independently at the same time [7], also taking the distance of the k nearest neighbors into account. The dynamic selection (DS) and dynamic voting (DV) approaches in Puuronen et al. [21,20] are in the same spirit as Woods et al. [30] and Giacinto and Roli [7]. A kNN approach is initially used to find the most similar training instances with the given test instance. DS selects the classifier with the least error within the local area of the neighbors weighted by distance. In fact, DS, is very similar to the weighted version of OLA presented in Giacinto and Roli [7]. DV is different, as it is a classifier fusion approach. It combines all models weighted by their local competence. Yet another approach along the same lines is Giacinto and Roli [8]. After finding the k nearest neighbors of the test instance, this approach further filters the neighborhood based on the similarity of the predictions of all models for this instance and each neighbor. In this sense, this approach is similar to the LCA variation in Woods et al. [30]. It finally selects the most competent classifier in the reduced neighborhood. The predictions of all models for an instance are in this paper called collectively multiple classifier behavior (MCB). The approach in Ortega et al. [19] estimates whether the ensemble's models will be correct/incorrect with respect to a given test instance, using a learning algorithm, trained from the k-fold cross-validation performance of the models on the training set. It can be considered as a generalization of the approaches we have seen so far in this subsection, where a nearest neighbor approach was specifically used instead. The approach we propose in this paper is based on the same principle, with the difference that multi-label learning algorithms are employed and therefore the binary tasks of predicting correct/incorrect decision for each model are viewed in a collective way.

2.2. Dynamic ensemble selection Similar with the dynamic Classifier Selection methods, the majority of the Dynamic Ensemble Selection methods start with retrieving the k nearest neighbors of a given test instance from the training set, in order to construct a new set of instances known as local region of competence [31]. The selection algorithms decide for the appropriate subset of the initial ensemble based on different properties (e.g. accuracy, diversity) of the base classifiers in this local region. Dynamic voting with selection (DVS) [27,28] is an approach that stands in between the DS and DV algorithms that were mentioned in the previous subsection. First, about half of the models in the ensemble, those with local errors that fall into the upper half of the error range of the committee, are discarded. Then, the rest are combined using DV. Since this variation, eventually selects a subset of the original models, we can consider it as an instancebased ensemble pruning approach. The primary goal of k-nearest-oracles (KNORA) [13] is improving the accuracy compared to the complete ensemble. Four different versions of the basic KNORA algorithm are proposed, all based on an initial stage of identifying the k nearest neighbors of a given unclassified instance. KNORA-ELIMINATE selects those classifiers that correctly classify all k neighbors. In case none such exists, the k value is decreased until at least one is found. KNORAUNION selects those classifiers that correctly classify at least one of the k neighbors. KNORA-ELIMINATE-W and KNORA-UNION-W are variations that weight the votes of classifiers according to their Euclidean distance to the unclassified instance. While the above methods only consider the accuracy of the ensemble within the local region, the method proposed by Xiao et al. [31] simultaneously considers both the accuracy and the diversity of the pruned ensemble. Specifically, this method utilizes the symmetric regularity criterion to measure the accuracy of the ensemble and the double-fault measure to estimate the diversity. Finally, a GMDH-based neural network describes the relationship between the class labels of the local region of competence and the test instance. We can distinguish two more categories of dynamic ensemble selection methods that do not consider the k-nearest neighbors of test instances: Clustering based methods and Ordering based methods. Clustering based methods use clustering algorithms (k-means, Gaussian Mixture Models etc.) in order to estimate the local regions Kuncheva [14]. In contrast with the k nearest neighbors based methods that generate the local regions of competence during the test phase, clustering based methods estimate them offline during the training phase. Only the selection of a winning local region and the appropriate classifier ensemble is selected during the test phase. Ordering based methods utilize statistical or probabilistic measures in order to produce a decreasing order of the base classifiers from the most suitable for a given test instance to the less suitable. The method proposed by Li et al. [16] assumes that base classifiers not only make a classification decision but also return a confidence score that shows their belief that their decision is correct. Dynamic ensemble selection is performed by ordering the base classifiers according to the confidence scores and fusion is performed using weighted voting. The method proposed by Yan et al. [32] is a twostep approach. In the first step classifiers are ordered based on their diversity using the Fleiss's statistic, in the second step classifiers in this rank are selected until a confidence threshold is reached. Recently, a statistical approach has been proposed for instancebased pruning of homogeneous ensembles, with the provided that the models are produced via independent applications of a randomized learning algorithm on the same training data, and that majority voting is used [10]. It is based on the observation that given the decisions made by the classifiers that have already been queried, the probability distribution of the remaining class predictions can be calculated via a Polya urn model. During prediction, it samples the

F. Markatopoulou et al. / Neurocomputing 150 (2015) 501–512

ensemble members randomly without replacement, stopping when the probability that the predicted class will change is below a specified level. As this approach assumes homogeneous models, it aims at speeding up the classification process without significant loss in accuracy. In contrast, our approach is directed at heterogeneous ensembles and aims primarily at improving the predictive performance compared to the full ensemble. Another approach with the same goal as Hernández-Lobato et al. [10] is Fan et al. [4]. In that work a first stage of static pruning takes place, followed up by a second stage of dynamic pruning, called dynamic scheduling. In the dynamic stage, classifiers are considered one by one in a decreasing order of total benefit (the authors focused on cost-sensitive applications). This iterative process stops when the difference of the current ensemble and the complete ensemble, as estimated assuming normal distribution with parameters calculated on the training set, is small. Other approaches that cannot be included to any of the aforementioned categories have been proposed. For example, in Lysiak et al. [17] the dynamic selection problem is treated as an optimization problem and the proposed method uses the simulated annealing algorithm to solve it. 2.3. Time complexity and diversity Since the output of dynamic ensemble pruning methods depends on just a subset of the models of the original ensemble, one would expect that there are always gains in terms of time complexity. However, this is not necessarily true. Some methods (e.g. LCA variant in [30]) first query all models and then select a subset of those to combine. Others might require computations that are more expensive compared to querying all models. For example, locating the 10 nearest neighbors of a new unclassified instance and then querying one decision tree, might be more expensive compared to directly querying 100 decision trees. Therefore, only computationally simple approaches to pruning, such as Hernández-Lobato et al. [10] can lead to improved time complexity in the general case, sometimes at the expense of accuracy. The primary goal of most of the other methods is improving the accuracy compared to the complete ensemble. The proposed approach falls into this latter category of methods. It aims at improved accuracy and cannot guarantee time complexity improvements as it first needs to query a multi-label learner. A potential point of criticism against most of the presented dynamic classifier combination methods, as well as against the proposed framework, could be that they are focused on accuracy and ignore diversity. The question is, should diversity be a desired property of the ensembles selected by instance-based ensemble pruning methods? The concept of diversity has been studied extensively [15] and it is generally accepted that diversity and accuracy are both required for a successful ensemble. Diversity is desired, because models that err in different parts of the input space can help each other correct their mistakes through a voting process. In instance-based approaches

503

however, we are interested in the performance for a specific example, not the whole input space in general, or a part of it. In this case we would want to choose only those models that are as accurate as possible for this example, and not those models that may make mistakes in it. Our approach is based solely on accuracy to select the appropriate subset of the ensemble. Note, however, that diversity is still a necessary property of the full ensemble.

3. Our approach The main idea of this work is that instance-based ensemble pruning can be modeled as a multi-label classification task, whose input space is the same as the input space of the classification task at hand and whose label space contains one label for each classifier in the ensemble. Given a new test instance, a multi-label classification model for this task would output a subset of classifiers, which is exactly what dynamic ensemble pruning is about. Formally, consider an ensemble of t models hj : X -Y; j ¼ 1…t, where X ¼ Rd is the input space and Y ¼ fy1 ; …; yk g is the domain of the target variable of the classification task at hand. The multi-label classification task aims to learn a model hm : X-h1 ; …; ht . To construct a training set for this task, we should have knowledge of which of the classifiers in the ensemble are correct/ incorrect for each training example of the classification task at hand. This could be achieved by letting each member of the ensemble classify each training example and comparing its output with the true class of the example. If the prediction of a classifier is correct then a positive class value is used for the corresponding label, while if the prediction is incorrect then a negative class value is used. As the predictions of a classifier on its training set are biased, we suggest the use of cross-validation for constructing the multi-label training set. Formally, consider a training set D ¼ !ðiÞ ! ð x ; yðiÞ Þ; i ¼ 1…n, where x A X and y A Y, and the predictions ðiÞ oj , obtained via cross-validation on D from the t different learning algorithms and/or different parameterizations of the same algorithms that were used to produce the models hj ; j ¼ 1…t. The multi!ðiÞ ðiÞ ðiÞ label training   set is then M ¼ ð x ; Y Þ; i ¼ 1…n, where Y ¼ ðiÞ ðiÞ g; j ¼ 1…t, where IðtrueÞ ¼ 1 and IðfalseÞ ¼ 0. Fig. 1 fI oj ¼ y exemplifies the process of constructing the multi-label training set for the image segmentation dataset (segment) of the European project Statlog [5]. The target attribute of this dataset has 6 values: brickface, sky, foliage, cement, window, path, grass. Given an unlabeled instance, the multi-label classifier hm is first queried, outputting a subset of models Z A fh1 ; …; ht g that it considers will correctly classify this instance. Note that the empty set is a feasible output for certain multi-label learning algorithms. However, in our case, an empty set is meaningless, as it means that all classifiers are pruned and that no prediction can be given. Therefore, for the proposed multi-label learning task, multi-label classifiers should be augmented with the constraint of non-empty predictions (Z a∅). Then, each of the models in the subset is

Fig. 1. Creating the multi-label training set.

504

F. Markatopoulou et al. / Neurocomputing 150 (2015) 501–512

queried and their decisions are combined via plurality voting, often called simple majority voting. Let us denote the subset of classifiers that give correct predictions as W. Assuming a two-class classification task, the final prediction will be correct if more than half of the models in Z are correct, or in other words if jZ \ Wj=jZj 4 0:5. The left part of this inequality is actually the definition of the precision of a multi-label prediction [9,24]. For problems with more than two class values, a lower precision value than 0.5 could in some cases suffice for a correct final decision through plurality voting. This analysis suggests that learning algorithms that optimize precision should be used for the proposed multi-label training task. We could therefore consider algorithms that minimize the following loss functions:  0 if jZ \ Wj=jZj 40:5 lossðZ; WÞ ¼ ð1Þ 1 otherwise lossðZ; WÞ ¼ 1 

jZ \ Wj jZj

ð2Þ

However, not many multi-label learning algorithms accept a specific loss function as a parameter. Furthermore, some multilabel learning algorithms learn models that can only output a score vector for each label [2,11], so they cannot be used without modifications in our case, where a bipartition of the labels into relevant (correct classifiers) and irrelevant (incorrect classifiers) is required. In addition, most state-of-the-art multi-label classification algorithms (i.e those that do output a bipartition) of the recent literature [33,25,22,23], actually learn models that output a score vector primarily and employ a thresholding method in order to be able to output bipartitions [12]. Given that high precision is crucial for the success of our approach, all the above learning algorithms could be used in conjunction with a thresholding method that optimizes the losses presented above.

4. Experimental setup 4.1. Datasets We experimented on 40 data sets from the UCI Machine Learning repository [1]. Table 2 presents the details of these data sets. 4.2. Ensemble construction We constructed a heterogeneous ensemble of 200 models, by using different learning algorithms with different parameters on the training set. The WEKA machine learning library [29] was used as the source of learning algorithms. We trained 40 multilayer perceptrons (MLPs), 60 k Nearest Neighbors (kNNs), 80 support vector machines (SVMs) and 20 decision trees (DT) using the C4.5 algorithm. The different parameters used to train the algorithms were the following (default values were used for the rest of the parameters):

 MLPs: We used 5 values for the nodes in the hidden layer {1, 2, 





4, 8, 16}, 4 values for the momentum term {0.0, 0.2, 0.5, 0.9} and 2 values for the learning rate {0.3, 0.6}. kNNs: We used 20 values for k distributed evenly between 1 and the plurality of the training instances. We also used 3 weighting methods: no-weighting, inverse-weighting and similarity-weighting. SVMs: We used 8 values for the complexity parameter {10  5, 10  4, 10  3, 10  2, 0.1, 1, 10, 100}, and 10 different kernels. We used 2 polynomial kernels (of degree 2 and 3) and 8 radial kernels (gamma A {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2}). Decision trees: We constructed 10 trees using post-pruning with 5 values for the confidence factor {0.1, 0.2, 0.3, 0.5 } and 2 values

Table 2 Details of data sets: Folder in UCI server, number of instances, classes, continuous and discrete attributes, percentage of missing values. id

UCI folder

Instances

Classes

Continuous

Discrete

Missing

d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40

anneal arrhythmia audiology autos balance-scale breast-cancer breast-w car cmc colic credit-a credit-g dermatology diabetes ecoli eucalyptus flags glass haberman heart-c heart-h heart-statlog heart-v hepatitis hill hypothyroid ionosphere kr-vs-kp liver-disorders lymph primary-tumor segment sick sonar soybean tic-tac-toe vehicle vote wine zoo

798 452 226 205 625 286 699 1728 1473 368 690 1000 366 768 336 736 194 214 306 303 294 270 200 155 607 3772 351 3196 345 148 339 2310 3772 195 683 958 846 435 178 101

6 16 24 7 3 2 2 4 3 2 2 2 6 2 8 5 8 7 2 5 5 2 5 2 2 4 2 2 2 4 2 7 2 2 19 2 4 2 3 7

9 206 0 15 4 0 0 0 2 7 6 7 1 8 7 14 2 9 3 6 6 13 6 6 100 7 34 0 6 3 0 19 7 60 0 0 18 0 13 1

29 73 69 10 0 9 2 6 7 15 9 13 33 0 0 5 27 0 0 7 7 0 7 13 0 30 0 36 0 15 17 0 23 0 35 9 0 16 0 16

0.00 0.32 20.33 11.51 0.00 0.35 0.00 0.00 0.00 23.80 0.65 0.00 0.00 0.00 0.00 3.20 0.00 0.00 0.00 0.18 20.46 0.00 26.85 5.67 0.00 5.4 0.00 0.00 0.00 0.00 0.00 0.00 5.40 0.00 0.00 0.00 0.00 5.63 0.00 0.00

for Laplace smoothing {true, false}, 8 trees using reduced error pruning with 4 values for the number of folds {2, 3, 4, 5} and 2 values for Laplace smoothing {true, false}, and 2 unpruned trees using 2 values for the minimum objects per leaf {2, 3}.

4.3. Dynamic ensemble pruning methods We compare the proposed approach against most of the methods that were presented in Section 2 and deal with the issue of accuracy improvement: OLA, LCA, MCB, DV, DS, DVS and KNORA. These methods start by finding the k nearest neighbors of a test instance in the training set. We set the number of neighbors to 10 and used Euclidean distance to estimate the nearest neighbors. The ELIMINATE version of KNORA was used, as that was found as producing the overall best results in Ko et al. [13]. We also calculate the performance of the complete ensemble of 200 models using simple majority voting for model combination (MV) as a baseline method. We did not compare the proposed approach against [10,4], which are methods with the different goal of speeding up the classification process without significant loss in accuracy. As our approach is based solely on accuracy to select the appropriate subset of the ensemble, we further refrained from comparing against approaches that consider diversity as well [17,31,32]. This would be an interesting future work direction. Finally, as we already mention in Section 2, the approach by Ortega et al. [19],

Table 3 Accuracy percentage and standard deviation of our approach instantiated with CLR and the following thresholding strategies: (a) default, (b) OneThreshold using thresholds ranging from 0.5 to 0.9 with a step of 0.05 and using 5-fold cross-validation to automatically compute the threshold that optimizes precision. Id

Default

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

Optimized Threshold 0.94 7 00.04 0.86 7 00.01 0.89 7 00.03 0.64 7 00.10 0.96 7 00.01 0.707 00.07 0.78 7 00.08 0.90 7 00.04 0.82 7 00.07 0.72 700.08 0.747 00.10 0.767 00.04 0.92 7 00.03 0.81 7 00.04 0.477 00.12 0.85 7 00.03 0.84 7 00.04 0.79 7 00.03 0.50 7 00.17 0.66 7 00.07 0.477 00.09 0.60 7 00.07 0.79 7 00.12 0.53 7 00.14 0.94 7 00.01 0.93 7 00.01 0.79 7 00.06 0.85 7 00.04 0.83 7 00.06 0.707 00.08 0.89 7 00.02 0.89 7 00.02 0.917 00.01 0.64 7 00.13 0.917 00.01 0.96 7 00.05 0.90 7 00.02 0.79 7 00.03 0.89 7 00.07 0.93 7 00.03

d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40

98.78 7 00.92 69.717 07.41 76.64 7 15.07 44.64 7 19.23 87.05 7 03.32 74.85 7 06.78 96.14 704.29 83.10 706.31 27.727 19.25 85.34 7 05.00 81.45 7 17.69 74.70 7 04.34 96.447 03.04 75.65 7 07.64 65.02 7 33.45 53.85 7 11.89 63.84 7 09.30 74.35 7 10.73 72.90 7 08.47 84.16 706.78 79.75 7 17.67 83.34 7 03.80 34.007 08.89 79.717 09.90 54.62 7 02.86 99.39 7 00.29 88.917 09.55 94.43 7 05.01 62.69 7 19.98 81.05 7 05.91 40.98 7 07.39 97.53 7 01.42 98.59 7 00.37 43.88 7 16.78 63.15 714.79 64.52 733.88 80.14 703.68 96.777 02.59 95.007 05.24 95.007 06.71

98.89 7 00.99 60.43 7 09.77 71.82 712.47 47.60 7 24.30 88.97 703.10 74.85 706.96 96.577 03.68 83.90 7 08.50 26.56 7 21.88 84.53 7 05.78 80.87 7 17.80 75.00 704.56 97.00 701.89 77.217 06.33 65.04 7 33.62 56.317 14.19 62.82 7 08.62 68.79 7 11.11 72.89 708.99 84.477 05.99 80.447 16.59 83.34 7 04.46 36.00 708.89 79.00 712.40 55.28 7 03.00 93.96 7 01.35 89.78 7 08.40 93.83 7 06.03 58.87 7 22.41 80.38 7 08.18 44.817 06.40 96.23 7 01.79 97.147 00.87 46.74 715.94 84.30 7 10.77 67.03 7 31.42 79.197 04.06 96.777 02.38 95.56 7 04.84 96.00 704.90

99.22 7 00.71 62.65 7 09.52 74.01 712.39 47.64 7 24.18 88.81 7 03.23 74.50 7 07.22 96.28 7 03.74 84.77 708.76 27.38 722.01 83.99 7 06.76 81.167 17.77 74.707 04.69 97.28 702.10 77.20 7 06.32 65.94 7 31.98 57.13 714.84 62.84 7 06.29 69.747 10.46 72.89 7 08.99 83.477 06.00 80.447 16.59 83.34 7 04.46 36.00 708.31 79.67 711.89 55.127 02.62 94.09 7 01.27 89.49 7 08.19 94.277 05.82 60.34 7 22.28 81.05 7 08.39 45.99 7 06.44 96.717 01.87 97.38 700.88 47.217 14.70 86.50 7 09.77 69.65 7 28.17 79.917 04.13 96.77 702.38 95.56 7 04.84 95.00 705.00

99.22 7 00.71 65.75 7 09.16 74.45 712.93 48.60 7 24.25 88.97 703.26 74.14 706.91 96.28 7 03.74 84.89 7 09.14 28.197 21.89 83.43 7 06.55 81.02 7 17.76 75.00 7 05.42 97.55 7 01.90 76.94 7 05.71 67.767 30.29 56.58 7 14.37 63.377 07.27 71.157 11.40 72.56 708.78 83.477 06.00 80.78 7 16.60 83.34 7 03.80 35.00 707.42 80.337 11.70 56.44 702.42 94.27 701.30 90.34 7 07.29 94.68 7 05.58 61.517 21.25 81.727 07.43 46.87 7 06.18 96.97 701.68 97.59 7 00.80 47.69 7 14.94 87.38 7 09.04 72.157 26.94 80.73 703.29 96.777 02.38 95.56 7 04.84 95.00 7 05.00

99.337 00.74 67.06 709.00 77.10 712.38 48.127 23.75 89.29 7 03.53 73.78 7 07.13 96.28 7 03.74 84.66 709.31 29.28 7 21.74 83.69 7 06.42 81.02 718.14 75.30 7 05.60 97.56 7 01.89 77.34 7 06.10 68.377 29.00 56.727 14.02 62.84 7 07.45 74.87 7 11.36 73.23 7 08.93 83.48 705.97 81.13 7 17.30 83.337 03.42 35.007 09.22 80.337 11.70 56.94 7 02.53 94.75 7 01.39 90.63 7 07.27 95.127 05.18 60.63 7 19.61 82.38 7 06.93 47.48 7 04.86 97.36 7 01.67 97.83 7 00.72 47.747 15.97 88.26 7 08.60 74.137 26.89 80.85 7 03.61 96.777 02.38 96.11 704.34 95.007 05.00

99.337 00.74 68.407 09.36 77.96 711.64 48.607 23.87 89.93 703.81 72.72 707.41 96.28 703.68 85.357 08.85 29.82 720.99 83.69 705.71 81.02 7 18.47 75.20 705.42 97.56 7 01.89 77.20 705.59 69.88 727.39 57.26 7 14.47 63.87 707.93 73.46 7 11.08 72.247 08.64 83.47 706.54 80.78 718.04 82.59 704.70 36.007 08.89 81.04 7 08.90 58.017 04.03 95.94 701.09 91.48 7 07.37 95.65 704.51 60.92 719.31 83.05 706.99 46.86 705.86 97.40 7 01.63 98.077 00.70 49.717 16.44 89.58 708.41 75.497 26.73 81.217 03.34 96.777 02.38 96.117 04.34 95.007 05.00

99.56 7 00.74 69.717 07.38 77.977 13.45 48.62 7 22.87 91.06 704.20 72.39 7 06.98 96.147 03.62 86.517 08.46 30.98 7 20.45 84.23 7 06.19 81.167 18.48 75.107 05.36 97.56 701.89 76.82 7 05.71 68.07 7 28.47 57.52 7 14.05 66.45 7 07.90 72.55 7 11.63 72.25 7 08.38 83.157 06.44 80.447 17.90 82.59 7 04.70 34.50 7 09.60 82.377 09.62 59.49 7 04.13 97.99 700.67 91.76 706.38 95.90 7 04.38 63.55 7 18.48 81.05 7 08.39 46.28 7 05.66 97.62 701.39 98.30 7 00.78 50.62 7 15.35 89.29 7 08.60 76.017 26.39 81.577 03.64 96.77 702.38 96.117 04.34 94.007 06.63

99.337 00.74 71.04 7 07.15 78.42 712.66 48.177 22.21 93.78 704.57 71.707 07.61 96.28 703.68 87.50 7 08.49 30.917 20.32 83.68 7 07.01 81.167 18.45 75.90 705.59 97.56 7 01.89 77.477 05.72 67.18 728.54 56.977 14.55 67.477 07.82 71.62 7 11.78 71.61 708.19 83.47 706.71 81.137 16.75 82.22 704.63 35.50 7 09.07 82.38 708.11 60.737 03.73 99.47 700.17 92.337 06.45 96.09 703.80 63.247 17.06 83.05 709.19 46.577 06.06 97.53 7 01.24 98.59 700.60 50.60 716.26 89.58 709.19 77.79 725.62 82.277 03.06 96.777 02.38 95.56 704.84 94.007 08.00

99.117 00.67 71.917 04.59 75.34 7 15.73 48.67 7 20.75 96.33 703.11 72.04 706.89 96.28 7 03.68 88.08 7 08.10 31.86 7 19.11 84.767 05.80 81.017 18.35 74.90 7 05.91 97.55 701.90 76.30 7 05.55 68.07 7 28.47 56.83 7 13.44 66.42 7 10.11 70.677 12.33 71.28 7 08.05 82.80 7 06.00 82.167 17.27 82.59 7 04.70 35.007 09.75 81.00 7 10.67 61.977 04.39 99.42 7 00.31 93.177 04.95 95.96 7 04.05 65.82 7 15.07 81.717 09.53 47.477 04.97 97.40 701.33 98.737 00.71 53.48 7 14.34 89.737 08.98 82.92 7 25.85 83.577 03.45 96.54 7 02.39 95.56 7 04.84 95.007 06.71

99.22 700.71 71.03 7 06.01 74.45 7 15.97 48.127 18.69 97.13 703.16 72.75 7 07.37 96.577 03.51 88.60 706.99 32.75 719.14 83.677 06.91 81.30 7 18.62 75.30 705.22 98.107 01.73 75.917 05.44 69.28 727.33 58.477 12.77 65.90 710.13 70.177 12.58 71.93 7 09.29 83.46 704.55 81.82 7 15.85 80.007 04.45 32.50 708.14 83.007 09.61 63.377 04.14 99.42 700.31 94.59 704.31 95.90 704.14 67.87 712.02 81.717 08.01 47.167 06.63 97.717 01.38 98.75 700.55 55.88 712.63 89.87 709.65 87.51 724.39 83.92 704.30 96.54 702.39 95.56 704.84 95.007 06.71

99.11 700.97 72.137 05.33 75.32 7 16.38 49.107 23.09 97.44 702.77 73.077 07.15 96.28 7 03.68 90.16 706.57 30.30 7 20.93 82.87 7 06.21 80.737 18.50 75.30 7 05.60 97.83 7 02.03 77.08 7 05.92 65.04 733.78 58.337 13.04 66.45 710.05 71.62 7 12.35 72.55 7 09.26 83.47 706.54 79.40 7 17.09 83.337 04.76 32.50 7 07.50 79.007 13.10 63.04 704.13 99.42 7 00.31 93.177 04.95 95.90 7 04.14 67.26 7 11.66 83.05 7 09.66 46.59 7 05.22 97.49 7 01.52 98.75 7 00.55 48.747 16.14 89.58 7 09.54 90.23 7 24.85 84.16 704.26 96.777 02.38 96.11 705.00 95.007 06.71

Rank

7.7

8.1

7.7

6.9

5.7

5.4

5.5

4.7

5.3

4.4

4.8

F. Markatopoulou et al. / Neurocomputing 150 (2015) 501–512

Result

505

Id

Default

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

506

Table 4 Accuracy percentage and standard deviation of our approach instantiated with ML-kNN the following thresholding strategies: (a) default, (b) OneThreshold using thresholds ranging from 0.5 to 0.9 with a step of 0.05 and using 5-fold cross-validation to automatically compute the threshold that optimizes precision. Optimized Threshold 0.94 700.01 0.62 700.03 0.64 7 00.04 0.377 00.06 0.95 700.00 0.60 700.07 0.717 00.06 0.727 00.13 0.29 700.05 0.747 00.07 0.63 700.06 0.68 7 00.02 0.92 700.02 0.677 00.03 0.30 700.08 0.52 700.03 0.617 00.04 0.64 7 00.03 0.447 00.11 0.737 00.06 0.36 700.06 0.727 00.08 0.337 00.06 0.55 700.12 0.53 700.04 0.94 700.00 0.737 00.05 0.727 00.11 0.64 7 00.05 0.747 00.05 0.357 00.02 0.86 700.01 0.86 700.03 0.417 00.11 0.53 700.08 0.63 700.14 0.737 00.02 0.90 700.02 0.80 700.15 0.89 700.04

d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40

97.22 7 01.33 67.05 7 07.57 74.497 14.82 44.19 716.63 88.187 03.73 73.79 704.85 96.577 04.00 87.38 7 07.22 32.28 7 15.33 83.42 7 05.63 82.32 7 17.23 74.60 705.06 96.73 701.64 76.95 7 05.58 66.24 7 32.56 53.147 09.48 64.40 7 08.70 70.63 7 11.59 72.90 709.20 83.83 7 05.66 80.79 7 14.62 82.59 7 03.33 34.00 7 05.39 78.337 13.20 54.30 7 02.47 93.617 01.43 88.077 08.82 83.337 16.89 60.65 7 16.57 82.43 7 08.55 44.54 705.10 96.677 01.81 96.217 01.08 47.677 17.08 81.707 10.26 82.39 7 24.50 78.84 7 04.22 95.62 7 03.20 94.44 704.30 96.00 7 04.90

97.22 7 01.33 67.05 707.57 74.49 7 14.82 44.197 16.63 88.187 03.73 73.79 7 04.85 96.577 04.00 87.38 7 07.22 32.28 715.33 83.42 705.63 82.32 717.23 74.60 7 05.06 96.737 01.64 76.95 705.58 66.24 732.56 53.14 709.48 64.40 708.70 70.63 7 11.59 72.90 7 09.20 83.83 705.66 80.79 714.62 82.59 703.33 34.007 05.39 78.337 13.20 54.30 702.47 93.617 01.43 88.077 08.82 83.337 16.89 60.65 716.57 82.43 708.55 44.54 7 05.10 96.677 01.81 96.21 701.08 47.67 717.08 81.70 710.26 82.39 724.50 78.84 704.22 95.62 703.20 94.447 04.30 96.007 04.90

97.89 7 01.16 67.50 7 08.80 74.01 712.28 43.747 15.55 88.187 03.38 74.14 706.53 96.43 7 04.05 88.07 707.59 32.767 14.83 83.69 7 05.83 82.03 7 17.68 75.50 7 05.22 97.007 01.45 77.20 7 05.94 65.04 7 33.31 50.277 10.40 63.32 7 09.21 67.81 7 11.43 72.58 7 08.35 83.177 06.03 81.157 14.95 82.59 7 03.33 35.50 707.23 79.00 713.10 57.93 705.76 93.69 7 01.33 88.64 7 08.55 84.39 7 16.17 61.82 716.05 82.43 7 08.55 45.417 05.63 96.80 7 01.66 96.42 7 01.02 47.217 17.65 80.67 710.73 82.39 7 24.63 79.55 7 04.74 96.08 7 02.96 95.00 704.61 96.00 704.90

98.007 01.09 70.80 7 07.11 73.127 11.98 44.747 13.15 88.187 03.38 74.84 7 07.17 96.43 7 04.05 87.79 707.14 33.23 7 14.87 83.69 7 05.95 82.03 7 17.68 75.60 7 04.69 97.007 01.45 76.82 7 06.13 65.04 733.31 54.11 710.18 63.84 7 08.72 69.22 7 11.34 72.58 7 07.42 83.177 06.03 80.46 7 14.70 82.59 7 03.33 34.50 7 06.87 81.717 08.01 61.157 05.81 93.727 01.35 89.50 7 08.35 85.52 7 14.74 63.59 7 15.26 83.09 7 07.49 44.82 7 05.73 97.06 7 01.57 96.63 7 01.23 47.247 17.84 80.52 7 10.77 82.80 7 24.63 80.37 704.57 96.317 03.15 95.007 04.61 96.007 04.90

98.33 700.74 70.13 709.17 73.127 10.51 42.747 11.25 88.18 703.38 73.81 7 07.67 96.577 03.68 88.65 7 06.62 33.577 14.96 83.70 705.92 82.32 7 17.81 75.70 703.77 97.007 01.45 76.68 7 06.21 65.34 7 33.27 53.43 7 11.59 62.84 7 08.20 66.417 09.61 71.93 7 08.81 83.517 05.74 81.157 14.38 82.59 7 03.33 34.50 7 06.87 82.377 07.55 62.54 7 05.00 93.93 7 01.45 89.78 7 07.65 86.36 7 14.10 65.03 7 12.54 83.09 7 07.49 44.827 05.73 96.80 7 01.59 96.717 01.14 47.22 7 15.97 81.12 710.46 83.747 24.18 80.377 04.60 95.84 7 03.57 95.007 04.61 96.007 04.90

98.44 700.89 69.69 7 07.64 73.107 09.42 42.717 12.78 88.337 03.59 72.747 07.34 96.577 03.68 88.777 06.54 32.90 7 14.44 83.42 7 06.24 82.187 17.77 74.60 704.32 97.00 701.45 77.99 7 06.22 65.34 7 33.08 53.43 7 11.08 63.32 7 08.90 68.79 7 11.89 72.25 707.73 83.17 705.84 81.15 713.79 82.22 7 03.63 34.50 7 06.87 81.04 709.85 63.86 7 06.26 94.017 01.41 89.78 7 07.65 86.55 7 14.12 63.23 7 09.49 83.09 7 07.49 43.92 7 06.86 96.88 7 01.52 96.93 7 01.10 50.577 15.27 81.55 7 10.10 84.167 24.05 81.20 7 05.60 96.08 7 03.76 96.117 04.34 96.007 04.90

98.67 700.97 69.917 07.52 74.82 7 10.02 44.247 11.06 89.147 03.71 73.107 07.46 96.43 7 03.63 89.007 06.75 32.56 7 14.46 84.26 7 05.98 82.477 17.64 75.30 7 03.69 97.277 01.71 77.217 05.18 65.66 7 31.89 53.29 7 11.52 64.34 7 07.16 68.36 7 12.31 71.95 7 08.42 82.83 7 05.76 80.137 14.84 82.60 7 03.72 35.007 06.71 78.377 12.91 64.43 7 05.70 94.147 01.36 90.07 707.44 86.80 7 14.34 63.23 7 07.73 82.38 7 07.55 43.33 706.29 97.06 7 01.39 97.32 701.16 50.107 14.98 81.707 09.73 85.20 7 24.24 81.09 7 04.90 95.84 7 03.57 96.117 04.34 96.007 04.90

99.007 01.05 70.36 7 07.22 74.41 709.52 41.767 12.35 91.86 7 04.39 70.65 7 07.53 96.147 04.38 88.54 7 07.12 32.69 7 14.42 83.98 7 05.70 82.61 717.54 73.70 705.27 97.27 701.71 76.30 7 05.38 65.65 7 32.85 53.56 7 11.46 62.29 7 08.74 66.977 11.65 71.28 7 09.12 82.49 7 05.82 80.46 7 15.56 82.22 7 03.99 35.00 706.71 79.71 710.76 64.76 706.00 94.417 01.42 89.50 7 07.85 88.187 11.81 63.247 09.31 83.71 706.99 43.03 7 06.69 97.23 7 01.46 97.69 7 00.85 48.19 715.47 81.56 7 09.94 85.417 24.30 82.63 7 04.18 95.84 7 03.57 96.117 04.34 96.007 04.90

98.89 701.31 70.14 707.40 74.417 09.12 42.247 12.66 93.93 704.24 69.98 706.77 95.86 704.26 88.42 706.70 32.76 714.36 84.247 06.40 82.47 717.51 73.30 7 05.00 97.55 7 02.25 77.60 705.29 66.257 32.47 53.707 11.46 62.29 708.74 66.517 11.50 69.977 10.15 81.48 7 06.43 79.777 14.33 82.977 03.78 35.007 06.71 79.047 11.33 65.09 705.72 94.80 701.61 90.34 708.23 89.21 710.65 62.36 709.23 81.727 07.43 43.337 06.53 97.58 7 01.27 97.777 00.85 48.64 7 15.73 81.85 7 10.06 85.417 24.35 82.14 7 04.30 96.31 702.98 96.117 04.34 96.007 04.90

99.337 00.74 70.147 07.40 73.95 709.08 43.197 14.19 96.017 02.77 69.27 707.24 95.717 03.61 88.88 7 06.58 32.767 14.36 83.43 7 06.72 82.187 17.27 72.50 705.18 98.107 02.12 77.73 705.19 65.95 7 32.17 53.707 11.46 62.29 7 08.74 66.517 11.50 69.96 7 10.48 80.17 707.34 80.12 714.19 80.00 705.55 35.007 06.71 80.38 7 10.55 65.107 05.28 95.79 7 01.26 91.47 707.31 89.96 7 10.64 62.36 7 09.23 82.43 7 09.06 43.337 06.53 97.717 01.13 97.96 7 00.81 51.07 713.62 80.53 7 09.15 86.137 24.53 82.26 7 03.58 96.317 02.58 96.117 04.34 95.00 706.71

99.007 01.05 68.81 7 08.07 76.29 7 13.69 44.247 17.53 97.45 7 01.45 73.09 7 07.36 96.147 04.38 88.53 7 07.39 26.29 7 22.11 83.707 05.92 82.32 7 17.71 75.707 04.15 97.82 7 02.03 77.337 06.25 65.92 7 33.66 51.377 11.43 63.32 7 09.21 66.43 7 11.56 73.24 708.80 82.84 7 05.72 80.78 7 15.94 82.22 7 03.63 33.00 707.81 80.37 711.36 57.85 7 04.40 97.54 7 00.87 90.077 07.44 86.71 714.21 62.39 7 14.54 82.38 7 07.55 45.99 7 05.73 97.717 01.13 97.96 7 00.97 43.38 7 16.82 81.417 09.85 82.28 7 24.85 81.68 704.86 96.08 7 03.12 95.55 7 04.16 96.007 04.90

Rank

7.5

7.5

7.0

5.9

5.9

5.6

5.0

5.7

5.2

5.4

5.5

F. Markatopoulou et al. / Neurocomputing 150 (2015) 501–512

Result

F. Markatopoulou et al. / Neurocomputing 150 (2015) 501–512

setting too. The second one is Calibrated Label Ranking (CLR) [6], a multi-label learner that is known to achieve high precision. The C4.5 learning algorithm with default settings is used for the binary classification tasks considered by CLR. Although the above methods do output bipartitions, they primarily output a score vector containing one score for each label. ML-kNN's scores are probability estimates and it uses a 0.5 threshold to output the bipartition. CLR uses an artificial calibration label to

is a generalized version of OLA: In OLA, kNN is used, while in Ortega et al. [19] any learning algorithm could be used. In this sense, we are comparing to that work too. We instantiate the proposed approach with two multi-label learning algorithms. The first one is ML-kNN [34], which was selected because it follows an instance-based approach, similar to the rest of the competing methods, apart from MV. The number of neighbors is also set to 10 here, which happens to be the default

1

507

accuracy (r=0.29)

number of classes (r=0.18)

30

0.8

20

0.6 10

0.4 0.2 0.4

0.6

0.8

1

0 0.4

size in samples (r=0.37)

0.6

0.8

1

missing values (r=−0.21)

4000

30

3000

20

2000 10

1000 0 0.4

0.6

0.8

1

0 0.4

0.6

0.8

1

Fig. 2. Scatterplots and correlation coefficients for the optimal threshold of our approach instantiated with CLR. The x-axis corresponds to the optimal threshold and the y-axis to the four different dataset properties.

1

accuracy (r=0.85)

30

0.8

number of classes (r=−0.073)

20

0.6 10

0.4 0.2 0.2

4000

0.4

0.6

0.8

1

0 0.2

size in samples (r=0.34)

0.4

0.6

0.8

1

missing values (r=−0.24) 30

3000

20

2000 10

1000 0 0.2

0.4

0.6

0.8

1

0 0.2

0.4

0.6

0.8

1

Fig. 3. Scatterplots and correlation coefficients for the optimal threshold of our approach instantiated with ML-kNN. The x-axis corresponds to the optimal threshold and the y-axis to the four different dataset properties.

508

F. Markatopoulou et al. / Neurocomputing 150 (2015) 501–512

break the ranking of the labels and produce the bipartition. Apart from using these default methods to obtain the bipartition, we also pair the algorithms with a simple thresholding strategy, called OneThreshold, which considers a label as positive if its score is higher than a single threshold t used for all labels [12].

meta-data concerning the performance of the algorithms, which are required by all dynamic ensemble pruning approaches. Threshold tuning is again based only on the training data, so at no point are test data seen by any of the approaches.

4.4. Evaluation methodology

5. Experiments

We use 10-fold cross validation to evaluate the different approaches. For each of the 10 train/test splits, we perform an inner 10-fold cross-validation on the training set in order to gather

Section 5.1 investigates how thresholding affects the precision of the multi-label learners and consequently the accuracy of the proposed approach. Then, Section 5.2 investigates the relationship

Table 5 Accuracy percentage and average rank of the competing approaches for all datasets. Id

DS

DV

DVS

KNORA

LCA

MCB

CLR

ML-kNN

MV

OLA

d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40

98.89 701.11 68.60 7 08.36 73.62 7 11.63 39.88 715.03 96.48 7 02.64 70.657 08.76 96.007 02.98 88.83 707.12 34.88 712.31 81.80 7 06.84 81.167 15.55 72.50 7 04.90 95.63 702.50 75.14 705.76 66.86 7 30.96 49.047 12.02 62.34 710.14 66.38 7 15.08 72.20 7 05.94 79.82 707.31 80.47 712.88 78.89 703.33 33.507 08.96 80.42 709.62 64.767 05.25 96.32 701.43 90.89 705.95 94.21 705.83 66.45 7 09.72 85.717 05.87 43.34 706.05 94.59 702.07 97.62 7 00.91 50.60 715.66 59.63 714.48 88.25 724.33 80.977 04.24 96.077 02.57 93.337 11.60 97.007 06.40

96.44 701.55 58.197 10.16 69.17 714.12 27.647 10.80 86.42 7 03.73 74.84 707.65 95.86 7 05.25 78.17 709.40 18.717 13.62 83.147 05.82 81.31 717.61 75.40 7 05.30 97.56 7 03.30 73.69 7 04.69 58.08 7 41.15 39.69 7 17.75 29.34 7 11.05 55.80 7 17.41 73.55 7 09.47 84.497 05.76 76.78 7 11.68 83.34 7 03.80 15.00 705.92 81.717 08.54 53.317 03.41 93.22 7 01.30 85.217 11.13 89.36 7 12.22 57.69 7 23.86 84.43 7 06.10 32.45 7 07.09 92.82 7 02.58 95.92 7 01.09 56.26 7 11.21 34.02 7 23.69 70.457 34.49 72.22 7 05.89 95.85 7 03.86 92.78 7 06.11 93.007 07.81

96.107 02.18 57.52 7 10.33 49.597 13.99 23.45 7 13.55 86.90 7 05.31 71.727 07.42 96.43 7 02.49 83.44 708.59 29.767 19.20 82.87 7 06.97 81.60 7 17.66 73.20 705.27 86.90 7 04.01 74.357 05.86 59.65 7 36.67 40.497 13.78 40.53 7 13.33 58.05 7 11.20 68.297 10.27 82.187 04.20 79.477 14.10 83.337 04.76 25.007 09.49 81.04 708.90 52.89 7 03.51 94.677 01.24 86.63 7 08.63 92.87 7 05.58 64.61 713.57 83.767 06.86 29.80 7 07.30 83.38 7 02.94 96.02 7 01.12 54.86 7 16.30 37.09 7 18.31 87.727 24.44 64.207 07.54 96.08 7 02.55 87.22 7 12.68 82.007 13.27

99.22 7 00.71 64.587 08.01 75.32 7 08.63 46.147 17.73 97.60 7 02.04 71.667 08.21 95.28 7 03.73 89.00 706.32 32.167 11.36 85.58 7 05.25 80.72 714.33 73.30 702.93 96.45 7 03.00 75.00 706.46 64.11 735.60 48.477 14.01 60.767 08.03 70.177 12.11 69.94 7 07.22 79.84 7 07.69 76.73 714.07 80.00 705.79 38.50 7 10.01 77.79 7 09.83 63.86 7 04.32 97.487 00.92 91.79 707.45 94.27 704.86 62.93 7 09.58 85.107 05.14 42.447 06.16 97.53 7 01.29 98.147 00.65 47.677 16.12 50.707 14.07 86.357 24.56 80.377 03.89 95.84 7 02.91 97.22 7 04.48 95.00 709.22

97.78 701.22 60.87 7 09.72 54.497 10.03 44.337 21.32 91.05 7 01.88 74.517 07.03 96.43 7 04.00 83.32 7 08.82 31.87 7 19.33 83.417 05.22 82.767 16.21 74.60 7 04.18 92.60 7 04.78 77.34 7 06.28 61.76 735.11 47.727 11.67 57.68 712.33 62.25 7 11.07 72.25 7 08.00 83.197 07.75 80.50 7 11.90 81.487 04.38 29.50 7 08.79 80.38 7 09.67 60.157 03.85 96.107 01.62 88.36 7 08.50 93.03 7 06.51 63.52 7 10.63 83.767 05.42 40.127 05.43 97.197 01.21 97.83 700.72 48.247 19.19 46.90 7 17.08 77.88 7 25.22 76.12 704.43 95.38 7 03.45 96.117 05.00 76.00 712.00

98.89 7 01.31 55.99 7 10.25 63.85 7 13.75 39.38 7 17.17 94.09 7 04.09 72.37 707.44 96.147 03.67 90.047 06.83 31.87 7 16.65 83.99 7 05.44 82.03 7 17.99 74.707 04.24 96.46 7 03.01 77.08 7 05.41 61.42 7 36.02 49.06 7 12.73 57.66 709.60 67.34 7 12.20 71.90 7 07.52 83.45 7 05.86 81.84 7 14.37 82.96 7 03.78 31.50 7 09.76 83.047 07.60 59.99 7 05.66 94.67701.14 88.35708.60 95.03 7 04.99 66.147 10.48 84.38 7 07.53 38.047 09.56 97.45 701.05 97.777 00.57 51.53 713.96 71.93 7 10.92 85.94 7 23.86 81.32 7 05.71 96.317 02.38 97.22 704.48 93.007 07.81

99.117 00.97 72.137 05.33 75.32 7 16.38 49.10 723.09 97.447 02.77 73.07707.15 96.28 7 03.68 90.167 06.57 30.30 7 20.93 82.87 7 06.21 80.73 718.50 75.30 7 05.60 97.83 7 02.03 77.08 7 05.92 65.047 33.78 58.337 13.04 66.457 10.05 71.62 7 12.35 72.55 709.26 83.477 06.54 79.40 7 17.09 83.337 04.76 32.50 7 07.50 79.007 13.10 63.047 04.13 99.42 7 00.31 93.17 704.95 95.90 7 04.14 67.26 7 11.66 83.05 7 09.66 46.59 7 05.22 97.497 01.52 98.75 7 00.55 48.747 16.14 89.58 7 09.54 90.23 7 24.85 84.167 04.26 96.777 02.38 96.117 05.00 95.007 06.71

99.007 01.05 68.817 08.07 76.29 7 13.69 44.247 17.53 97.45 701.45 73.09 7 07.36 96.147 04.38 88.53 7 07.39 26.29 7 22.11 83.70705.92 82.32 7 17.71 75.70704.15 97.82 702.03 77.33706.25 65.92 7 33.66 51.377 11.43 63.32 7 09.21 66.437 11.56 73.247 08.80 82.84 7 05.72 80.78 7 15.94 82.22 7 03.63 33.007 07.81 80.377 11.36 57.85 7 04.40 97.54 700.87 90.07707.44 86.717 14.21 62.39 7 14.54 82.38 7 07.55 45.99 7 05.73 97.717 01.13 97.96 700.97 43.38 7 16.82 81.417 09.85 82.28 7 24.85 81.687 04.86 96.08 7 03.12 95.55 7 04.16 96.007 04.90

91.107 02.42 54.217 10.76 25.30 711.39 44.26 7 24.75 86.25 703.73 71.707 06.03 95.147 05.72 70.017 12.90 13.50 7 10.13 82.87 705.48 82.177 15.70 70.007 04.65 93.157 05.39 72.007 04.48 56.007 41.88 35.87 7 15.18 47.34 7 11.92 53.017 18.44 73.55 7 09.47 83.517 05.93 80.09 718.09 82.96 702.96 30.007 07.07 79.677 12.96 53.067 03.56 92.29 701.42 78.36 715.20 68.907 20.13 57.39 7 23.61 81.717 06.12 28.58 707.71 91.737 02.98 93.88 700.96 35.14 718.97 25.917 19.15 63.96 744.02 69.62 705.37 95.39 703.44 87.197 10.53 79.007 13.00

99.11 701.09 64.177 08.06 64.31 714.25 39.38 7 16.85 92.337 04.97 72.377 07.44 96.147 03.38 89.46 7 07.04 31.26 717.30 83.147 07.01 81.88 717.95 74.20 7 04.66 95.92 7 02.49 76.82 7 05.67 64.15 733.17 50.277 11.82 59.187 09.27 68.29 710.10 71.59 7 07.58 83.45 7 05.86 80.83 7 12.34 83.707 03.78 31.50 709.76 83.04 707.60 62.54 7 04.04 96.40 7 01.46 89.50 7 07.75 95.59 7 04.49 66.717 10.20 84.38 7 07.53 41.88 7 05.94 97.36 7 01.17 98.127 00.70 54.45 7 13.70 76.55 7 16.27 86.157 23.86 81.56 705.54 96.317 02.38 97.22 7 04.48 93.007 07.81

Rank

5.1

6.9

7.4

4.9

5.9

4.8

3.1

4.1

8.6

4.2

Table 6 Wins, ties and losses (w:t:l) for all pairs of methods.

CLR DS DV DVS KNORA LCA MCB MLkNN MV OLA

CLR

DS

DV

DVS

KNORA

LCA

MCB

MLkNN

MV

OLA

– 10:0:30 10:0:30 7:1:32 9:2:29 9:1:30 8:1:31 14:0:26 5:0:35 10:0:30

30:0:10 – 10:0:30 10:0:30 19:0:21 14:0:26 22:0:18 27:0:13 7:0:33 25:0:15

30:0:10 30:0:10 – 20:0:20 28:0:12 28:0:12 29:1:10 32:0:8 5:1:34 31:1:8

32:1:7 30:0:10 20:0:20 – 29:0:11 32:0:8 35:0:5 30:0:10 11:0:29 37:0:3

29:2:9 21:0:19 12:0:28 11:0:29 – 12:0:28 19:1:20 23:0:17 7:0:33 19:1:20

30:1:9 26:0:14 12:0:28 8:0:32 28:0:12 – 26:0:14 26:0:14 7:0:33 31:0:9

31:1:8 18:0:22 10:1:29 5:0:35 20:1:19 14:0:26 – 24:0:16 4:1:35 19:8:13

26:0:14 13:0:27 8:0:32 10:0:30 17:0:23 14:0:26 16:0:24 – 4:0:36 17:0:23

35:0:5 33:0:7 34:1:5 29:0:11 33:0:7 33:0:7 35:1:4 36:0:4 – 36:0:4

30:0:10 15:0:25 8:1:31 3:0:37 20:1:19 9:0:31 13:8:19 23:0:17 4:0:36 –

F. Markatopoulou et al. / Neurocomputing 150 (2015) 501–512

of the optimal threshold for the proposed approach with four dataset properties (difficulty (via the accuracy of our approach), size in samples, missing values (percentage) and number of classes). Section 5.3 presents the results of comparing the proposed approach with other state-of-the-art approaches. Section 5.4 comments on the number and type of selected models for the competitive dynamic selection approaches. Finally, Section 5.5 examines the relationship of the relative performance of the proposed method on different dataset sizes.

509

datasets, which is not surprising. This is less pronounced in the case of CLR with a 0.29 correlation. Easier problems means many more positive examples per label than negative ones, e.g. a class imbalance which is opposite to the typical one found in multi-label learning problems, where the positive examples are scarce. We hypothesize that CLR is better at handling this class imbalance compared to MLkNN and does not require setting a much higher threshold. In conclusion, the optimal threshold does not appear to be predictable from dataset properties in the general case. We therefore suggest tuning the threshold via a validation set.

5.1. Thresholding the multi-label learners 5.3. Accuracy Tables 3 and 4 present the accuracy and standard deviation of the framework instantiated with CLR and ML-kNN respectively, for all datasets (column 1) under the following thresholding schemes: (a) their default thresholding strategy (column 2), (b) OneThreshold with threshold values ranging from 0.5 to 0.9 using a step of 0.05 (columns 3–11), and (c) OneThreshold using 5-fold cross-validation to automatically compute the threshold value that optimizes the loss function in Eq. (2) (column 12). The last column shows the actual threshold value that was selected. The highest accuracy at each dataset is underlined. As suggested in Demsar [3], we base the discussion of the results on the average ranks of the different versions of the proposed approach (last row of the tables). For CLR, we notice that threshold values higher than 0.55 lead to better average ranks compared to the default thresholding strategy. Higher threshold values lead to the greatest improvement, with 0.80, 0.85 and 0.90 being the three values with the best performance compared to the rest. This is in line with the analysis in Section 3, as higher thresholds improve the precision of the learners. Automatically selecting a threshold via cross-validation performs relatively well, but not better than a fixed high threshold. As already discussed, the default version of ML-kNN uses an implicit 0.5 threshold to obtain the bipartition. Therefore the first two columns are identical. We notice that all threshold values examined lead to better accuracy compared to the standard threshold. Again, higher thresholds lead to better results, but the best performance is achieved for a threshold value of 0.75. As in the case of CLR, automatically selecting a threshold via crossvalidation performs relatively well, but not better than the fixed threshold values of 0.75, 0.85 and 0.90. The general conclusion is that tuning the threshold improves the performance of the multi-label learners. This comes from the improvement of the precision of the multi-label learners, which in turn comes from the fact that the learners are forced to select less classifiers, but classifiers that we are more confident about their correctness. This assumes correct probability estimations and hence correct ranking of the classifiers. This of course is not always the case. That is why we do not see the accuracy of our approach monotonically increasing with the threshold values. 5.2. Relationship of optimal threshold with dataset properties Next we investigate whether the optimal threshold of our approach is correlated with properties of the given dataset, such as its difficulty (via the accuracy of our approach), size in samples, missing values (percentage) and number of classes. Figs. 2 and 3 show scatterplots contrasting the optimal threshold and each of the four dataset properties mentioned above for our approach instantiated with CLR and ML-kNN respectively. No strong relationships are noticed, with the exception of a 0.85 correlation between the accuracy of our approach using ML-kNN and the optimal threshold. This means that the easier the learning task (i.e. the higher the accuracy of our approach), the higher the optimal threshold of our approach. This suggests that ML-kNN's confidence in the accuracy of the ensemble's members is a bit optimistic in easier

Table 5 shows the accuracy percentage and standard deviation of the competing methods in all datasets. The highest accuracy at each dataset is underlined. We, again, start the analysis of the results based on the average ranks, as suggested in Demsar [3]. We notice that the best overall results are achieved by the proposed approach using CLR. This is in accordance with our prior knowledge that CLR is an algorithm that achieves good precision performance. On the other hand, using ML-kNN led to slightly worse results and an overall rank of 4.1. Almost the same performance was achieved by OLA with an average rank of 4.2, while close to this were also MCB and KNORA. DV and DVS did not perform that well, while the worst results were that of the static classifier fusion approach of majority voting. Note that automatic Table 7 Average number of models queried by dynamic ensemble selection methods. Dataset id

DVS

ML-kNN

CLR

KNORA

d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40

116 79 106 135 134 124 101 95 102 144 98 151 105 119 113 112 168 107 136 172 131 176 137 94 167 112 134 120 134 141 145 99 141 156 102 89 131 177 111 79

155 93 54 100 60 167 173 138 96 150 131 127 116 127 120 50 47 48 191 142 171 144 68 173 72 138 147 100 68 140 69 115 182 112 26 108 65 155 78 115

11 15 21 67 8 56 35 15 30 51 47 41 16 29 108 20 21 30 101 57 107 73 30 94 8 14 33 26 21 52 14 15 15 64 15 8 12 33 18 13

87 44 46 103 91 104 41 48 45 91 53 127 56 55 29 31 134 60 114 153 51 159 43 56 91 57 103 134 141 114 96 25 139 129 48 78 107 173 50 109

Mean

125

113

36

85

510

F. Markatopoulou et al. / Neurocomputing 150 (2015) 501–512

thresholding is used in these experiments by our approach for a fair comparison with competing approaches. We also comment the results in terms of the wins, ties and losses for each pair of algorithms, which are shown in Table 6. We refrain from reporting only the statistically significant wins and losses, following the recommendation in Demsar [3]. We notice again the high performance of CLR, which loses in 14 datasets maximum (35% of all datasets) from any method. Notable is the fact that this maximum number of losses happens only when CLR is compared with our second approach, ML-kNN. ML-kNN presents high performance too, losing in 26 datasets maximum from any method. Once again the main competitor seems to be the OLA algorithm. We also performed the Wilcoxon signed ranks test between our approach using CLR and its main competitor, namely OLA. The test gave a p-value of 0.0022, indicating that our approach is significantly better than OLA at a 99% level.

5.4. Number and type of selected models Table 7 shows the average number of models selected by our approach (with CLR and ML-kNN), KNORA and DVS across all test instances for all datasets. The mean across all datasets can be seen in the last row of the table. The lowest mean number of selected models, 36, is attained by our approach with CLR. It is therefore clear that one of the reasons for the success of CLR within our approach is the selection of a small number of accurate models. ML-kNN on the other hand outputs a larger number of models, 113 on average, which justifies its lower performance. The average

number of models selected by KNORA is 85, while that of DVS is even larger, 125, justifying its respective lower performance. Fig. 4 shows the frequency of selection of each member of the ensemble across all datasets and test examples. We notice that decision tree models are mostly selected, followed by good versions of SVMs (the higher the cost parameter, the better) and good versions of MLPs (the larger the number of hidden layer nodes, the better). kNNs are selected with lower frequency and we notice that the lower the number of nearest neighbors the better the results.

5.5. Relationship of improvement with respect to dataset size In this section we investigate if the relative performance of our approach is correlated with the data set size in samples. The relative performance of our approach is measured via its rank in each of the 40 different datasets when compared to the other nine competitive methods, presented on Table 5. Fig. 5 shows scatter plots contrasting the rank and the dataset size for our approach instantiated with ML-kNN (left) and CLR (right) respectively. No relationship can be noticed with respect to our approach using CLR. On the other hand, a weak negative correlation is noticed when using ML-kNN. This means that the larger the dataset size, the better the performance of our approach using ML-kNN (the smaller the rank of the proposed method). This finding is justified as follows. ML-kNN searches for the nearest neighbors of a test instance in the training set. A larger training set represents the real data distribution more accurately and the algorithm is able to generalize better to unknown instances.

Fig. 4. Model frequencies.

ML−kNN (r=−0.27)

4000 3500

3500

3000

3000

2500

2500

2000

2000

1500

1500

1000

1000

500

500

0

0

2

4

6

CLR (r=0.013)

4000

8

10

0

0

2

4

6

8

10

Fig. 5. Scatterplots and correlation coefficients for the optimal threshold of our approach instantiated with ML-kNN (left) and CLR (right). The x-axis corresponds to the rank of our method and the y-axis to the dataset size in samples.

F. Markatopoulou et al. / Neurocomputing 150 (2015) 501–512

6. Conclusions and future work This paper presented a new approach for instance-based ensemble pruning based on multi-label learning. Each classifier of an ensemble corresponds to a label and a multi-label model is learned to output the correct classifiers for a given instance. We have shown theoretically that in order to achieve a correct prediction after a plurality voting process among the subset of classifiers output by the multi-label model, this model should exhibit high example-based precision. We instantiated this framework with two multi-label learning algorithms and investigated the effect of thresholding on the accuracy of the framework. As expected, the use of higher thresholds leads to improved precision and higher accuracy. We further investigated the use of a strategy that automatically determines an appropriate threshold that optimizes the precision of the models. The performance of the proposed framework was compared with a variety of state-of-the-art competing methods. Based on the results, we reached the conclusion that the proposed framework performs significantly better in the large collection of classification datasets analyzed in this work. We would also like to stress that the performance of the framework depends on the multi-label learner and the thresholding strategy used. Therefore, it has the potential to achieve even better results than those presented here, which are based on the CLR and ML-kNN algorithms and the OneThreshold strategy. We also investigated if the optimal threshold of our approach is correlated with properties of the given dataset, such as its difficulty (via the accuracy of our approach), size in samples, missing values (percentage) and number of classes. No strong relationships are noticed, with the exception of a strong correlation between the accuracy of our approach using ML-kNN and the optimal threshold. In our future work it would be interesting to compare the proposed approach against methods that simultaneously consider both the diversity and the accuracy of the pruned ensemble. In addition, we would like to examine the benefit of our approach against popular homogeneous ensemble method, like boosting or random forests. However, for such an experiment to be fair we would also be adding the random forest in our heterogeneous ensemble of models, and hence we expect our framework to do better than random forest by itself. Furthermore, we would like to examine the benefit of the proposed approach on homogeneous ensembles. Finally, we would like to verify the conclusion of previously published papers that dynamic selection methods perform better than static selection methods by comparing our approach with methods that statically select subsets from the initial ensemble or statically select a single best model from the initial ensemble. Such experiments could also reveal what properties make dynamic selection methods perform better than their static counterparts.

Acknowledgments We would like to thank the anonymous reviewers for their useful comments and suggestions. References [1] A. Asuncion, D. Newman, UCI machine learning repository. URL 〈http://www. ics.uci.edu/  mlearn/MLRepository.html〉, 2007. [2] K. Crammer, Y. Singer, A family of additive online algorithms for category ranking, J. Mach. Learn. Res. 3 (2003) 1025–1058. [3] J. Demsar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [4] W. Fan, F. Chu, H. Wang, P.S. Yu, Pruning and dynamic scheduling of costsensitive ensembles, in: Eighteenth National Conference on Artificial Intelligence, American Association for Artificial Intelligence, 2002, pp. 146–151.

511

[5] C. Feng, A. Sutherland, R. King, S. Muggleton, R. Henery, Comparison of machine learning classifiers to statistics and neural networks, in: Proceedings of the Third International Workshop in Artificial Intelligence and Statistics, 1993, pp. 41–52. [6] J. Fürnkranz, E. Hüllermeier, E.L. Mencia, K. Brinker, Multilabel classification via calibrated label ranking, Mach. Learn. 73 (2) (2008) 133–153. [7] G. Giacinto, F. Roli, Adaptive selection of image classifiers, in: Proceedings of 9th International Conference on Image Analysis and Processing (ICIAP'97), vol. I, 1997, pp. 38–45. [8] G. Giacinto, F. Roli, Dynamic classifier selection based on multiple classifier behavior, Pattern Recognit. 34 (9) (2001) 1879–1881. [9] S. Godbole, S. Sarawagi, Discriminative methods for multi-labeled classification, in: Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2004), 2004, pp. 22–30. [10] D. Hernández-Lobato, G. Martínez-Muñoz, A. Suárez, Statistical instancebased pruning in ensembles of independent classifiers, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 364–369. [11] E. Hüllermeier, J. Fürnkranz, W. Cheng, K. Bringer, Label ranking by learning pairwise preferences, Artif. Intell., 2008. [12] M. Ioannou, G. Sakkas, G. Tsoumakas, I. Vlahavas, Obtaining bipartitions from score vectors for multi-label classification, in: Proceedings 22nd IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2010), IEEE Computer Society, Los Alamitos, CA, USA, 2010, pp. 409–416. [13] A.H. Ko, R. Sabourin, J. Alceu Souza Britto, From dynamic classifier selection to dynamic ensemble selection, Pattern Recognit. 41 (2007) 1718–1731. [14] L. Kuncheva, Clustering-and-selection model for classifier combination, in: Proceedings. Fourth International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies, vol. 1, 2000, pp. 185–188. [15] L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy, Mach. Learn. 51 (May (2)) (2003) 181–207. [16] L. Li, B. Zou, Q. Hu, X. Wu, D. Yu, Dynamic classifier ensemble using classification confidence, Neurocomputing 99 (0) (2013) 581–591. [17] R. Lysiak, M. Kurzynski, T. Woloszynski, Optimal selection of ensemble classifiers using measures of competence and diversity of base classifiers, Neurocomputing 126 (0) (2014) 29–35. [18] F. Markatopoulou, G. Tsoumakas, I. Vlahavas, Instance-based ensemble pruning via multi-label classification, in: 22nd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), vol. 1, 2010, pp. 401–408. [19] J. Ortega, M. Koppel, S. Argamon, Arbitrating among competing classifiers using learned referees, Knowl. Inf. Syst. 3 (4) (2001) 470–490. [20] S. Puuronen, V.Y. Terziyan, A. Tsymbal, A dynamic integration algorithm for an ensemble of classifiers, in: Z.W. Ras, A. Skowron (Eds.), ISMIS, Lecture Notes in Computer Science, vol. 1609, Springer, 1999, pp. 592–600. [21] S. Puuronen, V.Y. Terziyan, A.K.A. Tsymbal, Dynamic integration of multiple data mining techniques in a knowledge discovery management system, in: Proceedings of SPIE Conference on Data Mining and Knowledge Discovery, 1999, pp. 128–139. [22] J. Read, B. Pfahringer, G. Holmes, Multi-label classification using ensembles of pruned sets, in: Proceedings of 8th IEEE International Conference on Data Mining (ICDM'08), 2008, pp. 995–1000. [23] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains for multi-label classification, in: Proceedings of 20th European Conference on Machine Learning (ECML 2009), 2009, pp. 254–269. [24] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multi-label data, in: O. Maimon, L. Rokach (Eds.), Data Mining and Knowledge Discovery Handbook, 2nd edition, Springer, US, 2010, pp. 667–685 (Chapter 34). [25] G. Tsoumakas, I. Katakis, I. Vlahavas, Random k-labelsets for multi-label classification, IEEE Trans. Knowl. Data Eng. 23 (2011) 1079–1089. [26] G. Tsoumakas, I. Partalas, I. Vlahavas, An ensemble pruning primer, in: O. Okun, G. Valentini (Eds.), Supervised and Unsupervised Methods and Their Applications to Ensemble Methods (SUEMA 2009), Springer Verlag, Berlin, Heidelberg, 2009, pp. 1–13. [27] A. Tsymbal, Decision committee learning with dynamic integration of classifiers, in: Proceedings of 2000 ADBIS-DASFAA Symposium on Advances in Databases and Information Systems, 2000. [28] A. Tsymbal, S. Puuronen, Bagging and boosting with dynamic integration of classifiers, in: Proceedings of 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2000), Lyon, France, 2000, pp. 195–206. [29] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, San Francisco, 2005. [30] K. Woods, W.P. Kegelmeyer Jr., K. Bowyer, Combination of multiple classifiers using local accuracy estimates, IEEE Trans. Pattern Anal. Mach. Intell. 19 (4) (1997) 405–410. [31] J. Xiao, C. He, X. Jiang, D. Liu, A dynamic classifier ensemble selection approach for noise data, Inf. Sci. 180 (18) (2010) 3402–3421. [32] Y. Yan, X.-C. Yin, Z.-B. Wang, X. Yin, C. Yang, H.-W. Hao, Sorting-based dynamic classifier ensemble selection, in: 12th International Conference on Document Analysis and Recognition, IEEE, 2013, pp. 673–677. [33] M.-L. Zhang, Z.-H. Zhou, Multi-label neural networks with applications to functional genomics and text categorization, IEEE Trans. Knowl. Data Eng. 18 (10) (2006) 1338–1351. [34] M.-L. Zhang, Z.-H. Zhou, Ml-knn: a lazy learning approach to multi-label learning, Pattern Recognit. 40 (7) (2007) 2038–2048.

512

F. Markatopoulou et al. / Neurocomputing 150 (2015) 501–512 Fotini Markatopoulou received a B.Sc. in Informatics from Aristotle University of Thessaloniki in 2011 and an M.Sc. in Machine Learning, Data Mining & High performance computing from University of Bristol. Her research interests include various aspects of Machine Learning like Preference learning, Ensemble methods, Multi-Label learning and also Concept-based Image and Video Annotation-Retrieval. She works as a research assistant at the Informatics & Telematics Institute (ITI) of the Centre of Research & Technology Hellas (CERTH) since December 2012.

Grigorios Tsoumakas is an assistant professor at the Department of Informatics of the Aristotle University of Thessaloniki (AUTH). He received a degree in Informatics from AUTH in 1999, an M.Sc. in Artificial Intelligence from the University of Edinburgh in 2000 and a Ph.D. in Informatics from AUTH in 2005. His research interests include various aspects of machine learning, knowledge discovery and data mining, including ensemble methods, distributed data mining, text classification and multi-label learning.

Ioannis Vlahavas is a professor at the Department of Informatics at the Aristotle University of Thessaloniki. He received his Ph.D. degree in Logic Programming Systems from the same University in 1988. During the first half of 1997 he has been a visiting scholar at the Department of CS at Purdue University. He specializes in logic programming, knowledge based and AI systems and he has published over 200 papers and book chapters, and co-authored 8 books in these areas. He teaches logic programming, AI, expert systems, and DSS. He has been involved in more than 27 research and development projects, leading most of them. He was the chairman of the 2nd Hellenic Conference on AI and the host of the 2nd International Summer School on AI Planning. He is leading the Logic Programming and Intelligent Systems Group (LPIS Group, lpis.csd.auth.gr) (more information at www.csd.auth.gr/  vlahavas).