Dynamic classifier selection for one-class classification

Dynamic classifier selection for one-class classification

Accepted Manuscript Dynamic Classifier Selection for One-Class Classification Bartosz Krawczyk, Michał Wo´zniak PII: DOI: Reference: S0950-7051(16)3...

2MB Sizes 0 Downloads 99 Views

Accepted Manuscript

Dynamic Classifier Selection for One-Class Classification Bartosz Krawczyk, Michał Wo´zniak PII: DOI: Reference:

S0950-7051(16)30159-9 10.1016/j.knosys.2016.05.054 KNOSYS 3551

To appear in:

Knowledge-Based Systems

Received date: Revised date: Accepted date:

3 August 2015 5 May 2016 27 May 2016

Please cite this article as: Bartosz Krawczyk, Michał Wo´zniak, Dynamic Classifier Selection for OneClass Classification, Knowledge-Based Systems (2016), doi: 10.1016/j.knosys.2016.05.054

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • Introduction of dynamic classifier selection for one-class classification • Three novel competence measures for one-class classifiers

CR IP T

• Gaussian approach for extending competence over the entire decision space

AC

CE

PT

ED

M

AN US

• Results indicating that dynamic selection is a good alternative to static ensembles

1

ACCEPTED MANUSCRIPT

Dynamic Classifier Selection for One-Class Classification

a Department

CR IP T

Bartosz Krawczyka,∗, Michal Wo´zniaka of Systems and Computer Networks, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland

Abstract

ED

M

AN US

One-class classification is among the most difficult areas of the contemporary machine learning. The main problem lies in selecting the model for the data, as we do not have any access to counterexamples, and cannot use standard methods for estimating the classifier quality. Therefore ensemble methods that can use more than one model, are a highly attractive solution. With an ensemble approach, we prevent the situation of choosing the weakest model and usually improve the robustness of our recognition system. However, one cannot assume that all classifiers available in the pool are in general accurate - they may have local competence areas in which they should be employed. In this work, we present a dynamic classifier selection method for constructing efficient one-class ensembles. We propose to calculate the competencies of all classifiers for a given validation example and use them to estimate their competencies over the entire decision space with a the Gaussian potential function. We introduce three measures of classifier’s competence designed specifically for one-class problems. Comprehensive experimental analysis, carried on a number of benchmark data and backed-up with a thorough statistical analysis prove the usefulness of the proposed approach.

CE

PT

Keywords: one-class classification, classifier ensemble, machine learning, dynamic classifier selection, competence measure.

1. Introduction

AC

One-class classification (OCC) is among the most difficult, but very promising, areas of the contemporary machine learning. It works with the assumption that during the training phase it has only objects originating from a single class at our disposal. This may be caused by cost restraints, difficulties or ethical

∗ Corresponding

author Email addresses: [email protected] (Bartosz Krawczyk), [email protected] (Michal Wo´ zniak)

Preprint submitted to Knowledge-Based Systems

May 27, 2016

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

implication of collecting some samples or simply complete lack of ability to access or generate objects. This is common in many real-life applications, such as intrusion detection systems, fault diagnosis or object detection [23]. As we have no access to any counterexamples during the training phase, constructing an efficient model and selecting optimal parameters for it becomes a very demanding task. Among them a combined classification seems as a promising direction [31]. In the last decade, we have seen a significant development of algorithms known as ensemble classifiers or multiple classifier systems [47]. Their success lies in ability to tackle complex tasks by decomposition, utilizing different properties of each model and taking advantage of collective classification abilities. Classifier ensemble can work properly only if base classifiers should be characterized by high individual quality and be at the same time mutually complementary. This characteristic known as diversity can be achieved in following ways. The most common approach is known as static selection. It concentrates on creating a pool of base learners and then establishing a combination method for efficient exploitation of the pool members. Several different proposals exist on how to produce diverse classifiers for one problem - obtaining heterogeneous classifier ensemble is easier (as one need to train several different models) [38], while homogeneous requires applying some kinds of permutation (e.g., using different training parameters, selecting sub-groups of objects or partitioning the feature space) [27] to obtain initial diversity among members. After preparing a set of classifiers, one need to choose appropriate combination method, that can range from relatively simple voting procedures on discrete outputs, through aggregating discriminant functions, to more sophisticated trained combination rule [20]. The second group is known as dynamic selection [26]. In this approach, one assumes that the structure of the ensemble varies for each new incoming example. This is based on the assumption, that each classifier has its own local area of competence [34]. Therefore, for classifying new object the most competent model(s) should be delegated. To establish the competence, a dedicated measure based on correctness of classification is needed [33]. As it can only be calculated locally for examples provided during the training phase, one must extend it over the entire decision space. Dynamic ensembles are divided into two categories: dynamic classifier selection (DCS) [21] and dynamic ensemble selection (DES) [35]. The first model assumes, that for each new example the single classifier with highest competence is selected and the decision of the ensemble is based on the output of this individual classifier. In DES systems, l most competent classifiers are selected to construct a local sub-ensemble. DCS systems have become very popular in the last decade [5]. This is due to their ability to work efficiently with both heterogeneous and homogeneous pools of classifiers, by exploiting their local areas of competence [8]. Identically as static ensembles, they aim at using classifiers that are locally specialized [19]. This way, each learner from the pool has to tackle a simplified task, instead of trying to achieve competence over the entire decision space. However, DCS 3

ACCEPTED MANUSCRIPT

PT

ED

M

AN US

CR IP T

systems allow to efficiently chose the appropriate classifier for each incoming example, thus offering a flexible structure of the ensemble [14]. This cannot be achieved by static selection methods, as they need to establish the committee structure during the training phase - it remains identical regardless of what kind of objects will be labeled during its exploitation. Such a characteristic can be found as especially useful in case of non-stationary problems or cases in which our knowledge about the true object distribution is limited. DCS and DES systems have recently been applied with success to the data stream classification, where they can cope with continuously arriving and shifting data [24; 50]. However, one can see that these properties can be also useful for handling cases, in which we have limited access to information about what kind of objects can appear in the exploitation phase. Here, one can prepare a set of diverse classifiers and use the flexible DCS system for avoiding the selection of non-competent learners for given data. This idea leads us to the proposition a DCS system dedicated to the specific nature of the one-class classification task. Up to the best of authors knowledge, this is the first work on dynamic ensembles for one-class classification task. So far, there are some works on ensembles for one-class classification but they concentrate on static selection approaches [7; 16; 30–32; 40]. DCS systems have potential to efficiently handle the difficulty embedded in the nature of OCC problems - the lack of any information about the nature of counterexamples. Good one-class pattern classifier should at the same time display generalization abilities on the target class, and be prepared to discriminate outliers that can potentially appear anywhere in the decision space. Thus adaptable structure of the ensemble can be a great advantage, as one may use at the same time different learning paradigms offered by various one-class classifiers. In this work we will formulate the dynamic classifier selection problem from the one-class classification perspective and propose three efficient measures tailored for the specific task of calculating the competence of given one-class classifiers. Then, we will use the Gaussian potential function to estimate the competence of the classifier pool over the entire decision space and to gain a generalization abilities for our ensemble. Below, let us enlist the main contribution of our paper.

AC

CE

• We introduce a dynamic classifier selection system for problems without an access to counterexamples during the training phase. Best to our knowledge, this is the first attempt to use a DCS system in one-class classification. We formulate the background for this task in single-class learning scenario. • We propose three measures of classifier’s competency, based on different heuristics and tailored for the specific nature of one-class classification tasks. They are independent from used classification model, and hence flexible enough to work with any type of one-class classifier. • We show, how to extend the local competencies of base classifiers calculate at validation points into global competencies spanning over the entire de4

ACCEPTED MANUSCRIPT

cision space. For this task, we employ the normalized Gaussian potential function.

CR IP T

• We experimentally validate our proposal over an extensive number of benchmark datasets and with a thorough statistical analysis. Experiments prove, that one-class DCS system can often outperform state-of-the-art methods of classifier combination [12].

2. One-Class Classification

AN US

The remaining part of this manuscript is organized as follows. Next section gives a necessary background about the nature of one-class classification task. In Section 3, we present in details our one-class DCS system. Section 4 describes the experimental analysis and discussion of obtained results, while Section 5 concludes the paper.

In this Section, an introduction into the nature of OCC problems is given. Additionally, current state-of-the-art in one-class ensemble models is be discussed. 2.1. Learning in the Absence of Counterexamples

AC

CE

PT

ED

M

One-class classification (OCC) is a specific area of machine learning. It aims at distinguishing a given single-class from a more broad set of classes. This class is known as the target concept and denoted as ωT . All other objects, that do not satisfy the conditions of ωT are labeled as outliers ωO . This may seem as a binary classification problem, but the biggest difference lies in the learning procedure [43]. OCC assumes that the counterexamples are unavailable during the training, therefore, an OCC’s training procedure needs to estimate the classification rules without an access to counterexamples. At the same time, it must display good generalization properties as during the exploitation phase both objects from the target concept and unseen outliers may appear. OCC aims at finding a trade-off between capturing the properties of the target class (too fitted or too lose boundary may lead to high false rejection / false acceptance rates) and maintaining good generalization (as overfitting is likely to occur when having only objects from a single class for training). The idea of learning in the absence of counterexamples is depicted in Figure 1. Let’s formulate the model of OCC classifier. We assume that our model is working in a n-dimensional feature space X ⊆ Rn and deals with a one-class problem described by a set of class labels {ωT , ωO }. A one-class classifier is a model Ψ : X 7→ {ωO , ωT },

(1)

that produces a set of support functions (F (x, ωT ), F (x, ωO )) for a given object x, characterized by a vector of features x.

5

CR IP T

ACCEPTED MANUSCRIPT

Figure 1: Example one-class toy problem with three main stages: (left) during the training phase only positive objects are available; (center ) a one-class classifier is trained on the data, enclosing all the relevant samples, while not being overfitted to data; (right) during the exploitation phase new objects appear, that can be labeled as the target concept (positive samples that should be accepted) or outliers (negative objects that should be rejected).

AN US

Classification algorithm Ψ makes a decision using the following rule ( ωT if F (x, ωT ) ≥ F (x, ωO ) Ψ(x) = ωO otherwise

(2)

M

To apply the mentioned above decision rule, we require the knowledge about the values of support functions of each individual classifier from the pool. But not all of one-class classifiers can output it directly - some of them (like Nearest Neighbor Data Description [17] or One-Class Support Vector Machine [32]) work on the basis of distance between the new sample and its decision boundary (known also as reconstruction error). Therefore, to conduct the combination step we propose to use a heuristic mapping:

ED

F (x, ωT ) = exp(−dst(x, ωT )/s),

(3)

AC

CE

PT

where dst(x, ωT ) stands for a distance (usually Euclidean metric is used) between the considered object x and decision boundary for ωT (this depends on the nature of used classifier, e.g., support vectors or nearest neighbor may be used) and s is the scale parameter that should be fitted to the target class distribution. This scale factor is related to how spread out data points are. When the distance between objects tend to get very high (e.g., in high dimensional spaces) small value of s is used to control the stability of the mapping. Therefore, in most cases s = d1 . This mapping has the advantage that the outputted support value is always bounded between 0 and 1. One-class classification finds its applications in situations, where gathering counterexamples is difficult, costly, unethical or simply impossible. One may view an intrusion detection for computer system (IDS/IPS) [25] as an example. The target class covering the normal and safe messages is unchanged but the malicious messages as intrusion are still changing, because the malicious users are trying to lead security systems on. Therefore, one cannot predict beforehand what kinds of attacks can occur. If we concentrate on normal messages only, as target class, then we may able to train classifiers which are capable of distinguishing normal messages from malicious ones without knowledge about 6

ACCEPTED MANUSCRIPT

the outlier class [23]. It could protect our security system against the so-called ”zero day attack” as well. In the last decade a plethora of one-class classifiers have been proposed. They can be divided among three major groups:

CR IP T

• The first approach comprises methods based on density estimation of a target class. This is a simple, yet surprisingly effective method for handling concept learning. However, this approach has limited application as it requires a high number of available samples and the assumption of a flexible density model [39]. The most widely used methods from this group are the Gaussian model, the mixture of Gaussians [51], and the Parzen Density data description [11].

AN US

• The second group is known as reconstruction methods. They were originally introduced as a tool for data modeling. These algorithms estimate the structure of the target class, and their usage in OCC tasks is based on the idea that unknown outliers differ significantly from this established positive class structure. The most popular techniques are the k-means algorithm [9], self-organizing maps [44], and auto-encoder neural networks [36].

PT

ED

M

• The third group consists of boundary methods. Estimating the complete density or structure of a target concept in a one-class problem can very often be too demanding or even impossible. Boundary methods instead concentrate on estimating only the closed boundary for the given data, assuming that such a boundary will sufficiently describe the target class [28]. The main aim of these methods is to find the optimal size of the volume enclosing the given training points [42], in order to find trade-off between robustness to outliers and generalization over positive examples. Boundary methods require a smaller number of objects to estimate the decision criterion correctly compared with the two previous groups of methods. The most popular methods in this group include the support vector data description [41] and the one-class support vector machine [10].

AC

CE

2.2. One-Class Classifier Ensemble Each of described one-class classifiers uses a different learning paradigm for data description task. Thus, the outputted shapes of decision boundary can differ significantly among one-class learners as seen in Figure 2. At the same time more than one model may have desirable properties for the considered problem, especially in case of complex datasets. It is almost impossible to undoubtedly select the proper model without any knowledge about possible characteristic of negative objects. Therefore, utilizing more than single model in an ensemble may significantly improve the robustness of the constructed system and prevent us from choosing a weaker model [47]. There exist a big variety for ensemble combination techniques for multiclass problems [47]: based on class labels and support functions [6], trained and untrained [48] or static and dynamic [18]. 7

AN US

CR IP T

ACCEPTED MANUSCRIPT

b)

M

a)

d)

f)

CE

e)

PT

ED

c)

AC

Figure 2: Exemplary differences between decision boundaries created by different one-class classifiers: (a) Support Vector Data Description, (b) Parzen Density Data Description, (c) Minimum Spanning Tree Data Description, (d) Nearest Neighbor Data Description, (e) Mixture of Gaussians Data Description and (f ) Principal Component Data Description.

8

ACCEPTED MANUSCRIPT

AN US

CR IP T

However, only static combination methods for one-class ensembles have been so far investigated in literature [7; 16; 30; 31; 37; 40; 45; 49]. One may distinguish two approaches for forming ensembles of classifiers: homogeneous and heterogeneous. The first one assumes that a committee is constructed with the usage of a single type of classifier, while the second one utilizes base learned trained using different types of classifiers. For homogeneous ensembles we require a way to ensure diversity among classifiers in the pool in order for the ensemble to work. In OCC this is commonly achieved by manipulating inputs in form of varying training objects (selected e.g., through Bagging or clustering) [32] or using reduced feature spaces [13]. For heterogeneous ensembles the diversity is achieved to some extent by the simple fact of base classifiers being trained with the usage of different algorithms [29]. Heterogeneous ensembles have not been widely explored in the context of OCC, but offer potentially more flexible data description when combined with a proper selection algorithm. Let us assume, that the pool consists of L one-class classifiers Π = {Ψ1 , Ψ2 , ..., ΨL }. After performing the mapping, one may use proposed the following combination rules [40], which can be applied to both heterogeneous and homogeneous oneclass ensembles: Mean vote, which combines votes in form of support functions of one-class classifiers. It is expressed by: L

Fmv (x, ωT ) =

1X I(Fk (x, ωT ) ≥ θk ), L

(4)

M

k=1

PT

ED

where Fk (x, ωT ) stands for the discriminant function value returned by the kth individual classifier for a given observation x and class ωT . I(·) is the indicator function and θk is a classification threshold. When a threshold equal to 0.5 is applied this rule transforms into a majority vote for binary problems. Mean weighted vote which introduces the weighting of base classifiers by fTk , where fTk is the fraction of target class objects accepted by k-th individual classifier:

CE

Fmwv (x, ωT ) =

L 1 X (fT I(Fk (x, ωT ) ≥ θk ) + (1 − fTk )I(Fk (x, ωT ) ≤ θk )), L k=1 k

(5)

AC

which is a smoothed version of the mean vote method. Product of the weighted votes, which is defined as: Fpwv (x, ωT ) = QL

QL

k=1

fTk I(Fk (x, ωT ) ≥ θk ) , QL k=1 (1 − fTk )I(Fk (x, ωT ) ≤ θk )

k=1 fTk I(Fk (x, ωT ) ≥ θk ) +

(6)

Mean of the estimated probabilities which is expressed by: ymp (x) =

1X Fk (x, ωT ). L k

9

(7)

ACCEPTED MANUSCRIPT

CR IP T

Product combination of the estimated probabilities, which is expressed by: Q ) k Fk (x, ωTQ . (8) ypc (x) = Q k Fk (x, ωT ) + k θk

3. One-Class Ensemble with Dynamic Classifier Selection

AC

CE

PT

ED

M

AN US

As mentioned in the previous section, one often cannot easily select the best model for an one-class problem. This is often accompanied by observations, that more than one type of classifier can be useful for analyzing the given dataset. This assumptions are the basis of our proposal of one-class dynamic classifier selection (OCDCS) system, in which we prepare a pool of heterogeneous models and delegate one of them dynamically to the decision area in which it is the most competent one. This idea is based on an assumption that there is always a trade-off between the flexibility and generalization abilities of a given ensemble system. By having a larger pool of diverse models we potentially may obtain improved classification accuracy and be prepared for various types of outliers to appear. However, with the growth of the ensemble size usually we lose the individual beneficial characteristics of each of its base learners. This is due to the fact that not all of classifiers are competent in given decision space. With increasing sizes of classifier pool increases the probability of having non-competent learners within. Therefore, by combining competent classifiers with non-competent ones we may harm the overall systems’ performance. Based on this observation a proper classifier selection seems most necessary. However, as we do not have any knowledge about the nature of potential outliers preparing a static selection of learners seems risky. In OCC scenario one cannot fully determine which classifiers will be useful for dealing with unknown objects. Therefore, a dynamic selection seems as a more attractive approach. Here we assume that we store all of classifiers in the pool and for each new object to be classified we delegate a single classifier that is estimated to be most competent in given decision space. This allows us to maintain both ensemble flexibility and discriminatory powers. Dynamic classifier selection requires the following components: pool of base classifiers, independent validation set to measure their local competencies, a dedicated measure of competence and a function that will allow to extend the measured competence from single examples over the entire decision space. During the ensemble training stage, we should have at our disposal a training set T RS and a validation set VS, both with objects described by feature vector x and with known true class labels. In case of one-class classification both training and validation sets consists only of objects belonging to the target concept ωT . Let us now describe each of components of the proposed one-class dynamic classifier selection system in details.

10

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

3.1. Measuring Classifier’s Competence Using the provided VS, one may calculate the competence measure C(Ψ|x). It reflects the competence given classifier to correctly classify a given example x ∈ X . Therefore, it may reflect how much certainty given one-class classifier has in this specific point of the decision space. Please note, that the source competence is calculated with the use of a validation set VS for a classifier already trained on T RS. Separate validation set is required to obtain unbiased estimation of classifiers competence in given area. If one would use the same objects for classifier training and competence evaluation then the measurements would be strongly biased. This is due to the fact that a classifier trained with a given sample is most likely to return a very high competence for it. Such a fact is intuitive, but does not give us any outlook on the actual competence and generalization abilities of given classifier. Only using new, unseen samples for estimating the competence allows us to obtain meaningful measurements. In case of multi-class problems all objects from T RS and VS should have class labels supplied by an expert or labeling system. However, this is not necessary in case of OCC problems, as they are often viewed as unsupervised learning (we assume, that all objects at our disposal during the ensemble design stage are from the target concept ωT ). Now, one needs to define the method for calculating the source competence CSRC (Ψ|x). We propose three different measures, based on selected heuristics for estimating how a given one-class classifier is competent for each object from VS, inspired by contemporary works on DCS systems for multi-class problems [3].

CE

PT

ED

3.1.1. Minimal Difference Measure This measure is based on a degree of uncertainty of a base classifier. In most real-life problems an extreme case of classifier’s confidence (equal to 1 if the classified object is considered as target concept or equal to 0 if it is considered as an outlier) are very rare. Most often, the classifier’s support value lies somewhere between 0 and 1. Thus we must deal with a decision that is unreliable to some extend. Minimal Difference (MD) measure is calculated according to both correctness and degree of uncertainty of the analyzed classifier Ψ at a given validation object xk ∈ VS and jk is its correct class label: CSRC (Ψ|xk ) = (2[Ψ(xk ) = jk ] − 1) (F (xk , jk ) − F (xk , j))

(9)

AC

where [] denotes Iverson bracket. If the considered classifier Ψ correctly classifies the validation point xk , then CSRC (Ψ|xk ) will be high and this classifier is considered as competent. The loss of competence is directly related to drop in values of CSRC (Ψ|xk ). In both cases the returned MD value relies on the degree of decision’s uncertainty returned by Ψ. This measure allows us to calculate the differences between the expected and obtained supports for each class, thus directly linking the increase in decision uncertainty to decrease in classifier local competence. 11

ACCEPTED MANUSCRIPT

3.1.2. Full Competence Measure In the proposed Full Competence (FC) measure, the source competence of a classifier in a given validation point (xk ) is defined as following full difference between the support functions: 2 · F (xk , jk ) − 1 . |F (xk , ωO ) − F (xk , ωT )|

(10)

CR IP T

CSRC (Ψ|xk ) =

This simple measure reflects how given classifier opts strongly for one of the considered classes. This measure favors classifiers that have high values of supports pointing to the correct class and penalizes classifiers with high values of support for incorrect class. Therefore, it is most effective to evaluate learners with high potential certainty of decision.

AN US

3.1.3. Entropy Measure Entropy Measure (EM) is based on the popular entropy criterion. The competence measure is a combination of two factors, one determining the absolute value of the competence, and the other determining its sign. The value is inverse proportional to the normalized entropy value of the discriminant values of given one-class classifier, while the sign of the competence is determined by the correctness of the classification of validation point xk . With this, one may calculate the entropy-based competence function as follows:

M



j∈{ωO ,ωT }

ED

CSRC (Ψ|xk ) = (2[Ψ(xk ) = jk ] − 1) −

X



F (xk , j) log2 F (xk , j) . (11)

PT

Higher entropy means lower competence of a given classifier in analyzed decision area. This way we may evaluate the stability of the given learner in extended decision regions. 3.2. Extending the Classifier’s Competence over Decision Space

AC

CE

The presented measures only gives us an outlook on the performance of the classifier in specific validation points. As new objects may appear anywhere in the feature space we require a method that will allow us to measure how this local competence is related to the global one in areas not covered by validation points. By extending the calculated competencies over the entire decision space one may select the most competent classifier for a given new, unknown objects. Following the suggestion for multi-class problems, we use a two-step procedure for estimating the competence of a given classifier for the entire decision space: • calculate a source competence CSRC (Ψ|xk ) for each xk ∈ VS; • extend these source competencies for the entire decision space according to normalized Gaussian potential function [46].

12

ACCEPTED MANUSCRIPT

CR IP T

The normalized Gaussian potential function allows us to estimate the competence of the l-th one-class classifier over the entire space (and not only it part, described by validation objects). This can be formulated as: P CSRC (Ψ|xk )exp(−dist(xn , xk ))2 P , (12) C(Ψ|xn ) = xk ∈VS 2 xk ∈VS exp(−dist(xn , xk ))

where xn is the new, incoming example to be classified and dist(xn , xk ) is the Euclidean distance between the new object and object from VS with already known source competence. Here, we should note that all of three proposed measures satisfy the properties required for Gaussian-based estimation of competence values over the decision space:

AN US

1. for CSRC (Ψ|xk ) < 0 the considered classifier is deemed as incompetent; 2. for CSRC (Ψ|xk ) = 0 the considered classifier is deemed as neutral; 3. for CSRC (Ψ|xk ) > 0 the considered classifier is deemed as competent. 3.3. Dynamic Classifier Selection

ED

M

We can easily extend the presented competence functions for classifier ensemble system. Let us assume, that for a given pattern classification problem we have a pool Π of L one-class classifiers at our disposal. We can improve the robustness and diversity of the one-class classification system by utilizing a pool of different one-class models and dynamically selecting for each incoming object the single most-competent classifier. This is done according to the following rule: Ψ(xn ) = Ψl (xn ) ⇔ C(Ψl |xn ) =

max

i∈{1,2,...,L}

C(Ψi |xn ),

(13)

PT

where Ψ(xn ) is the ensemble decision for new, incoming object after dynamic classifier selection.

CE

4. Experimental Analysis

AC

The following experiments were designed in order to reach two goals: • To investigate the usefulness of applying DCS system in one-class classification problems and its potential to outperform single-model approaches and static ensemble fusion methods; • To asses the quality and areas of applicability of three introduced competence measures.

13

ACCEPTED MANUSCRIPT

Table 1: Details of one-class classification models and their parameters, that were used as a base for considered ensembles. Classifier

Abbreviation

Parameters

1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Auto-Encoder Neural Network K-means Data Description Minimum Spanning Tree Data Description Mixture of Gaussians Data Description Nearest Neighbor Data Description One-Class Support Vector Machine Parzen Density Data Description Principal Component Data Description Self-Organizing Map Data Description Support Vector Data Description

AENN KMDD MSTDD MoGDD NNDD OCSVM PDDD PCADD SOMDD SVDD

hidden units = 10, frac. rejected = 0.1 k = 5, frac. rejected = 0.15 max. path = 20, frac. rejected = 0.1 clusters = 4, arbitrary shaped clusters, frac. rejected = 0.05 frac. rejected = 0.05 kernel = RBF, σ = 0.3, frac. rejected = 0.05 frac. rejected = 0.05 variance = 0.95, frac. rejected = 0.05 map = [5,5], frac. rejected = 0.1 kernel = polynomial, d = 2, frac. rejected = 0.1

No.

Name (Abbr.)

1. 2. 3. 4. 5. 6. 7. 8. 9.

Single Best (SB) Mean Vote (MV) Mean Weighted Vote (MWV) Mean Weighted Vote (PWV) Mean of Estimated Probabilities MEP) Product Combination of the Estimated Probabilities (PEP) Minimum Difference Selection (OCDCS-MD) Full Competence Selection (OCDCS-FC) Entropy Measure Selection (OCDCS-EM)

CR IP T

No.

Table 2: One-class ensemble systems used for the experimental study.

using single best model from the pool with highest accuracy voting based on support functions votes weighted by rejection threshold of each classifier product of votes weighted by rejection threshold of each classifier average of continuous outputs of each classifier product of continuous outputs of each classifier OCDCS with minimum difference competence function OCDCS with full competence function OCDCS with entropy-based competence function

M

4.1. Used Classification Models

AN US

Description

CE

PT

ED

To apply the proposed OCDCS system, we need to have a pool of base classifiers. In the experiments, we decided to use a heterogeneous set of models in order to exploit the differences in their learning paradigms for tackling single-class problems. We propose to form the ensemble of 10 heterogeneous models, that are presented together with their parameters in Table 1. Please note, that each group of one-class learners is represented. We use different rejection thresholds in order to ensure additional diversity among the committee members. To put the obtained results into context, we compared our proposed OCDCS systems with reference one-class static ensembles. Description of used ensemble systems is presented in Table 2. 4.2. Datasets

AC

We have chosen 20 binary datasets in total, originating from four different repositories: UCI1 , KEEL2 [1], TUDelft3 [43] and TunedIT4 . Details of the chosen datasets are given in Table 3.

1 https://archive.ics.uci.edu/ml/index.html 2 http://sci2s.ugr.es/keel/datasets.php 3 http://homepage.tudelft.nl/n9d04/occ/index.html 4 http://tunedit.org/challenge/QSAR

14

CR IP T

ACCEPTED MANUSCRIPT

Table 3: Details of datasets used in the experimental investigation. Indexes stands for a database from which given benchmark is taken. Numbers in parentheses indicates the number of objects in the minority class.

Name

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

Arrhythmia3 Biomed3 Breast-cancer1 Breast-Wisconsin1 Colic1 Diabetes1 Delft Pump 2x2 noisy3 Glass0123vs4562 Heart-statlog1 Hepatitis3 Ionosphere1 Liver1 New-thyroid22 p53 Mutants1 Pima1 Spambase3 Sonar1 Yeast32 Voting records1 CYP2C19 isoform4

Features

Classes

420 (183) 194 (67) 286 (85) 699 (241) 368 (191) 768 (268) 240 (64) 214 (51) 270 (120) 155 (32) 351(124) 345 (145) 215 (37) 16772 (3354) 768 (268) 4601 (1813) 208 (97) 1484 (163) 435 (168) 837 (181)

278 5 9 9 22 8 64 9 13 19 34 6 5 5409 8 57 60 8 16 242

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

M

ED

PT AC

CE

Objects

AN US

No.

15

ACCEPTED MANUSCRIPT

CR IP T

Due to the lack of one-class benchmarks we use the binary ones. The training set was composed from the part of objects from the target class (according to cross-validation rules), while the testing set consisted of the remaining objects from the target class and outliers (to check both the false acceptance and false rejection rates). Majority class was used as the target concept and minority class as outliers. This scheme is popularly used in works on one-class classifiers [7; 31; 32]. 4.3. Set-up and Statistical Tests

PT

ED

M

AN US

To present the results of the comparison between several classification models on a varied benchmark datasets predictive accuracy is popularly used. However, it assumes that all of appearing classes have roughly balanced distributions and are equally important in the classification step. For examining one-class classifiers, we have only objects from the target class for training, but both from target and outlier classes for testing. However, usually we have more examples coming from the target class than outlier class at our disposal, as can be seen in Table 3. Therefore, one needs a more appropriate measure that will allow to properly report the trade-off between generalization abilities on the target concept and discriminative power on outliers. Following the methodology from binary imbalanced domain, we propose to evaluate our one-class ensembles with the usage of G-mean [4], as to achieve a good performance according to this measure, the accuracy of both classes should be maximized at the same time. In order to present a detailed comparison among a group of machine learning algorithms, one must use statistical tests to prove, that the reported differences among classifiers are significant [22]. We use both pairwise and multiple comparison tests. Pairwise tests give as an outlook on the specific performance of methods for a given dataset, while multiple comparison allows us to gain a global perspective on the performance of the algorithms over all benchmarks. With this, we get a full statistical information about the quality of the examined classifiers.

AC

CE

• We use a 5x2 CV combined F-test [2] for simultaneous training/testing and pairwise statistical analysis. It repeats five-time two fold cross-validation. The combined F-test is conducted by comparison of all versus all. As a test score the probability of rejecting the null hypothesis is adopted, i.e. that classifiers have the same error rates. As an alternative hypothesis, it is conjectured that tested classifiers have different error rates. A small difference in the error rate implies that the different algorithms construct two similar classifiers with similar error rates; thus, the hypothesis should not be rejected. For a large difference, the classifiers have different error rates and the hypothesis should be rejected. • To calculate the competence for OCDCS system, we need to have a validation set VS.Therefore, for each iteration of 5x2 CV, we separate 20% of the training data for validation purposes, having 40% of objects in training set 16

ACCEPTED MANUSCRIPT

T RS, 10% in validation set VS and 50% in testing set T S. This procedure additionally ensure, that these sets are disjoint: T RS ∩ VS ∩ T S = ∅. • All reference static ensembles are trained with all of available training examples, as they do not require a separate training set.

CR IP T

• Demsar’s critical difference plots based on Nemenyi test are used for visual comparison among tested methods. One must remember that although this is very illustrative Nemenyi test is very conservative and has a low power. Therefore, additional testing is required.

• For assessing the ranks of classifiers over all examined benchmarks, we use a Friedman ranking test [15]. It checks, if the assigned ranks are significantly different from assigning to each classifier an average rank.

AN US

• We use the Shaffer post-hoc test [22] to find out which of the tested methods are distinctive among an n x n comparison. The post-hoc procedure is based on a specific value of the significance level α. Additionally, the obtained p-values should be examined in order to check how different given two algorithms are. We fix the significance level α = 0.05 for all comparisons. 4.4. Results of the Experimental Analysis

ED

M

The results are presented in Table 4, Nemenyi’s critical difference plot is depicted in Figure 3, while results of the Shaffer post-hoc test are depicted in Tables 5 – 8. Averaged classification times are presented in Table 9.

PT

4.5. Discussion of Results From the presented results one may clearly see that the proposed dynamic classifier selection system for one-class classification problems return the most satisfactory performance. The OCDCS method outperformed all other reference methods for 15 out of 20 benchmark datasets. Let us have a closer look on these experimental findings.

AC

CE

4.5.1. Dynamic Classifier Selection versus Static Ensembles In only two cases (Hepatitis and Voting records datasets) OCDCS system was unable to outperform the single-best classifier from its pool. This can be explained by a situation, in which we have a single dominant model (strong classifier). In such cases this model outputs the best performance over the entire decision space and applying dynamic selection cannot improve the ensemble performance. Furthermore, by estimating a competence over the entire decision space, we may incorrectly select different classifier for new data. But this is a quite a rare situation in one-class classification. Usually the structure of the target class is too complex to be handled by only one specific type of classifier’s model.

17

CR IP T

ACCEPTED MANUSCRIPT

1

2

OCDCS.FC

5

6

7

8

MEP

ED

OCDCS.MD

4

M

OCDCS.EM

3

AN US

CD

PEP

SB MWV PWV

PT

MV

AC

CE

Figure 3: Demsar’s critical difference plot for the proposed OCDCS systems and reference methods over 20 benchmark datasets.

18

ACCEPTED MANUSCRIPT

Table 4: Results of the experimental results with the respect to the G-mean [%] and results of pairwise 5x2 CV combined F-test. Small numbers under proposed methods stand for the indexes of models to which the considered one is statistically superior. Single SB1

MV2

Arrythmia

80.12

77.68

Static Classifier Combination MWV3 PWV4 MEP5 PEP6 78.86

80.02

80.89

Biomed

91.07

86.93

90.71

90.95

91.07

Breast-cancer

63.98

59.40

60.36

60.91

62.32

Breast-Wisconsin

83.07

82.35

83.64

83.64

82.66

Colic

73.28

71.65

71.73

71.85

72.93

61.04

59.23

Delft Pump 2x2 noisy

86.13

81.07

Glass0123vs456

81.65

77.68

Heart-statlog

82.19

83.56

Hepatitis

55.63

52.87

75.26

Liver

60.11

p53 Mutants Pima

87.61

Sonar

61.48

2,5

1,2,3,4,5,8

76.01

77.09

1,2,3,4,5,6

1,2,3,4,5,6,7

ALL

60.18

61.98

60.87

62.18

83.98

84.18

53.06

52.87

53.06

53.17

78.77

65.39 89.62

80.79

81.66

81.24

81.66

82.49

63.76

62.41

62.41

62.41

61.27

62.96

79.63

85.27

80.70

83.72

81.09 84.06

81.09 84.16

80.03 83.72

80.78 85.96

74.79

72.62

73.08

73.08

73.27

74.79

Voting records

91.59

90.30

90.72

89.48

87.92

88.63

CE

Yeast3

CYP2C19 isoform

6.00

84.96 7.97

85.91 6.55

AC

Avg. ranks

82.38

85.20 6.70

86.38 5.80

67.41

1,2,5

83.98

88.13

ALL

65.06

74.71

83.56

87.90

2

66.35

73.65

82.78

88.42

1,2,3,4,5,6,8

ALL

82.36

65.39

94.63

84.55

77.92

63.03

91.55

83.80

77.92

63.48

ALL

93.11

1,2,3,4,5,6

84.99

76.92

87.24 4.05



2,3,4,5,7,9

2,3,4,5

89.67

88.35

91.72

1,2,3,4,5,6,8

1,2,3,4,5,6

ALL

84.93

85.42

86.89

1,2,3,4,5,6

1,2,3,4,5,6

ALL

87.82

85.83

87.82

1,2,3,4,5,6,8

1,2,3,4,5,6

1,2,3,4,5,6,8

55.38

53.92

55.38

2,3,4,5,6,8

2,4

2,3,4,5,6,8

79.36

79.91

80.93

1,2,3,4,5

1,2,3,4,5,6

1,2,3,4,5,6,7

64.51

63.17

64.51

1,2,3,4

1

1,2,3,4

87.14

86.48

87.61

1

1

1,7

83.20

86.51

84.33

1,2,3,4,5

ALL

1,2,3,4,5,6,7

64.07

66.15

65.29

2,3,4,5,6

ALL

1,2,3,4,5,6,7

82.24

84.52

83.40

1,2,3,4,5,6

ALL

1,2,3,4,5,6,7

88.58

86.82

85.96

1,2,3,4,5,6,8

1,2,3,4,5

ALL

75.92

77.81

76.48

91.59

90.72

91.59

2,3,4,5,6,8

4,5,6

2,3,4,5,6,

90.35

92.13

92.13

1,2,3,4,5,6

1,2,3,4,5,6,7

1,2,3,4,5,6,7

3.03

2.92

1.97

In four cases (Breast-Wisconsin, Diabetes, Liver and New-thyroid2) the OCDCS ensemble was similar or slightly worse than static combination rules. This can be explained by a situation, in which competence measure becomes very sparse (for objects located far from the validation examples). This shows us, that there is a need for developing different methods of competence estima-

19

84.04

84.00

84.16

76.48

83.02

1,2,3,4,5,6

1,2,3,4,5,6,8

82.40

76.48

82.34

1,2,3,4,5,6

84.75

78.86

PT

Spambase

83.79

63.48

ED

New-thyroid2

76.18

59.48

62.60

82.74

M

Ionosphere

59.05

90.95

AN US

Diabetes

81.01

Dynamic Classifier Selection OCDCS-MD7 OCDCS-FC8 OCDCS-EM9

CR IP T

Dataset

ACCEPTED MANUSCRIPT

Table 5: Shaffer test for comparison between the one-class dynamic classifier selection via minimum difference competence function (OCDCS-MD) and reference static ensemble methods. Symbol ’+’ stands for a situation in which the proposed OCDCS system is superior.

OCDCS-MD OCDCS-MD OCDCS-MD OCDCS-MD OCDCS-MD OCDCS-MD

p-value vs vs vs vs vs vs

SB MV MWV PWV MEP PEP

+ + + + + +

0.000011 0.000000 0.000047 0.000022 0.001354 0.046584

CR IP T

hypothesis

hypothesis

p-value

vs vs vs vs vs vs

SB MV MWV PWV MEP PEP

M

OCDCS-FC OCDCS-FC OCDCS-FC OCDCS-FC OCDCS-FC OCDCS-FC

AN US

Table 6: Shaffer test for comparison between the one-class dynamic classifier selection via full competence function (OCDCS-FC) and reference static ensemble methods. Symbol ’+’ stands for a situation in which the proposed OCDCS system is superior.

+ + + + + +

0.000008 0.000000 0.000028 0.000013 0.000901 0.033931

ED

Table 7: Shaffer test for comparison between the one-class dynamic classifier selection via entropy measure competence function (OCDCS-EM) and reference static ensemble methods. Symbol ’+’ stands for a situation in which the proposed OCDCS system is superior.

hypothesis

CE

PT

OCDCS-EM OCDCS-EM OCDCS-EM OCDCS-EM OCDCS-EM OCDCS-EM

p-value vs vs vs vs vs vs

SB MV MWV PWV MEP PEP

+ + + + + +

0.000003 0.000000 0.000000 0.000000 0.000010 0.016575

AC

Table 8: Shaffer test for comparison between the different proposed competence measures for one-class dynamic classifier selection. Symbol ’+’ stands for a situation in which the method on the left is superior, ’-’ for vice versa, and ’=’ represents a lack of statistically significant differences.

hypothesis

p-value

OCDCS-MD vs OCDCS-FC OCDCS-MD vs OCDCS-EM OCDCS-EM vs OCDCS-FC

= 0.908073 – 0.040346 + 0.045656

20

ACCEPTED MANUSCRIPT

Table 9: Comparison of averaged classification times [s.] per 50 objects for a single-best one-class classifier, static ensembles and proposed dynamic classifier selection systems. SB

Static

OCDCS-MD

OCDCS-FC

OCDCS-EM

Arrythmia Biomed Breast-cancer Breast-Wisconsin Colic Diabetes Delft Pump 2x2 noisy Glass0123vs456 Heart-statlog Hepatitis Ionosphere Liver New-thyroid2 p53 Mutants Pima Spambase Sonar Yeast3 Voting records CYP2C19 isoform

1.79 0.35 0.41 0.40 0.81 0.48 1.37 0.51 0.62 0.80 1.17 0.38 0.19 7.53 0.51 1.06 1.10 0.41 0.57 1.94

2.43 0.64 0.69 0.71 1.03 0.74 2.04 1.03 0.77 1.02 2.56 0.63 0.38 16.64 0.74 2.36 2.24 0.68 0.93 3.16

3.02 0.93 0.95 0.82 1.20 0.86 3.13 1.68 0.90 1.27 3.28 0.73 0.49 19.92 1.00 3.28 2.89 0.81 0.99 3.64

3.85 1.27 1.32 0.93 1.62 1.18 4.36 2.01 1.19 1.68 4.03 0.85 0.62 23.58 1.11 4.27 3.56 0.89 1.05 4.48

4.17 1.61 1.49 1.30 2.38 1.55 4.73 2.34 1.38 1.99 4.75 0.98 0.88 26.49 1.40 5.03 3.86 1.28 1.17 4.92

AN US

CR IP T

Dataset

AC

CE

PT

ED

M

tion that can handle such situations. When investigating the different static combination methods, one may see that mean vote, mean weighted vote and product of weighted votes are always inferior to dynamic selection methods. These are simple combination rules, that display no robustness to presence of weak classifiers in the pool. That is why their performance can be easily degraded by incompetent classifiers, as they have no means to prevent such models from influencing the final decision. The weighting scheme applied to the combination of votes does not significantly improve the classification accuracy - this shows that weighted voting does not perform well for one-class problems. The only methods, that are able to outperform OCDCS are based on support functions. As they use more information from the classifiers in the pool (taking into account their degree of uncertainty), they are able to deliver a better combination results than voting methods. This observation gives us another insight into the reasons behind the high effectiveness of dynamic selection. All of the proposed competence measures work directly on the continuous support functions and can process this additional information in order to more efficiently grade the usefulness of different models for a given sample. According to both Friedman and Shaffer tests, all three proposed OCDCS systems display statistically significant improvement over the reference methods, when considering their performance over multiple datasets. These observations are highly interesting, as they prove that despite OCDCS systems using actually less training objects (as 20% of examples are delegated to forming VS) they are able to deliver improved performance due to precise 21

ACCEPTED MANUSCRIPT

CR IP T

estimation of the local competencies of classifiers in the pool. When analyzing Table 9, we can see that DCS systems require more classification time than static ensembles due to requirement of calculating competencies for each object to be classified. However, the proposed measures are simple ones and do not impose high computational cost on the system. Additionally, we assume a static classification scenario where classification time is not a crucial factor. However, such methods can be used e.g., for high-speed data streams by using efficient computational architectures like GPU-based processing that will offer a speed-up for calculating the competence measures.

AC

CE

PT

ED

M

AN US

4.5.2. Comparison of Proposed Competence Measures When comparing the three proposed competence measures, one may clearly see their areas of potential applicability. The Minimal Difference measure can be considered as the simplest and most straightforward one. It was able to outperform in most cases static combiners, but turned out to be inferior to the two remaining measures. OCDCS-MD only for two cases (Heart-statlog and Voting records) gave the best results - however, in each of those cases it tied with OCDCS-EM. Seeing that there is no clear situation in which MD can be successfully applied, it is recommended to apply different measures of competence for one-class tasks. Friedman test places this method on last position, and Shaffer test clearly proves that it is inferior to the Entropy Measure. The Fuzzy Competence measure returns a more complex performance. On the first sight it returns a similar performance to MD measure (close rank position in Friedman test and rejected hypothesis in Shaffer test). However, when analyzing its performance in-depth one can see a clear correlation between the obtained accuracy and the size of the training set. OCDCS-FC system outputs lowest accuracy from all of proposed dynamic ensembles in cases of very small datasets. This can be explained by the fact, that with small number of training examples it cannot properly estimate the crisp differences between support values. Therefore, it obtain too little information to work properly. On the other hand for larger datasets (p53 Mutants, Pima, Spambase, Yeast3 and CYP2C19 isoform) it outperforms in a statistically significant way two remaining competence measures. This allows us to conclude that OCDCS-FC, despite its simplicity, is recommended for datasets with large number of samples, as they allow to properly estimate the full crisp difference between classifier supports. This also show us the importance of pairwise statistical tests in analysis of machine learning experiments. Tests for multiple comparisons (Friedman and Shaffer) pointed out that there are no significant differences between MD and FC, and that FC is inferior to EM. However, with pairwise analysis we were able to extract the specific area of applicability of this method. The Entropy Measure was superior to two other competence measures in 12 cases. This combined with excellent results from Friedman and Shaffer tests, allows us to conclude that for small and standard datasets EM measure is the most recommended one to apply in OCDCS system.

22

ACCEPTED MANUSCRIPT

5. Conclusions

AN US

CR IP T

In this paper, we have presented a novel approach for constructing ensembles for one-class classification problems based on dynamic classifier selection. We described the complete dynamic selection system designed for purpose of learning in the absence of counterexamples. To properly calculate the competence of one-class methods, we introduced three competence measures based on different heuristics. To estimate the competence of each classifier from the pool, we proposed to use a Gaussian potential function. These steps allowed to create an efficient ensemble system for one-class classification, that was able to exploit the local competencies of classifiers from the pool, and delegate them to local decision areas. We concluded with recommendations to use the Entropy Measure of competence for small and standard datasets, and Full Competence measure for large datasets. We showed that our ensemble works very well with pool of heterogeneous classifiers, but there is no restrictions that prohibit for using OCDCS with homogeneous classifiers. In future, we plan to examine other methods for estimating competence over the entire decision spaces (that will be robust to rare objects lying far from validation points) and to extend our framework to one-class dynamic ensemble selection system and high-speed non-stationary scenarios.

M

Acknowledgment

References

ED

This work was supported by the Polish National Science Centre under the grant PRELUDIUM number DEC-2013/09/N/ST6/03504 realized in years 20142016.

PT

[1] J. Alcal´ a-Fdez, A. Fern´ andez, J. Luengo, J. Derrac, and S. Garc´ıa. KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Multiple-Valued Logic and Soft Computing, 17(2-3):255–287, 2011.

CE

[2] E. Alpaydin. Combined 5 x 2 cv f test for comparing supervised classification learning algorithms. Neural Computation, 11(8):1885–1892, 1999.

AC

[3] B. Antosik and M. Kurzy´ nski. New measures of classifier competence heuristics and application to the design of multiple classifier systems. In ˙ lnierek, editors, ComR. Burduk, M. Kurzy´ nski, M. Wo´zniak, and A. Zo puter Recognition Systems 4, volume 95 of Advances in Intelligent and Soft Computing. Springer Berlin Heidelberg, 2011.

[4] R. Barandela, R. M. Valdovinos, and J. S. S´ anchez. New applications of ensembles of classifiers. Pattern Anal. Appl., 6(3):245–256, 2003.

23

ACCEPTED MANUSCRIPT

[5] A.S. Britto, R. Sabourin, and L.E.S. Oliveira. Dynamic selection of classifiers - a comprehensive review. Pattern Recognition, 47(11):3665–3680, 2014.

CR IP T

[6] R. Burduk. Classifier fusion with interval-valued weights. Pattern Recognition Letters, 34(14):1623–1629, 2013.

[7] P. Casale, O. Pujol, and P. Radeva. Approximate polytope ensemble for one-class classification. Pattern Recognition, 47(2):854–864, 2014.

[8] P.R. Cavalin, R. Sabourin, and C.Y. Suen. Dynamic selection approaches for multiple classifier systems. Neural Computing and Applications, 22(34):673–688, 2013.

AN US

[9] B. Chen, A. . Feng, S. . Chen, and B. Li. One-cluster clustering based data description. Jisuanji Xuebao/Chinese Journal of Computers, 30(8):1325– 1332, 2007. [10] Y. Chen, X. S. Zhou, and T. S. Huang. One-class svm for learning in image retrieval. In IEEE International Conference on Image Processing, volume 1, pages 34–37, 2001.

M

[11] G. Cohen, H. Sax, and A. Geissbuhler. Novelty detection using one-class parzen density estimator. an application to surveillance of nosocomial infections. In Studies in Health Technology and Informatics, volume 136, pages 21–26, 2008.

ED

[12] L. Cordella, P. Foggia, C. Sansone, F. Tortorella, and M. Vento. A cascaded multiple expert system for verification. In Multiple Classifier Systems, volume 1857 of Lecture Notes in Computer Science, pages 330–339. Springer Berlin / Heidelberg, 2000.

PT

[13] B. Cyganek. Color image segmentation with support vector machines: Applications to road signs detection. International journal of neural systems, 18(4):339–345, 2008.

CE

[14] Q. Dai. The build of a dynamic classifier selection icbp system and its application to pattern recognition. Neural Computing and Applications, 19(1):123–137, 2010.

AC

[15] J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006.

[16] C. D´esir, S. Bernard, C. Petitjean, and L. Heutte. One class random forests. Pattern Recognition, 46(12):3490–3506, 2013. [17] V. Di Ges` u and G. Lo Bosco. Combining one class fuzzy knn’s. In Applications of Fuzzy Sets Theory, 7th International Workshop on Fuzzy Logic and Applications, WILF 2007, Camogli, Italy, July 7-10, 2007, Proceedings, pages 152–160, 2007. 24

ACCEPTED MANUSCRIPT

[18] L. Didaci, G. Giacinto, F. Roli, and G.L. Marcialis. A study on the performances of dynamic classifier selection based on local accuracy estimation. Pattern Recognition, 38(11):2188–2191, 2005.

CR IP T

[19] E.M. Dos Santos, R. Sabourin, and P. Maupin. A dynamic overproduceand-choose strategy for the selection of classifier ensembles. Pattern Recognition, 41(10):2993–3009, 2008.

[20] R.P.W. Duin. The combining classifier: to train or not to train? In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 2, pages 765 – 770 vol.2, 2002.

AN US

[21] M. Galar, A. Fern´ andez, E. Barrenechea Tartas, H. Bustince Sola, and F. Herrera. Dynamic classifier selection for one-vs-one strategy: Avoiding non-competent classifiers. Pattern Recognition, 46(12):3412–3424, 2013. [22] S. Garc´ıa, A. Fern´ andez, J. Luengo, and F. Herrera. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf. Sci., 180(10):2044–2064, 2010.

M

[23] G. Giacinto, R. Perdisci, M. Del Rio, and F. Roli. Intrusion detection in computer networks by a modular ensemble of one-class classifiers. Inf. Fusion, 9:69–82, January 2008.

ED

[24] J.B. Gomes, M.M. Gaber, P.A.C. Sousa, and E. Menasalvas. Collaborative data stream mining in ubiquitous environments using dynamic classifier selection. International Journal of Information Technology and Decision Making, 12(6):1287–1308, 2013.

PT

[25] W. Hu, J. Gao, Y. Wang, O. Wu, and S. Maybank. Online adaboostbased parameterized methods for dynamic distributed network intrusion detection. IEEE Transactions on Cybernetics, 44(1):66–82, 2014.

CE

[26] A. Hung-Ren Ko, R. Sabourin, and A. de Souza Britto Jr. From dynamic classifier selection to dynamic ensemble selection. Pattern Recognition, 41(5):1718–1731, 2008.

AC

[27] K. Jackowski and M. Wo´zniak. Algorithm of designing compound recognition system on the basis of combining classifiers with simultaneous splitting feature space into competence areas. Pattern Analysis and Applications, 12(4):415–425, 2009. [28] P. Juszczak, D. M. J. Tax, E. Pekalska, and R. P. W. Duin. Minimum spanning tree based one-class classifier. Neurocomputing, 72(7-9):1859– 1869, 2009. [29] S. Kang, S. Cho, and P. Kang. Multi-class classification via heterogeneous ensemble of one-class classifiers. Eng. Appl. of AI, 43:35–43, 2015.

25

ACCEPTED MANUSCRIPT

[30] B. Krawczyk. One-class classifier ensemble pruning and weighting with firefly algorithm. Neurocomputing, 150:490–500, 2015. [31] B. Krawczyk and M. Wo´zniak. Diversity measures for one-class classifier ensembles. Neurocomputing, 126:36–44, 2014.

CR IP T

[32] B. Krawczyk, M. Wo´zniak, and B. Cyganek. Clustering-based ensembles for one-class classification. Information Sciences, 264:182–195, 2014.

[33] M. Krysmann and M. Kurzy´ nski. Methods of learning classifier competence applied to the dynamic ensemble selection. In Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013, Milkow, Poland, 27-29 May 2013, pages 151–160, 2013.

AN US

[34] C. Lin, W. Chen, C. Qiu, Y. Wu, S. Krishnan, and Q. Zou. Libd3c: Ensemble classifiers with a clustering and dynamic selection strategy. Neurocomputing, 123:424–435, 2014. [35] R. Lysiak, M. Kurzy´ nski, and T. Woloszynski. Optimal selection of ensemble classifiers using measures of competence and diversity of base classifiers. Neurocomputing, 126:29–35, 2014. [36] L. Manevitz and M. Yousef. One-class document classification via neural networks. Neurocomputing, 70(7-9):1466–1481, 2007.

ED

M

[37] G. Ratsch, S. Mika, B. Scholkopf, and K. . Muller. Constructing boosting algorithms from svms: An application to one-class classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1184– 1199, 2002.

PT

[38] M. Smetek and B. Trawi´ nski. Selection of heterogeneous fuzzy model ensembles using self-adaptive genetic algorithms. New Generation Computing, 29:309–327, 2011. [39] S. Sonnenburg, G. Rtsch, C. Schfer, and B. Schlkopf. Large scale multiple kernel learning. Journal of Machine Learning Research, 7:1531–1565, 2006.

CE

[40] D. M. J. Tax and R. P. W. Duin. Combining one-class classifiers. In Proceedings of the Second International Workshop on Multiple Classifier Systems, MCS ’01, pages 299–308, London, UK, 2001. Springer-Verlag.

AC

[41] D. M. J. Tax and R. P. W. Duin. Support vector data description. Machine Learning, 54(1):45–66, 2004. [42] D. M. J. Tax, P. Juszczak, E. Pekalska, and R. P. W. Duin. Outlier detection using ball descriptions with adjustable metric. In Proceedings of the 2006 joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition, SSPR’06/SPR’06, pages 587–595, Berlin, Heidelberg, 2006. Springer-Verlag.

26

ACCEPTED MANUSCRIPT

[43] D.M.J. Tax and Robert P. W. Duin. Characterizing one-class datasets. In Proceedings of the Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, pages 21–26, 2005.

CR IP T

[44] O. Taylor and J. MacIntyre. Adaptive local fusion systems for novelty detection and diagnostics in condition monitoring. In Proceedings of SPIE - The International Society for Optical Engineering, volume 3376, pages 210–218, 1998. [45] T. Wilk and M. Wo´zniak. Soft computing methods applied to combination of one-class classifiers. Neurocomput., 75:185–193, January 2012.

AN US

[46] T. Woloszynski and M. Kurzy´ nski. On a new measure of classifier competence applied to the design of multiclassifier systems. In Image Analysis and Processing - ICIAP 2009, 15th International Conference Vietri sul Mare, Italy, September 8-11, 2009, Proceeding, pages 995–1004, 2009. [47] M. Wo´zniak, M. Grana, and E. Corchado. A survey of multiple classifier systems as hybrid systems. Information Fusion, 16(1):3–17, 2014. [48] M. Wo´zniak and M. Zmyslony. Designing combining classifier with trained fuser - analytical and experimental evaluation. Neural Network World, 20(7):925–934, 2010.

M

[49] Y. Zhang, B. Zhang, F. Coenen, J. Xiao, and W. Lu. One-class kernel subspace ensemble for medical image classification. Eurasip Journal on Advances in Signal Processing, 2014(1), 2014.

ED

[50] X. Zhu, X. Wu, and Y. Yang. Effective classification of noisy data streams with attribute-oriented dynamic classifier selection. Knowledge and Information Systems, 9(3):339–363, 2006.

AC

CE

PT

[51] H. Zuo, O. Wu, W. Hu, and B. Xu. Recognition of blue movies by fusion of audio and video. In 2008 IEEE International Conference on Multimedia and Expo, ICME 2008 - Proceedings, pages 37–40, 2008.

27