Combining localized fusion and dynamic selection for high-performance SVM

Combining localized fusion and dynamic selection for high-performance SVM

Expert Systems with Applications 42 (2015) 9–20 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.els...

3MB Sizes 0 Downloads 23 Views

Expert Systems with Applications 42 (2015) 9–20

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Combining localized fusion and dynamic selection for high-performance SVM Jun-Ki Min, Jin-Hyuk Hong, Sung-Bae Cho ⇑ Department of Computer Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 120-749, Republic of Korea

a r t i c l e

i n f o

a b s t r a c t

Article history: Available online 25 July 2014

To resolve class-ambiguity in real world problems, we previously presented two different ensemble approaches with support vector machines (SVMs): multiple decision templates (MuDTs) and dynamic ordering of one-vs.-all SVMs (DO-SVMs). MuDTs is a classifier fusion method, which models intra-class variations as subclass templates. On the other hand, DO-SVMs is an ensemble method that dynamically selects proper SVMs to classify an input sample based on its class probability. In this paper, we newly propose a hybrid scheme of those two approaches to utilize their complementary properties. The localized fusion approach of MuDTs increases variance of the classification models while the dynamic selection scheme of DO-SVMs reduces the unbiased-variance, which causes incorrect prediction. We show the complementary properties of MuDTs and DO-SVMs with several benchmark datasets and verify the performance of the proposed method. We also test how much our method could increase its baseline accuracy by comparing with other combinatorial ensemble approaches. Ó 2014 Elsevier Ltd. All rights reserved.

Keywords: Hybrid approach Classifier fusion Dynamic selection Sub-class modeling Support vector machines

1. Introduction Expert systems on various domains have been employing classifier ensemble approaches to deal with the complex prediction and classification problems of real-world applications (GarcíaPedrajas & García-Osorio, 2011; Nanni & Lumini, 2009). A classifier ensemble incorporates multiple classifiers to achieve highly reliable and accurate performance in classification (Jain, 2000). Fusion and selection are two main streams in ensembling, where the purpose of fusion is to combine diverse classifiers (Brown, Wyatt, Harris, & Yao, 2005; Windeatt, 2004) while the selection’s is to choose a competent (locally accurate) classifier for a test sample from base classifiers (Didaci, Giacinto, Roli, & Marcialis, 2005; Giacinto & Roli, 2001; Woods, Kegelmeyer, & Bowyer, 1997) (see Fig. 1). The selection approach is often further categorized into static or dynamic methods according to whether the selection regions on a sample space are specified during the training phase or operation phase, respectively (Kuncheva, 2002). In general, a fusion approach could be less adaptive to an incoming sample than a selection method since the fusion uses the same combination of base classifiers for all the incoming samples. On the other hand, the selection approach might produce ⇑ Corresponding author. Tel.: +82 2 2123 2720; fax: +82 2 365 2579. E-mail addresses: [email protected] (J.-H. Hong), [email protected] (S.-B. Cho).

(J.-K.

http://dx.doi.org/10.1016/j.eswa.2014.07.028 0957-4174/Ó 2014 Elsevier Ltd. All rights reserved.

Min),

[email protected]

biased results by relying on a chosen classifier. To address those weaknesses, recent studies have introduced combinatorial ensemble approaches that select competent classifiers and combine their decisions (Cavalin, Sabourin, & Suen, 2013; Ruta & Gabrys, 2005; Woloszynski & Kurzynski, 2011). Ensemble approaches also have been applied to combine multiple binary classifiers like support vector machines (SVMs) GarcíaPedrajas & Ortiz-Boyer, 2011; Qi, Tian, & Shi, 2013; Wang et al., 2009. In our previous work, we presented two different ensemble methods with SVMs: multiple decision templates (MuDTs) Min, Hong, & Cho, 2010 and dynamic ordering of one-vs.-all SVMs (DO-SVMs) Hong, Min, Cho, & Cho, 2008. MuDTs is a classifier fusion method that models intra-class variations of a given problem by using subclass templates (called localized templates). On the other hand, DO-SVMs is an ensemble method that dynamically selects proper SVMs for an input sample based on the sample’s class probability. In this paper, we further investigate those two methods in terms of bias-variance analysis and newly present their hybrid scheme. The proposed method can model intra-class variations while minimize the unbiased-variance errors by deciding the evaluation order of localized templates, dynamically. The contributions of this work are threefold. First, we evaluate different property of MuDTs and DO-SVMs by estimating their bias and variance errors. Second, we present their hybrid method and verify it on benchmark datasets, showing that the hybrid could enhance the performance of the individual approaches. Third, we

10

J.-K. Min et al. / Expert Systems with Applications 42 (2015) 9–20

Fig. 1. Concepts of two ensemble approaches of (a) fusion and (b) selection, and their (c) hybrid.

compare our approach with several combinatorial ensemble approaches including dynamic-classifiers-ensemble-selection (Ko, Sabourin, & Britto, 2008; Xiao, He, Jiang, & Liu, 2010), over-produce and choose (Banfield, Hall, Bowyer, & Kegelmeyer, 2005), modified divide-and-conquer (Frosyniotis, Stafylopatis, & Likas, 2003), and switching between selection and fusion (Kuncheva, 2002). This paper is organized as follows. Section 2 introduces representative approaches of the combinatorial ensemble and compares their properties in the context of how they build an ensemble. The section also gives the background of multiclass classification with SVMs. Section 3 presents details of the proposed method including its theoretical assumptions based on bias-variance error reduction and the training/testing processes. Experimental results and discussions are presented in Section 4, and the conclusion is given in Section 5. 2. Background 2.1. Combinatorial ensemble methods There exist many studies for combining classifier fusion and selection methods or merging two different concepts of classifier ensemble to maximize classification accuracy (Banfield et al., 2005; Giacinto & Roli, 2001; Ko et al., 2008; Partridge & Yates, 1996; Xiao, Xie, He, & Jiang, 2012; Xiao et al., 2010). In this paper, we will refer this approach as a combinatorial ensemble. One of early approaches for the combinatorial ensemble is over-produce and choose (or also called thinning) which constructs an initial large set of base classifiers and then chooses some of diverse models during the training phase (see Fig. 2c) (Banfield et al., 2005; Giacinto & Roli, 2001; Partridge & Yates, 1996). Giacinto and Roli proposed a thinning scheme that groups similar classifiers (those that have similar errors) together and builds an ensemble by selecting the most accurate classifier from each group (Giacinto & Roli, 2001). Banfield et al. suggested a concurrency thinning method that takes into account the performance of ensemble as well as the accuracy of each classifier (Banfield et al., 2005). Contrary to the thinning approach, a more recent strategy called dynamic classifier-ensemble selection (DCES) selects a set of base classifiers dynamically for an incoming pattern (see Fig. 2d). The basic idea of DCES is to combine most ‘‘locally’’ accurate classifiers, where the accuracy is estimated for the training space surrounding a test sample. Santos et al. suggested a two-staged method that builds candidate ensembles by using a genetic algorithm and then uses the one with the highest confidence for a test pattern (Santos, Sabourin, & Maupin, 2008). Ko et al. presented a K-nearest-oracles (KNORA) scheme, which combines the classifiers that correctly predicted the neighbor examples (training samples) of a test

pattern (Ko et al., 2008). Xiao et al. employed a group method of data handling (GMDH) neural network to combine base classifiers, considering both accuracy and diversity in the process of ensemble selection (Xiao et al., 2010). Some approaches modify conventional ensemble schemes to increase their accuracy. For example, the modified divide-and-conquer (Frosyniotis et al., 2003) method clusters a sample space into overlapped sub-regions and combines their local-expert classifiers based on the test sample’s fuzzy membership values to the subregions (see Fig. 2e). Kuncheva et al. presented two hybrid schemes: switching between selection and fusion (Kuncheva, 2002) and random linear oracles (Kuncheva & Rodriguez, 2007). The former selects the best classifier for the training region around a test sample; if there is no dominant classifier for the region, the method combines all the models instead (see Fig. 2f). The latter divides a sample space into a set of paired regions by using random hyperplanes and trains a base classifier for each region, where the hyperplanes select the classifiers to be used for a test sample (see Fig. 2g). The combinatorial ensemble methods can be characterized by the way they build base classifiers and the method they select the classifiers (see Table 1). In the building part, a divide-and-conquer approach could generate diverse and less error-correlated classifiers (local-experts) by training each of them on a sub-region of a sample space. A local-expert, however, could be biased due to the fewer training samples on the corresponding sub-region (Ko et al., 2008). Base classifiers built on the entire training space (let us refer them as general models) would be able to cope with the bias issue, but an ensemble often requires to combine more than dozens of the general classifiers to guarantee the improvement of its accuracy even with an effort to create diversity among them like the Bagging (Opitz & Maclin, 1999). For the selection part, a dynamic selection scheme is more adaptive to a test sample since it chooses the classifiers based on the incoming pattern. As a combinatorial approach of classifier fusion and selection, we present a hybrid method of localized fusion and dynamic selection. Instead of training local experts with a subset of the training data, we first build general models (SVMs) on the entire training space and then build localized templates by clustering the outputs of the general models. Our approach also adaptively decides the evaluation order of the localized templates for the incoming sample (see Fig. 2h and Table 1). 2.2. Multiclass classification with SVMs Although SVM has emerged as a popular technique in many pattern-recognition problems, we cannot directly apply it to multiclass classification problems since it is originally designed

11

J.-K. Min et al. / Expert Systems with Applications 42 (2015) 9–20

Fig. 2. Structures of different ensemble approaches: (a) fusion, (b) selection, (c) over-produce and choose (Banfield et al., 2005; Giacinto & Roli, 2001; Partridge & Yates, 1996), (d) dynamic classifiers-ensemble selection (Ko et al., 2008; Santos et al., 2008; Xiao et al., 2010), (e) modified divide-and-conquer (Frosyniotis et al., 2003), (f) switching between selection and fusion (Kuncheva, 2002), (g) random linear oracles (Kuncheva & Rodriguez, 2007), and (h) the proposed method (C, R, and T denote a classifier, sample space, and template, respectively).

Table 1 Comparison of the properties of hybrid methods.

a

Author (year)

Approach

Base modelsa

Selection method

Giacinto and Roli (2001), Banfield et al. (2005) Ko et al. (2008), Xiao et al. (2010) Frosyniotis et al. (2003) Kuncheva (2002) Kuncheva and Rodriguez (2007) Proposed method

Over-produce and choose Dynamic classifiers ensemble selection Modified divide-and-conquer Switching between selection and fusion Random linear oracles Hybrid of MuDTs Min et al. (2010) and DO-SVMs (Hong et al., 2008)

General classifiers Local-expert classifiers General classifiers Local-expert classifiers Localized templates General classifiers

N/A Dynamic selection Fuzzy membership Static selection Random hyperplane Dynamic ordering

In base models, the general classifier refers to a model built on the entire sample space while a local-expert denotes a model trained on a sub-region of the sample space.

for binary classification (Cortes & Vapnik, 1995). Alternatively, a decomposition strategy is used to resolve a multiclass problem as a set of binary problems. There are several popular decomposition strategies. One-versus-all (OVA Rifkin & Klautau, 2004) trains M binary models; each of them classifies samples into the corresponding class against the rest classes. Pair-wise (PW or one-versus-one (Kreßel, 1999) constructs M(M  1)/2 SVMs by pairing all classes by two, each of which classifies two specific classes. Complete-code (COM Rifkin & Klautau, 2004) takes account of all possible binary combinations of classes to support a better error correcting property than the others despite a high computational cost.

After building multiple SVMs based on a decomposition strategy, the multiple outputs have to be summed up to produce a final result. Winner-takes-all (WTA Min et al., 2010), error correcting output codes (ECOC Rifkin & Klautau, 2004), and directed acyclic graph-SVM (DAGSVM Platt, Cristianini, & Shawe-Taylor, 2000) are common methods for this. Based on outputs of the SVMs, WTA categorizes a sample into the class corresponding to the SVM producing the highest value. ECOC constructs a coding matrix E 2 {1, 0, 1}ML on an M-class problem with L SVMs, where the (m, l)th instance is 1, 0, or 1 if the class m is regarded as a positive class, neutral class (not used), or negative class for the lth SVM, respectively. With this coding matrix, ECOC classifies a sample into

12

J.-K. Min et al. / Expert Systems with Applications 42 (2015) 9–20

the class which corresponds to the outputs of SVMs. DAGSVM is a rooted binary directed-acyclic-graph with M leaves where its branches are PW SVMs and the leaves refer to the M classes. Starting from the root node, it moves to either left or right depending on the result of the corresponding PW SVM until reaching to one of leaves that indicates the predicted class. Besides this indirect approach, there is a direct method to reformulate an SVM as a multiclass classifier, yet exploiting common SVMs has two practical advantages (Hashemi, Yang, Mirzamomen, & Kangavari, 2009; Rifkin & Klautau, 2004). First, the indirect approach is often easier to include new class into an old multiclass system without rebuilding entire classifiers. Second, each SVM is only required to model the corresponding training set, resulting in lower error-correlation (high variation) to the others, and thus their combination could lead to high classification accuracy.

deviating from the incorrect prediction. Let Y be the class most frequently predicted by multiple models for an input x, which are trained with L training sets {D1, . . ., DL}, and yi be the prediction of the model trained with the ith training set Di. With the true label p, Vu and Vb are calculated as follows:

V u ðxÞ ¼

L 1X kðY ¼ pÞ and ðY – yi Þk L i¼1

and V b ðxÞ ¼

L 1X kðY – pÞ and ðY – yi Þk; L i¼1

ð1Þ

ð2Þ

respectively, where the bias for x, B(x) will be

BðxÞ ¼



1 if

Y –p

0

Y¼p

if

:

ð3Þ

With Domingos’ definition, the average loss for an input x in the noise-free case is a simple algebraic sum of B, Vu, and Vb, as follows:

3. Dynamic selection of localized templates The proposed method is composed with the localized fusion approach of MuDTs (Min et al., 2010) and dynamic selection scheme of DO-SVMs (Hong et al., 2008) (see Fig. 3). In the training phase, the method builds base classifiers (SVMs) and then generates multiple localized templates for each class by clustering the outputs of SVMs. Here, our method uses the entire data to build the base classifiers to keep the generality, preventing overfit to the outlier cluster. The method also trains an NB classifier in order for a dynamic selection module. For the test phase, (1) it estimates the class probabilities of a test sample by using the NB classifier, (2) produces the output matrix of SVMs for the sample, and (3) matches the matrix (called a decision profile, DP (Kuncheva, Bezdek, & Duin, 2001) with the localized templates by the order of the class probabilities. 3.1. Expected loss for the 0/1 loss function In his seminal work, Domingos (2000) identified two types of variance in a machine-learning model: unbiased-variance Vu that increases a model’s classification errors by deviating from the correct prediction and biased-variance Vb that decreases the errors by

LD ðxÞ ¼ BðxÞ þ V u ðxÞ  V b ðxÞ:

ð4Þ

In this section, we estimate the expected errors of our model in the contexts of Vu and Vb based on the 0/1 loss function (Valentini & Dietterich, 2004). Let tc be a template (or a model) for class c and Rc be its corresponding region from the output space of the OVA SVMs, where a test sample x in Rc will be classified as c (i.e. tc is the most similar template to the samples in Rc). For n test samples in Rc, the expected To decrease Vu, we of tc, E[Vb(x|Rc)], is calculated as follows:

E½V b ðxjRc Þ ¼

n 1X V b ðxj jRc Þ: n j¼1

ð5Þ

Assume that xb is a biased sample predicted as c by tc, whose true label is class p. We could correctly predict the sample if we build K localized templates (sub-class templates) for the class c and p to be:

min jtc;e  xb j > min jtp;f  xb j;

e¼1;...K

f ¼1;...K

ð6Þ

where tc,e is the eth localized template for class c, and |t  x| is the matching distance between the template t and the sample x (the

Fig. 3. Overview of the proposed method.

13

J.-K. Min et al. / Expert Systems with Applications 42 (2015) 9–20

more similar t and x are, the less the distance between them). A model that utilizes the localized templates satisfying (2) could have a larger Vb than the model that uses a template for each class as follows:

V b ðxj jRc;e Þ P V b ðxj jRc Þ;

[ Rc;e  Rc

e¼1;...;k

and Rc;e \ Rc;f ¼ /ðe – f Þ; ð7Þ

where Rc,e is the eth sub-region of Rc. Their expected Vbs are then calculated as:

EK ½V b ðxjRc Þ ¼

k X n 1 X V b ðxj jRc;e Þ P E½V b ðxjRc Þ: nk e¼1 j¼1

ð8Þ

The localized templates, however, also have a large Vu because of the outliers of the clustering result (even we build the localized templates from generally trained models). To decrease Vu, we adopt a dynamic selection scheme of DO-SVMs that uses the class probability of a test sample. Let t be the nearest template to x and tc be the nearest template to x among the ones for the class c. Instead of using the nearest template, we select tm where

V u ðxjt m Þ < V u ðxjtc Þ:

ð9Þ

For this, the proposed method uses naïve Bayes classifiers.

3.3. Dynamic selection scheme of DO-SVMs Since a number of local models (as localized templates) are produced, it is required to integrate them. For this, we build a naïve Bayes classifier for our dynamic selection process. The NB calculates the posterior probability of each class for an input sample x having F = {f1, . . ., fv} as the feature vector. According to the independence assumption of the NB, the probability for class c, P(c|F), is calculated as follows:

PðcjFÞ ¼ ¼

PðcÞPðFjcÞ PðcÞPðf 1 ;...;f v jcÞ ¼ PðFÞ PðFÞ v v Y PðcÞPðf 1 jcÞPðf 2 jcÞ...Pðf v jcÞ PðcÞ Y ¼ Pðf i jcÞ / PðcÞ Pðf i jcÞ: PðFÞ PðFÞ i¼1 i¼1

ð13Þ Note that P(F) does not change by class. The dynamic selection process compares the DP of the input sample with the templates of the class of higher probability. A minimum distance between the DP and the templates Tc = {tc,1, . . ., tc,k} of the class c is calculated as follows:

dstðT c ; DPðxÞÞ ¼ min ðdstðt c;i ; DPðxÞÞÞ i¼1;...;k ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1 0v u M uX 2 ¼ min @t ðt c;i ðmÞ  dm ðxÞÞ A; i¼1;...;k

ð14Þ

m¼1

3.2. Localized fusion approach of MuDTs MuDTs (Min et al., 2010) is a fusion method that classifies a sample by comparing the outputs of the base classifiers with a number of localized templates. The templates are produced based on the training outputs of the base classifiers (in this study, we use OVA SVMs as the base classifier). Let dj(x) be the output of the jth SVM for a sample x. In an OVA scheme with an M-class problem, a decision profile (DP) for x is represented as follows:

3 d1 ðxÞ 6 . 7 7 DPðxÞ ¼ 6 4 .. 5: dM ðxÞ 2

4. Experiments

ð10Þ

After constructing DPs for training samples, each class is divided into several distinctive sub-classes based on them by using the k-means clustering algorithm (Jain & Dubes, 1988). Let ec,i be the central point of the ith subclass Sc,i, where c denotes a class index and ec,i(m) is the mth instance of ec,i. For each class, it randomly picks up k DPs as an initial centers of k subclasses {ec,1, . . ., ec,k} and divides the DPs into one of the subclasses according to the minimum-squared-distance criterion as follows:

¼ 1; . . . ; k and x 2 cg; where kDPðxÞ  ec;i k vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u M uX 2 ¼ t ðd ðxÞ  e ðmÞÞ : c;i

ð11Þ

m¼1

Each central point is updated as an average of the DPs of new subclasses as follows:

ec;i ¼

X 1 DPðxÞ: jSc;i j DPðxÞ2Sc;i

To verify the proposed method we first evaluated our original methods, MuDTs and DO-SVMs, highlighting their characteristics, and analyzed how their integration improved the overall classification performance. Experiments were conducted by a ten-fold cross-validation, where learning set and validation set were further divided for each training fold to find optimal parameters. We also compared our method with other combinatorial ensemble methods by measuring the difference between the ensemble accuracy and the baseline accuracy (accuracy of a base classifier). 4.1. Datasets

Si ¼ fDPðxÞ : kDPðxÞ  ec;i k 6 kDPðxÞ  ec;j k for all j

m

where tc,i(m) denotes the mth instance of tc,i. The input sample x is classified as c if dst(Tc, DP(x)) is smaller than a threshold, otherwise, it repeats the comparison with templates of the next class of high probability. When no templates are matched with the input sample, the proposed method classifies it as the class of the highest probability of the NB (Hong et al., 2008). Fig. 4 shows the pseudo code of our method, where thresholds and the number of clusters are systematically determined by the accuracy on the training samples.

ð12Þ

This partitioning and updating process is repeated until no more change in the centers is found. Finally, we obtain optimal subclasses and their centers ec=1M, i=1k for the localized decision-templates tc=1M, i=1k.

In order to validate our hybrid scheme and its components on different types of features, eight multiclass datasets were used as shown in Table 2. The types include image, characteristics of elements, bioinformatics, and time-series features. The FingerCode (FC) dataset is a problem that classifies fingerprint images into five different types, such as arch, tented arch, left loop, right loop, and whorl, based on the features extracted from NIST4 database (Jain, Prabhakar, & Hong, 1999). Since it is often hard to define fingerprints exactly with these five classes, the ambiguity causes errors in classification, e.g., arch and tented arch are similar to each other while left loop, right loop, and whorl could be considered as a sub-group. As shown in Fig. 5, many of arch and tented arch samples are highly confusing; sometimes even multiple labels are assigned to a fingerprint (Jain et al., 1999). As another image data, we used the Segmentation (SEG)

14

J.-K. Min et al. / Expert Systems with Applications 42 (2015) 9–20

Fig. 4. Pseudo code of the proposed method.

Table 2 Summary of datasets used in the experiment. Feature type

Data set

#Sample

#Feature

#Class

SVM parameter

Image

FingerCode (FC) Segmentation (SEG) Iris Wine Brain cancer1 (BC1) Brain cancer2 (BC2) 1-Accelerometer-gesture (G1) 5-Accelerometers-gesture (G5)

3937 2310 150 178 90 50 1000 120

192 19 4 13 5920 10,367 60 300

5 7 3 3 5 4 20 24

Rbf, g = 0.06 Rbf, g = 0.81 Rbf, g = 0.81 Linear, c = 2 Linear, c = 2 Linear, c = 2 Rbf, g = 0.06 Rbf, g = 0.06

Characteristics of elements Bioinformatics Time-series

dataset from UCI machine-learning repository (Merz & Murphey, 2003), which consists of seven classes of outdoor images of brick-face, sky, foliage, cement, window, path, and grass, where the images were hand segmented to create a classification. The Iris and Wine datasets (also came from UCI machine learning repository (Merz & Murphey, 2003) consist of small umber of features that describes the characteristics of a sample’s elements such as the length and width of iris’ sepal; and alcohol and color intensity of wine, respectively. Two brain cancer datasets (BC1 and BC2 from http://www.gems-system.org/) have very few samples, as shown in Fig. 5, where it is difficult to train a classification model for them because of the ambiguity between in-class samples and the outliers. In addition, a large number of DNA microarray features might include many noises. As the last type of data, two time-series gesture datasets are collected in this study by using 3D-accelerometers (XSens XBus Kit, http:// www.xsens.com). In the first gesture dataset (G1), 20 one-hand gestures such as drawing a circle and arrow are captured with one 3D-accelerometer on a user’s dominant hand. The second gesture dataset (G5) includes 24 types of upper-body gestures such as greet and congratulate, referred from the MS agent genie’s behavior set (http://msagentring.org/), where five accelerometers were attached to the head, both upper arms, and both

wrists of the user. Since the time-series features have different lengths, we transformed them into a fixed-length vector. In this paper, all features were normalized on a scale of 1.0 to 1.0 for the training of OVA SVMs (parameters shown in Table 2 are systematically determined by the accuracy on the training samples); real valued features were linearly discretized for the NB except for FC, BC1, and BC2 datasets where the same feature extraction techniques in our previous studies are used (Hong & Cho, 2008; Hong et al., 2008). All experiments were conducted in ten-fold cross validation, and the average is reported in the experimental results. 4.2. Experimental results We tested five methods on the eight classification problems: two classifiers, OVA SVMs (combined by a winner-takes-all strategy) and NB, which are used as base-classifiers in the proposed approach; two ensemble methods, MuDTs and DO-SVMs, as the localized fusion and dynamic selection components of the proposed method, respectively; and the proposed method (hybrid of MuDTs and DO-SVMs). Table 3 shows the averaged accuracy where the results of the base-classifiers, OVA SVMs and NB, provide baseline accuracy for

15

J.-K. Min et al. / Expert Systems with Applications 42 (2015) 9–20

Fig. 5. Sample distributions of the confusing classes on eight datasets. Axes denote the values of two most discriminative features selected by using Pearson correlation (FC: Feature #140 and #183, Seg: #13 and #12, Iris: #1 and #2, Wine: #12 and #3, BC1: #2609 and #4821, BC2: #9901 and #4976, G1: #35 and #43, and G5: #250 and #246).

the ensemble approaches. Both MuDTs and DO-SVMs produced higher classification accuracy compared to the baselines. For example, on FC dataset, OVA SVMs and NB produced 90.9% and 84.6% of accuracy while MuDTs and DO-SVMs yielded 92.4% and 91.9%, respectively (more comparison results about the ensemble methods and other conventional approaches are given in Hong et al. (2008), Min et al. (2010)). The hybrid showed much better accuracy than the baselines’ (8.4% higher than OVA SVMs for G5 dataset and 14.4% higher than NB on BC1 dataset, as the best cases) and also achieved higher performance than the individual ensembles over most of the datasets (showed 1.7% and 6.7% higher accuracy than MuDTs and DO-SVMs, respectively, on BC1 dataset as the best cases). Table 4 shows the error rate for each case of misclassification of the fusion component, selection component, and both of them. In the FC dataset, for example, the proposed method resolved 2.3% and 10.3% of errors caused by the fusion and selection components, respectively. The proposed method addressed some errors of both components (as shown in the results from the FC, SEG, and BC1 datasets) when it produced the minimum matching distance to the wrong class with a lower probability and produced a larger distance to the wrong class with the highest probability. Table 5 shows examples correctly classified by the proposed method but not by its individual components. Since both the OVA SVMs and NB incorrectly predicted the first example, DOSVMs hardly classified it. The proposed method, however, correctly classified this case based on the small matching distance between

Table 4 Error rate (%) for each case of misclassification: Incorrectly predicted by the fusion component (O, X); incorrectly predicted by the selection component (X, O); and misclassified by both of them (X, X). Dataset

NB, MuDTs

Proposed

Dataset

NB, MuDTs

Proposed

FC

O, X X, O X, X O, X X, O X, X O, X X, O X, X O, X X, O X, X

4.4 11.7 3.7 0.7 4.7 2.7 2.2 13.3 7.8 0.0 0.6 0.2

SEG

O, X X, O X, X O, X X, O X, X O, X X, O X, X O, X X, O X, X

0.4 0.8 1.7 0.0 0.0 1.1 0.0 0.0 14.0 0.0 0.0 3.3

Iris

BC1

G1

2.1 1.4 3.5 0.0 0.0 2.7 0.0 0.0 6.7 0.0 0.0 0.2

Wine

BC2

G5

0.7 9.1 1.9 0.0 1.1 1.1 4.0 10.0 14.0 6.7 2.5 3.3

the sample and the true class template of the MuDTs component. Note that the smaller the matching distance, the more similar the sample and template are to each other. The second example in Table 5 shows the opposite case, where MuDTs inevitably performed a wrong classification for the sample yet its confusion among the class 1, 3, and 5 was resolved by using the selection result within the proposed scheme. In our study, DT did not obtain any significant improvement from OVA SVM, although it showed good performance in Kuncheva’s study (Kuncheva et al., 2001). Fig. 6 presents two examples

Table 3 Accuracy (%) and standard deviation for the methods over the different types of datasets (bold: method achieving the highest accuracy for each dataset). Dataset

OVA SVM

NB

DT

DO-SVMs

MuDTs

Hybrid (Proposed)

FC SEG Iris Wine BC1 BC2 G1 G5 Avg.

90.9 ± 1.8 96.2 ± 1.3 96.0 ± 5.6 98.3 ± 2.7 90.0 ± 8.2 78.0 ± 22.0 99.3 ± 0.8 88.3 ± 11.9 92.13 ± 7.0

84.6 ± 1.5 89.1 ± 2.7 92.7 ± 4.9 97.8 ± 2.9 78.9 ± 16.1 76.0 ± 20.7 99.2 ± 0.8 94.2 ± 10.4 89.06 ± 8.6

90.4 ± 1.6 95.7 ± 1.6 95.3 ± 5.5 98.3 ± 2.7 90.0 ± 8.2 80.0 ± 18.9 99.3 ± 0.7 90.0 ± 7.7 92.38 ± 6.2

92.4 ± 1.5 96.5 ± 1.4 96.7 ± 4.7 98.9 ± 2.3 92.2 ± 9.1 86.0 ± 19.0 99.5 ± 0.7 95.0 ± 10.5 94.65 ± 4.4

91.9 ± 1.8 97.4 ± 1.0 96.7 ± 4.7 98.9 ± 2.3 90.0 ± 8.2 82.0 ± 17.5 99.8 ± 0.4 90.0 ± 7.7 93.34 ± 6.0

93.0 ± 1.4 97.1 ± 1.2 97.3 ± 3.4 98.9 ± 2.3 93.3 ± 7.8 86.0 ± 19.0 99.8 ± 0.4 96.7 ± 5.8 95.26 ± 4.4

16

J.-K. Min et al. / Expert Systems with Applications 42 (2015) 9–20

Table 5 Examples of the error resolution by integrating two components, where the prediction of each component is marked as boldface. The hybrid scheme rejected the selection result for (a) and accepted for (b) based on the matching distance between the sample and templates. (a) SEG, the 14th sample in the 1st fold (True class is C4) Class label Class probability by NB Matching distance to the localized templates

1 0.80 2.32

2 0.00 2.39

3 0.01 2.00

4 0.00 1.04

5–7 ... ...

(b) BC1, the 8th sample in the 1st fold (True class is C5) Class label Class probability by NB Matching distance to the localized templates

1 0.00 1.82

2 0.00 2.15

3 0.00 1.83

4 0.01 2.52

5 0.99 1.99

that show how the localized fusion of the proposed method effectively manages the intra-class ambiguity compared to DT. Sample-profiles and class-templates are denoted as a graphical representation whose vertices mean the outputs of OVA SVMs. The first example in the FC dataset has a DP similar to the 3rd and 5th classes whereas it belongs to the 3rd class. Our method produces multiple templates, one of which could represent the pattern of the input sample (see the template t3,2), but the conventional DT fails to describe the variation of class 3 and misclassifies the sample into class 5 with the matching distance (1.67) shorter than the distance with the correct template of class 3 (2.06). The second example of the SEG dataset is another case that shows the improvement obtained by the proposed method compared with the conventional method. Instead of keeping one template for every class, the multiple localized templates of the proposed method improved the capability for modeling the variety and ambiguity of classes. For the DO-SVMs and the proposed method, NB was successful in selecting appropriate models based on its output probability. On the other hand, the NB as an individual classifier failed to address the complex dependency in data and thus showed the lowest accuracy among those methods. In Fig. 7, the rank scores were calculated to give a measure of the overall dominance among the methods on the different characteristics of datasets. We assigned the rank scores of one (the lowest score) to six (the highest score) onto each method according to its place among the others on each dataset and summed. As shown in

Fig. 7b, the localized-fusion scheme yielded good performance on FC and SEG datasets that have a number of training samples for modeling the intra-class variations with the multiple templates. For the small-size datasets with complex features shown in Fig. 7d, however, the sub-class modeling approach was not efficient to discriminate between meaningful clusters and outliers. On the other hand, the dynamic selection method performed well on these cases by resolving the Vu and bias on the models. Combining their complementary properties, the hybrid method ranked in the top among the others over all types of datasets. Based on the paired t-test results, we revealed that the improvement of the hybrid method on the datasets was statistically significant against MuDTs (p < 0.001) and DO-SVMs (p < 0.01). In order to analyze the errors of each method, we decomposed them into bias, Vu (unbiased-variance), and Vb (biased-variance) terms as shown in Fig. 8. Note that Vu increases error by deviating from the correct prediction while Vb decreases error by deviating from the incorrect prediction as described in Section 3.1 (Valentini & Dietterich, 2004). Since MuDTs constructs sub-class models, it has high variation that could cause the Vu points as shown in the figure. However, MuDTs also has a larger number of Vb points and thus it showed lower error than DT and other base classifiers (SVMs and NB). On the other hand, DO-SVMs has significantly lower Vu than the MuDTs, and keeps the bias at a lower level. Moreover, the proposed method that dynamically selects localized templates has much lower bias than DO-SVMs. In conclusion, the proposed method that hybridizes MuDTs and DO-SVMs

Fig. 6. Examples of class-ambiguous samples (center) and the matched templates (left: DT; right: MuDTs of the proposed method). Gray box indicates the most similar template to the given sample.

J.-K. Min et al. / Expert Systems with Applications 42 (2015) 9–20

17

Fig. 7. Sum of the rank scores on (a) all datasets, (b) T1: FC and Seg, (c) T2: Iris and Wine, (d) T3: BC1 and BC2, and (e) T4: G1 and G5. The higher the score the better the performance.

Fig. 8. The average bias/variance of methods on the eight datasets.

effectively decreases bias and Vu, leading to an increase of overall classification accuracy. 4.3. Comparison with combinatorial ensemble methods To compare the proposed hybrid method with representative techniques of combinatorial ensemble (introduced in Section 2.1 and Table 1), we conducted additional experiment with benchmark

datasets used in those previous studies (Banfield et al., 2005; Frosyniotis et al., 2003; Ko et al., 2008; Kuncheva, 2002; Xiao et al., 2010). Note that we exclude (Giacinto & Roli, 2001) and (Kuncheva & Rodriguez, 2007) in our comparison since (Giacinto & Roli, 2001) used unpublished datasets and (Kuncheva & Rodriguez, 2007) showed the rank scores without accuracy. Nine benchmark datasets were employed including four binary classification problems (see Table 6). Three datasets such as SEG, Iris, and

18

J.-K. Min et al. / Expert Systems with Applications 42 (2015) 9–20

Table 6 Datasets used for the comparison of hybrid methods. Dataset a

Anneal (Banfield et al., 2005; Kuncheva & Rodriguez, 2007; Xiao et al., 2010) Breast-cancer-Wisconsin-diagnostica (BCWD) Ko et al., 2008; Kuncheva, 2002 Hepatitisa Kuncheva & Rodriguez, 2007; Xiao et al., 2010 Irisa Banfield et al., 2005; Kuncheva & Rodriguez, 2007; Xiao et al., 2010 Phonemeb (Banfield et al., 2005; Frosyniotis et al., 2003; Kuncheva, 2002) Pima-diabetesa (P-diabetes) Ko et al., 2008; Kuncheva, 2002; Kuncheva & Rodriguez, 2007; Xiao et al., 2010 Segmentationa,b (SEG) Banfield et al., 2005; Frosyniotis et al., 2003; Ko et al., 2008; Kuncheva & Rodriguez, 2007 Waveform-noisea (Banfield et al., 2005; Kuncheva & Rodriguez, 2007; Xiao et al., 2010) Winea (Ko et al., 2008; Xiao et al., 2010) a b

#Sample

#Feature

#Class

SVM parameter

798 569 155 150 5404 768 2310 5000 178

38 30 19 4 5 8 19 40 13

6 2 2 3 2 2 7 3 3

Rbf, g = 0.06 Linear, c = 2 Rbf, g = 0.06 Rbf, g = 0.81 Rbf, g = 0.22 Poly, d = 2 Rbf, g = 0.81 Poly, d = 2 Linear, c = 2

UCI machine learning repository (Merz & Murphey, 2003). ELENA project (ELENA Project., 2003).

Fig. 9. Comparative results of the hybrid techniques over the different number of training sets. For a study that presented several models, the highest accuracy among them is shown as its representative performance. Note that (Banfield et al., 2005) is the over-produce and choose; (Ko et al., 2008) and (Xiao et al., 2010) are dynamic-classifiersensemble-selection; (Frosyniotis et al., 2003) is the modified divide-and-conquer; and (Kuncheva, 2002) is the switching between selection and fusion.

Wine were already explored in the previous section, and six datasets were newly included in this additional experiment. In order to compare with the previous studies that experimented with the different size of training and test sets, we

measured the performance of the proposed method in two, three, five, and ten fold cross-validations. The proposed method integrates two to seven OVA SVMs (depends on the number of classes for each given problem) where its baseline performance was

19

J.-K. Min et al. / Expert Systems with Applications 42 (2015) 9–20

Table 7 Averaged accuracy over the datasets, where the performance of the proposed method on each column was calculated by averaging the accuracies for the corresponding datasets (e.g., the first column shows the averaged results of Ko et al. (2008) and the proposed method on the same datasets as Ko et al. (2008).

Other hybrid Proposed method

Ko et al. (2008)

Banfield et al. (2005)

Xiao et al. (2010)

Frosyniotis et al. (2003)

Kuncheva (2002)

91.6% 92.7%

93.5% 94.5%

88.1% 90.6%

79.9% 88.9%

83.8% 88.4%

measured by OVA SVMs with the WTA fusion strategy. For the other methods, we used the Majority voting results of base-classifiers as their baseline performances were given in their results. Fig. 9 shows the accuracy of the combinatorial ensembles and their baselines. In general, the overall performance is determined by the accuracy of base classifiers (denoted by baseline), while great improvements were achieved by the proposed method on Hepatitis; (Ko et al., 2008) on Wine; (Banfield et al., 2005) on Iris; (Kuncheva, 2002) on Phoneme; and (Frosyniotis et al., 2003) on SEG. For most datasets, the proposed method could gain more improvements from the baseline, compared to the other techniques as shown in Fig. 9. In the case of P-Diabetes dataset, (Ko et al., 2008) showed a higher accuracy of over 96% than the others which showed accuracies around 75–80% whereas its improvement was only 0.35% from its baseline accuracy. A summary of comparative performance on the entire datasets is given in Table 7, where the proposed method obtained higher accuracies than the others.

5. Concluding remarks In order to exploit the complementary properties of fusion and selection in ensembling, this paper presented a novel method that integrates multiple SVMs based on a hybrid scheme of MuDTs and DO-SVMs. For modeling an intra-class variation, localized templates were estimated by using a clustering algorithm, where the templates were dynamically selected to classify an incoming sample. In our experiments, we tested our method on benchmark datasets from various applications, showing that our method is feasible to use in different kinds of expert systems. On the eight multiclass problems that have different types of features, we thoroughly compared the proposed method with other techniques including its individual components, MuDTs and DO-SVMs. The experimental results showed the usefulness of the proposed method, where its MuDTs effectively models intra-class variations and its DO-SVMs decreased unbiased-variance errors. In the additional experiment on nine benchmark datasets, the proposed method also showed its superior performance against other combinatorial ensemble techniques. Based on our experiments, we found three implications for building expert systems with applications: First, exploiting complementary characteristics of different models or schemes is useful. We showed that hybrid of two different ensemble approaches outperforms the individual schemes. Our approach can provide a new way to increase the performance of existing expert systems. Secondly, a real-world problem such as bioinformatics and activity recognition often has small amount of labeled samples with unbalanced classes. In general, the more the model is complex, the larger size of training data is required. With this respect, we need to further investigate about the complexity of our method and performance over different sizes of training data with imbalanced class. Finally, we can use the same classification scheme to different kinds of applications, but we have to use different type of features and/or models to fit the scheme to each application. For example,

NB shows good performance on semantic features while SVM is powerful when it takes lower-level features as an input. To build an optimized expert system, selecting appropriate models and features based on domain knowledge is important as well as choosing a good classification scheme.

Acknowledgement This work was supported by the Industrial Strategic Technology Development Program, 10044828, Development of augmenting multisensory technology for enhancing significant effect on service industry, funded by the Ministry of Trade, Industry & Energy (MI, Korea).

References Banfield, R. E., Hall, L. O., Bowyer, K. W., & Kegelmeyer, W. P. (2005). Ensemble diversity measures and their application to thinning. Information Fusion, 6(1), 49–62. Brown, G., Wyatt, J., Harris, R., & Yao, X. (2005). Diversity creation methods: A survey and categorisation. Information Fusion, 6(1), 5–20. Cavalin, P. R., Sabourin, R., & Suen, C. Y. (2013). Dynamic selection approaches for multiple classifier systems. Neural Computing and Applications, 22(3–4), 673–688. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297. Didaci, L., Giacinto, G., Roli, F., & Marcialis, G. L. (2005). A study on the performances of dynamic classifier selection based on local accuracy estimation. Pattern Recognition, 38(11), 2188–2191. Domingos, P. (2000). A unified bias-variance decomposition and its applications. In Proc. of the 17th int. conf. on machine learning (pp. 231–238). ELENA Project. (2003). . Frosyniotis, D., Stafylopatis, A., & Likas, A. (2003). A divide-and-conquer method for multi-net classifiers. Journal of Pattern Analysis Application, 6(1), 32–40. García-Pedrajas, N., & García-Osorio, C. (2011). Constructing ensembles of classifiers using supervised projection methods based on misclassified instances. Expert Systems with Applications, 38(1), 343–359. García-Pedrajas, N., & Ortiz-Boyer, D. (2011). An empirical study of binary classifier fusion methods for multiclass classification. Journal of Information Fusion, 12(2), 111–130. Giacinto, G., & Roli, F. (2001). An approach to the automatic design of multiple classifier systems. Pattern Recognition Letters, 22(1), 25–33. Hashemi, S., Yang, Y., Mirzamomen, Z., & Kangavari, M. (2009). Adapted one-versusall decision trees for data stream classification. IEEE Transactions on Knowledge and Data Engineering, 21(5), 624–637. Hong, J.-H., & Cho, S.-B. (2008). A probabilistic multi-class strategy of one-versusrest support vector machines for cancer classification. Neurocomputing, 71, 3275–3281. Hong, J.-H., Min, J.-K., Cho, U.-K., & Cho, S.-B. (2008). Fingerprint classification using one-vs-all support vector machines dynamically ordered with naive Bayes classifiers. Pattern Recognition, 41(2), 662–671. Jain, A. K. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 4–37. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall. Jain, A. K., Prabhakar, S., & Hong, L. (1999). A multichannel approach to fingerprint classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(4), 348–359. Ko, A. H. R., Sabourin, R., & Britto, A. S. (2008). From dynamic classifier selection to dynamic ensemble selection. Pattern Recognition, 41(5), 1718–1731. Kreßel, U. H.-G. (1999). Pairwise classification and support vector machines. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: support vector learning (pp. 255–268). Cambridge, MA: MIT Press. Kuncheva, L. I. (2002). Switching between selection and fusion in combining classifiers: An experiment. IEEE Transactions on Systems, Man, and Cybernetics. Part B, Cybernetics, 32(2), 146–156. Kuncheva, L. I., Bezdek, J. C., & Duin, R. P. W. (2001). Decision templates for multiple classifier fusion: An experimental comparison. Pattern Recognition, 34(2), 299–314.

20

J.-K. Min et al. / Expert Systems with Applications 42 (2015) 9–20

Kuncheva, L. I., & Rodriguez, J. J. (2007). Classifier ensembles with a random linear oracle. IEEE Transactions on Knowledge and Data Engineering, 19(4), 500–508. Merz, C. J., & Murphey, P. M. (2003). UCI repository of machine learning databases. . Min, J.-K., Hong, J.-H., & Cho, S.-B. (2010). Fingerprint classification based on subclass analysis using multiple templates of support vector machines. IIntelligent Data Analysis, 14(3), 369–384. Nanni, L., & Lumini, A. (2009). An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications, 36, 3028–3033. Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11, 169–198. Partridge, D., & Yates, W. B. (1996). Engineering multiversion neural-net systems. Neural Computation, 8(4), 869–893. Platt, J. C., Cristianini, N., & Shawe-Taylor, J. (2000). Large margin DAGs for multiclass classification. In Proc. neural information processing systems (pp. 547– 553). Qi, Z., Tian, Y., & Shi, Y. (2013). Structural twin support vector machine for classification. Knowledge-Based Systems, 43, 74–81. Rifkin, R. M., & Klautau, A. (2004). In defence of one-vs-all classification. Journal of Machine Learning Research, 5(1), 101–141. Ruta, D., & Gabrys, B. (2005). Classifier selection for majority voting. International Journal of Information Fusion, 6(1), 63–81.

Santos, E. M. D., Sabourin, R., & Maupin, P. (2008). A dynamic overproduce-andchoose strategy for the selection of classifier ensembles. Pattern Recognition, 41(10), 2993–3009. Valentini, G., & Dietterich, T. G. (2004). Bias-variance analysis of support vector machines for the development of SVM-based ensemble methods. Journal of Machine Learning Research, 5, 725–775. Wang, S.-J., Mathew, A., Chen, Y., Xi, L.-F., Ma, L., & Lee, J. (2009). Empirical analysis of support vector machine ensemble classifiers. Expert Systems with Applications, 36(3–2), 6466–6476. Windeatt, T. (2004). Diversity measures for multiple classifier system analysis and design. International Journal of Information Fusion, 6(1), 21–36. Woloszynski, T., & Kurzynski, M. (2011). A probabilistic model of classifier competence for dynamic ensemble selection. Pattern Recognition, 44(10–11), 2656–2668. Woods, K., Kegelmeyer, W. P., & Bowyer, K. (1997). Combination of multiple classifiers using local accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 405–410. Xiao, J., He, C., Jiang, X., & Liu, D. (2010). A dynamic classifier ensemble selection approach for noise data. Information Sciences, 180(18), 3402–3421. Xiao, J., Xie, L., He, C., & Jiang, X. (2012). Dynamic classifier ensemble model for customer classification with imbalanced class distribution. Expert Systems with Applications, 39(3), 3668–3675.