A spectral clustering based ensemble pruning approach

Neurocomputing 139 (2014) 289–297 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom A spect...

Download PDF

434KB Sizes 0 Downloads 115 Views

Report

PDF Reader
Full Text

Neurocomputing 139 (2014) 289–297

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

A spectral clustering based ensemble pruning approach Huaxiang Zhang a,b,n, Linlin Cao a,b a b

Department of Computer Science, Shandong Normal University, Jinan 250014, Shandong, China Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, China

art ic l e i nf o

a b s t r a c t

Article history: Received 16 March 2013 Received in revised form 1 January 2014 Accepted 23 February 2014 Communicated by A.M. Alimi Available online 13 April 2014

This paper introduces a novel bagging ensemble classiﬁer pruning approach. Most investigated pruning approaches employ heuristic functions to rank classiﬁers in the ensemble, and select part of them from the ranked ensemble, so redundancy may exist in the selected classiﬁers. Based on the idea that the selected classiﬁers should be accurate and diverse, we deﬁne classiﬁer similarity according to the predictive accuracy and the diversity, and introduce a Spectral Clustering based classiﬁer selection approach (SC). SC groups the classiﬁers into two clusters based on the classiﬁer similarity, and retains one cluster of classiﬁers in the ensemble. Experimental results show that SC is competitive in terms of classiﬁcation accuracy. & 2014 Elsevier B.V. All rights reserved.

Keywords: Ensemble pruning Classiﬁer similarity Spectral clustering

1. Introduction Classiﬁcation techniques are commonly used to reveal data patterns hidden in large datasets, and have been extensively studied in the ﬁeld of machine learning. Various algorithms have been developed for constructing classiﬁers [1], but researche shows that no single algorithm outperforms the others theoretically or empirically in all scenarios [2]. Sometimes we are confused with which algorithm to utilize when facing a practical classiﬁcation problem. In order to deal with this issue, ensemble classiﬁers have been proposed. An ensemble consists of a group of classiﬁers, and classiﬁes instances based on the decisions of all members. Research shows that an ensemble of simple classiﬁers may achieve better classiﬁcation performance than any one sophisticated classiﬁer [3,4], and many ensemble approaches have been proposed [5–9]. Both accuracy and diversity play important roles in constructing ensemble classiﬁers, and many works focus on obtaining a group of accurate classiﬁers. Among the ensemble approaches, bagging [5] and boosting [6,10–12] are effective and have been extensively studied. Bagging adopts different bootstrap samples to generate diverse classiﬁers, and boosting constructs ensemble classiﬁers by using the original training data with weights updated for each classiﬁer. Ensemble classiﬁers can achieve remarkable performance, but redundancy may exist in them, and implementing a large number of classiﬁers requires large memory and slows down the

n Corresponding author at: Department of Computer Science, Shandong Normal University, Jinan 250014, Shandong, China. E-mail address: [email protected] (H. Zhang).

http://dx.doi.org/10.1016/j.neucom.2014.02.030 0925-2312/& 2014 Elsevier B.V. All rights reserved.

classiﬁcation. If only parts of the ensemble classiﬁers are implemented when classifying newly coming samples, the computational cost can be reduced. Many research works have been done to select a subset of ensemble classiﬁers without sacriﬁcing the performance. Zhou et al. [13] proved the “many-could-be-betterthan-all” theorem, and studies show that it is possible to achieve a small yet strong ensemble [14–20]. It is difﬁcult to select an optimal classiﬁer subset, since it needs a combinational search with exponential time complexity. In this paper, we propose a method to choose part of the generated weak classiﬁers while simultaneously considering both the classiﬁer accuracies and the diversities among a pair of classiﬁers in one model. As we know learning systems are very common on the internet, and learning from different sources or teachers has been existent for a long time. But the behaviors or teaching styles of the teachers are quite different, and some may have negative effects on learning, thus, it is necessary to distinguish the positive ones from the negative ones. If we consider each classiﬁer as a teacher, then a classiﬁer ensemble can be regarded as a multiple teacher system, and each classiﬁer is responsible for labeling newly coming samples. The teachers are categorized into responsible and irresponsible ones based on their inﬂuence on students. All responsible teachers are very similar, since they try to convey the unique right semantic concept, thus forming a cluster. Similarly, the classiﬁers can be partitioned into similar and dissimilar ones, and similar ones make positive contributions to the ensemble with high probability, thus they should form a cluster. Based on this assumption, we consider accuracy and diversity in one model for evaluating classiﬁer similarity, and adopt clustering techniques to partition ensemble classiﬁers.

290

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

This paper is organized as follows. Section 2 details the related works in the literature. Section 3 deﬁnes the classiﬁer similarity, and introduces how the spectral clustering approach is used to cluster classiﬁers. Section 4 explains the experimental results, and Section 5 concludes the study.

2. Related works Since a large number of weak classiﬁers in an ensemble incur computational and storage cost, and the “many-could-be-betterthan-all” theorem has been proved [13], many approaches have been proposed for selecting an optimal classiﬁer subset [21–23]. Ensemble classiﬁer selection approaches can be categorized as static and dynamic ones based on whether the selected classiﬁer subset changes or not when different patterns are classiﬁed. The approaches that keep the subset unchanged are static, and the approaches that employ different classiﬁer subsets to classify different patterns are dynamic [24,25]. Tsoumakas et al. [26] categorized the ensemble pruning approaches differently into four types: search based methods, clustering based methods, optimization based methods and other methods. Clustering based algorithms are based on the notion of distance to cluster the constructed classiﬁers. Martínez-Muñoz et al. [16] analyzed several ordered aggregation based ensemble pruning techniques, and evaluated their performances on benchmark datasets. They concluded that ordered aggregation techniques sometimes could generate effective pruned bagging ensembles. The investigated pruning approaches employ speciﬁc metrics to rank classiﬁers, or perform a heuristic search in the classiﬁer space while evaluating the collective merit of a candidate classiﬁer subset [19,27–30]. Other ensemble pruning techniques employ genetic algorithms [31], randomized greedy selective strategy and ballot [32] or semideﬁnite programming [20] to perform classiﬁer pruning. Rokach [33] took into account the predictive capability of classiﬁers along with the degree of redundancy among them, and selected a high accuracy and low inter-agreement classiﬁer subset. The approach implements the best ﬁrst search strategy in a 2n huge search space (n: the number of classiﬁers), and reports that over half of the original classiﬁers can be pruned. Aksela and Laaksonen [34] proposed a classiﬁer selection method based on an exponential diversity error measure, and evaluated their approach on handwritten character patterns. Meynet and Thiran [35] took into account the classiﬁer diversity and accuracy in the deﬁnition of information theoretic score (ITS), and selected a classiﬁer subset with the optimal ITS. ITS is obtained by selecting one classiﬁer at each iteration to maximize its value. It is not differentiable and its calculation incurs large time complexity. Different from static approaches, Xiao et al. [36] proposed a dynamic classiﬁer selection approach to noise data classiﬁcation, and introduced several data handling methods for dynamic classiﬁer selection. Statistical analysis and experimental results show that their approach has stronger noise-immunity ability than several other strategies. Many other dynamic approaches have also been proposed [24,37–39]. Zhu [40] integrated data envelopment analysis and stacking, and described a hybrid approach to classiﬁer selection. Bakker and Heskes [41] proposed a clustering method for ensemble classiﬁer extraction, in which a small collection of representative entities is used to represent a large entity collection. A method was used to extract the small representative model set through clustering the constructed models. Different from the proposed classiﬁer pruning approaches, the small representative models are not part of the original ones.

Based on the analysis of the relationship between the proposed on-line allocation algorithm and the boosting algorithm, Freund and Schapire [42] proposed variants of adaboost, and proved the error bound of each variant. In order to handle the classiﬁcation of a dataset with overlapping patterns from different classes, Verma and Rahman [43] ﬁrst clustered the classiﬁed data, and used a group of component classiﬁers to learn the decision boundaries between pairs of clusters. A fusion classiﬁer is responsible for the class decision based on the decisions of the component learners. Different ensemble construction approaches may lead to quite different performances for a classiﬁer ensemble, and many classiﬁer ensemble approaches have been reported. We are only interested in how to prune a constructed ensemble for performance improvement.

3. The proposed approach The above-mentioned research works show that, if both the accuracy and the diversity are taken into account in the process of classiﬁer selection, the performance of the pruned ensemble may be improved. In this study, we focus on static classiﬁer selection approaches, and propose a classiﬁer similarity concept. The classiﬁers are used to construct a graph with weighted edges, and the spectral clustering approach is employed to analyze classiﬁer aggregation based on the assumption that highly similar classiﬁers will aggregate into one group and lowly similar classiﬁers will aggregate into another group. 3.1. Classiﬁer similarity For each classiﬁer, we calculate its classiﬁcation accuracies on the training datasets used to construct weak classiﬁers, and use the obtained accuracies to construct an accuracy vector. Let D1, D2, …, and Dn denote n training datasets, and classiﬁer hi is modeled on Di. Given hi's classiﬁcation accuracies a1i , a2i , …, ani on D1 ; D2 ; …; Dn correspondingly, we use ai ¼ ða1i ; a2i ; …; ani ÞT to represent the accuracy vector of hi. Since we evaluate the prediction performances on different samples from the original training data, the resulting estimates may suffer from training bias. This problem does not matter, because each sample has an equal chance to inﬂuence the classiﬁer's performance when calculating each classiﬁer's accuracy vector. Hence, the effect of the training bias can be decreased by the vector entries. For each pair of classiﬁers hi and hj with corresponding accuracy vectors ai and aj , the accuracy similarity between them is deﬁned by ( pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðai aj Þ=n; ði a jÞ Sij ðaÞ ¼ ð1Þ 0; ði ¼ jÞ where ” ” denotes a scalar product of two vectors. Sij ðaÞð A ½0; 1Þ is large if both classiﬁers perform well to some extent on all the sampled datasets, and small otherwise. It is also symmetric for each pair of classiﬁers, since we have Sij ðaÞ ¼ Sji ðaÞ. Diversity also plays an important role in the success of an ensemble [44], and it can be viewed as a measure of dependence, complement or even orthogonality among classiﬁers [45]. Diverse classiﬁer ensembles are preferred, and Giacinto and Roli [46] stated that ensemble classiﬁers should be accurate and diverse. There exist many diversity measures [47,48], and no one has been proved to be the best. We employ the widely used diversity measure Q-statistics [45] to calculate the diversity of pairs of classiﬁers. Given a dataset D, Q is calculated as Q ¼ ðN 11 N 00 N01 N10 Þ=ðN 11 N 00 þ N 01 N 10 Þ

ð2Þ

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

where N11 is the number of samples correctly classiﬁed by both classiﬁers, N00 is the number of samples wrongly classiﬁed by both, N01 is the number of samples wrongly classiﬁed by the ﬁrst one but correctly classiﬁed by the second one, and N10 is the number of samples correctly classiﬁed by the second one but wrongly classiﬁed by the ﬁrst one. If one of N11, N00, N01 and N10 equals the size of D or zero, then 1 is assigned to Q . We calculate Q on Di and Dj for hi and hj, and use Qij and Qji to denote the Q on Di and Dj respectively. Both Qij and Qji lie in ½ 1; 1, and the smaller the Q value is, the more diverse the two classiﬁers are. After making the following transformation (3), both Qij and Qji will lie in ½0; 1, and the larger the Q value is, the more diverse the two classiﬁers are: Q ij ¼ ð1 Q ij Þ=2;

Q ji ¼ ð1 Q ji Þ=2

We deﬁne the diversity similarity between hi and hj as qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Sij ðdÞ ¼ Q ij Q ji

ð3Þ

ð5Þ

where λ A ½0; 1 reﬂects the inﬂuence of the left two terms of (5). Sij A ½0; 1 is also symmetric and nonnegative, and Sii equals 0 for each i. We also deﬁne the similarity between hi and hj as qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Sij ¼ Sij ðaÞ Sij ðdÞ ð6Þ

3.2. Spectral clustering on classiﬁers It is straightforward to implement clustering on the classiﬁers. Since the similarity between two classiﬁers is nonnegative, we conduct spectral clustering [49] on them. In spectral clustering, a similarity graph is constructed, and each vertex represents a data point. Two vertices are connected by an edge if the similarity between them is positive or larger than a speciﬁc threshold, and the edge is similarity weighted. Each classiﬁer is regarded as a vertex, and is connected to the others by similarity weighted edges. The similarity assigned to each edge is the similarity between the corresponding two classiﬁers. Spectral clustering approach is implemented on the classiﬁer graph to divide the classiﬁers into several groups, so that classiﬁers in the same group are similar while classiﬁers in different groups are dissimilar. In spectral clustering, the similarity matrix W is built up with entry Sij in the ith row and the jth column, and the diagonal matrix D is built up with entry dii as n

dii ¼ ∑ Sij

Table 1 Spectral clustering algorithm for picking the subset of ensemble classiﬁers (SC). Input: m classiﬁers and the parameter λ Output: n classiﬁers from m classiﬁers (no m) 1. for i,j¼ 1 to m do calculate the similarity Sij between classiﬁer hi and hj according to (1), (2), (3), (4)and (5) endfor 2. construct the normalized Laplacian matrix L using (7) and (8) and calculate its two eigenvectors v1 and v2 corresponding to the two smallest eigenvalues; 3. cluster the row data of [v1,v2] into two groups using k-means algorithm; 4. calculate the averaged similarity for each group of classiﬁers, and choose the group of classiﬁers with larger averaged similarity; 5. use the classiﬁers with number corresponding to the selected row number of [v1, v2] to build up an ensemble.

ð4Þ

Q is obtained by implementing two classiﬁers on one dataset, so we cannot deﬁne a diversity vector for each classiﬁer with Qs as entries. Q is larger when two classiﬁers disagree with each other on more instances. It is obvious that Sij(d) is also symmetric and nonnegative, since Sij ðdÞ ¼ Sji ðdÞ. Sii(d) equals 0 for each i according to (2)–(4). Since both Sij(d) and Sij(a) lie in [0,1], we consider both of them in one model, and deﬁne the similarity between hi and hj as Sij ¼ λSij ðaÞ þ ð1 λÞSij ðdÞ

291

as a data entry, and any classical clustering approaches can be implemented on the n two-dimensional data points. We choose k-means [1] to aggregate them into different clusters, and assign 2 to the number of clusters, since we want to select one classiﬁer subset for the ensemble, and discard the left. If the number of clusters is more than 2, for example k clusters are speciﬁed, we need to evaluate each cluster or different cluster combinations, and 2k different cluster combinations should be evaluated. We can obtain the global optimal subset for the ensemble after all cluster combinations have been evaluated, but it is a combinational research problem and time consuming, so we assign 2 to the number of clusters to simplify the problem. Spectral cluster approaches divide the similarity graph into two sub-graphs with each sub-graph corresponding to one classiﬁer cluster. We calculate the averaged classiﬁer similarity of each cluster by the ratio between the sum of weights and the number of edges of its corresponding sub-graph, and select the cluster with larger similarity. The algorithm for picking the subset of bagging ensemble classiﬁers (SC) is described in Table 1. When formula (5) is replaced with (6), the algorithm is renamed as SC-gm. The cost of computing all the eigenvalues and eigenvectors of a n n matrix is cubic in n. As we just need to compute two eigenvectors corresponding to the two smallest eigenvalues, more efﬁcient approaches can be used to decrease the computational cost. For example, the inverse power method [50,51] can be used to ﬁnd the eigenvector with the smallest eigenvalue, and has order of Oðn2 Þ time complexity. Applying k-means to cluster the row vectors of ½v1 ; v2 has order of O(knt) time complexity, where k (equals 2 in this paper) is the number of clusters, n is the number of constructed classiﬁers, and t is the number of iterations for clustering. Calculating the accuracy vectors and the accuracy similarities has order of Oðn2 Þ time complexity, and calculating the diversity similarities also has order of Oðn2 Þ time complexity. So the classiﬁer selection method proposed in this paper incurs time complexity with order of Oðn2 þ kntÞ.

ð7Þ

j¼1

4. Experiments

The normalized Laplacian matrix is deﬁned as L ¼ I D 1=2 WD 1=2

ð8Þ

where I is a n n identity matrix. After the eigenvalues and eigenvectors of L have been obtained, two vectors v1 and v2 that correspond to the two smallest eigenvalues are selected, and are used to build up a n 2 matrix ½v1 ; v2 . The row number of [v1 , v2 ] corresponds to the classiﬁer number in the original ensemble. Each row of ½v1 ; v2 is regarded

The common ways of generating ensemble classiﬁers include varying the input data and changing the learning algorithm or model parameters. We focus on the case of varying the input data, and use bagging for pruning. Standard Cart trees (pruned) [52] are commonly used inductive learning algorithms, and are adopted as base classiﬁers. Because it is difﬁcult to determine the optimal parameter λ and the number of base classiﬁers to be generated, we set λ to 1, 0.8, 0.6, 0.4, 0.2, 0 and m to 100. Research shows that,

292

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

when the number of classiﬁers becomes large, the classiﬁcation error in bagging asymptotically tends to a constant [50], and overﬁtting may occur in boosting [12]. We conduct 10-fold cross validation on the data, which means one-fold data are used as testing data, and the other nine-fold data are used as training data. The ﬁnal results are the average of ﬁve 10-fold cross validation executions, which means that each experiment consists of 50 executions on each dataset. The experiments are conducted on 20 extensively studied benchmark datasets available from UCI repository [53], and a summary of the datasets is given in Table 2. We evaluate the effect of λ on the performance of the pruned ensemble. If λ equals 1, only classiﬁcation accuracy is considered for classiﬁer pruning, and if λ equals 0, only classiﬁer diversity is considered. When λ changes from 1 to 0, the inﬂuence of accuracy decreases and the inﬂuence of diversity increases. The classiﬁer similarity deﬁned by (6) can be considered as a geometric mean of both similarities, and we also use this deﬁnition to evaluate the proposed classiﬁer pruning approach. In order to compare the performance of SC and SC-gm with other pruning methods proposed in the literature, the same bagging ensemble is pruned using reduce-error pruning (RE) [28], Orientation Ordering (OO) [30], semi-deﬁnite programming

Table 2 Characteristics of datasets. Dataset

No. of instances

No. of attributes

Class

Audio Australian Breast W. Diabetes Ecoli Glass Heart Horse-colic Ionosphere Labor Liver Pendigits Segment Sonar Spam Tic-tac-toc Vehicle Votes Waveform Wine

226 690 699 768 336 214 270 368 351 57 345 10,992 2310 208 4601 958 846 435 300 178

69 14 9 8 7 9 13 21 34 16 6 16 19 60 57 9 18 16 21 13

24 2 2 2 8 6 2 2 2 2 2 10 7 2 2 2 4 2 3 3

(SDP) [20], Complementarity Measure (CC) [29] and Kappa pruning(Kappa) [28]. We construct 100 component classiﬁers on Glass, and use both SC (λ ¼0.5) and SC-gm to calculate the two dimensional vectors of ½v1 ; v2 . The data points are clustered into two groups, and are plotted in Fig. 1. Different classiﬁcation errors are obtained when λ is assigned different values, and the optimal λ is different for each dataset. We obtain the classiﬁcation error rate of SC, RE, OO, SDP, CC and Kappa on each dataset under their corresponding conditions for performance comparison. The results described in Table 3 show that, on most of the 20 datasets, SC outperforms the other algorithms, and it only loses on Wine and Liver. The averaged errors on 20 datasets are also given in Table 3, and the results show that SC outperforms the others, and SC-gm as well as SDP performs. We also compare the average rank of algorithms on the 20 datasets. All algorithms are ranked on each dataset according to their classiﬁcation errors, the best one gets a rank of 1, the second best one gets a rank of 2, and so on. In case of ties, average ranks are assigned [54]. rj denotes the average rank of the jth algorithm on all datasets, and is calculated as r j ¼ ð1=nÞ∑ni¼ 1 r ji , where n is the number of datasets, and rji is the rank of the jth algorithm on the ith dataset. The results described in Table 4 show that the difference between the average rank of SC and that of the others is obvious. Spectral clustering approach is employed to analyze classiﬁer aggregation based on their similarities. We make a simple assumption that good classiﬁers will aggregate into one group and poor ones will aggregate into another. Good classiﬁers denote accurate and diverse ones, while poor classiﬁers denote accurate and non-diverse ones, or inaccurate and diverse ones, or inaccurate and non-diverse ones. According to the deﬁnition of formulae (5) and (6), accurate and diverse classiﬁers have high similarity, other classiﬁers have low similarity. Similarity represents the weight assigned to each edge of the classiﬁer graph, and when we aggregate the nodes into two groups, those connected with large weights will be clustered into one group. Statistical test is also implemented to measure and assess whether there are signiﬁcant divergences between SC and the other algorithms. We do the paired two-tailed t-test [55] based on the classiﬁcation accuracies of algorithms, since it is commonly used to compare classiﬁcation algorithms. The difference between two algorithms is measured by the p-value of a t-test, which represents the probability with which two sets of compared samples come from distributions with the same mean. Small p-value denotes a big difference between two average values, and 0.05 is a typical threshold which is considered to be

SC−gm

SC(0.5) 0.5

0.3

0.4

0.2

0.3

0.1

0.2

v2

v2

0 0.1

−0.1 0 −0.2

−0.1

−0.3

−0.2 −0.3 −0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

−0.4 −0.4

−0.3

−0.2

−0.1

0

v1

v1 Fig. 1. Distribution of ½v1 ; v2 in a two dimensional space.

0.1

0.2

0.3

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

293

Table 3 Averaged classiﬁcation errors and standard deviations for ensemble of cart trees (the optimal values are in boldface). Dataset

SC

SC-gm

RE

OO

SDP

CC

Kappa

Bagging

Audio Australian Breast W. Diabetes Ecoli Glass Heart Horse-colic Ionosphere Labor Liver Pendigits Segment Sonar Spam Tic-tac-toc Vehicle Votes Waveform Wine Average

20.3 7 7.4 13.4 7 4.0 3.5 7 2.1 22.3 7 3.2 15.2 7 4.5 23.2 7 6.3 17.17 6.8 14.5 7 5.9 6.8 7 3.8 6.7 7 10.0 28.2 7 6.9 1.6 7 0.3 2.5 7 1.0 18.9 7 9.0 6.4 7 1.6 1.17 1.1 22.17 3.3 3.7 7 2.9 18.0 7 1.1 3.8 7 1.7 12.6 7 4.1

21.4 7 7.3 13.7 7 4.1 3.8 7 2.4 23.4 7 4.0 16.4 7 5.8 24.8 7 7.0 19.6 7 7.3 14.6 7 5.3 8.5 7 4.5 7.9 7 11.6 28.17 7.0 1.9 7 0.4 2.6 7 0.9 21.4 7 8.9 6.47 1.6 1.17 1.0 24.2 7 3.4 3.77 2.9 18.7 7 1.0 3.3 7 1.5 13.3 7 4.4

21.17 7.4 13.7 7 4.0 4.2 7 2.5 24.67 4.0 16.2 7 5.7 24.57 7.6 19.5 7 7.4 15.7 7 5.7 7.5 7 4.5 11.9 7 12.3 27.7 7 7.0 1.8 7 0.5 2.6 7 1.1 21.0 7 9.5 6.4 7 1.7 1.6 7 1.2 25.5 7 4.0 4.6 7 3.0 20.5 7 1.4 4.0 7 2.8 13.7 7 4.7

21.17 7.7 13.7 7 4.0 4.3 7 2.6 24.4 7 4.2 15.8 7 5.5 24.3 7 7.7 17.5 7 7.1 15.0 7 5.7 7.8 7 4.2 8.0 7 10.0 28.8 7 7.4 1.8 7 0.4 2.5 7 1.1 21.3 7 9.0 6.5 7 1.6 1.5 7 1.1 25.2 7 4.4 5.2 7 3.2 20.2 7 1.3 3.9 7 2.1 13.4 7 4.5

21.0 77.5 13.4 74.1 3.8 72.3 24.574.3 15.8 75.5 24.778.0 17.8 77.1 16.5 76.0 7.1 74.0 8.2 711.2 28.2 76.7 1.7 70.4 2.5 71.2 19.6 79.7 6.4 71.6 1.2 71.1 25.2 74.4 4.7 73.1 20.2 71.3 3.2 72.8 13.3 74.6

21.17 7.9 13.7 7 4.1 4.2 7 2.4 23.9 7 3.9 16.17 6.1 24.77 7.5 19.0 7 7.0 16.4 7 5.8 7.3 7 4.0 14.0 7 13.1 28.17 6.7 1.8 7 0.4 2.6 7 1.0 21.6 7 9.2 6.5 7 1.6 1.6 7 1.2 25.8 7 4.0 4.7 7 3.2 20.4 7 1.4 4.5 7 3.8 13.9 7 4.7

27.8 77.5 13.7 74.1 5.0 72.9 24.373.9 18.6 75.9 30.8 78.0 18.0 76.8 16.8 76.1 8.1 74.3 10.3 712.1 30.1 76.0 2.5 73.0 3.8 71.1 24.579.4 6.8 71.6 3.1 71.6 29.0 74.0 4.7 73.1 22.1 71.8 5.9 73.3 15.3 74.8

26.7 7 8.1 14.6 7 3.7 3.5 7 2.3 24.57 5.0 17.8 7 6.0 29.9 7 7.0 19.8 7 8.1 18.0 7 6.0 7.41 7 4.2 13.8 7 12.6 32.1 7 5.9 2.1 7 0.3 3.0 7 1.1 25.2 7 10.0 7.0 7 1.5 1.8 7 1.2 29.0 7 3.6 4.5 7 3.1 23.0 7 2.1 4.4 7 3.8 15.4 7 4.8

Table 4 Averaged ranks for ensemble of cart trees(the optimal values are in boldface. rank* denotes average rank on 20 datasets). Dataset

SC

SC-gm

RE

OO

SDP

CC

Kappa

Bagging

Audio Australian Breast W. Diabetes Ecoli Glass Heart Horse-colic Ionosphere Labor Liver Pendigits Segment Sonar Spam Tic-tac-toc Vehicle Votes Waveform Wine rank*

1 1 1 1 1 1 1 1 1 1 4.5 1 2 1 2 1.5 1 1.5 1 3 1.4

6 5 3.5 2 6 6 7 2 8 2 2.5 6 5 5 2 1.5 2 1.5 2 2 3.9

4 5 5.5 8 5 3 6 4 5 6 1 4 5 3 4 5.5 5 4 6 5 4.7

4 5 7 5 7.5 2 2 3 6 3 6 4 2 4 5.5 4 3.5 8 3.5 4 4.5

2 2 3.5 6.5 7.5 4.5 3 6 2 4 4.5 2 2 2 2 3 3.5 6 3.5 1 3.5

4 5 5.5 3 4 4.5 5 5 3 8 2.5 4 5 6 5.5 5.5 6 6 5 7 5.0

8 5 8 4 8 8 4 7 7 5 7 8 8 7 7 8 7.5 6 7 8 6.9

7 8 2 6.5 7 7 8 8 4 7 8 7 7 8 8 7 7.5 3 8 6 6.7

statistically signiﬁcant. We report the p-values on each dataset when comparing SC with the others, and the results given in Table 5 show that SC has better generalized performance on the whole most of the time. For example, on Audio, the p-value is 0.0203 and is less than 0.05, it means SC performs signiﬁcantly better than Bagging on Audio at the 0.05 signiﬁcant level. The pruning rate of each algorithm is reported in Table 6. We split each dataset into 10 approximately equal partitions and each in turn is used for testing and the remainder is used for training. Each time, we use instances sampled with replacement from the nine-fold data t times to construct a classiﬁer (t equals the size of the nine-fold data), repeat the above experiments 100 times to construct 100 classiﬁers, and obtain the pruning rate of each algorithm. The procedure is repeated 10 times so that, in the end, each fold of data has been used exactly once for testing, and the mean of the 10 pruning rates of each algorithm is calculated. The fractional part of the averaged pruning rates is ignored. For calculating the pruning rate of RE [28], we ﬁrst incorporate the

classiﬁer with the lowest classiﬁcation error to the ensemble, and one of the remaining classiﬁers is sequentially incorporated in the ensemble, such that the classiﬁcation error of the subensemble is as low as possible. The process is terminated until the classiﬁcation error of the subensemble no longer decreases, and the pruning rate is obtained by the ratio between the size of the subensemble and the total number of classiﬁers. For calculating the pruning rate of Kappa [28], we order pairs of classiﬁer in descending order based on the diversity between them, and incorporate the ﬁrst classiﬁers with κ o 0:5 to the ensemble. The pruning rate is obtained by the ratio between the size of the subensemble and the total number of classiﬁers. For calculating the pruning rate of CC [29], we ﬁrst incorporate the classiﬁer with the lowest classiﬁcation error to the ensemble, and then add a classiﬁer to the ensemble at each iteration, whose performance is the most complementary to the subensemble. We terminate the iteration until no classiﬁer can correctly classify more than half of the number of examples that are misclassiﬁed by the previous subensemble. For calculating the pruning rate of OO [30], the classiﬁers are ordered by increasing value of the angle between their corresponding signature vectors and a reference vector. Classiﬁers whose angle is greater than π =2 are not included in the ﬁnal ensemble. For calculating the pruning rate of SDP [20], k in the constraint of the programming formulation is a positive integer, and is varied from 20 to 80. For each given value of k, we solve SDP and obtain the classiﬁcation error of the subset of classiﬁers on a speciﬁc dataset. The value of k that corresponds to the optimal classiﬁcation error is chosen, and the pruning rate is obtained by the ratio between k and the size of the original ensemble. The results show that the pruning rates are different for each algorithm and dataset. On the 20 datasets, OO performs the best and Kappa performs the worst most of the time. Results also show that, on the 20 datasets, the averaged pruning rate of SC, SC-gm, OO, and SDP exceeds 50%, and SC just loses to OO. Fig. 2 shows the classiﬁer selection rate of SC on Glass and Votes when λ is assigned different values. It is clear that λ inﬂuences classiﬁer clustering. We evaluate the inﬂuence of λ on the classiﬁcation accuracy using Glass, Ionosphere, Votes and Wine, and show the results in Fig. 3. Since the optimal classiﬁcation error rates of SC described in Table 3 are obtained when the optimal λ is chosen from f0; 0:2; 0:4; 0:6; 0:8; 1g, which may be a local optimum, it could be

294

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

Table 5 p-values of t-test on 20 UCI datasets. Dataset

SC/SC-gm

SC/RE

SC/OO

SC/SDP

SC/CC

SC/Kappa

SC/Bagging

Audio Australian Breast W. Diabetes Ecoli Glass Heart Horse-colic Ionosphere Labor Liver Pendigits Segment Sonar Spam Tic-tac-toc Vehicle Votes Waveform Wine

0.6835 0.9987 0.7843 0.0728 0.2318 0.0489 0.0328 0.9868 0.0378 0.0982 0.9986 0.6542 0.8991 0.0487 1 0.9998 0.0872 1 0.6734 0.7426

0.6127 0.9699 0.1026 0.0436 0.3014 0.0518 0.0229 0.0756 0.0589 0.0213 0.4765 0.6971 0.9002 0.0492 0.9951 0.5321 0.0543 0.0932 0.0498 0.8963

0.6078 0.8643 0.1347 0.0411 0.8642 0.0520 0.7281 0.6261 0.0510 0.0500 0.9763 0.7013 0.9992 0.0490 0.9679 0.6655 0.0561 0.0486 0.0510 0.9762

0.5160 0.9997 0.2218 0.0429 0.8750 0.0498 0.7010 0.0417 0.3710 0.4786 0.9679 0.8642 0.9435 0.1753 1 0.9424 0.0550 0.0843 0.0511 0.3271

0.6022 0.9346 0.1965 0.0683 0.2451 0.0500 0.0344 0.0489 0.3528 0.0086 0.9965 0.6952 0.9210 0.4034 1 0.5235 0.0614 0.0798 0.0498 0.2764

0.0124 0.8942 0.0438 0.0457 0.0114 0.0102 0.4687 0.0401 0.0469 0.0121 0.2105 0.1033 0.0497 0.0211 0.8973 0.0437 0.0249 0.0810 0.0211 0.0470

0.0203 0.0457 0.8769 0.0391 0.0201 0.0393 0.0203 0.0210 0.2987 0.0095 0.0432 0.2349 0.3425 0.0145 0.8210 0.1031 0.0218 0.1013 0.0189 0.0986

Table 6 Pruning rate comparison. Dataset

SC (%) SC-gm (%) RE (%) OO (%) SDP (%) CC (%) Kappa (%)

Audio 56 Australian 65 Breast W. 81 Diabetes 62 Ecoli 56 Glass 52 Heart 84 Horse-colic 58 Ionosphere 52 Labor 84 Liver 66 Pendigits 65 Segment 80 Sonar 52 Spam 66 Tic-tac-toc 75 Vehicle 75 Votes 51 Waveform 70 Wine 78 Averaged 66

63 24 58 62 45 56 47 62 59 86 73 59 64 64 56 49 66 14 79 57 57

32 48 16 38 29 26 43 39 46 50 37 54 51 42 39 30 37 46 45 43 40

78 80 76 78 81 74 81 78 71 71 70 75 70 72 67 75 73 74 72 69 74

63 74 58 70 51 55 63 52 45 48 56 73 68 48 57 49 46 53 64 54 57

52 39 81 45 72 40 28 46 52 41 64 50 63 39 40 29 40 35 35 55 47

23 31 19 29 31 27 31 42 36 29 28 30 24 26 31 25 20 21 24 31 28

100 Glass Votes

classifier selection rate

90 80 70 60 50 40 30

0

0.2

0.4

0.6

0.8

lamda Fig. 2. Classiﬁer selection rates on two datasets.

1

a good idea to explore different λ values. We tune λ from 0 to 1 with a small step size to learn its optimal value. Since a very small change of λ may have no obvious inﬂuence on the similarity matrix and the classiﬁcation error, we set the step size to 0.05, and change λ gradually. The obtained results drawn in Fig. 3 show that there exist several local minima for each dataset. The optimal error rate of SC reported in Table 3 may not be the global optimum, but it is near the global optimum. In the above experiments, the optimal value of λ is found by cross validation using the training data. We partitioned each dataset into 10 approximately equal parts, and each in turn is left unused and the remainder is used for training. Each time, we use instances sampled with replacement from the nine-fold data t times to construct a classiﬁer (t equals the size of the nine-fold data), and repeat the above experiments 100 times to construct 100 classiﬁers. We calculate the similarity between two classiﬁers according to formula (5) when λ is assigned a speciﬁc value, and cluster the classiﬁers. We choose the cluster of classiﬁers with larger averaged classiﬁer similarity on the unused one-fold data. The procedure is repeated 10 times so that, in the end, each fold data is not involved in the nine-fold data exactly once. We calculate the mean of the 10 classiﬁcation errors, then we change the value of λ, and use the above-mentioned approach to obtain another mean error. The process terminates until each value of λ is tried. We choose the value of λ that corresponds to the smallest mean error for each dataset. Other approaches can be used to learn the value of λ for each training dataset. It is also a good idea to explore different λ and select the one that gives the smallest ensemble, so we do experiments to choose the optimal λ for each dataset based on the number of selected classiﬁers. For each dataset, the optimal λ can be obtained if we tune λ in a very small step, but it is time consuming. To simplify the exploration, we set λ to 0, 0.2, 0.4, 0.6, 0.8 and 1 each time, and learn the pruned smaller ensemble. The value that corresponds to the smallest ensemble is chosen. We partition each dataset into 10 approximately equal parts, and each in turn is left unused and the remainder is used for training. Each time, we use instances sampled with replacement from the nine-fold data t times to construct a classiﬁer (t equals the size of the nine-fold data), and repeat the above experiments 100 times to construct 100 classiﬁers. We calculate the similarity between two classiﬁers according to formula (5) when λ is assigned a speciﬁc value, and cluster the classiﬁers. We choose

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

8.4

classification error rate(ionosphere)

classification error rate(glass)

31 30 29 28 27 26 25 24 23

0

0.2

0.4

0.6

0.8

8.2 8 7.8 7.6 7.4 7.2 7 6.8 6.6

1

0

0.2

0.4

0.6

0.8

1

0.6

0.8

1

lambda

lambda

5

6

4.8

classification error rate(wine)

classification error rate(votes)

295

4.6 4.4 4.2 4 3.8 3.6 3.4

5.5

5

4.5

4

3.5 0

0.2

0.4

0.6

0.8

1

0

0.2

lambda

0.4

lambda

Fig. 3. Classiﬁcation error rates on four datasets. (a) Glass, (b) Ionosphere, (c) Votes, (d) Wine.

Table 7 λ value that corresponds to the smallest ensemble. Dataset

λ

SC(λ):error7 std.

SC(optimal)error 7 std.

Audio Australian Breast W. Diabetes Ecoli Glass Heart Horse-colic Ionosphere Labor Liver Pendigits Segment Sonar Spam Tic-tac-toc Vehicle Votes Waveform Wine

0.2 0.4 0.4 0.2 0.8 0.4 0.8 0.4 0.4 0.2 0.8 0.6 0.4 0.6 0.8 0.4 0.2 0.2 0.4 0.2

20.3 77.4 13.7 73.8 3.7 72.6 22.3 73.2 16.1 76.0 24.7 77.6 17.9 77.1 15.3 75.5 7.4 74.5 6.7 710.1 30.1 77.2 1.7 70.4 2.6 70.9 18.9 79.0 6.4 71.8 1.3 71.2 24.1 74.1 3.7 72.9 19.9 71.8 3.8 71.7

20.3 7 7.4 13.4 7 4.0 3.5 7 2.1 22.3 7 3.2 15.2 7 4.5 23.2 7 6.3 17.17 6.8 14.5 7 5.9 6.8 7 3.8 6.7 7 10.1 28.2 7 6.9 1.6 7 0.3 2.5 7 1.0 18.9 7 9.0 6.4 7 1.6 1.17 1.1 22.17 3.3 3.7 7 2.9 18.0 7 1.1 3.8 7 1.7

the smaller cluster of classiﬁers, and count the number of classiﬁers in it. The procedure is repeated 10 times so that in the end, each fold of data is ignored in the nine-fold data exactly once. We calculate the mean number of classiﬁers in the obtained 10 smaller clusters, then we change the value of λ, and use the abovementioned approach to obtain another mean number of classiﬁers.

The process terminates until each value of λ is tried. We choose the value that corresponds to the smallest mean number of classiﬁers for each dataset, and report the results in Table 7. The averaged classiﬁcation errors and standard deviations of SC are also reported when λ is assigned the selected value. The results reveal that the selected λ may be different from one dataset to another, and the corresponding classiﬁcation errors are less than or equal to the results of SC under optimal conditions. That means the smallest ensemble sometimes may not perform optimally according to the classiﬁcation error.

5. Conclusions It is known that an ensemble classiﬁer using all the weak classiﬁers in hand may not outperform part of them, and many studies on choosing the optimal classiﬁer subset have been done. Different from the proposed heuristic approaches and the programming approaches, this paper takes into account both the classiﬁcation accuracy of each individual classiﬁer and the diversity of pairs of classiﬁers simultaneously in one model to evaluate similarities among classiﬁers, and implements spectral clustering techniques to prune classiﬁers. Two similarity concepts are proposed to evaluate component classiﬁers, and the component classiﬁers are viewed as vertices of a graph and are connected to each other with similarity weighted edges. Based on the assumption that similar classiﬁers should be grouped together, we employ the spectral clustering techniques to group the component

296

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

classiﬁers into two clusters, and choose the one with smaller averaged classiﬁer similarity. The computational cost of the proposed algorithm has been analyzed, and experiments on benchmark datasets have been done to evaluate its performance. The results show that the proposed algorithm SC or SC-gm outperforms other order aggregation algorithms most of the time, and the average rank also reveals that differences among the proposed algorithm and the other algorithms are obvious. The pruning rate of each algorithm is different, and varies from one dataset to another. It depends on the number of the initial classiﬁers and the experimental setup [16]. Given the size 100 of the initial classiﬁers, we use 10-fold cross validation to evaluate the pruning rate of each algorithm. The averaged pruning rate of the proposed algorithm shows that more than half of the initial classiﬁers can be pruned. The effect of parameter λ on the performance of the proposed algorithm has also been evaluated, and results show that λ affects the performance of the proposed algorithm. It is difﬁcult to determine the value of λ in advance. In order to ﬁnd the optimal λ, we just evaluate the values from a ﬁnite set by cross validation on the training data. The value that corresponds to the optimal performance is preserved. It can be accepted that there is an optimal λ for each dataset, and this problem needs further investigation.

Acknowledgments The authors thank the editors and the anonymous reviewers for their helpful comments and suggestions. This research is partially supported by the National Natural Science Foundation of China (Nos.61170145, 61373081), the Specialized Research Fund for the Doctoral Program of Higher Education of China (20113704110001), the Natural Science Foundation of Shandong (No. ZR2010FM021), the Scientiﬁc Technology and Development Project of Shandong (No. 2013GGX10125) and the Taishan Scholar Project of Shandong, China. References [1] C.M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag New York Inc., New York, NY, 2006. [2] J.V. Hansen, Combining predictors: comparison of ﬁve meta machine learning methods, Inf. Sci. 119 (1-2) (1999) 91–105. [3] T. Dietterich, An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization, Mach. Learn. 40 (2000) 139–158. [4] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combining classiﬁers, IEEE Trans. Pattern Anal. Mach. Intell. 20 (3) (1998) 226–239. [5] L. Breiman, Bagging Predictors, Mach. Learn. 24 (1996) 123–140. [6] Y. Freund, R. Schapire, Experiments with a new boosting algorithm, in: Proceedings of the 13th International Conference of Machine Learning, 1996, pp. 325–332. [7] C.J. Merz, Using correspondence analysis to combine classiﬁers, Mach. Learn. 36 (1) (1999) 33–58. [8] N. Garca-Pedrajas, Constructing ensembles of classiﬁers by means of weighted instance selection, IEEE Trans. Neural Netw. 20 (2) (2009) 258–277. [9] T. Windeatt, Accuracy/diversity and ensemble MLP classiﬁer design, IEEE Trans. Neural Netw. 17 (5) (2006) 1194–1211. [10] E. Bauer, R. Kohavi, An empirical comparison of voting classiﬁcation algorithms: bagging, boosting, and variants, Mach. Learn. 36 (1–2) (1999) 105–139. [11] R. Caruana, A. Niculescu-Mizil, An empirical comparison of supervised learning algorithms, in: Proceedings of the 23rd International Conference of Machine Learning, 2006, pp. 161–168. [12] G. Rätsch, T. Onoda, K.-R. Müller, Soft margins for adaboost, Mach. Learn. 42 (3) (2001) 287–320. [13] Z.-H. Zhou, J. Wu, W. Tang, Ensembling neural networks: many could be better than all, Artif. Intell. 137 (1–2) (2002) 239–263. [14] S. Singh, M. Singh, A dynamic classiﬁer selection and combination approach to image region labeling, Signal Process. Image Commun. 20 (3) (2005) 219–231. [15] A. Ulas, M. Semerci, O.T. Yıldız, E. Alpaydın, Incremental construction of classiﬁer and discriminant ensembles, Inf. Sci. 179 (9) (2009) 1298–1318.

[16] G. Martínez-Muñoz, D. Hernáñdez-Lobato, A. Suárez, An analysis of ensemble pruning techniques based on ordered aggregation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 245–259. [17] D. Hernández-Lobato, G. Martínez-Muñoz, A. Suárez, Statistical instance-based pruning in ensembles of independent classiﬁers, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 364–369. [18] L. Chen, M.S. Kamel, A generalized adaptive ensemble generation and aggregation approach for multiple classiﬁer systems, Pattern Recognit. 42 (2009) 629–644. [19] G. Martínez-Muñoz, A. Suárez, Using boosting to prune bagging ensembles, Pattern Recognit. Lett. 28 (1) (2007) 156–165. [20] Y. Zhang, S. Burer, W.N. Street, Ensemble pruning via semi-deﬁnite programming, J. Mach. Learn. Res. 7 (2006) 1315–1338. [21] G. Martínez-Muñoz, A. Suárez, Switching class labels to generate classiﬁcation ensembles, Pattern Recognit. 38 (10) (2005) 1483–1494. [22] R.E. Banﬁeld, L.O. Hall, K.W. Bowyer, W.P. Kegelmeyer, A comparison of decision tree ensemble creation techniques, IEEE Trans. Pattern Anal. Mach. Intell. 29 (1) (2007) 173–180. [23] C.D. Stefano, G. Folino, F. Fontanella, A. Scotto di Freca, Using Bayesian networks for selecting classiﬁers in GP ensembles, Inf. Sci. 258 (2014) 200–216. [24] L. Kuncheva, Switching between selection and fusion in combining classiﬁers: an experiment, IEEE Trans. Syst. Man Cybern. Part B 32 (2) (2002) 146–156. [25] R. Lysiak, M. Kurzynski, T. Woloszynski, Optimal selection of ensemble classiﬁers using measures of competence and diversity of base classiﬁers, Neurocomputing 126 (2014) 29–35. [26] G. Tsoumakas, I. Partalas, I. Vlahavas, A taxonomy and short review of ensemble selection, in: Proceedings of Workshop on Supervised and Unsupervised Ensemble Methods and their Applications, 2008, pp. 41–46. [27] H. Zhou, X. Zhao, X. Wang, An effective ensemble pruning algorithm based on frequent patterns, Knowl. Based Syst. 56 (2014) 79–85. [28] D.D. Margineantu, T.G. Dietterich, Pruning adaptive boosting, in: Proceedings of the 14th International Conference of Machine Learning, 1997, pp. 211–218. [29] G. Martínez-Muñoz, A. Suárez, Aggregation ordering in bagging, in: Proceedings of IASTED International Conference on Artiﬁcial Intelligence and Applications, 2004, pp. 258–263. [30] G. Martínez-Muñoz, A. Suárez, Pruning in Ordered Bagging Ensembles, in: Proceedings of the 23rd International Conference of Machine Learning, 2006, pp. 609–616. [31] Z.-H. Zhou, W. Tang, Selective ensemble of decision trees, in: Q. Liu, Y. Yao, A. Skowron (Eds.), Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing, Springer, 2003, pp. 476–483. [32] Q. Dai, A novel ensemble pruning algorithm based on randomized greedy selective strategy and ballot, Neurocomputing 122 (2013) 258–265. [33] L. Rokach, Collective-agreement-based pruning of ensembles, Comput. Stat. Data Anal. 53 (2009) 1015–1026. [34] M. Aksela, J. Laaksonen, Using diversity of errors for selecting members of a committee classiﬁer, Pattern Recognit. 39 (2006) 608–623. [35] J. Meynet, J.-P. Thiran, Information theoretic combination of pattern classiﬁers, Pattern Recognit. 43 (2010) 3412–3421. [36] J. Xiao, C. He, X. Jiang, D. Liu, A dynamic classiﬁer ensemble selection approach for noise data, Inf. Sci. 180 (2010) 3402–3421. [37] K. Woods, W.P. Kegelmeyer, K. Bowyer, Combination of multiple classiﬁers using local accuracy estimates, IEEE Trans. Pattern Anal. Mach. Intell. 19 (4) (1997) 405–410. [38] G. Giacinto, F. Roli, Dynamic classiﬁer selection based on multiple classiﬁer behavior, Pattern Recognit. 34 (9) (2001) 1879–1881. [39] E.M.D. Santos, R. Sabourin, P. Maupin, A dynamic overproduce-and-choose strategy for the selection of classiﬁer ensembles, Pattern Recognit. 41 (10) (2008) 2993–3009. [40] D. Zhu, A hybrid approach for efﬁcient ensembles, Decis. Support Syst. 48 (2010) 480–487. [41] B. Bakker, T. Heskes, Clustering ensembles of neural network models, Neural Netw. 16 (2003) 261–269. [42] Y. Freund, R.E. Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, J. Comput. Syst. Sci. 55 (1) (1997) 119–139. [43] B. Verma, A. Rahman, Cluster-oriented ensemble classiﬁer: impact of multicluster characterization on ensemble classiﬁer learning, IEEE Trans. Knowl. Data Eng. 24 (4) (2012) 605–618. [44] H. Zouari, L. Heutte, Y. Lecourtier, Controlling the diversity in classiﬁer ensembles through a measure of agreement, Pattern Recognit. 38 (2005) 2195–2199. [45] L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classiﬁer ensembles and their relationship with the ensemble accuracy, Mach. Learn. 51 (2) (2003) 181–207. [46] G. Giacinto, F. Roli, Design of effective neural network ensembles for image classiﬁcation purposes, Image Vis. Comput. 19 (9–10) (2001) 699–707. [47] G. Yule, On the association of attributes in statistics, Biometrika 2 (1903) 121–134. [48] E.K. Tang, P.N. Suganthan, X. Yao, An analysis of diversity measures, Mach. Learn. 65 (1) (2006) 247–271. [49] U.V. Luxburg, A tutorial on spectral clustering, Stat. Comput. 17 (4) (2007) 395–416. [50] G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edition, Johns Hopkins University Press, Baltimore, Maryland, US, 1996.

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

[51] U. Helmke, P.A. Fuhrmann, Controllability of matrix eigenvalue algorithms: the inverse power method, Syst. Control Lett. 41 (1) (2000) 57–66. [52] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classiﬁcation and Regression Trees, Chapman and Hall, UK, 1984. [53] A. Asuncion, D. Newman, UCI Machine Learning Repository, 〈http://www.ics. uci.edu/ mlearn/MLRepository.html〉. [54] J. Demšar, Statistical comparisons of classiﬁers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [55] E. Alpaydin, Introduction to Machine Learning, The MIT Press, Cambridge, MA, USA, 2004.

Huaxiang Zhang received a Ph.D from Shanghai Jiaotong University, in 2004, and now is a professor and Ph. D supervisor at the Department of Computer Science, Shandong Normal University, Shandong, China. His research interests include machine learning, pattern recognition, evolutionary computation, and web information processing.

297 Linlin Cao received her B.Sc. degree in Shandong Normal University, China, in 2010, and she is currently pursuing a Master degree in the school of information science & engineering from the same university. Her research ﬁelds include support vector machines (SVMs), manifold learning and related applications.

A spectral clustering based ensemble pruning approach

A spectral clustering based ensemble pruning approach

Recommend Documents