A spectral clustering based ensemble pruning approach

A spectral clustering based ensemble pruning approach

Neurocomputing 139 (2014) 289–297 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom A spect...

434KB Sizes 0 Downloads 115 Views

Neurocomputing 139 (2014) 289–297

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

A spectral clustering based ensemble pruning approach Huaxiang Zhang a,b,n, Linlin Cao a,b a b

Department of Computer Science, Shandong Normal University, Jinan 250014, Shandong, China Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, China

art ic l e i nf o

a b s t r a c t

Article history: Received 16 March 2013 Received in revised form 1 January 2014 Accepted 23 February 2014 Communicated by A.M. Alimi Available online 13 April 2014

This paper introduces a novel bagging ensemble classifier pruning approach. Most investigated pruning approaches employ heuristic functions to rank classifiers in the ensemble, and select part of them from the ranked ensemble, so redundancy may exist in the selected classifiers. Based on the idea that the selected classifiers should be accurate and diverse, we define classifier similarity according to the predictive accuracy and the diversity, and introduce a Spectral Clustering based classifier selection approach (SC). SC groups the classifiers into two clusters based on the classifier similarity, and retains one cluster of classifiers in the ensemble. Experimental results show that SC is competitive in terms of classification accuracy. & 2014 Elsevier B.V. All rights reserved.

Keywords: Ensemble pruning Classifier similarity Spectral clustering

1. Introduction Classification techniques are commonly used to reveal data patterns hidden in large datasets, and have been extensively studied in the field of machine learning. Various algorithms have been developed for constructing classifiers [1], but researche shows that no single algorithm outperforms the others theoretically or empirically in all scenarios [2]. Sometimes we are confused with which algorithm to utilize when facing a practical classification problem. In order to deal with this issue, ensemble classifiers have been proposed. An ensemble consists of a group of classifiers, and classifies instances based on the decisions of all members. Research shows that an ensemble of simple classifiers may achieve better classification performance than any one sophisticated classifier [3,4], and many ensemble approaches have been proposed [5–9]. Both accuracy and diversity play important roles in constructing ensemble classifiers, and many works focus on obtaining a group of accurate classifiers. Among the ensemble approaches, bagging [5] and boosting [6,10–12] are effective and have been extensively studied. Bagging adopts different bootstrap samples to generate diverse classifiers, and boosting constructs ensemble classifiers by using the original training data with weights updated for each classifier. Ensemble classifiers can achieve remarkable performance, but redundancy may exist in them, and implementing a large number of classifiers requires large memory and slows down the

n Corresponding author at: Department of Computer Science, Shandong Normal University, Jinan 250014, Shandong, China. E-mail address: [email protected] (H. Zhang).

http://dx.doi.org/10.1016/j.neucom.2014.02.030 0925-2312/& 2014 Elsevier B.V. All rights reserved.

classification. If only parts of the ensemble classifiers are implemented when classifying newly coming samples, the computational cost can be reduced. Many research works have been done to select a subset of ensemble classifiers without sacrificing the performance. Zhou et al. [13] proved the “many-could-be-betterthan-all” theorem, and studies show that it is possible to achieve a small yet strong ensemble [14–20]. It is difficult to select an optimal classifier subset, since it needs a combinational search with exponential time complexity. In this paper, we propose a method to choose part of the generated weak classifiers while simultaneously considering both the classifier accuracies and the diversities among a pair of classifiers in one model. As we know learning systems are very common on the internet, and learning from different sources or teachers has been existent for a long time. But the behaviors or teaching styles of the teachers are quite different, and some may have negative effects on learning, thus, it is necessary to distinguish the positive ones from the negative ones. If we consider each classifier as a teacher, then a classifier ensemble can be regarded as a multiple teacher system, and each classifier is responsible for labeling newly coming samples. The teachers are categorized into responsible and irresponsible ones based on their influence on students. All responsible teachers are very similar, since they try to convey the unique right semantic concept, thus forming a cluster. Similarly, the classifiers can be partitioned into similar and dissimilar ones, and similar ones make positive contributions to the ensemble with high probability, thus they should form a cluster. Based on this assumption, we consider accuracy and diversity in one model for evaluating classifier similarity, and adopt clustering techniques to partition ensemble classifiers.

290

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

This paper is organized as follows. Section 2 details the related works in the literature. Section 3 defines the classifier similarity, and introduces how the spectral clustering approach is used to cluster classifiers. Section 4 explains the experimental results, and Section 5 concludes the study.

2. Related works Since a large number of weak classifiers in an ensemble incur computational and storage cost, and the “many-could-be-betterthan-all” theorem has been proved [13], many approaches have been proposed for selecting an optimal classifier subset [21–23]. Ensemble classifier selection approaches can be categorized as static and dynamic ones based on whether the selected classifier subset changes or not when different patterns are classified. The approaches that keep the subset unchanged are static, and the approaches that employ different classifier subsets to classify different patterns are dynamic [24,25]. Tsoumakas et al. [26] categorized the ensemble pruning approaches differently into four types: search based methods, clustering based methods, optimization based methods and other methods. Clustering based algorithms are based on the notion of distance to cluster the constructed classifiers. Martínez-Muñoz et al. [16] analyzed several ordered aggregation based ensemble pruning techniques, and evaluated their performances on benchmark datasets. They concluded that ordered aggregation techniques sometimes could generate effective pruned bagging ensembles. The investigated pruning approaches employ specific metrics to rank classifiers, or perform a heuristic search in the classifier space while evaluating the collective merit of a candidate classifier subset [19,27–30]. Other ensemble pruning techniques employ genetic algorithms [31], randomized greedy selective strategy and ballot [32] or semidefinite programming [20] to perform classifier pruning. Rokach [33] took into account the predictive capability of classifiers along with the degree of redundancy among them, and selected a high accuracy and low inter-agreement classifier subset. The approach implements the best first search strategy in a 2n huge search space (n: the number of classifiers), and reports that over half of the original classifiers can be pruned. Aksela and Laaksonen [34] proposed a classifier selection method based on an exponential diversity error measure, and evaluated their approach on handwritten character patterns. Meynet and Thiran [35] took into account the classifier diversity and accuracy in the definition of information theoretic score (ITS), and selected a classifier subset with the optimal ITS. ITS is obtained by selecting one classifier at each iteration to maximize its value. It is not differentiable and its calculation incurs large time complexity. Different from static approaches, Xiao et al. [36] proposed a dynamic classifier selection approach to noise data classification, and introduced several data handling methods for dynamic classifier selection. Statistical analysis and experimental results show that their approach has stronger noise-immunity ability than several other strategies. Many other dynamic approaches have also been proposed [24,37–39]. Zhu [40] integrated data envelopment analysis and stacking, and described a hybrid approach to classifier selection. Bakker and Heskes [41] proposed a clustering method for ensemble classifier extraction, in which a small collection of representative entities is used to represent a large entity collection. A method was used to extract the small representative model set through clustering the constructed models. Different from the proposed classifier pruning approaches, the small representative models are not part of the original ones.

Based on the analysis of the relationship between the proposed on-line allocation algorithm and the boosting algorithm, Freund and Schapire [42] proposed variants of adaboost, and proved the error bound of each variant. In order to handle the classification of a dataset with overlapping patterns from different classes, Verma and Rahman [43] first clustered the classified data, and used a group of component classifiers to learn the decision boundaries between pairs of clusters. A fusion classifier is responsible for the class decision based on the decisions of the component learners. Different ensemble construction approaches may lead to quite different performances for a classifier ensemble, and many classifier ensemble approaches have been reported. We are only interested in how to prune a constructed ensemble for performance improvement.

3. The proposed approach The above-mentioned research works show that, if both the accuracy and the diversity are taken into account in the process of classifier selection, the performance of the pruned ensemble may be improved. In this study, we focus on static classifier selection approaches, and propose a classifier similarity concept. The classifiers are used to construct a graph with weighted edges, and the spectral clustering approach is employed to analyze classifier aggregation based on the assumption that highly similar classifiers will aggregate into one group and lowly similar classifiers will aggregate into another group. 3.1. Classifier similarity For each classifier, we calculate its classification accuracies on the training datasets used to construct weak classifiers, and use the obtained accuracies to construct an accuracy vector. Let D1, D2, …, and Dn denote n training datasets, and classifier hi is modeled on Di. Given hi's classification accuracies a1i , a2i , …, ani on D1 ; D2 ; …; Dn correspondingly, we use ai ¼ ða1i ; a2i ; …; ani ÞT to represent the accuracy vector of hi. Since we evaluate the prediction performances on different samples from the original training data, the resulting estimates may suffer from training bias. This problem does not matter, because each sample has an equal chance to influence the classifier's performance when calculating each classifier's accuracy vector. Hence, the effect of the training bias can be decreased by the vector entries. For each pair of classifiers hi and hj with corresponding accuracy vectors ai and aj , the accuracy similarity between them is defined by ( pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðai  aj Þ=n; ði a jÞ Sij ðaÞ ¼ ð1Þ 0; ði ¼ jÞ where ”  ” denotes a scalar product of two vectors. Sij ðaÞð A ½0; 1Þ is large if both classifiers perform well to some extent on all the sampled datasets, and small otherwise. It is also symmetric for each pair of classifiers, since we have Sij ðaÞ ¼ Sji ðaÞ. Diversity also plays an important role in the success of an ensemble [44], and it can be viewed as a measure of dependence, complement or even orthogonality among classifiers [45]. Diverse classifier ensembles are preferred, and Giacinto and Roli [46] stated that ensemble classifiers should be accurate and diverse. There exist many diversity measures [47,48], and no one has been proved to be the best. We employ the widely used diversity measure Q-statistics [45] to calculate the diversity of pairs of classifiers. Given a dataset D, Q is calculated as Q ¼ ðN 11  N 00  N01  N10 Þ=ðN 11  N 00 þ N 01  N 10 Þ

ð2Þ

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

where N11 is the number of samples correctly classified by both classifiers, N00 is the number of samples wrongly classified by both, N01 is the number of samples wrongly classified by the first one but correctly classified by the second one, and N10 is the number of samples correctly classified by the second one but wrongly classified by the first one. If one of N11, N00, N01 and N10 equals the size of D or zero, then 1 is assigned to Q . We calculate Q on Di and Dj for hi and hj, and use Qij and Qji to denote the Q on Di and Dj respectively. Both Qij and Qji lie in ½  1; 1, and the smaller the Q value is, the more diverse the two classifiers are. After making the following transformation (3), both Qij and Qji will lie in ½0; 1, and the larger the Q value is, the more diverse the two classifiers are: Q ij ¼ ð1 Q ij Þ=2;

Q ji ¼ ð1  Q ji Þ=2

We define the diversity similarity between hi and hj as qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Sij ðdÞ ¼ Q ij  Q ji

ð3Þ

ð5Þ

where λ A ½0; 1 reflects the influence of the left two terms of (5). Sij A ½0; 1 is also symmetric and nonnegative, and Sii equals 0 for each i. We also define the similarity between hi and hj as qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Sij ¼ Sij ðaÞ  Sij ðdÞ ð6Þ

3.2. Spectral clustering on classifiers It is straightforward to implement clustering on the classifiers. Since the similarity between two classifiers is nonnegative, we conduct spectral clustering [49] on them. In spectral clustering, a similarity graph is constructed, and each vertex represents a data point. Two vertices are connected by an edge if the similarity between them is positive or larger than a specific threshold, and the edge is similarity weighted. Each classifier is regarded as a vertex, and is connected to the others by similarity weighted edges. The similarity assigned to each edge is the similarity between the corresponding two classifiers. Spectral clustering approach is implemented on the classifier graph to divide the classifiers into several groups, so that classifiers in the same group are similar while classifiers in different groups are dissimilar. In spectral clustering, the similarity matrix W is built up with entry Sij in the ith row and the jth column, and the diagonal matrix D is built up with entry dii as n

dii ¼ ∑ Sij

Table 1 Spectral clustering algorithm for picking the subset of ensemble classifiers (SC). Input: m classifiers and the parameter λ Output: n classifiers from m classifiers (no m) 1. for i,j¼ 1 to m do calculate the similarity Sij between classifier hi and hj according to (1), (2), (3), (4)and (5) endfor 2. construct the normalized Laplacian matrix L using (7) and (8) and calculate its two eigenvectors v1 and v2 corresponding to the two smallest eigenvalues; 3. cluster the row data of [v1,v2] into two groups using k-means algorithm; 4. calculate the averaged similarity for each group of classifiers, and choose the group of classifiers with larger averaged similarity; 5. use the classifiers with number corresponding to the selected row number of [v1, v2] to build up an ensemble.

ð4Þ

Q is obtained by implementing two classifiers on one dataset, so we cannot define a diversity vector for each classifier with Qs as entries. Q is larger when two classifiers disagree with each other on more instances. It is obvious that Sij(d) is also symmetric and nonnegative, since Sij ðdÞ ¼ Sji ðdÞ. Sii(d) equals 0 for each i according to (2)–(4). Since both Sij(d) and Sij(a) lie in [0,1], we consider both of them in one model, and define the similarity between hi and hj as Sij ¼ λSij ðaÞ þ ð1  λÞSij ðdÞ

291

as a data entry, and any classical clustering approaches can be implemented on the n two-dimensional data points. We choose k-means [1] to aggregate them into different clusters, and assign 2 to the number of clusters, since we want to select one classifier subset for the ensemble, and discard the left. If the number of clusters is more than 2, for example k clusters are specified, we need to evaluate each cluster or different cluster combinations, and 2k different cluster combinations should be evaluated. We can obtain the global optimal subset for the ensemble after all cluster combinations have been evaluated, but it is a combinational research problem and time consuming, so we assign 2 to the number of clusters to simplify the problem. Spectral cluster approaches divide the similarity graph into two sub-graphs with each sub-graph corresponding to one classifier cluster. We calculate the averaged classifier similarity of each cluster by the ratio between the sum of weights and the number of edges of its corresponding sub-graph, and select the cluster with larger similarity. The algorithm for picking the subset of bagging ensemble classifiers (SC) is described in Table 1. When formula (5) is replaced with (6), the algorithm is renamed as SC-gm. The cost of computing all the eigenvalues and eigenvectors of a n  n matrix is cubic in n. As we just need to compute two eigenvectors corresponding to the two smallest eigenvalues, more efficient approaches can be used to decrease the computational cost. For example, the inverse power method [50,51] can be used to find the eigenvector with the smallest eigenvalue, and has order of Oðn2 Þ time complexity. Applying k-means to cluster the row vectors of ½v1 ; v2  has order of O(knt) time complexity, where k (equals 2 in this paper) is the number of clusters, n is the number of constructed classifiers, and t is the number of iterations for clustering. Calculating the accuracy vectors and the accuracy similarities has order of Oðn2 Þ time complexity, and calculating the diversity similarities also has order of Oðn2 Þ time complexity. So the classifier selection method proposed in this paper incurs time complexity with order of Oðn2 þ kntÞ.

ð7Þ

j¼1

4. Experiments

The normalized Laplacian matrix is defined as L ¼ I D  1=2 WD  1=2

ð8Þ

where I is a n  n identity matrix. After the eigenvalues and eigenvectors of L have been obtained, two vectors v1 and v2 that correspond to the two smallest eigenvalues are selected, and are used to build up a n  2 matrix ½v1 ; v2 . The row number of [v1 , v2 ] corresponds to the classifier number in the original ensemble. Each row of ½v1 ; v2  is regarded

The common ways of generating ensemble classifiers include varying the input data and changing the learning algorithm or model parameters. We focus on the case of varying the input data, and use bagging for pruning. Standard Cart trees (pruned) [52] are commonly used inductive learning algorithms, and are adopted as base classifiers. Because it is difficult to determine the optimal parameter λ and the number of base classifiers to be generated, we set λ to 1, 0.8, 0.6, 0.4, 0.2, 0 and m to 100. Research shows that,

292

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

when the number of classifiers becomes large, the classification error in bagging asymptotically tends to a constant [50], and overfitting may occur in boosting [12]. We conduct 10-fold cross validation on the data, which means one-fold data are used as testing data, and the other nine-fold data are used as training data. The final results are the average of five 10-fold cross validation executions, which means that each experiment consists of 50 executions on each dataset. The experiments are conducted on 20 extensively studied benchmark datasets available from UCI repository [53], and a summary of the datasets is given in Table 2. We evaluate the effect of λ on the performance of the pruned ensemble. If λ equals 1, only classification accuracy is considered for classifier pruning, and if λ equals 0, only classifier diversity is considered. When λ changes from 1 to 0, the influence of accuracy decreases and the influence of diversity increases. The classifier similarity defined by (6) can be considered as a geometric mean of both similarities, and we also use this definition to evaluate the proposed classifier pruning approach. In order to compare the performance of SC and SC-gm with other pruning methods proposed in the literature, the same bagging ensemble is pruned using reduce-error pruning (RE) [28], Orientation Ordering (OO) [30], semi-definite programming

Table 2 Characteristics of datasets. Dataset

No. of instances

No. of attributes

Class

Audio Australian Breast W. Diabetes Ecoli Glass Heart Horse-colic Ionosphere Labor Liver Pendigits Segment Sonar Spam Tic-tac-toc Vehicle Votes Waveform Wine

226 690 699 768 336 214 270 368 351 57 345 10,992 2310 208 4601 958 846 435 300 178

69 14 9 8 7 9 13 21 34 16 6 16 19 60 57 9 18 16 21 13

24 2 2 2 8 6 2 2 2 2 2 10 7 2 2 2 4 2 3 3

(SDP) [20], Complementarity Measure (CC) [29] and Kappa pruning(Kappa) [28]. We construct 100 component classifiers on Glass, and use both SC (λ ¼0.5) and SC-gm to calculate the two dimensional vectors of ½v1 ; v2 . The data points are clustered into two groups, and are plotted in Fig. 1. Different classification errors are obtained when λ is assigned different values, and the optimal λ is different for each dataset. We obtain the classification error rate of SC, RE, OO, SDP, CC and Kappa on each dataset under their corresponding conditions for performance comparison. The results described in Table 3 show that, on most of the 20 datasets, SC outperforms the other algorithms, and it only loses on Wine and Liver. The averaged errors on 20 datasets are also given in Table 3, and the results show that SC outperforms the others, and SC-gm as well as SDP performs. We also compare the average rank of algorithms on the 20 datasets. All algorithms are ranked on each dataset according to their classification errors, the best one gets a rank of 1, the second best one gets a rank of 2, and so on. In case of ties, average ranks are assigned [54]. rj denotes the average rank of the jth algorithm on all datasets, and is calculated as r j ¼ ð1=nÞ∑ni¼ 1 r ji , where n is the number of datasets, and rji is the rank of the jth algorithm on the ith dataset. The results described in Table 4 show that the difference between the average rank of SC and that of the others is obvious. Spectral clustering approach is employed to analyze classifier aggregation based on their similarities. We make a simple assumption that good classifiers will aggregate into one group and poor ones will aggregate into another. Good classifiers denote accurate and diverse ones, while poor classifiers denote accurate and non-diverse ones, or inaccurate and diverse ones, or inaccurate and non-diverse ones. According to the definition of formulae (5) and (6), accurate and diverse classifiers have high similarity, other classifiers have low similarity. Similarity represents the weight assigned to each edge of the classifier graph, and when we aggregate the nodes into two groups, those connected with large weights will be clustered into one group. Statistical test is also implemented to measure and assess whether there are significant divergences between SC and the other algorithms. We do the paired two-tailed t-test [55] based on the classification accuracies of algorithms, since it is commonly used to compare classification algorithms. The difference between two algorithms is measured by the p-value of a t-test, which represents the probability with which two sets of compared samples come from distributions with the same mean. Small p-value denotes a big difference between two average values, and 0.05 is a typical threshold which is considered to be

SC−gm

SC(0.5) 0.5

0.3

0.4

0.2

0.3

0.1

0.2

v2

v2

0 0.1

−0.1 0 −0.2

−0.1

−0.3

−0.2 −0.3 −0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

−0.4 −0.4

−0.3

−0.2

−0.1

0

v1

v1 Fig. 1. Distribution of ½v1 ; v2  in a two dimensional space.

0.1

0.2

0.3

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

293

Table 3 Averaged classification errors and standard deviations for ensemble of cart trees (the optimal values are in boldface). Dataset

SC

SC-gm

RE

OO

SDP

CC

Kappa

Bagging

Audio Australian Breast W. Diabetes Ecoli Glass Heart Horse-colic Ionosphere Labor Liver Pendigits Segment Sonar Spam Tic-tac-toc Vehicle Votes Waveform Wine Average

20.3 7 7.4 13.4 7 4.0 3.5 7 2.1 22.3 7 3.2 15.2 7 4.5 23.2 7 6.3 17.17 6.8 14.5 7 5.9 6.8 7 3.8 6.7 7 10.0 28.2 7 6.9 1.6 7 0.3 2.5 7 1.0 18.9 7 9.0 6.4 7 1.6 1.17 1.1 22.17 3.3 3.7 7 2.9 18.0 7 1.1 3.8 7 1.7 12.6 7 4.1

21.4 7 7.3 13.7 7 4.1 3.8 7 2.4 23.4 7 4.0 16.4 7 5.8 24.8 7 7.0 19.6 7 7.3 14.6 7 5.3 8.5 7 4.5 7.9 7 11.6 28.17 7.0 1.9 7 0.4 2.6 7 0.9 21.4 7 8.9 6.47 1.6 1.17 1.0 24.2 7 3.4 3.77 2.9 18.7 7 1.0 3.3 7 1.5 13.3 7 4.4

21.17 7.4 13.7 7 4.0 4.2 7 2.5 24.67 4.0 16.2 7 5.7 24.57 7.6 19.5 7 7.4 15.7 7 5.7 7.5 7 4.5 11.9 7 12.3 27.7 7 7.0 1.8 7 0.5 2.6 7 1.1 21.0 7 9.5 6.4 7 1.7 1.6 7 1.2 25.5 7 4.0 4.6 7 3.0 20.5 7 1.4 4.0 7 2.8 13.7 7 4.7

21.17 7.7 13.7 7 4.0 4.3 7 2.6 24.4 7 4.2 15.8 7 5.5 24.3 7 7.7 17.5 7 7.1 15.0 7 5.7 7.8 7 4.2 8.0 7 10.0 28.8 7 7.4 1.8 7 0.4 2.5 7 1.1 21.3 7 9.0 6.5 7 1.6 1.5 7 1.1 25.2 7 4.4 5.2 7 3.2 20.2 7 1.3 3.9 7 2.1 13.4 7 4.5

21.0 77.5 13.4 74.1 3.8 72.3 24.574.3 15.8 75.5 24.778.0 17.8 77.1 16.5 76.0 7.1 74.0 8.2 711.2 28.2 76.7 1.7 70.4 2.5 71.2 19.6 79.7 6.4 71.6 1.2 71.1 25.2 74.4 4.7 73.1 20.2 71.3 3.2 72.8 13.3 74.6

21.17 7.9 13.7 7 4.1 4.2 7 2.4 23.9 7 3.9 16.17 6.1 24.77 7.5 19.0 7 7.0 16.4 7 5.8 7.3 7 4.0 14.0 7 13.1 28.17 6.7 1.8 7 0.4 2.6 7 1.0 21.6 7 9.2 6.5 7 1.6 1.6 7 1.2 25.8 7 4.0 4.7 7 3.2 20.4 7 1.4 4.5 7 3.8 13.9 7 4.7

27.8 77.5 13.7 74.1 5.0 72.9 24.373.9 18.6 75.9 30.8 78.0 18.0 76.8 16.8 76.1 8.1 74.3 10.3 712.1 30.1 76.0 2.5 73.0 3.8 71.1 24.579.4 6.8 71.6 3.1 71.6 29.0 74.0 4.7 73.1 22.1 71.8 5.9 73.3 15.3 74.8

26.7 7 8.1 14.6 7 3.7 3.5 7 2.3 24.57 5.0 17.8 7 6.0 29.9 7 7.0 19.8 7 8.1 18.0 7 6.0 7.41 7 4.2 13.8 7 12.6 32.1 7 5.9 2.1 7 0.3 3.0 7 1.1 25.2 7 10.0 7.0 7 1.5 1.8 7 1.2 29.0 7 3.6 4.5 7 3.1 23.0 7 2.1 4.4 7 3.8 15.4 7 4.8

Table 4 Averaged ranks for ensemble of cart trees(the optimal values are in boldface. rank* denotes average rank on 20 datasets). Dataset

SC

SC-gm

RE

OO

SDP

CC

Kappa

Bagging

Audio Australian Breast W. Diabetes Ecoli Glass Heart Horse-colic Ionosphere Labor Liver Pendigits Segment Sonar Spam Tic-tac-toc Vehicle Votes Waveform Wine rank*

1 1 1 1 1 1 1 1 1 1 4.5 1 2 1 2 1.5 1 1.5 1 3 1.4

6 5 3.5 2 6 6 7 2 8 2 2.5 6 5 5 2 1.5 2 1.5 2 2 3.9

4 5 5.5 8 5 3 6 4 5 6 1 4 5 3 4 5.5 5 4 6 5 4.7

4 5 7 5 7.5 2 2 3 6 3 6 4 2 4 5.5 4 3.5 8 3.5 4 4.5

2 2 3.5 6.5 7.5 4.5 3 6 2 4 4.5 2 2 2 2 3 3.5 6 3.5 1 3.5

4 5 5.5 3 4 4.5 5 5 3 8 2.5 4 5 6 5.5 5.5 6 6 5 7 5.0

8 5 8 4 8 8 4 7 7 5 7 8 8 7 7 8 7.5 6 7 8 6.9

7 8 2 6.5 7 7 8 8 4 7 8 7 7 8 8 7 7.5 3 8 6 6.7

statistically significant. We report the p-values on each dataset when comparing SC with the others, and the results given in Table 5 show that SC has better generalized performance on the whole most of the time. For example, on Audio, the p-value is 0.0203 and is less than 0.05, it means SC performs significantly better than Bagging on Audio at the 0.05 significant level. The pruning rate of each algorithm is reported in Table 6. We split each dataset into 10 approximately equal partitions and each in turn is used for testing and the remainder is used for training. Each time, we use instances sampled with replacement from the nine-fold data t times to construct a classifier (t equals the size of the nine-fold data), repeat the above experiments 100 times to construct 100 classifiers, and obtain the pruning rate of each algorithm. The procedure is repeated 10 times so that, in the end, each fold of data has been used exactly once for testing, and the mean of the 10 pruning rates of each algorithm is calculated. The fractional part of the averaged pruning rates is ignored. For calculating the pruning rate of RE [28], we first incorporate the

classifier with the lowest classification error to the ensemble, and one of the remaining classifiers is sequentially incorporated in the ensemble, such that the classification error of the subensemble is as low as possible. The process is terminated until the classification error of the subensemble no longer decreases, and the pruning rate is obtained by the ratio between the size of the subensemble and the total number of classifiers. For calculating the pruning rate of Kappa [28], we order pairs of classifier in descending order based on the diversity between them, and incorporate the first classifiers with κ o 0:5 to the ensemble. The pruning rate is obtained by the ratio between the size of the subensemble and the total number of classifiers. For calculating the pruning rate of CC [29], we first incorporate the classifier with the lowest classification error to the ensemble, and then add a classifier to the ensemble at each iteration, whose performance is the most complementary to the subensemble. We terminate the iteration until no classifier can correctly classify more than half of the number of examples that are misclassified by the previous subensemble. For calculating the pruning rate of OO [30], the classifiers are ordered by increasing value of the angle between their corresponding signature vectors and a reference vector. Classifiers whose angle is greater than π =2 are not included in the final ensemble. For calculating the pruning rate of SDP [20], k in the constraint of the programming formulation is a positive integer, and is varied from 20 to 80. For each given value of k, we solve SDP and obtain the classification error of the subset of classifiers on a specific dataset. The value of k that corresponds to the optimal classification error is chosen, and the pruning rate is obtained by the ratio between k and the size of the original ensemble. The results show that the pruning rates are different for each algorithm and dataset. On the 20 datasets, OO performs the best and Kappa performs the worst most of the time. Results also show that, on the 20 datasets, the averaged pruning rate of SC, SC-gm, OO, and SDP exceeds 50%, and SC just loses to OO. Fig. 2 shows the classifier selection rate of SC on Glass and Votes when λ is assigned different values. It is clear that λ influences classifier clustering. We evaluate the influence of λ on the classification accuracy using Glass, Ionosphere, Votes and Wine, and show the results in Fig. 3. Since the optimal classification error rates of SC described in Table 3 are obtained when the optimal λ is chosen from f0; 0:2; 0:4; 0:6; 0:8; 1g, which may be a local optimum, it could be

294

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

Table 5 p-values of t-test on 20 UCI datasets. Dataset

SC/SC-gm

SC/RE

SC/OO

SC/SDP

SC/CC

SC/Kappa

SC/Bagging

Audio Australian Breast W. Diabetes Ecoli Glass Heart Horse-colic Ionosphere Labor Liver Pendigits Segment Sonar Spam Tic-tac-toc Vehicle Votes Waveform Wine

0.6835 0.9987 0.7843 0.0728 0.2318 0.0489 0.0328 0.9868 0.0378 0.0982 0.9986 0.6542 0.8991 0.0487 1 0.9998 0.0872 1 0.6734 0.7426

0.6127 0.9699 0.1026 0.0436 0.3014 0.0518 0.0229 0.0756 0.0589 0.0213 0.4765 0.6971 0.9002 0.0492 0.9951 0.5321 0.0543 0.0932 0.0498 0.8963

0.6078 0.8643 0.1347 0.0411 0.8642 0.0520 0.7281 0.6261 0.0510 0.0500 0.9763 0.7013 0.9992 0.0490 0.9679 0.6655 0.0561 0.0486 0.0510 0.9762

0.5160 0.9997 0.2218 0.0429 0.8750 0.0498 0.7010 0.0417 0.3710 0.4786 0.9679 0.8642 0.9435 0.1753 1 0.9424 0.0550 0.0843 0.0511 0.3271

0.6022 0.9346 0.1965 0.0683 0.2451 0.0500 0.0344 0.0489 0.3528 0.0086 0.9965 0.6952 0.9210 0.4034 1 0.5235 0.0614 0.0798 0.0498 0.2764

0.0124 0.8942 0.0438 0.0457 0.0114 0.0102 0.4687 0.0401 0.0469 0.0121 0.2105 0.1033 0.0497 0.0211 0.8973 0.0437 0.0249 0.0810 0.0211 0.0470

0.0203 0.0457 0.8769 0.0391 0.0201 0.0393 0.0203 0.0210 0.2987 0.0095 0.0432 0.2349 0.3425 0.0145 0.8210 0.1031 0.0218 0.1013 0.0189 0.0986

Table 6 Pruning rate comparison. Dataset

SC (%) SC-gm (%) RE (%) OO (%) SDP (%) CC (%) Kappa (%)

Audio 56 Australian 65 Breast W. 81 Diabetes 62 Ecoli 56 Glass 52 Heart 84 Horse-colic 58 Ionosphere 52 Labor 84 Liver 66 Pendigits 65 Segment 80 Sonar 52 Spam 66 Tic-tac-toc 75 Vehicle 75 Votes 51 Waveform 70 Wine 78 Averaged 66

63 24 58 62 45 56 47 62 59 86 73 59 64 64 56 49 66 14 79 57 57

32 48 16 38 29 26 43 39 46 50 37 54 51 42 39 30 37 46 45 43 40

78 80 76 78 81 74 81 78 71 71 70 75 70 72 67 75 73 74 72 69 74

63 74 58 70 51 55 63 52 45 48 56 73 68 48 57 49 46 53 64 54 57

52 39 81 45 72 40 28 46 52 41 64 50 63 39 40 29 40 35 35 55 47

23 31 19 29 31 27 31 42 36 29 28 30 24 26 31 25 20 21 24 31 28

100 Glass Votes

classifier selection rate

90 80 70 60 50 40 30

0

0.2

0.4

0.6

0.8

lamda Fig. 2. Classifier selection rates on two datasets.

1

a good idea to explore different λ values. We tune λ from 0 to 1 with a small step size to learn its optimal value. Since a very small change of λ may have no obvious influence on the similarity matrix and the classification error, we set the step size to 0.05, and change λ gradually. The obtained results drawn in Fig. 3 show that there exist several local minima for each dataset. The optimal error rate of SC reported in Table 3 may not be the global optimum, but it is near the global optimum. In the above experiments, the optimal value of λ is found by cross validation using the training data. We partitioned each dataset into 10 approximately equal parts, and each in turn is left unused and the remainder is used for training. Each time, we use instances sampled with replacement from the nine-fold data t times to construct a classifier (t equals the size of the nine-fold data), and repeat the above experiments 100 times to construct 100 classifiers. We calculate the similarity between two classifiers according to formula (5) when λ is assigned a specific value, and cluster the classifiers. We choose the cluster of classifiers with larger averaged classifier similarity on the unused one-fold data. The procedure is repeated 10 times so that, in the end, each fold data is not involved in the nine-fold data exactly once. We calculate the mean of the 10 classification errors, then we change the value of λ, and use the above-mentioned approach to obtain another mean error. The process terminates until each value of λ is tried. We choose the value of λ that corresponds to the smallest mean error for each dataset. Other approaches can be used to learn the value of λ for each training dataset. It is also a good idea to explore different λ and select the one that gives the smallest ensemble, so we do experiments to choose the optimal λ for each dataset based on the number of selected classifiers. For each dataset, the optimal λ can be obtained if we tune λ in a very small step, but it is time consuming. To simplify the exploration, we set λ to 0, 0.2, 0.4, 0.6, 0.8 and 1 each time, and learn the pruned smaller ensemble. The value that corresponds to the smallest ensemble is chosen. We partition each dataset into 10 approximately equal parts, and each in turn is left unused and the remainder is used for training. Each time, we use instances sampled with replacement from the nine-fold data t times to construct a classifier (t equals the size of the nine-fold data), and repeat the above experiments 100 times to construct 100 classifiers. We calculate the similarity between two classifiers according to formula (5) when λ is assigned a specific value, and cluster the classifiers. We choose

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

8.4

classification error rate(ionosphere)

classification error rate(glass)

31 30 29 28 27 26 25 24 23

0

0.2

0.4

0.6

0.8

8.2 8 7.8 7.6 7.4 7.2 7 6.8 6.6

1

0

0.2

0.4

0.6

0.8

1

0.6

0.8

1

lambda

lambda

5

6

4.8

classification error rate(wine)

classification error rate(votes)

295

4.6 4.4 4.2 4 3.8 3.6 3.4

5.5

5

4.5

4

3.5 0

0.2

0.4

0.6

0.8

1

0

0.2

lambda

0.4

lambda

Fig. 3. Classification error rates on four datasets. (a) Glass, (b) Ionosphere, (c) Votes, (d) Wine.

Table 7 λ value that corresponds to the smallest ensemble. Dataset

λ

SC(λ):error7 std.

SC(optimal)error 7 std.

Audio Australian Breast W. Diabetes Ecoli Glass Heart Horse-colic Ionosphere Labor Liver Pendigits Segment Sonar Spam Tic-tac-toc Vehicle Votes Waveform Wine

0.2 0.4 0.4 0.2 0.8 0.4 0.8 0.4 0.4 0.2 0.8 0.6 0.4 0.6 0.8 0.4 0.2 0.2 0.4 0.2

20.3 77.4 13.7 73.8 3.7 72.6 22.3 73.2 16.1 76.0 24.7 77.6 17.9 77.1 15.3 75.5 7.4 74.5 6.7 710.1 30.1 77.2 1.7 70.4 2.6 70.9 18.9 79.0 6.4 71.8 1.3 71.2 24.1 74.1 3.7 72.9 19.9 71.8 3.8 71.7

20.3 7 7.4 13.4 7 4.0 3.5 7 2.1 22.3 7 3.2 15.2 7 4.5 23.2 7 6.3 17.17 6.8 14.5 7 5.9 6.8 7 3.8 6.7 7 10.1 28.2 7 6.9 1.6 7 0.3 2.5 7 1.0 18.9 7 9.0 6.4 7 1.6 1.17 1.1 22.17 3.3 3.7 7 2.9 18.0 7 1.1 3.8 7 1.7

the smaller cluster of classifiers, and count the number of classifiers in it. The procedure is repeated 10 times so that in the end, each fold of data is ignored in the nine-fold data exactly once. We calculate the mean number of classifiers in the obtained 10 smaller clusters, then we change the value of λ, and use the abovementioned approach to obtain another mean number of classifiers.

The process terminates until each value of λ is tried. We choose the value that corresponds to the smallest mean number of classifiers for each dataset, and report the results in Table 7. The averaged classification errors and standard deviations of SC are also reported when λ is assigned the selected value. The results reveal that the selected λ may be different from one dataset to another, and the corresponding classification errors are less than or equal to the results of SC under optimal conditions. That means the smallest ensemble sometimes may not perform optimally according to the classification error.

5. Conclusions It is known that an ensemble classifier using all the weak classifiers in hand may not outperform part of them, and many studies on choosing the optimal classifier subset have been done. Different from the proposed heuristic approaches and the programming approaches, this paper takes into account both the classification accuracy of each individual classifier and the diversity of pairs of classifiers simultaneously in one model to evaluate similarities among classifiers, and implements spectral clustering techniques to prune classifiers. Two similarity concepts are proposed to evaluate component classifiers, and the component classifiers are viewed as vertices of a graph and are connected to each other with similarity weighted edges. Based on the assumption that similar classifiers should be grouped together, we employ the spectral clustering techniques to group the component

296

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

classifiers into two clusters, and choose the one with smaller averaged classifier similarity. The computational cost of the proposed algorithm has been analyzed, and experiments on benchmark datasets have been done to evaluate its performance. The results show that the proposed algorithm SC or SC-gm outperforms other order aggregation algorithms most of the time, and the average rank also reveals that differences among the proposed algorithm and the other algorithms are obvious. The pruning rate of each algorithm is different, and varies from one dataset to another. It depends on the number of the initial classifiers and the experimental setup [16]. Given the size 100 of the initial classifiers, we use 10-fold cross validation to evaluate the pruning rate of each algorithm. The averaged pruning rate of the proposed algorithm shows that more than half of the initial classifiers can be pruned. The effect of parameter λ on the performance of the proposed algorithm has also been evaluated, and results show that λ affects the performance of the proposed algorithm. It is difficult to determine the value of λ in advance. In order to find the optimal λ, we just evaluate the values from a finite set by cross validation on the training data. The value that corresponds to the optimal performance is preserved. It can be accepted that there is an optimal λ for each dataset, and this problem needs further investigation.

Acknowledgments The authors thank the editors and the anonymous reviewers for their helpful comments and suggestions. This research is partially supported by the National Natural Science Foundation of China (Nos.61170145, 61373081), the Specialized Research Fund for the Doctoral Program of Higher Education of China (20113704110001), the Natural Science Foundation of Shandong (No. ZR2010FM021), the Scientific Technology and Development Project of Shandong (No. 2013GGX10125) and the Taishan Scholar Project of Shandong, China. References [1] C.M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag New York Inc., New York, NY, 2006. [2] J.V. Hansen, Combining predictors: comparison of five meta machine learning methods, Inf. Sci. 119 (1-2) (1999) 91–105. [3] T. Dietterich, An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization, Mach. Learn. 40 (2000) 139–158. [4] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combining classifiers, IEEE Trans. Pattern Anal. Mach. Intell. 20 (3) (1998) 226–239. [5] L. Breiman, Bagging Predictors, Mach. Learn. 24 (1996) 123–140. [6] Y. Freund, R. Schapire, Experiments with a new boosting algorithm, in: Proceedings of the 13th International Conference of Machine Learning, 1996, pp. 325–332. [7] C.J. Merz, Using correspondence analysis to combine classifiers, Mach. Learn. 36 (1) (1999) 33–58. [8] N. Garca-Pedrajas, Constructing ensembles of classifiers by means of weighted instance selection, IEEE Trans. Neural Netw. 20 (2) (2009) 258–277. [9] T. Windeatt, Accuracy/diversity and ensemble MLP classifier design, IEEE Trans. Neural Netw. 17 (5) (2006) 1194–1211. [10] E. Bauer, R. Kohavi, An empirical comparison of voting classification algorithms: bagging, boosting, and variants, Mach. Learn. 36 (1–2) (1999) 105–139. [11] R. Caruana, A. Niculescu-Mizil, An empirical comparison of supervised learning algorithms, in: Proceedings of the 23rd International Conference of Machine Learning, 2006, pp. 161–168. [12] G. Rätsch, T. Onoda, K.-R. Müller, Soft margins for adaboost, Mach. Learn. 42 (3) (2001) 287–320. [13] Z.-H. Zhou, J. Wu, W. Tang, Ensembling neural networks: many could be better than all, Artif. Intell. 137 (1–2) (2002) 239–263. [14] S. Singh, M. Singh, A dynamic classifier selection and combination approach to image region labeling, Signal Process. Image Commun. 20 (3) (2005) 219–231. [15] A. Ulas, M. Semerci, O.T. Yıldız, E. Alpaydın, Incremental construction of classifier and discriminant ensembles, Inf. Sci. 179 (9) (2009) 1298–1318.

[16] G. Martínez-Muñoz, D. Hernáñdez-Lobato, A. Suárez, An analysis of ensemble pruning techniques based on ordered aggregation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 245–259. [17] D. Hernández-Lobato, G. Martínez-Muñoz, A. Suárez, Statistical instance-based pruning in ensembles of independent classifiers, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 364–369. [18] L. Chen, M.S. Kamel, A generalized adaptive ensemble generation and aggregation approach for multiple classifier systems, Pattern Recognit. 42 (2009) 629–644. [19] G. Martínez-Muñoz, A. Suárez, Using boosting to prune bagging ensembles, Pattern Recognit. Lett. 28 (1) (2007) 156–165. [20] Y. Zhang, S. Burer, W.N. Street, Ensemble pruning via semi-definite programming, J. Mach. Learn. Res. 7 (2006) 1315–1338. [21] G. Martínez-Muñoz, A. Suárez, Switching class labels to generate classification ensembles, Pattern Recognit. 38 (10) (2005) 1483–1494. [22] R.E. Banfield, L.O. Hall, K.W. Bowyer, W.P. Kegelmeyer, A comparison of decision tree ensemble creation techniques, IEEE Trans. Pattern Anal. Mach. Intell. 29 (1) (2007) 173–180. [23] C.D. Stefano, G. Folino, F. Fontanella, A. Scotto di Freca, Using Bayesian networks for selecting classifiers in GP ensembles, Inf. Sci. 258 (2014) 200–216. [24] L. Kuncheva, Switching between selection and fusion in combining classifiers: an experiment, IEEE Trans. Syst. Man Cybern. Part B 32 (2) (2002) 146–156. [25] R. Lysiak, M. Kurzynski, T. Woloszynski, Optimal selection of ensemble classifiers using measures of competence and diversity of base classifiers, Neurocomputing 126 (2014) 29–35. [26] G. Tsoumakas, I. Partalas, I. Vlahavas, A taxonomy and short review of ensemble selection, in: Proceedings of Workshop on Supervised and Unsupervised Ensemble Methods and their Applications, 2008, pp. 41–46. [27] H. Zhou, X. Zhao, X. Wang, An effective ensemble pruning algorithm based on frequent patterns, Knowl. Based Syst. 56 (2014) 79–85. [28] D.D. Margineantu, T.G. Dietterich, Pruning adaptive boosting, in: Proceedings of the 14th International Conference of Machine Learning, 1997, pp. 211–218. [29] G. Martínez-Muñoz, A. Suárez, Aggregation ordering in bagging, in: Proceedings of IASTED International Conference on Artificial Intelligence and Applications, 2004, pp. 258–263. [30] G. Martínez-Muñoz, A. Suárez, Pruning in Ordered Bagging Ensembles, in: Proceedings of the 23rd International Conference of Machine Learning, 2006, pp. 609–616. [31] Z.-H. Zhou, W. Tang, Selective ensemble of decision trees, in: Q. Liu, Y. Yao, A. Skowron (Eds.), Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing, Springer, 2003, pp. 476–483. [32] Q. Dai, A novel ensemble pruning algorithm based on randomized greedy selective strategy and ballot, Neurocomputing 122 (2013) 258–265. [33] L. Rokach, Collective-agreement-based pruning of ensembles, Comput. Stat. Data Anal. 53 (2009) 1015–1026. [34] M. Aksela, J. Laaksonen, Using diversity of errors for selecting members of a committee classifier, Pattern Recognit. 39 (2006) 608–623. [35] J. Meynet, J.-P. Thiran, Information theoretic combination of pattern classifiers, Pattern Recognit. 43 (2010) 3412–3421. [36] J. Xiao, C. He, X. Jiang, D. Liu, A dynamic classifier ensemble selection approach for noise data, Inf. Sci. 180 (2010) 3402–3421. [37] K. Woods, W.P. Kegelmeyer, K. Bowyer, Combination of multiple classifiers using local accuracy estimates, IEEE Trans. Pattern Anal. Mach. Intell. 19 (4) (1997) 405–410. [38] G. Giacinto, F. Roli, Dynamic classifier selection based on multiple classifier behavior, Pattern Recognit. 34 (9) (2001) 1879–1881. [39] E.M.D. Santos, R. Sabourin, P. Maupin, A dynamic overproduce-and-choose strategy for the selection of classifier ensembles, Pattern Recognit. 41 (10) (2008) 2993–3009. [40] D. Zhu, A hybrid approach for efficient ensembles, Decis. Support Syst. 48 (2010) 480–487. [41] B. Bakker, T. Heskes, Clustering ensembles of neural network models, Neural Netw. 16 (2003) 261–269. [42] Y. Freund, R.E. Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, J. Comput. Syst. Sci. 55 (1) (1997) 119–139. [43] B. Verma, A. Rahman, Cluster-oriented ensemble classifier: impact of multicluster characterization on ensemble classifier learning, IEEE Trans. Knowl. Data Eng. 24 (4) (2012) 605–618. [44] H. Zouari, L. Heutte, Y. Lecourtier, Controlling the diversity in classifier ensembles through a measure of agreement, Pattern Recognit. 38 (2005) 2195–2199. [45] L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy, Mach. Learn. 51 (2) (2003) 181–207. [46] G. Giacinto, F. Roli, Design of effective neural network ensembles for image classification purposes, Image Vis. Comput. 19 (9–10) (2001) 699–707. [47] G. Yule, On the association of attributes in statistics, Biometrika 2 (1903) 121–134. [48] E.K. Tang, P.N. Suganthan, X. Yao, An analysis of diversity measures, Mach. Learn. 65 (1) (2006) 247–271. [49] U.V. Luxburg, A tutorial on spectral clustering, Stat. Comput. 17 (4) (2007) 395–416. [50] G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edition, Johns Hopkins University Press, Baltimore, Maryland, US, 1996.

H. Zhang, L. Cao / Neurocomputing 139 (2014) 289–297

[51] U. Helmke, P.A. Fuhrmann, Controllability of matrix eigenvalue algorithms: the inverse power method, Syst. Control Lett. 41 (1) (2000) 57–66. [52] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Chapman and Hall, UK, 1984. [53] A. Asuncion, D. Newman, UCI Machine Learning Repository, 〈http://www.ics. uci.edu/  mlearn/MLRepository.html〉. [54] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [55] E. Alpaydin, Introduction to Machine Learning, The MIT Press, Cambridge, MA, USA, 2004.

Huaxiang Zhang received a Ph.D from Shanghai Jiaotong University, in 2004, and now is a professor and Ph. D supervisor at the Department of Computer Science, Shandong Normal University, Shandong, China. His research interests include machine learning, pattern recognition, evolutionary computation, and web information processing.

297 Linlin Cao received her B.Sc. degree in Shandong Normal University, China, in 2010, and she is currently pursuing a Master degree in the school of information science & engineering from the same university. Her research fields include support vector machines (SVMs), manifold learning and related applications.