Margin optimization based pruning for random forest

Margin optimization based pruning for random forest

Neurocomputing 94 (2012) 54–63 Contents lists available at SciVerse ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom M...

619KB Sizes 1 Downloads 88 Views

Neurocomputing 94 (2012) 54–63

Contents lists available at SciVerse ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Margin optimization based pruning for random forest Fan Yang a, Wei-hang Lu a, Lin-kai Luo a, Tao Li b,n a b

School of Information Science and Technology, Xiamen University, Xiamen 361005, China School of Computer Science, Florida International University, Miami, FL 33199, USA

a r t i c l e i n f o

a b s t r a c t

Article history: Received 11 December 2011 Received in revised form 5 April 2012 Accepted 9 April 2012 Communicated by Qi Li Available online 11 May 2012

This article introduces a margin optimization based pruning algorithm which is able to reduce the ensemble size and improve the performance of a random forest. A key element of the proposed algorithm is that it directly takes into account the margin distribution of the random forest model on the training set. Four different metrics based on the margin distribution are used to evaluate the generalization ability of subensembles and the importance of individual classification trees in an ensemble. After a forest is built, the trees in the ensemble are first ranked according to the margin metrics and subensembles with decreasing sizes are then built by recursively removing the least important trees one by one. Experiments on 10 benchmark datasets demonstrate that our proposed algorithm can significantly improve the generalization performance while reducing the ensemble size at the same time. Furthermore, empirical comparison with other pruning methods indicates that the margin distribution plays an important role in evaluating the performance of a random forest, and can be directly used to select the near-optimal subensembles. & 2012 Elsevier B.V. All rights reserved.

Keywords: Random forests Ensemble pruning Margin optimization

1. Introduction In a classifier ensemble, the complementarity of individual members is required to ensure the generalization ability of the ensemble. The diversity in the members of an ensemble is a necessary (albeit not sufficient) condition for the complementarity [1]. Random forest (RF) [2], as a state-of-the-art ensemble method, has been widely used in many applications. However, RF does not explicitly encourage the generation of complementary classifiers, and diversity in an ensemble is achieved by using different bootstrap samples of the training data and the random subspace selection from the original feature sets for node splitting in tree construction. Generally, the generalization error of RF first decreases as the number of trees generated in the ensemble increases, and then the error asymptotically reaches to a constant level. It is computationally expensive to determine the best ensemble size (i.e., the number of base tree classifiers) of RF with cross validation. Users tend to construct an ensemble with a large size because RF typically does not overfit. As a consequence, common approaches will possibly generate many redundant and similar trees first and subsequently reduce the diversity in the ensemble trees. Ensemble pruning is a possible and useful strategy to reduce the ensemble size by selecting only a subset of the classifiers from

n

Corresponding author. E-mail address: taoli@cs.fiu.edu (T. Li).

0925-2312/$ - see front matter & 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2012.04.007

the original ensemble [3–7]. Besides the reduction in the complexity of classification, ensemble pruning can often obtain a more diverse subset of classifiers and outperform the original ensemble. Ensemble pruning can be viewed as searching for the optimal solution in the space of the subsets of ensemble. However, the selection of the optimal subensemble via exhaustive search with cross validation is computationally expensive. Many approximate searching strategies have been proposed to identify the near-optimal subensembles, such as ordered aggregation [3–5], genetic algorithms [6], semi-definite programming (SDP) [7], quadratic programming [8], and linear programming [9]. Generally speaking, according to [10], the ensemble pruning method can be mainly divided into four categories: Ranking based, Clustering based, Optimization based, and other methods. There are also some hybrid methods such as instance-based pruning [11], and double pruning [12]. Please refer [10] for a relatively comprehensive review on the taxonomy and representative methods on ensemble pruning. RF does not overfit as the number of the trees increases [2], so it seems that the redundancy in the trees would not deteriorate the performance of RF. However, the experiments on real world datasets illustrate that there is still room for improving the performance of RF by optimization based pruning. In order to find the minimal size of the forest while maintaining the prediction accuracy, Zhang and Wang proposed three measures to determine the importance of a tree in a forest [25]. Their measures are mainly designed in terms of prediction performance, i.e., prediction accuracy and similarities between the

F. Yang et al. / Neurocomputing 94 (2012) 54–63

prediction outcomes of two base trees. In [13], a dynamic instance-based pruning and weighting method was proposed. For each instance to be classified, its neighbors based on the random forest proximity [2] are first identified within all the training instances. Those trees where the neighbors are in the out-of-bag set are weighted according to the positive average margin and the trees with negative average margin are discarded when classifying this instance. Optimization based ensemble pruning [10] aims to solve an optimization problem to find the optimal subensemble which optimizes certain objective measures (e.g., the diversity of the classifier members, and the generalization ability). The definition of diversity among the classifier members is generally heuristic, and many different measures have been proposed in recent studies [1,3,24,26,27]. The performance based optimization searches for the optimal subensemble by explicitly evaluating their performances (such as the accuracy, root mean-squarederror, mean cross-entropy and so on) on the training set [10]. However, an important performance evaluation measure for ensemble classifiers–margin is often neglected in these studies. Previous studies have shown that margin has played an important role in the ensemble classifiers. The margin-based bound for AdaBoost provides a qualitative explanation of the effectiveness of boosting [15]. Much attention has been paid to maximizing the minimum margin [16,17]. However, Arc-Gv (an arcing algorithm) always has a minimum margin larger than AdaBoost but performs worse [18]. To address this problem, Wang et al. proposed a tighter margin bound called Equilibrium Margin [19]. Reyzin and Schapire [14] found that a better margin distribution, i.e., average margin and variance of margin, is more important than the minimum margin. Inspired by this, Shen and Li designed a new boosting algorithm called MDboost by directly optimizing the margin distribution: maximizing the average margin and minimizing the variance of the margin at the same time [20]. Despite the importance of margin, little attention has been paid to the margin and the margin distribution in the ensemble pruning literatures. Tang et al. conducted a theoretical analysis on six existing diversity measures, and showed the relation between these measures and the minimum margin maximization concept [24]. Fu et al. designed a new diversity measure for bagging by a combination of average margin and variance of margin [21]. In [3], a margin distance minimization method was used to search for the subensemble with the minimum Euclidean distance between its signature vector and an ideal vector that correctly classifies all examples and the ith component of the signature vector is the margin of the ith example. In this paper we explore the possibility of using margin to prune random forest explicitly. Inspired by previous studies, we design four margin metrics as the optimization objective and propose a margin-optimization method using recursive elimination for ensemble pruning. Our proposed method explicitly optimizes the average margin/minimum margin via a backward searching strategy in the space of subensembles. At the same time, we also study the effect of margin variance on the pruning process. The margin distribution is viewed as a performance evaluation metric rather than a diversity measure of ensemble classifiers in our work. The remaining of this article is organized as follows: Section 2 introduces the background knowledge of the margin theory on ensemble classifiers and random forest, and provides the definitions for margin and Out-of-Bag margin in a random forest. Section 3 briefly reviews related studies and introduces two existing pruning methods for random forest. In Section 4, we formulate the optimization objective and employ a backward elimination algorithm, named Recursively Pruning-Random forest (RP-RF) to search for the optimal solution. Four margin metrics

55

are defined in this section. Section 5 presents an empirical comparison of RF, RP-RF and the other pruning methods. A summary of the results and conclusions are given in Section 6.

2. Background Without loss of generality, consider a binary classification problem. Given the member classifier function space H, a classifier hAH is a mapping from input space X to label {  1, þ1}. The training set of N instances S ¼{(x1,y1),(x2,y2),y(xn,yn)} is generated randomly on unknown distribution D. Define the ensemble classifier as follows: ( ) X X ah ¼ 1 ð1Þ f A C : f : x/ ah hðxÞ9ah Z 0, heH

h

2.1. The margin of ensemble classifiers and random forest Given the voting method and a specific instance x, the margin mg is defined to be the difference between the number of correct votes and error votes. If the margin is positive, then the sample is correctly classified. The larger the margin is, the more confident the classification is. Many ensemble algorithms such as Adaboost, Arc-gv and MDboost can improve the generalization performance of classifiers by increasing the margin. Definition 1. Given an ensemble of classifiers {h1(x),h2(x),y, hN(x)}, input vector x and its corresponding output y, the margin function of sample (x, y) is defined as follows: mgðx,yÞ ¼ avk Iðhk ðxÞ ¼ yÞmaxavk Iðhk ðxÞ ¼ jÞ, jay

ð2Þ

where I(  ) is the indicator function, avk(  ) is used to perform the mean operation. Therefore, the generalization error of ensemble is given by PEn ¼ P X,Y ðmgðx,yÞ o 0Þ

ð3Þ

In RF, hk(x)¼h(x,yk). When the number of trees in RF is large enough, the following theorem holds. Theorem 1. [2]. As the number of trees increases, for almost surely all sequences y1,y2,y3y, PEnconverges to PX,Y ððPY ðhðx, yÞ ¼ yÞmaxP Y ðhðx, yÞ ¼ jÞÞ o0Þ, jay

ð4Þ

where y is the corresponding random vector of a single decision tree, h(x,y) is the output of the tree. Theorem 1 shows that as the number of trees in the forest increases, the generalization error of RF reaches a finite upper bound determined by the margin according to the Law of Large Numbers. 2.2. Out-of-bag estimates and OOB margin In random forest, each tree is constructed using a different bootstrap sample from the original training data. Each bootstrap sample leaves out about 37% of the original examples, which are not used in the construction of the ith tree, and these left-out examples are called out-of-bag data for the ith tree. They can be used to get nearly optimal estimates of generalization errors of random forest [2]. Accordingly, each sample in the training data is not used in about one-third of the trees in the forest. We can call these trees as the out-of-bag trees (OOB trees) for this sample, and the margin of the sample can also be calculated over its OOB trees,

56

F. Yang et al. / Neurocomputing 94 (2012) 54–63

and the corresponding margin can be called as out-of-bag margin(OOB margin) in order to distinguish with the margin. The two different margin definitions, i.e., margin and the out-of-bag margin, are formulated as the following: 1. Margin: The margin of an instance (x, y) computed over all the trees in the ensemble H, i.e., 0 1 9H9 9H9 X 1 @X mg 1 ðx,y,HÞ ¼ ð5Þ Iðhi ðxÞ ¼ yÞ maxIðhi ðxÞ ¼ lÞA: lay 9H9 i ¼ 1 i¼1 2. Out-of-bag Margin (OOB Margin): The margin computed over the out-of-bag trees in H for instance (x, y) (denoted asHOOB D H), i.e., 0 1 9HOOB 9 9H OOB 9 X 1 @X mg 2 ðx,y,HÞ ¼ Iðhi ðxÞ ¼ yÞ maxIðhi ðxÞ ¼ lÞA lay 9HOOB 9 i ¼ 1 i¼1 ð6Þ where 9  9 denotes the size of ensemble.

3. Related work In this section, we briefly review two existing pruning methods for random forest, i.e., similarity base pruning, and margin distance minimization. 3.1. Similarity based pruning In ensemble pruning literatures, the term ‘‘diversity’’ is generally considered as the most important concept and standard to measure the quality of an ensemble. Diversity often indicates that the classifiers in the ensemble should be not too similar to each other. So several measures, such as disagreement measure, double fault measure, inter-rater agreement, have been proposed to define the diversity could also be seen as measures of similarity [24,26,27], The underlying assumption is that two classifiers can be viewed as diverse or not similar when they give relatively different outputs on the same training data. Zhang and Wang proposed to prune a random forest based on the similarity between tree pairs in the forest [25]. To improve the diversity in the ensemble, a tree is removed if it is similar to some other trees in the current subensemble. The measure of similarity is also defined based on the signature vector of member classifiers. Given two trees hi and hj, the correlation of the two trees are defined as cor(ci, cj) and the function cor(.) can be specified by users. In the subensemble H0 , the overall similarity between a tree h and H0 is defined as the average of its similarities with all the other trees, X 1 corðht ,hi Þ 9H0 91 h A H0 ,i a t

3.2. Margin distance minimization To the best of our knowledge, the margin distance minimization method is the earliest study on pruning ensemble classifiers with margin related concept. The margin distance minimization method searches for the subensemble H0 D H with the minimum distance between its signature vector c and a predefined vector o chosen in the first quadrant of the N-dimensional hyperspace which correctly classifies all the N examples [3,4]. The signature vector of a member classifier h is defined as the N-dimensional vector whose ith component is chi ¼ 2Iðhðxi Þ ¼ yi Þ1:

In a random forest, the margin of a training instance is always positive because the CART trees are un-pruned and probabilistically an instance is correctly classified on about 2/3 of the trees which makes the majority vote definitely equals to its true labels. The OOB margin is computed only over the Out-of-Bag trees where the instance is an OOB sample, so the OOB margin may be negative. And for a given instance, it is easy to prove that the value of its OOB margin is smaller than that of the margin.

rh ¼

a theoretical analysis on the relations between minimum margin and six diversity measures. They claimed that only under certain condition using these measures could implicitly enlarge the minimum margin. Their experiments showed that ‘‘the minimum margin of an ensemble is not monotonically increasing with respect to diversity’’ [24].

ð7Þ

i

and the tree in the current subensemble with the most similarities with the subensemble should be removed from the forest iteratively. However, similarity based pruning can not guarantee a good generalization ability of ensemble classifiers. Tang et al. presented

ð8Þ 0

The signature vector c of subensemble H is defined as the average of signature vectors of all the member classifiers h1AH0 , 9H 9 1 X hi c : 0 9H 9 i ¼ 1 0

cðH0 Þ ¼

ð9Þ

According to the definition of margin, the ith component of c, denoted as ci, is the margin of the ith example which is correctly classified if ci is positive. Clearly, a subensemble H0 correctly classifies all the N examples if its signature vector c is in the first quadrant of the N-dimensional hyperspace. Consequently, Margin distance minimization tries to select a subensemble whose signature vector is as close as possible to a reference vector o placed somewhere in the first quadrant, i.e., Ho ¼ argmin dðo,cðHk ÞÞ:

ð10Þ

k

In [4], the reference vector o is set to be sufficiently small (e.g., oi ¼0.075, i¼1, 2, y, N) and the optimal subensemble Ho is selected with an ordered aggregation approach. From the definition of margin distance minimization, we can see that it tries to find a sub ensemble which correctly classifies all the training samples, i.e., has a positive margin. However, because the components of the reference vector o are the same, the objective function indicates that all the training samples would have the same margin, which is not practical and violates the margin theory of ensemble classifiers.

4. Recursively pruning random forest with margin optimization Existing pruning methods for a random forest (such as the margin distance minimization and similarity based pruning) are mainly concerned with the prediction outcome of each individual tree. In similarity based pruning, the margin is not used and in the margin distance minimization method, the margin information is used implicitly in the pruning process. In [21], a new diversity measure for bagging was designed using the average margin divided by the variance of the margin. This method oversimplified the relations between the average margin and the variance of the margin, and the optimization of this objective function does not equal to margin distribution optimization which is a complicated task as indicated in [20]. Further, it is meaningful to compare the effects of the minimum margin and the average margin in the ensemble pruning process.

F. Yang et al. / Neurocomputing 94 (2012) 54–63

In this paper, we assume that an optimal forest is the one with the ‘‘optimal margin distribution’’ on the training set. Here the margin distribution is represented by the average margin or the minimum margin over all the training instances. So the objective is to find the subensemble with the maximum average margin or maximum minimum margin on the training set, i.e., Ho ¼arg max(av (mg(x,y,Hk))) or Ho ¼ arg max(min (mg(x,y,Hk))). As mentioned before, it is unfeasible to find the optimal solution in the space of subensembles using exhaustive search and some heuristics such as Genetic Algorithm and Simulated Annealing can be used. In this paper we employ a backward elimination strategy, which is a type of the well-known hill climbing search algorithms. This method ranks the contribution and importance of a base tree h in the temporary ensemble H by observing the decrease of margin metric when removing h from the forest. This is equivalent to evaluating the subensemble {H\h}. At each step, the least important tree hmin with the minimum decrease of the margin metric is eliminated from the ensemble and the ensemble H shrinks to its subset H0 ¼{H\hmin}. Then the trees in H’ are reordered and the above process is repeated. The algorithm denoted as RP-RF is listed below. Algorithm 1. Recursively Pruning-Random Forest, RP-RF Input: Training set S ¼ ðx1 ,y1 Þ,ðx2 ,y2 Þ:::,ðxn ,yn Þ, the size of RF ntree Output: A nested set of subensemble Hkk ¼ 1,:::ntree1 1 Construct a RF model Hntree ¼ h1 ,h2 ,:::,hntree of size ntree on training set S 2 For k ¼ ntree : 1 (1) Rank hi A Hk ,i ¼ 1,:::,9Hk 9 according to margin metric f ðhi ,Hk ,SÞ (2) Find the least important tree hmin ¼ argminðf ðhi ,Hk ,SÞÞ,i ¼ 1,:::9Hk 9 (3) Eliminate hmin from the ensemble Hk1 ¼ Hk \hmin End for

In Algorithm 1, 9Hk9 stands for the size of ensemble Hk, and the number of candidate pffiffiffi variables randomly sampled at each split of a tree mtry ¼ p . The decrease in the average margin/OOB margin and the decrease in the minimum margin/OOB margin on the training set are computed as evaluation metrics f ðhi ,Hk ,SÞ, so we have four margin metrics to compute the importance of tree hi in subensemble Hk on training set S as follows, (i) Mean Decrease in Margin (MeanD-M) f 1 ðhi ,Hk ,SÞ ¼ av ðmg 1 ðx,y,Hk Þmg 1 ðx,y,Hk \hi ÞÞ ðx,yÞ A S

ð11Þ

(ii) Decrease in minimum margin (MinD-M) f 2 ðhi ,Hk ,SÞ ¼ min ðmg 1 ðx,y,Hk ÞÞ min ðmg 1 ðx,y,Hk \hi ÞÞ ðx,yÞ A S

ðx,yÞ A S

ð12Þ

ðx,yÞ A S

ð13Þ

(iv) Decrease in minimum OOB margin (MinD-OM) f 4 ðhi ,Hk ,SÞ ¼ min ðmg 2 ðx,y,Hk ÞÞ min ðmg 2 ðx,y,Hk \hi ÞÞ ðx,yÞ A S

ðx,yÞ A S

Obviously, the time complexity of RP-RF is O(ntree2  R(ntree, N)) and R(ntree, N) concerns the complexity of ranking process, which is O(9Hk9  N) where 9Hk9 denotes the size of current ensemble.

5. Experiments and discussion In this section, we first examine the performance of RP-RF in comparison with the original random forest algorithm. Second, we conduct comparison experiments of three pruning methods. Third, the effect of margin variance and the stopping criterion are discussed. The package Random Forest [22] is used for algorithm implementation in our experiments. 5.1. Comparison with random forest First, experiments on 10 datasets from the UCI repository [23] were performed to evaluate the performance of the four different pruning metrics when applied to a random forest. These datasets came from different fields of real world applications, and their characteristics are summarized in Table 1. The random forest algorithm achieves stable performances when the ensemble size is larger than 20 on all the 10 datasets. For performance comparison, we first construct a random forest model with the ensemble size ntree¼100, then prune it with RP-RF Algorithm and obtain a nested set of subensembles with size ranging from 99 to 20. The performances of RP-RF algorithms are compared with RF models with the corresponding sizes. The results reported are averaged over 10-fold cross-validation. Specifically, for each dataset, the following steps were carried out: (i) Generate the training and testing sets by 10-fold cross-validation and generate 81 random forests classifiers with size 20 to 100, and estimate the generalization error with the unseen test set. (ii) For each training set, rank the decision trees in the forests with size 100 using MeanD-M, MinD-M, MeanD-OM, and MinD-OM, respectively. (iii) Remove the least important tree in the ranked list from the ensemble and compute the generalization error of the subensemble on the test sets. (iv) Repeat eliminating the trees with RP-RF until the size of ensemble decreases to 20. Fig. 1 shows the comparison of the five methods with different ensemble sizes on 10 datasets. RP-RF performs well on most datasets except diabetes. The comparison shows that RP-RF improves the RF model using pruning. On all the datasets except colon and diabetes datasets, the more the number of redundant trees pruned, the better the ensemble performs. On all the datasets except diabetes and hearts, all four versions of RP-RF significantly outperform RF at almost every ensemble size. On diabetes and hearts datasets, the performance of RP-RF is close to Table 1 Characteristics of the datasets.

(iii) Mean decrease in OOB margin (MeanD-OM) f 3 ðhi ,Hk ,SÞ ¼ av ðmg 2 ðx,y,Hk Þmg 2 ðx,y,Hk \hi ÞÞ

57

ð14Þ

Here av(  ) stands for the mean operation and min(  ) means finding the minimum of a vector.

Datasets

Instances

Attrib.

Classes

Diabetes Heart Hearts Ionosphere Iris Monks Sonar Steel Tic Wine

768 462 267 351 150 432 208 1941 958 178

9 10 45 35 5 7 61 34 10 14

2 2 2 2 3 2 2 2 2 3

58

F. Yang et al. / Neurocomputing 94 (2012) 54–63

MinD-M and MinD-OM can also improve the performance of RF, but their performances are not as stable as the performances of metrics based on the average margin. Unlike the average OOB margin, the minimum OOB margin is more stable, so MinD-OM only shows a sharp drop on hearts and monks1 datasets when the ensemble size is small. Table 2 shows the best test accuracy and corresponding ensemble size of the above five methods on individual datasets. Except for diabetes data, all the four RP-RF methods achieved better test accuracy than RF with only a small fraction of trees of the original ensemble. In Table 2, the best result on each dataset is marked with ‘‘*’’ and MeanD-M achieves the best test accuracy on

that of RF. On hearts dataset, MeanD-M achieves better accuracies than RF when the ensemble size is smaller than about 40. Among the four versions, MeanD-M and MeanD-OM perform better on most occasions and these two methods have similar performances to each other. However, when the ensemble size approaches 20, the performance of MeanD-OM declines rapidly as ensemble size decreases. According to Eq.(6), the reason mainly lies in the instability of OOB margin: for each training instance, the number of OOB trees also deceases rapidly in the process of pruning, and it can even decrease to zero. Hence the estimation of OOB margin is likely to be determined by only a small fraction of training instances. diabetes

0.79

0.71

0.77

Average Test Accuracy

Average Test Accuracy

0.78

0.76 0.75 0.74 MeanD-M MinD-M MeanD-OM MinD-OM Random Forest

0.73 0.72 0.71 100

heart

0.72

90

80

0.7 0.69 0.68 0.67

MeanD-M MinD-M MeanD-OM MinD-OM Random Forest

0.66

70

60

50

40

30

0.65 100

20

90

80

70

Ensemble Size

hearts

0.83

60

50

40

30

20

40

30

20

40

30

20

Ensemble Size

ionosphere

0.95

0.82

0.94

Average Test Accuracy

Average Test Accuracy

0.81 0.8 0.79 0.78 0.77 MeanD-M MinD-M MeanD-OM MinD-OM Random Forest

0.76 0.75 0.74 100

90

80

0.93 0.92 0.91 0.9

MeanD-M MinD-M MeanD-OM MinD-OM Random Forest

0.89

70

60

50

40

30

0.88 100

20

90

80

70

Ensemble Size

iris

0.98

0.99

0.96

Average Test Accuracy

Average Test Accuracy

50

mo nks

1

0.97

0.95 0.94 0.93 MeanD-M MinD-M MeanD-OM MinD-OM Random Forest

0.92 0.91 0.9 100

60

Ensemble Size

90

80

0.98 0.97 0.96 0.95

MeanD-M MinD-M MeanD-OM MinD-OM Random Forest

0.94

70

60

50

Ensemble Size

40

30

20

0.93 100

90

80

70

60

50

Ensemble Size

Fig. 1. Average test accuracy with respect to the ensemble size ranging from 99 to 20.

F. Yang et al. / Neurocomputing 94 (2012) 54–63

sonar

0.88

59

steel

0.999 0.998 0.997

Average Test Accuracy

Average Test Accuracy

0.86

0.84

0.82

0.8 MeanD-M MinD-M MeanD-OM MinD-OM Random Forest

0.78

0.76 100

90

80

0.996 0.995 0.994 0.993 0.992

MeanD-M MinD-M MeanD-OM MinD-OM Random Forest

0.991 0.99

70

60

50

40

30

0.989 100

20

90

80

70

Ensemble Size

tic

0.96

60

50

40

30

20

40

30

20

Ensemble Size

wine

1 0.99 0.98

Average Test Accuracy

Average Test Accuracy

0.95

0.94

0.93

0.92 MeanD-M MinD-M MeanD-OM MinD-OM Random Forest

0.91

0.9 100

90

80

0.97 0.96 0.95 0.94 0.93

MeanD-M MinD-M MeanD-OM MinD-OM Random Forest

0.92 0.91

70

60

50

40

30

0.9 100

20

90

Ensemble Size

80

70

60

50

Ensemble Size

Fig. 1. (continued) Table 2 The best test accuracy (TA) and corresponding ensemble size. Datasets

Diabetes Heart Hearts Ionosphere Iris Monks Sonar Steel Tic Wine

MeanD-M

MinD-M

MeanD-OM

MinD-OM

RF

TA

Size

TA

Size

TA

Size

TA

Size

TA

Size

0.7604 0.7123 0.8274n 0.9458n 0.9667 0.9977n 0.8705n 0.9974n 0.9583n 0.9944n

64 29 30 21 40 21 41 20 51 21

0.7643 0.7056 0.8161 0.9401 0.9667 0.9767 0.8562 0.9948 0.9541 0.9886

31 73 71 37 31 27 85 49 81 21

0.7604 0.7188n 0.8198 0.9401 0.9733n 0.9977n 0.8562 0.9969 0.9562 0.9944n

74 34 36 32 28 31 40 23 43 22

0.7630 0.7036 0.8162 0.9372 0.9667 0.9977n 0.8517 0.9964 0.9552 0.9886

84 93 33 49 57 27 97 27 81 29

0.7734n 0.6925 0.8146 0.9399 0.9560 0.9625 0.8705n 0.9969 0.9520 0.9842

97 84 80 41 20 88 75 65 79 38

seven datasets while MeanD-OM performs the best on five datasets. Generally speaking, the average margin is better than the minimum margin in the pruning of RF and MeanD-M has the most stable performances. Our study indicates that the average margin might play a more important role than the minimum margin in ensemble learning and this is consistent with previous studies [14,20]. 5.2. Comparison with other pruning methods Second, empirical comparisons were performed to compare MeanD-M with similarity based pruning (referred to as Sim-P) and margin distance minimization(referred to as MarDistM).

Fig. 2 shows the 10-fold cross validation experimental results of the 3 algorithms on 10 data sets when the size of the ensemble is ranging from size 99 to 1. Generally speaking, except for diabetes, our method outperforms the other two algorithms. As shown in Section 5.1, RP-PF can not prune random forest effectively on diabetes dataset, and we will discuss the possible reasons in Section 5.3. Table 3 shows the best test accuracy and corresponding ensemble size of the three methods on individual datasets, and the results reported here were also averaged over 10-fold crossvalidation. Except for diabetes and wine datasets, MeanD-M achieved slightly better test accuracy than Sim-P and MarDistM in the pruning process. And the size of optimum subensemble

60

F. Yang et al. / Neurocomputing 94 (2012) 54–63

diabetes in Fig. 3, respectively. It shows that on monks the average margin on training instances increases monotonically in the pruning process, and the test accuracy increases as the average margin increases before the ensemble size reaches 10. However, on the diabetes dataset, where RP-RF shows no significant improvement, the average margin metric can still be improved with pruning. To further investigate this, we consider the effect of variance. On the monks dataset, the variance of margin of RP-RF decreases monotonically in the pruning process before the forest size reaches 10. But on the diabetes dataset, the variance can not be reduced by pruning. On the contrary, it grows monotonically as the ensemble size decreases and shows an exponential growth when the size reaches about 20. In contrary to RP-RF, the average margin of RF on both datasets is almost unchanged at every size and the variance increases monotonically as the ensemble size increases. When the size approaches about 20, there are also exponential growths of variance on both datasets. Note that the test accuracy decreases sharply at that size.

obtained by MeanD-M is fairly small. The empirical results demonstrated that MeanD-M always outperforms than Sim-P and MarDistM with a small size of ensembles. Note that once a random forest is built, the similarity between trees and the signature vectors of trees can be computed in advance. This would simplify the calculations of the Sim-P and MarDistM. The average margin/minimum margin of a subensemble must be updated at each iteration, which may cost much time in the pruning process. So RP-RF is much more time-consuming.

5.3. The effect of variance on RP-RF To illustrate the effectiveness and optimality of RP-RF, we show the change of margin metrics as the ensemble size decreases. Taking the average margin as an example, we conduct a 10-fold cross validation on diabetes and monks datasets by pruning the ensemble size from 100 to 1. Test Accuracy, mean and variance of margin on both datasets are reported for monks and

diabetes

heart

0.78

0.7 0.68

Average Test Accuracy

Average Test Accuracy

0.77 0.76 0.75 0.74 0.73 0.72 0.71

MeanD-M Sim-P MarDistM

0.7 0.69 0.68 100

90

80

70

0.66 0.64 0.62 0.6

MeanD-M Sim-P MarDistM

0.58

60

50

40

30

20

10

0

100

90

80

70

Ensemble Size

60

50

40

30

20

10

0

30

20

10

0

30

20

10

0

Ensemble Size

hearts

ionoshphere

0.86

0.96

Average Test Accuracy

Average Test Accuracy

0.84 0.82 0.8 0.78 0.76 0.74 0.72

MeanD-M Sim-P MarDistM

0.7 0.68 0.66 100

90

80

70

0.94 0.92 0.9 0.88

MeanD-M Sim-P MarDistM

0.86

60

50

40

30

20

10

0.84 100

0

90

80

70

Ensemble Size

60

0.95

0.95

0.94 0.93 0.92

0.9 100

MeanD-M Sim-P MarDistM 90

80

70

0.9 0.85 0.8 0.75

MeanD-M Sim-P MarDistM

0.7

60

50

40

monks 1

Average Test Accuracy

Average Test Accuracy

iris 0.96

0.91

50

Ensemble Size

40

Ensemble Size

30

20

10

0

0.65 100

90

80

70

60

50

40

Ensemble Size

Fig. 2. Average test accuracy with respect to ensemble size decreasing from 99 to 1.

F. Yang et al. / Neurocomputing 94 (2012) 54–63

sonar

0.9

61

steel

1

Average Test Accuracy

Average Test Accuracy

0.98 0.85

0.8

0.75

MeanD-M Sim-P MarDistM

0.7

0.65 100

90

80

70

0.96 0.94 0.92 0.9 0.88 0.86

MeanD-M Sim-P MarDistM

0.84 0.82

60

50

40

30

20

10

0.8 100

0

90

80

70

Ensemble Size

tic

0.95 0.9 0.85 0.8

MeanD-M Sim-P MarDistM

0.75 0.7 100

90

80

70

60

50

40

50

40

30

20

10

0

30

20

10

0

wine

1

Average Test Accuracy

Average Test Accuracy

1

60

Ensemble Size

30

20

10

0

0.95

0.9

0.85

MeanD-M Sim-P MarDistM

0.8

0.75 100

Ensemble Size

90

80

70

60

50

40

Ensemble Size Fig. 2. (continued)

Table 3 The best test accuracy and corresponding ensemble size of 3 methods. Datasets

MeanD-M TA

Diabetes Heart Hearts Ionosphere Iris Steel Monks Sonar Wine Tic

0.7734 0.6905n 0.8390n 0.9402n 0.96n 0.9979n 1n 0.8457n 0.9830 0.9530n

Sim-P Size 85 52 36 45 45 20 9 26 11 35

TA 0.7682 0.6839 0.8162 0.9373 0.9533 0.9923 0.9608 0.8410 0.9830 0.9446

MarDistM Size 94 73 14 38 5 85 99 48 14 83

TA

Size n

0.7747 0.6818 0.8198 0.9373 0.96n 0.9923 0.9608 0.8410 0.9886n 0.9457

37 61 82 32 5 87 99 51 27 83

These behaviors coincide with previous studies [14] and also verify the rational of using the margin as the performance evaluation measures and optimization objectives in ensemble pruning under certain conditions: the pruning should increase the average margin (or the minimum margin) on training instances while reduce the variance of margin at the same time. It also provides a heuristic stopping criterion for our pruning method: the pruning process should stop when the variance of margin begins to increase. This phenomenon can also be observed in the change behaviors of the margin metrics with Sim-P and Mar-DistM. Take the monks dataset as an example, Fig. 4 shows the comparison on the monks dataset and illustrates that the performances of Sim-P and MarDistM deteriorates with pruning because the average margin on the monks data declines when the ensemble size decreases.

Furthermore, the variance of margin increases in the pruning process when using the Sim-P method. This result illustrates that the margin distance minimization method did not necessarily improve the average margin of random forests, because it only forced the signature vector close to a vector of small values in the first quadrant of the N-dimensional hyperspace.

6. Conclusions and future work This paper introduces possible ensemble pruning strategies based on the margin distributions to improve the performance of a random forest. Unlike previous approaches that are based on the diversity and/or the classification strength, our proposed method directly formulates ensemble pruning as a margin optimization problem. Four different margin measures were used as the evaluation metric. To solve the optimization problem, we employ a recursive eliminating strategy to obtain a near-optimal solution. The experiments on 10 UCI datasets demonstrates that this margin optimization-based pruning algorithm outperforms RF on most occasions and can reduce the redundancy effectively. The resulted subensembles with only 20–40% of trees of the original ensemble achieve better test accuracies than RF. Our study also indicates that the margin distribution accounts for the performance of random forest, and can be explicitly used in pruning a random forest. Among the four metrics, the average margin has the most stable and effective performance which coincides with previous studies. The importance of the variance of the margin is also discussed. We compare our proposed method with two other popular pruning methods, and the empirical experiments demonstrate the importance of the margin. In our future work, we will explore the possibility of using the margin metric in pruning other ensemble classifiers, such as

62

F. Yang et al. / Neurocomputing 94 (2012) 54–63

Average Margin on Monks

Average Test Accuracy on Monks

Average Test Accuracy

0.95 0.95

MeanD-M Random Forest

MeanD-M Random Forest

0.25

0.9

0.2

0.85

0.9

0.15 0.8 0.85

0.1 0.75

0.8

100

MeanD-M Random Forest 80

60

0.05

0.7

40

20

0

0.65 100

80

60

Ensemble Size 0.78

40

20

0

0 100

80

60

Ensemble Size

20

0

Variance of Margin on diabetes

0.83 0.82

40

Ensemble Size

Average Margin on diabetes

Average Test Accuracy on diabetes

0.76

Average Test Accuracy

Variance of Margin on Monks

1

1

0.25

MeanD-M Random Forest

MeanD-M Random Forest

0.2

0.81 0.74

0.8

0.72

0.15

0.79 0.1

0.78

0.7

0.77 0.68 0.66 100

0.05

MeanD-M Random Forest 80

60

0.76 40

20

0

0.75 100

80

60

40

20

0

0 100

80

60

Ensemble Size

Ensemble Size

40

20

0

Ensemble Size

Fig. 3. Comparison of Test Accuracy and Margin Distribution on monks and diabetes.

Average Margin on Monks

Average Test Accuracy on Monks 1

Variance of Margin on Monks

1

0.35 MeanD-M

MeanD-M

0.95

Average Test Accuracy

0.95

Sim-P MarDistM

MarDistM

0.9

0.9

0.3

Sim-P

0.25

0.85 0.85

0.2 0.8

0.8

0.15 0.75

0.75 0.7

0.1

0.7

MeanD-M Sim-P

0.05

0.65

MarDistM

0.65 100

80

60

40

Ensemble Size

20

0

100

80

60

40

20

0

0 100

Ensemble Size

80

60

40

20

0

Ensemble Size

Fig.4. Comparisons of Test Accuracy and Margin Distribution on monks.

Adaboost and MDboost. Instead of using the hill-climbing algorithm, other optimization methods like Genetic Algorithms and Semi-Definite Programming can also be investigated.

Grant No.2011J01373, and the Natural Science Foundations of China under Grant No.60975052 and No. 61102136. References

Acknowledgment This work was supported by the Fundamental Research Funds for the Central Universities under Grant No.2010121065, the Natural Science Foundations of Fujian Province of China under

[1] L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy, Mach. Learn. 51 (2) (2003) 181–207. [2] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. ˜ oz, A. Sua´rez, Aggregation ordering in bagging, Proceedings [3] G. Martı´nez-Mun of International Conference on Artificial Intelligence and Applications (2004) 258–263.

F. Yang et al. / Neurocomputing 94 (2012) 54–63

˜ oz, D. Herna´ndez-Lobato, A. Sua´rez, An analysis of ensemble [4] G. Martı´nez-Mun pruning techniques based on ordered aggregation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 245–259. [5] Z. Lu, X. Wu, X. Zhu, J. Bongard, Ensemble pruning via individual contribution ordering, Proc. 16th ACM SIGKDD (2010) 871–880. [6] Z.H. Zhou, J. Wu, W. Tang, Ensembling neural networks: many could be better than all, Artif. Intell. 137 (1–2) (2002) 239–263. [7] Y. Zhang, S. Burer, W.N. Street, Ensemble pruning via semi-definite programming, J. Mach. Learn. Res. 7 (2006) 1315–1338. [8] N. Li, Z.H. Zhou, Selective Ensemble under Regularization Framework, in: Proceedings of the 8th International Workshop on Multiple Classifier Systems, 2009, pp.293–303. [9] L. Zhang, W.D. Zhou, Sparse ensembles using weighted combination methods based on linear programming, Pattern Recognit. 44 (2011) 97–106. [10] G. Tsoumakas, I. Partalas, I. Vlahavas, An Ensemble Pruning Primer, Application of Supervised and Unsupervised Ensemble Methods, Springer, 2009 pp.1  13. ˜ oz, D. Herna´ndez-Lobato, A. Sua´ rez, Statistical Instance[11] G. Martı´ nez-Mun based Ensemble Pruning for Multi-Class Problems, in: Proceedings of the 19th International Conference on Artificial Neural Networks, 2009, pp.90–99. ˜ oz, D. Herna´ndez-Lobato, A. Sua´rez, A. Double [12] V. Soto, G. Martı´nez-Mun Pruning Algorithm for Classification Ensembles, in: Proceedings of the 9th International Workshop on Multiple Classifier Systems, 2010, pp. 104–113. ˇ [13] M. Robnik-Sikonja, Improving Random Forests, in: Proceedings of the 15th European Conference on Machine Learning, 2004, pp.359–370. [14] L. Reyzin, R.E. Schapire, How Boosting the Margin can also Boost Classifier Complexity, in: Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 753–760. [15] R.E. Schapire, Y. Freund, P. Bartlett, W.S. Lee, Boosting the margin: a new explanation for the effectiveness of voting methods, Annu. Stat 26 (5) (1998) 1651–1686. ¨ [16] G. Ratsch, M.K. Warmuth, Efficient margin maximizing with boosting, J. Mach. Learn. Res. 6 (2005) 2131–2152. ¨ [17] M.K. Warmuth, J. Liao, G. Ratsch, Totally Corrective Boosting Algorithms that Maximize the Margin, in: Proceedings of the 23rd International Conference on Machine Learning, 2006, pp.1001–1008. [18] L. Breiman, Prediction games and arcing algorithms, Neural Comput. 11 (7) (1999) 1493–1517. [19] L. Wang, M. Sugiyama, C. Yang, Z.H. Zhou, J. Feng, On the Margin Explanation of Boosting Algorithms, in: Proceedings of the 21st Annual Conference on Learning Theory, 2008, pp. 479–490. [20] C.H. Shen, H.X. Li, Boosting through optimization of margin distributions, IEEE Trans. Neural Networks 21 (4) (2010) 659–666. [21] B. Fu, Z.H. Wang, Z.F. Wang, Algorithm of classifier selection for maximizing the margin, J. Comput. Sci. Front. 5 (1) (2011) 59–67. [22] A. Liaw, M. Wiener, Classification and regression by random forest, R News 2 (3) (2002) 18–22. [23] A. Frank, A. Asuncion, UCI Machine Learning Repository, University of California, School of Information and Computer Science, Irvine, CA, 2010. [24] E.K. Tang, P.N. Suganthan, X. Yao, An analysis of diversity measures, Mach. Learn. 65 (1) (2006) 247–271. [25] H.P. Zhang, M.H. Wang, Search for the smallest random forest, Stat. Interface 2 (2009) 381–388. [26] H.C. Lian and B.L. Lu, An Algorithm for Pruning Redundant Modules in Min– Max Modular Network, in: Proceedings of International Joint Conference on Neural Networks, 2005,pp.1983–1988. [27] J. Li, B.L. Lu, and M. Ichikawa, An Algorithm for Pruning Redundant Modules in Min–Max Modular Network with GZC Function, in: Proceedings of the 1st International Conference on Natural Computation, 2005, pp.293–302.

63

Fan Yang received the Ph.D degree in control theory and control engineering from Xiamen University, Xiamen, China, in 2009. He currently is an assistant professor in the Institute of Pattern Recognition & Intelligent Systems, Department of Automation at Xiamen University. His research interests include machine learning, pattern recognition, data mining and bioinformatics.

Wei-hang Liu received the BE degree in automation from Xiamen University in 2010.He is currently working toward the master’s degree specializing in pattern recognition and intelligent system in Department of Automation at Xiamen University. His current research interests are machine learning, pattern recognition, data mining and their application.

Lin-Kai Luo received the Ph.D degree in control theory and control engineering from Xiamen University, Xiamen, China, in 2007. He currently is a professor in the Institute of Pattern Recognition & Intelligent Systems, Department of Automation at Xiamen University. He has been a member of the Automation Institute of Fujian province for many years. His research interests include machine learning, pattern recognition, bioinformatics, and financial data mining.

Tao Li received the Ph.D. degree in computer science from the Department of Computer Science, University of Rochester, Rochester, NY, in 2004. He is currently an Associate Professor with the School of Computing and Information Sciences, Florida International University, Miami. His research interests are data mining, machine learning, and information retrieval. He is a recipient of NSF CAREER Award and multiple IBM Faculty Research Awards.