Decision trees using model ensemble-based nodes

Decision trees using model ensemble-based nodes

Pattern Recognition 40 (2007) 3540 – 3551 www.elsevier.com/locate/pr Decision trees using model ensemble-based nodes Hakan Altınçay ∗ Department of C...

220KB Sizes 0 Downloads 88 Views

Pattern Recognition 40 (2007) 3540 – 3551 www.elsevier.com/locate/pr

Decision trees using model ensemble-based nodes Hakan Altınçay ∗ Department of Computer Engineering, Eastern Mediterranean University, Ma˘gusa, Northern Cyprus, Turkey Received 2 August 2006; received in revised form 27 March 2007; accepted 28 March 2007

Abstract Decision trees recursively partition the instance space by generating nodes that implement a decision function belonging to an a priori specified model class. Each decision may be univariate, linear or nonlinear. Alternatively, in omnivariate decision trees, one of the model types is dynamically selected by taking into account the complexity of the problem defined by the samples reaching that node. The selection is based on statistical tests where the most appropriate model type is selected as the one providing significantly better accuracy than others. In this study, we propose the use of model ensemble-based nodes where a multitude of models are considered for making decisions at each node. The ensemble members are generated by perturbing the model parameters and input attributes. Experiments conducted on several datasets and three model types indicate that the proposed approach achieves better classification accuracies compared to individual nodes, even in cases when only one model class is used in generating ensemble members. 䉷 2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Decision trees; Ensemble-based decision nodes; Model selection; Omnivariate decision trees; Random subspace method

1. Introduction

In linear multivariate trees, the decision is based on weighted linear combination of the features as

Decision tree classifiers utilize a divide-and-conquer strategy to partition the instance space into decision regions by generating internal or test nodes [1,2]. Each internal node, m, implements a decision or discriminant function denoted by fm (x) for this purpose. The input denoted by x represents the feature vectors that are in the form x = [x1 , x2 , . . . , xd ]T where d is the number of features. The functional form of fm (x) is one of the principal differences among various decision tree algorithms. In univariate decision trees, each internal node uses only one feature to define a decision or a model in the form, fm (x) = xi + bm ,

E-mail address: [email protected].

d 

wmi xi + bm .

(2)

i=1

A linear multivariate node generates an arbitrary hyperplane that is more powerful compared to the univariate case producing a hyperplane orthogonal to a particular axis [3]. In nonlinear multivariate case, the model is a weighted linear combination of nonlinear basis functions as fm (x) =

H 

wmh mh (x) + bm .

(3)

h=1

(1)

where bm is a constant. The selection of the best attribute xi and corresponding bm for the instance subset reaching at the node m are the main tasks in the generation of the decision function. ∗ Tel.: +90 3926302842; fax: +90 3923650711.

fm (x) =

Nonlinear decision nodes can implement decision boundaries of arbitrary complexity providing maximum discrimination capability. Neural trees belong to this category where in each internal node, a multi-layer perceptron (MLP) is used as the nonlinear decision function [4]. The model estimation can be either iterative or analytical. For instance, linear machine decision trees (LMDT) approach minimizes the number of misclassifications through the use of

0031-3203/$30.00 䉷 2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2007.03.023

H. Altınçay / Pattern Recognition 40 (2007) 3540 – 3551

an iterative algorithm [5]. In classification and regression trees (CART), parameter adaptation is done by keeping all parameters fixed except one and that is optimized to improve the Gini index [6]. OC1 is an extended form of CART [7]. It uses deterministic heuristic search as in CART which is followed by randomization to avoid local minima. Perceptron trees use a linear function neuron [8]. The weights are optimized so as to minimize the sum of distances of the misclassified objects. In the case of linear discriminant trees, a hyperplane is trained at each node. For instance, Fletcher–Powell descent algorithm can be used for this purpose [9]. Recently, the use of Fisher’s linear discriminant analysis is proposed where the best split is analytically computed [10]. In designing decision tree classifiers, it is generally assumed that the decision complexity is the same at all nodes. However, this is not true in general. It is argued that nonlinear models may be more appropriate for the nodes that are close to the root [11]. Since the complexity of the decision decreases as we go down the tree, linear models may generalize better for lower level nodes. The use of nonlinear models may lead to overfitting at these nodes where the training data are limited and classification problems are easier. Omnivariate decision trees have been recently proposed as an alternative framework to avoid the risk of overfitting [11]. In this technique, each node is allowed to select either univariate, linear or nonlinear multivariate decision function. The ultimate goal in model selection is to maximize the generalization ability. The choice of the most appropriate model is based on the performance of each function on the instance subset reaching at that node. The 5 × 2 cross-validation F-test [12] is used for this purpose, where a more complex model is selected if it is significantly better than a simpler one. Experiments have shown that the generalization ability of omnivariate trees is superior to the other types of trees using the same function in all decision nodes. It is also observed that nonlinear nodes are generally closer to the root, and univariate trees are generally closer to the leaf nodes. This is reasonable since simpler models are expected to generalize better as the instance space size decreases and problems become simpler as we go down the tree. It is also emphasized by Brodley that the generalization ability of an algorithm depends on how the underlying model class suits to the classification problem under concern [13]. She proposed a set of heuristic rules for selecting the most appropriate model class at each node by taking into account the number of samples, linear separability of the instance subspace, the number of selected features, etc. Utgoff proposed a hybrid tree approach where either a linear threshold unit making use of all features or a single attribute test is applied at each node by taking into account the linear separability of the subproblem [8]. Another way of reducing the risk of overfitting is by selecting the best feature subspace at each node. The main idea is that certain dimensions may not vary in the instance subspace reaching at a particular node and can be considered as redundant. Avoiding these features may increase the generalization ability and reduce node complexity [3,10]. Feature selection is, in general, implemented in a sequential manner where features are added or eliminated one by one. For instance, LMDT

3541

initially considers the whole feature space to train a linear model. The irrelevant variables are then eliminated by taking into account the weights assigned to each feature. Smaller weights correspond to less contribution to the discriminant and corresponding variables are considered as candidates for elimination. In the decision tree algorithm proposed by Utgoff and Brodley [14] named as PT2, a greedy search procedure is proposed for feature elimination until further elimination decreases the classification accuracy. Model and feature subspace selection are two challenging problems in decision tree design. In order to tackle with these problems, we propose to use an ensemble of decision functions and feature subspaces at each node. In other words, instead of searching for the best individual model and feature subspace, it is aimed to find the best generalization at each node by allowing cooperation among various models operating in different feature subspaces. More specifically, each internal node implements a decision that is defined as an ensemble of functions, each using a different prototype, initialization and termination condition for training, design parameter and feature subspace. MLP, linear multivariate perceptron algorithm and Fisher’s linear discriminant are three models considered [3]. Random subspace method [15] is used to select a feature subspace for each component model. The proposed approach allows representation of the instance subset reaching at a node by several feature subspaces eliminating the need for explicit feature selection. Using different training conditions and models as described above, the input of each node is represented by several formalisms aiming at effective modeling of decision boundaries having varying complexities at different nodes. This paper is organized as follows. In Section 2, details about implementation of decision trees using individual nodes such as the models considered and their training, the impurity measure and the stopping criterion used are presented. Generation of proposed model ensemble-based nodes are described in Section 3. The experiments conducted are addressed in Section 4 where the results are also discussed and comparisons to decision trees using individual nodes are presented. Section 5 discusses the main conclusions drawn from this study.

2. Decision trees using individual models Decision tree classifiers based on individual models utilize the same a priori specified model class at all internal nodes. Having selected a model class for fm (·), its parameters are estimated so as to achieve the best split for the training set reaching at node m. The first node to be generated is the root node where the data considered are the whole training set. Then, the child nodes are generated by taking into account the outputs provided by the model. The samples for which fm (x) < 0 are assigned to the left child and the rest to the right. This procedure is repeated for the child nodes by taking into account the training instance subset reaching at them until leaf or terminal nodes are obtained which are pure having training samples of a single class or a termination criterion is satisfied. The terminal nodes carry the label of a pattern class.

3542

H. Altınçay / Pattern Recognition 40 (2007) 3540 – 3551

Impurity measure is used for quantifying the goodness of the split implemented by the trained model. Entropy, Gini index and misclassification error are three popular impurity measures [16]. In this study, entropy is used as the impurity measure due to its acceptable performance justified by numerous experi = N i /N , where iments [3,17]. For a given node m, let pm m m Nm denotes the total number of samples reaching at that node and Nmi is the total number of such samples belonging to class wi . For the two classes case, the node entropy is defined as [1] Em = −

2 

i i pm log2 pm .

(4)

i=1 i =N i /N i Let pm,j m,j , where Nm,j denotes the number samples m,j belonging to class wi that are classified as wj (i.e., takes the branch j) and Nm,j denotes the total number of samples that are classified as wj . In the cases of two classes, the total entropy after the split becomes

E∗m = −

2 2  Nm,j  j =1

Nm

i i pm,j log2 pm,j .

(5)

i=1

In order to avoid overfitting and hence improve generalization, pruning is generally applied [18]. In prepruning approach, the splitting process is terminated if the reduction in impurity is not significant or less than an a priori selected threshold. The number of training samples hitting the node may also be considered. If the number of samples is less than a threshold, splitting may be terminated. Alternatively, the tree may be grown fully and postpruning is applied. In this approach, all pairs of leaf nodes connected to the same internal node are considered for elimination. In particular, each subtree is replaced by a leaf if it does not perform worse compared to the subtree when a pruning set is used. The main advantage of prepruning is its speed. Although it is generally argued to provide better results, postpruning requires a separate validation set which is not used during model estimation. In this study, prepruning is applied. Splitting is stopped when Em − E∗m < , where the value of  is empirically determined as 0.05. In two classes case, we need to find a decision model fm (x) at node m which corresponds to the best way of partitioning the data reaching at that node into two classes. In other words, samples belonging to different classes should be assigned to different branches. However, in the case of K > 2 classes, we must find the best way of partitioning of K classes into two groups. In order to avoid testing for all possible partitions, a heuristic approach called exchange method is proposed [4]. Assuming that C = {C1 , . . . , CK } denotes the set of all classes reaching at node m, the algorithm can be summarized as follows [10]: L and C R , each containing 1. Partition the classes in C into Cm m K/2 classes. L and C R . Compute the entropy, 2. Train fm (x) to separate Cm m E0 . 3. For k = 1 to K, form the partitions CL (k) and CR (k) by L and C R . changing the assignment of Ck in partitions Cm m

Train fm (x) to separate CL (k) and CR (k). Compute the decrease in entropy as Ek = E0 − Ek . 4. If the maximum reduction in entropy provided by the class Ck ∗ is greater than zero, set CL = CL (k ∗ ), CR = CR (k ∗ ) and go to step 2. Otherwise exit the algorithm. Instead of random initial partitioning applied in step 1, the intermean distances may be taken into account. For instance, Yıldız and Alpaydın proposed to initially put the two classes L and C R [10]. having maximum intermean distances into Cm m Then, the class having minimum intermean distance to either L or C R is placed. This is repeated until all classes are asCm m signed to one of these groups. In this study, we also applied this heuristic for initial partitioning. Three types of decision tree classifiers, each based on individual models of either MLP, linear multivariate perceptron (LMP) or Fisher’s linear discriminant function (FLD) are generated. In the first type named as MLPind, a single MLP is trained at each node. Assuming d inputs and two outputs corresponding to two-classes, the number of neurons in the hidden layer is selected as (d + 2)/2 (number of inputs + number of outputs)/2 as in Ref. [11]. The back-propagation algorithm is used for training and it is run for 1000 iterations. In the second tree type named as LMPind, linear multivariate perceptron algorithm is used where the algorithm is run for 500 iterations and the learning parameter is set to 0.005. The third tree type based on Fisher’s linear discriminant is named as FLDind. In this case, the optimal hyperplane is computed using analytical calculations [3]. For the training of all three models, PRTOOLS toolbox is used [19]. These decision tree classifiers based on individual models are used to investigate the improvements that can be achieved by making use of model ensembles.

3. Proposed approach: decision trees using model ensembles Bias-variance decomposition of error is an important tool for studying different modeling techniques [20]. Simple models generally have high bias (deviation from the best-fitting model) and low variance whereas complex models have higher variance providing low bias. High bias may lead to underfitting whereas high variance may lead to overfitting. A model is said to fit the classification problem under concern if it provides a good bias-variance tradeoff. In the case of decision trees, since the classification complexity changes from one node to another, it can be argued that the complexity of the model should be tuned for each node. In model ensembles approach, a number of different highvariance methods are simultaneously used [17]. Model ensembles may also be investigated in terms of bias-variance decomposition of error where the strength of an ensembling technique is measured by its ability to reduce bias and/or variance, leading to a smaller mean square error [21]. For instance, voting over models having high variance is shown to lead lowered variance and hence a smaller error. Voting is also helpful for biased members since reduction in variance may offset the bias [17].

H. Altınçay / Pattern Recognition 40 (2007) 3540 – 3551

In the case of finite ensemble size, an alternative decomposition of ensemble error is ambiguity decomposition [22]. It is shown that the combined ensemble error is equal to the difference between the average error of the individual members and their ambiguity A (or, diversity), Eens = (Eavg − A). Ambiguity measures the amount of variability among the ensemble members. Average accuracy of the individual members and their diversity are two objective functions that should be simultaneously maximized in ensemble design. A model with high variance, which is also referred as unstable, tends to produce members having high ambiguity when trained under different conditions [23]. Ensembling approach has proven its strength in many classification problems and various techniques are developed [24–26]. In particular, instead of selecting the best model class, decision functions belonging to different model classes are trained in parallel and then their outputs are combined. The same model class may also be used where the degree of complexity and/or some parameters of each decision function are set to be different. For instance, several MLPs may be trained, each using a different number of neurons in its hidden layer, different initial set of weights or number of iterations. Similarly, instead of computing the best feature subset, a different feature subspace may be used by each decision function. These differences are expected to provide diversity among the members of the ensemble which is essential in order to achieve an improved generalization ability compared to the individual members [24,27–29]. Although there is not a well-defined definition of diversity, it is generally agreed that the models should be different in the sense that their errors lie in different parts of the input space. In order to achieve good bias-variance tradeoff for the classification problems of varying complexities that occur in designing a decision tree, ensembling unstable models is considered in this study. MLP is one of the selected models due to its unstable nature [30]. Similarly, LMP is an unstable classifier since the solution can go from best to worst in one iteration [31]. FLD is also considered since random subspace technique is shown to improve its performance on critical training set sizes which may be the case at lower level nodes [32]. Two approaches are considered in generating model ensembles. Firstly, the same model class, i.e., either MLP, LMP or FLD, is used to generate component models where the degree of complexity, initialization condition and parameter settings of each component are different. The decision of an ensemblebased node can be represented as f˜m (x) =

E 

sign{mi (x)},

(6)

i=1

where sign{a} =



−1, a < 0; +1

otherwise

(7)

and mi (x) denotes a decision function belonging to one of the model classes mentioned above and the number of component models, i.e., the ensemble size is denoted by E. The

3543

combination operation given in Eq. (6) corresponds to voting over the decisions of the component models. If f˜m (x) < 0, then x is assigned to the left branch and to the right otherwise. As an extension to the proposed modeling approach, ensembling decision functions derived from three different model classes is also studied. More specifically, some of the component models denoted by mi (x) are allowed to be linear (FLD or LMP) while others are MLP-based nonlinear models. The training of MLP ensemble-based nodes is illustrated in Fig. 1. In the given algorithm, the feature subspace size named as subFeatureSize is randomly selected in the interval [d/3, 2d/3]. Then, this number of features is arbitrarily selected to form the feature space of the component model. It is shown that random subspace method using different randomly selected feature subsets may construct better classifiers by avoiding redundant features [33]. Furthermore, random subspace method is also effective when the number of training samples is comparable to number of features (small sample size problem) which may occur at the internal nodes that are close to the leaf nodes. For each MLP-based component model, initial weights and number of neurons in the hidden layer are randomly selected. The number of neurons in the hidden layer is also randomly selected in the interval [ 23 × subFeatureSize, 43 × subFeatureSize]. Using different initial conditions and model structures, the instance subspace is represented by several different formalisms which are expected to contribute in accurate representation of different splitting problems [13]. In random subspace approach, the individual accuracies achieved by some of the component models may be insufficient since some of the features may be irrelevant to the learning target [34]. In order to avoid this, the model generation process is repeated 3 × E times and the best E models providing highest individual accuracies on the training set reaching at that node are selected. Ideally, the individually best models having large diversity should be selected for combination. For instance, classifier selection techniques can be applied for optimal selection of a model subset that provides the maximum combined accuracy [35]. However, since limited number of samples may reach to some nodes, this approach may lead to overfitting. Taking into account the fact that the diversity measures considered so far do not always provide correlated results with the combined accuracies, and assuming that the perturbations applied in model generation lead to uncorrelated decisions, only the individual accuracies are considered in model selection. During testing, the selected models are combined using majority voting. The decision tree involving multiple MLP-based models is referred as MLPens in the following context. Training of perceptron ensemble-based nodes is presented in Fig. 2. The feature subspace is selected in the same way as it was done for MLP case. In order to create differences between component models, the number of iterations that the perceptron algorithm is run is randomly selected in the interval [300, 600] and the learning factor is randomly selected in the interval [0.001, 0.01]. The decision tree involving LMP ensemble-based internal nodes is referred as LMPens in the following context. Training of Fisher’s linear discriminant ensemble-based nodes is presented in Fig. 3. The feature subspace is selected

3544

H. Altınçay / Pattern Recognition 40 (2007) 3540 – 3551 Neural tree node training for i = 1 to 3E subFeatureSize subspace(i) hiddenLayerSize model(i)

= = = =

RandNumber(d/3, 2d/3); RandomSubspace(subFeatureSize); RandNum((2/3)subFeatureSize,(4/3)subFeatureSize); TrainNeuralNetwork(trainData,subspace(i), hiddenLayerSize);

end ChooseBestModelSubset(E);

Fig. 1. Generation of an MLP ensemble-based node.

Perceptron tree node training for i = 1 to 3E subFeatureSize subspace(i) iterNumber learningRate model(i)

= = = = =

RandNumber(d/3, 2d/3); RandomSubspace(subFeatureSize); RandNumber(300,600); RandNumber(0.001,0.01); TrainPerceptron(trainData,subspace(i), iterNumber,learningRate);

end ChooseBestModelSubset(E);

Fig. 2. Generation of a perceptron ensemble-based node.

Fisher’s linear discriminant tree node training for i = 1 to 3E subFeatureSize = RandNumber(d/3, 2d/3); subspace(i) = RandomSubspace(subFeatureSize); model(i) = TrainFisher(trainData,subspace(i)); end ChooseBestModelSubset(E);

Fig. 3. Generation of a Fisher’s linear discriminant ensemble-based node.

in the same way as it is done before. Since the optimal hyperplane is computed using analytical calculations, the only difference among different models is in their feature spaces. The decision tree involving FLD ensemble-based internal nodes is referred as FLDens in the following context. In all decision tree implementations involving model ensemblebased nodes, impurity measure and stopping criterion are set to be the same as in the trees making use of individual models. Training of multiple model classes ensemble-based nodes involves training E number of MLP-based models, E number of LMP-based models and E number of FLD-based models and then selecting the best subset of size E. When several model classes are considered simultaneously, training data accuracybased selection may not be appropriate since the most complex model is expected to win due to overfitting in almost all cases even if it is not significantly better than a simpler model, thus providing low generalization ability. In order to avoid this, Yıldız and Alpaydın proposed the use of 5 × 2 cross-validation F-test for selecting either univariate, multivariate linear or

nonlinear model at each node [10]. Alternatively, Li et al. proposed a novel classifiability measure for model selection [36]. The measure is based on the complexity at each node. Compared to the 5 × 2 cross-validation F-test, this approach is fast and it provides comparable performance. In our simulation experiments, we studied two different approaches for model selection in the case of multiple model classes. In order to investigate its appropriateness, individual accuracies on the training data are firstly taken into account. Then, cross-validation-based model selection is considered as the second approach and three-hold-out method is used for this purpose [21]. The training data reaching at each node are randomly partitioned into two equal parts where one part is used for training and the other for testing. This procedure is repeated three times and the average accuracy achieved is recorded for all 3 × E models. A subset of E models providing the highest accuracies are then selected. In cases where two different models provide the same highest accuracy, our algorithm selects the simplest model where it is assumed that LMP is a simpler model than FLD.

H. Altınçay / Pattern Recognition 40 (2007) 3540 – 3551 Table 1 Description of the datasets Dataset

Classes

Instances

Features (d)

credit_g ionosphere pima sonar vote wdbc phoneme cancer heart australian dermatology waveform21 waveform40 wine image iris ecoli cmc dna monkey1

2 2 2 2 2 2 2 2 2 2 6 3 3 3 7 3 8 3 3 2

1000 351 768 208 435 569 5404 683 303 690 366 5000 5000 178 2310 150 336 1473 2000 556

24 34 8 60 16 30 5 9 13 42 34 21 40 13 19 4 7 9 180 17

3545

Table 2 Comparison of generalization accuracies for individual and ensemble models of multi-layer perceptron Dataset

credit_g ionosphere pima sonar vote wdbc phoneme cancer heart australian dermatology waveform21 waveform40 wine image iris ecoli cmc dna monkey1

MLPind

70.18 ± 2.72 89.66 ± 2.37 73.65 ± 2.06 80.29 ± 2.97 92.30 ± 1.19 96.83 ± 0.85 81.56 ± 1.47 96.42 ± 0.80 75.89 ± 3.16 81.08 ± 1.65 94.62 ± 1.61 83.31 ± 0.72 82.70 ± 0.51 97.73 ± 1.86 94.84 ± 0.85 95.33 ± 2.11 82.71 ± 2.63 50.16 ± 2.05 87.60 ± 0.91 99.14 ± 1.83

MLPens E=5

E = 15

E = 25

74.08 ± 0.82 91.60 ± 1.68 75.70 ± 2.31 80.97 ± 4.87 93.69 ± 0.97 97.32 ± 1.00 79.63 ± 1.85 96.86 ± 0.75 80.00 ± 2.02 84.68 ± 1.93 95.60 ± 1.68 84.19 ± 0.40 83.83 ± 0.52 96.70 ± 1.56 96.17 ± 0.53 93.73 ± 1.78 83.67 ± 1.51 50.86 ± 1.95 88.10 ± 0.82 97.55 ± 2.38

75.64 ± 1.19 92.63 ± 2.39 76.02 ± 1.82 83.20 ± 2.55 94.01 ± 1.11 96.97 ± 1.33 79.75 ± 0.63 97.10 ± 0.75 81.72 ± 2.74 85.15 ± 0.59 96.87 ± 1.07 85.19 ± 0.32 85.15 ± 0.43 97.61 ± 1.65 96.27 ± 0.43 94.40 ± 2.58 84.16 ± 2.25 50.76 ± 2.20 91.16 ± 1.08 99.35 ± 1.38

75.86 ± 1.47 92.97 ± 1.89 76.74 ± 1.27 82.14 ± 2.43 94.24 ± 1.05 97.36 ± 1.43 79.74 ± 0.45 97.10 ± 0.89 81.66 ± 2.48 85.15 ± 1.04 97.03 ± 1.13 85.40 ± 0.44 85.07 ± 0.28 97.61 ± 1.46 96.40 ± 0.54 94.53 ± 2.03 84.34 ± 2.34 51.63 ± 1.39 91.69 ± 0.88 99.10 ± 1.91

4. Experimental results In order to evaluate the proposed algorithm, experiments are conducted on 20 datasets from the UCI machine learning repository [37]. Description of these datasets including the number of classes, samples and feature space dimensionality is presented in Table 1. For each dataset, the experiments are repeated 10 times using randomly generated training and test sets. For “phoneme”, “waveform21”, “waveform40” and “image” datasets, 20% of the available data is used for training and the rest for testing. Training and test sets have equal number of instances for the other datasets. The experiments are initially performed for each individual model type separately. In other words, the ensembles are generated by applying perturbations to only one model class. The average accuracies and standard deviations obtained are illustrated in the second columns of Tables 2–4 , respectively, for MLPind, LMPind and FLDind. The following three columns present the results of ensemble-based nodes for E = 5, 15 and 25. In order to assess the statistical significance of the improvements in the accuracies provided by the proposed approach, hypothesis tests are performed using ttest approach. The null hypothesis is defined as “H0 = mean of the improvement is equal to zero” and the alternative hypothesis is defined as “H1 = mean of the improvement is greater than zero”. The tests are performed between the trees utilizing individual models and their model ensemble-based forms for E = 25. The accuracies of the datasets for which the null hypothesis is rejected at confidence level of 95% are printed in boldface. When averaged over all datasets, the performance of MLPind and FLDind are comparable, providing 85.30% and 84.07% accuracies, respectively. LMPind provides 76.79% average classification accuracy. As it can be seen in the tables,

Table 3 Comparison of generalization accuracies for individual and ensemble models of linear multivariate perceptron Dataset

credit_g ionosphere pima sonar vote wdbc phoneme cancer heart australian dermatology waveform21 waveform40 wine image iris ecoli cmc dna monkey1

LMPind

70.36 ± 1.14 83.49 ± 2.73 64.71 ± 1.24 73.69 ± 6.64 91.24 ± 1.76 88.59 ± 4.04 72.08 ± 1.93 96.16 ± 0.58 57.42 ± 3.51 62.06 ± 5.11 92.09 ± 4.63 80.30 ± 1.29 79.95 ± 1.23 64.77 ± 2.98 82.47 ± 4.09 96.27 ± 1.64 80.90 ± 1.33 42.72 ± 0.00 89.36 ± 0.76 67.16 ± 11.93

LMPens E=5

E = 15

E = 25

70.40 ± 0.85 85.03 ± 1.59 65.10 ± 0.00 76.50 ± 3.60 93.00 ± 1.23 89.54 ± 2.31 72.33 ± 2.61 96.72 ± 0.60 62.12 ± 8.76 77.50 ± 8.29 95.99 ± 1.64 81.23 ± 1.05 81.44 ± 1.35 89.43 ± 8.59 89.32 ± 1.93 95.20 ± 2.45 81.93 ± 1.68 42.61 ± 0.34 90.48 ± 2.18 74.89 ± 1.65

70.00 ± 0.00 85.14 ± 2.14 65.10 ± 0.00 76.60 ± 5.89 94.61 ± 1.17 89.96 ± 2.53 72.16 ± 2.46 96.80 ± 0.63 69.40 ± 8.86 78.66 ± 3.58 96.43 ± 1.74 82.04 ± 1.54 83.29 ± 0.35 91.36 ± 2.79 90.15 ± 1.23 95.07 ± 2.36 82.95 ± 1.93 42.69 ± 0.09 90.18 ± 1.36 74.89 ± 1.56

70.00 ± 0.00 87.49 ± 3.39 65.10 ± 0.00 75.92 ± 5.17 94.75 ± 0.82 91.09 ± 1.30 71.20 ± 1.71 96.83 ± 0.60 64.50 ± 7.74 79.65 ± 4.07 96.65 ± 1.28 82.90 ± 1.04 83.00 ± 0.73 92.50 ± 1.53 90.97 ± 1.22 94.93 ± 2.16 82.71 ± 1.63 42.87 ± 0.47 90.40 ± 0.59 74.89 ± 1.56

ensemble-based models provide improved accuracies in majority of the cases. MLPens improved the accuracies for 16 datasets. The improvements achieved are statistically significant for 15 of these datasets. It is not surprising to see that all ensemble-based models failed to provide any improvement for “phoneme” and “iris” datasets for E = 25. The main

3546

H. Altınçay / Pattern Recognition 40 (2007) 3540 – 3551

Table 4 Comparison of generalization accuracies for individual and ensemble model of Fisher’s linear discriminant

Table 5 Comparison of error reduction rates (in %) for different model types where E = 25

Dataset

Dataset

MLPens

LMPens

FLDens

credit_g ionosphere pima sonar vote wdbc phoneme cancer heart australian dermatology waveform21 waveform40 wine image iris ecoli cmc dna monkey1

19.05 32.04 11.76 9.36 25.15 16.67 −9.90 18.85 23.90 21.51 44.90 12.47 13.71 −5.00 30.29 −17.14 9.41 2.94 33.01 −4.17

−1.21 24.22 1.11 8.49 40.00 21.91 −3.15 17.56 16.64 46.36 57.64 13.23 15.20 78.71 48.49 −35.72 9.46 0.26 9.78 23.55

−1.13 13.66 0.11 29.94 −1.77 5.19 −12.57 4.17 0.39 6.00 23.75 0.04 5.89 −40.00 11.67 −32.00 1.95 −0.87 24.10 0.43

credit_g ionosphere pima sonar vote wdbc phoneme cancer heart australian dermatology waveform21 waveform40 wine image iris ecoli cmc dna monkey1

FLDind

75.28 ± 1.75 87.03 ± 3.49 77.19 ± 1.52 64.66 ± 5.53 94.79 ± 1.09 95.25 ± 1.24 77.34 ± 1.79 96.48 ± 0.24 82.91 ± 2.02 86.42 ± 0.91 95.60 ± 0.97 82.44 ± 0.55 81.17 ± 0.69 97.73 ± 1.20 93.70 ± 0.56 96.67 ± 1.30 84.58 ± 1.78 47.10 ± 1.13 90.28 ± 0.93 74.82 ± 1.44

FLDens E=5

E = 15

E = 25

74.78 ± 0.82 89.20 ± 2.45 76.64 ± 1.08 73.20 ± 2.68 95.02 ± 0.86 95.46 ± 1.50 73.90 ± 2.29 96.22 ± 0.49 81.85 ± 2.95 86.83 ± 1.01 96.43 ± 1.42 81.86 ± 0.53 81.70 ± 0.59 96.25 ± 1.94 94.03 ± 0.68 94.67 ± 1.54 84.52 ± 2.75 47.46 ± 1.15 91.99 ± 0.80 74.86 ± 1.53

75.16 ± 1.14 88.74 ± 2.72 77.16 ± 1.28 74.47 ± 3.43 95.07 ± 1.04 95.60 ± 1.27 74.77 ± 1.88 96.16 ± 0.40 82.78 ± 3.14 86.95 ± 1.03 96.92 ± 1.49 82.32 ± 0.54 82.27 ± 0.49 96.82 ± 1.40 94.50 ± 0.53 95.47 ± 1.12 85.30 ± 2.24 46.90 ± 1.87 92.60 ± 0.76 74.93 ± 1.60

75.00 ± 1.32 88.80 ± 2.50 77.21 ± 1.50 75.24 ± 3.92 94.70 ± 0.85 95.49 ± 1.42 74.50 ± 1.61 96.63 ± 0.32 82.98 ± 2.63 87.24 ± 0.78 96.65 ± 1.36 82.45 ± 0.31 82.28 ± 0.44 96.82 ± 1.17 94.43 ± 0.48 95.60 ± 1.26 84.88 ± 1.89 46.64 ± 1.81 92.62 ± 0.82 74.93 ± 1.60

reason for this is that the number of features is only 5 and 4, respectively, which are too small for random subspace technique. The models obtained are not diverse due to overlapping feature subspaces, violating one of the requirements for good ensemble design. It can also be seen in Table 2 that the accuracies generally increase as the ensemble size increases from 5 to 15. Further improvement in the accuracies is achieved for majority of the datasets when E = 25 is selected. In a similar way, when E = 25 is considered for LMPens, improved accuracies are obtained for all datasets except for “credit_g”, “phoneme” and “iris” as it can be seen in Table 3. The improvements achieved are statistically significant for 14 datasets. Similarly, FLDens provided improved accuracies for 14 70% ( 20 ) of the datasets as it can be seen in Table 4. For six of these datasets, the improvements are statistically significant. The improvements achieved by the proposed approach can be more easily seen by analyzing the reduction in the classification errors presented in Table 5 [34]. The error reduction (or improvement) of a model ensemble-based decision tree is defined as improvement% =

errorind − errorens × 100, errorind

(8)

where errorens denotes the average error rate achieved by the model ensemble-based decision tree over 10 independent simulations and errorind denotes the average error obtained using an individual model-based node. As seen in the table, MLPens and LMPens provided the maximum improvements in general. This is mainly due to the diversity among different models achieved by perturbing a wider set of design parameters such as number of hidden nodes, initialization of weights, number of iterations, etc.

Although the average accuracy provided by FLDind (84.07%) is comparable to that of MLPind (85.30%), smaller improvements are accomplished by FLD model ensembles in general. The main reason for this is that the only difference among different ensemble members is in the subspaces considered. It is known that error reduction achieved by ensembles is mainly due to the reduction of model variance [38]. As a matter of fact, diversity of the models is an important factor in designing ensemble-based nodes with high generalization abilities. It is already shown that RSM-based ensembles of Fisher’s linear discriminant classifier are superior to the individual using the whole feature space when the training size is comparable to the feature dimensionality [32]. As seen in the last column of Table 5, “ionosphere”, “sonar” and “dermatology” are among the datasets for which maximum reduction in error rates is achieved. These datasets have high-dimensional feature spaces. Due to high-dimensional spaces, the models obtained at lower level nodes which are trained using smaller number of samples may suffer from curse of dimensionality. The random subspace method provides superior ensemble-based models in such cases since it relatively increases the number of training samples [32]. Despite their high-dimensional feature spaces, the reduction is comparatively smaller for “wdbc”, “australian” and “waveform40” datasets. This is mainly due to the fact that the number of training samples is large for these datasets. In such cases, FLD becomes a strong and stable classifier. It should be noted that, although the number of training samples of “dna” dataset is even larger than these datasets, a larger error reduction (24.10%) is achieved. It can be argued that FLD models obtained at lower level nodes may also suffer from curse of dimensionality due to 180-dimensional feature vectors which is much larger than that of the other datasets.

H. Altınçay / Pattern Recognition 40 (2007) 3540 – 3551 Table 6 Comparison of tree sizes measured by the number of internal nodes (leaf nodes are not included) Dataset

MLPind

MLPens

LMPind

LMPens

FLDind

FLDens

credit_g ionosphere pima sonar vote wdbc phoneme cancer heart australian dermatology waveform21 waveform40 wine image iris ecoli cmc dna monkey1

3.4 1.6 5.6 1.0 1.0 1.4 7.5 1.7 2.2 2.4 5.0 12.9 10.8 2.0 9.2 2.3 11.3 19.5 2.0 1.0

3.0 1.2 2.4 1.0 1.3 1.1 4.2 1.8 2.7 2.9 5.0 8.6 9.3 2.0 9.2 2.2 11.8 13.4 2.0 1.1

1.0 2.7 1.0 3.7 2.4 1.3 1.1 2.3 1.3 1.6 5.9 17.0 18.7 2.1 18.3 2.4 13.7 1.0 2.0 1.1

1.0 3.4 1.0 3.4 3.6 1.0 1.0 1.5 1.2 2.4 5.5 6.0 7.4 3.3 14.7 2.8 11.8 1.0 5.0 1.0

3.4 2.4 1.0 2.7 2.3 1.4 3.6 2.6 1.9 2.8 5.5 3.7 3.6 2.1 12.4 2.0 9.6 5.7 5.1 1.1

3.1 2.8 1.3 2.9 2.4 1.3 2.1 2.3 1.5 2.9 6.2 2.0 2.1 2.5 11.6 2.0 9.6 3.9 6.1 1.0

5.2

4.3

5.0

3.9

3.7

3.5

Average

Yıldız and Alpaydın have shown that as the complexity of the model increases, the size (number of nodes) of the tree decreases [11]. On the other hand, they also argued that more complex nodes may result in more accurate trees [10]. In order to investigate the changes in the number of internal nodes when individual nodes are replaced by ensemble-based ones, consider Table 6 where E = 25 is selected in the ensemble-based nodes cases. For all three model classes, the average number of internal nodes decreases. When the size of the trees generated for individual datasets is investigated, it can be seen that the tree sizes either decrease or remain the same for majority of the datasets. However, the size may also increase in some cases. For instance, when LMPens is applied to “ionosphere” dataset, the average number of nodes increases from 2.7 to 3.4. This can be explained as follows: Since prepruning is applied, node generation is terminated if the reduction of impurity is below the selected threshold. When the classification problem corresponding to a node is more complex than the decision model, reduction in impurity may not be achieved and hence the node generation procedure terminates. In such cases, a more complex model can reduce the impurity further, leading to trees with larger number of nodes. The effectiveness of multiple model class based model ensembles is also studied. As it was mentioned in Section 3, two methods are considered in computing the individual model accuracies. In the first approach, out of 3 × E models where there are equal number of models from MLP, LMP and FLD model classes, a subset of E models providing maximum individual training data-based accuracies are selected. This system is referred as ALLperf in the following context. The second approach involving cross-validation-based accuracy estimation which uses three-hold-out method as described in Section 3 is

3547

referred as ALLcv. For E = 25, the experimental results obtained are presented in Table 7. The average accuracies and standard deviations provided by ALLperf are presented in the second column of table. The average number of LMP-, FLDand MLP-based nodes used in 10 independent runs are also presented in the following three columns. Last four columns show the accuracies achieved by ALLcv and the average number of models used. The sum and percentage of model types selected are also given in last two rows. As seen in the table, cross-validation-based model selection selects fewer MLP models (32.74%) compared to training accuracy-based model selection (54.49%). In the case of cross-validation, due to the use of half of the available data during model training, simpler models may generalize better, providing higher validation accuracies. On the other hand, MLP models are more likely to overfit, leading to their less frequent use. The test accuracies which are significantly better compared to MLPind given in Table 2 are printed in boldface. As seen in the table, when compared to MLPind, ALLperf provides better accuracies for 16 datasets. The improvements achieved are statistically significant for 14 datasets. Similarly, ALLcv provides better accuracies for 15 datasets. The improvements achieved are statistically significant for 12 datasets. For a better visualization of the improvements achieved by ALLperf and ALLcv, consider Fig. 4 where MLPens for E = 25 using only MLP model is also included. The datasets “phoneme”, “iris” and “monkey1” are excluded since none of the three methods could provide any improvement to MLPind. It can be seen in the figure that ALLperf and ALLcv provide overall the best accuracy for five datasets whereas MLPens achieves the overall best for six datasets. These results show that the overall best accuracy is achieved by using multiple model classes for 10 of these 17 datasets. It should be noted that maximum and equal improvements are achieved by ALLperf and MLPens for “wdbc” database. Out of 17 datasets, the improvements achieved by MLPens are surpassed by ALLperf for eight datasets. These observations justify the fact mentioned in Section 3 that diversity is important in the design of ensembles and different modeling perspectives can increase the diversity among the individuals. ALLperf provides better accuracies than ALLcv for 11 datasets whereas ALLcv performed better for five datasets. For “waveform40” dataset, they provide the same average accuracy. It may be surprising to observe that ALLcv performs worse compared to ALLperf. The inferior performance of ALLcv can be mainly due to the small sample size problem encountered at the internal nodes that are close to the leafs. Since the number of samples reaching at these nodes is small and cross-validation approach selects half of these for training, reliable models cannot be obtained [36]. On the other hand, even if the component models selected by ALLperf are overfitted, a diverse set of such models obtained by applying various types of perturbations may provide a good generalization ability [39–41]. Since ALLperf performs better than MLPens in eight of the 17 datasets as described above, it can be concluded that models other than MLP are also selected when the accuracy on the

3548

H. Altınçay / Pattern Recognition 40 (2007) 3540 – 3551

Table 7 The average accuracies achieved with the use of multiple model classes and two different model selection criteria: training data-based accuracies (ALLperf) and cross-validation-based accuracies (ALLcv) Dataset

credit_g ionosphere pima sonar vote wdbc phoneme cancer heart australian dermatology waveform21 waveform40 wine image iris ecoli cmc dna monkey1  %

ALLperf

Number of models

ALLcv

LMP

FLD

MLP

75.04 ± 1.33 93.20 ± 1.71 77.11 ± 1.63 82.43 ± 3.31 94.19 ± 1.13 97.36 ± 1.04 77.85 ± 0.89 97.33 ± 0.68 83.05 ± 1.95 86.34 ± 1.17 97.42 ± 1.04 85.12 ± 0.52 84.61 ± 0.24 97.84 ± 1.46 95.63 ± 0.77 94.67 ± 1.54 84.22 ± 2.21 47.65 ± 1.53 91.41 ± 0.63 89.06 ± 6.10

3.8 19.7 1.8 0.0 20.9 0.7 5.1 4.8 12.0 2.5 80.6 26.6 93.5 7.1 82.3 30.0 139.6 13.5 16.9 16.6

3.6 1.0 12.6 0.0 4.7 2.4 15.1 1.5 6.4 6.7 11.9 15.8 13.9 23.1 55.6 10.9 54.5 31.8 0.0 1.6

55.1 29.3 20.6 25.0 34.4 24.4 39.8 18.7 51.6 40.8 32.5 105.1 162.6 19.8 109.6 14.1 105.9 59.7 33.1 36.8

— —

578.0 30.91

273.1 14.60

1018.9 54.49

73.92 ± 1.58 91.94 ± 1.86 76.93 ± 1.20 81.07 ± 2.48 95.02 ± 0.78 96.90 ± 1.22 76.63 ± 0.62 97.16 ± 0.48 83.64 ± 2.63 86.80 ± 1.08 97.64 ± 1.37 85.03 ± 0.53 84.61 ± 0.41 97.73 ± 2.00 95.28 ± 0.35 95.20 ± 1.80 85.30 ± 1.45 45.97 ± 1.94 90.96 ± 0.71 85.43 ± 8.15 — —

Number of models LMP

FLD

MLP

9.0 1.4 0.0 6.0 16.2 0.0 5.2 7.3 0.3 10.9 48.1 0.2 2.7 2.8 58.1 22.4 93.8 1.1 55.1 5.1

26.7 6.6 12.0 2.1 29.3 9.7 20.2 5.7 17.8 26.8 51.5 21.8 22.4 30.6 103.6 19.3 82.0 14.3 33.8 16.0

6.8 22.0 13.0 16.9 17.0 15.3 29.6 12.0 6.9 9.8 27.9 28.0 27.4 16.6 75.8 8.3 51.7 14.6 13.6 23.9

345.7 25.90

552.2 41.36

437.1 32.74

The average number of selected models are also provided where E = 25.

training data is taken into account during model selection. In order to investigate this further, we performed additional experiments. In particular, we implemented the main idea of omnivariate decision trees where the best individual model is selected. For this purpose, three decisions are generated at each node where each belongs to a different model class. Then, the model providing the best training accuracy is selected. The design parameters of each model class are set to be the same as in the cases when they were used alone (refer to Figs. 1–3). We observed that, at lower level nodes where the number of samples is small and hence the classification problem is comparatively simpler, the perceptron algorithm or Fisher’s linear discriminant classifier may provide the same accuracy as MLP. Moreover, class imbalance may occur at nodes close to the leafs where the number of samples belonging to one class may be much smaller compared to the other. In such cases, since the convergence rate of MLP is very low for minority samples [42], the accuracy provided by MLP may be equivalent to or less than that of a simpler model. For instance, out of 56 nodes generated for 10 independent simulations on “pima” dataset, perceptron algorithm-based model is selected 15 times and Fisher’s linear discriminant-based model is selected six times. At two nodes of “wdbc” and “phoneme” datasets, LMP provided better accuracies than FLD and MLP. Since simpler models are selected when they perform equivalent to the complex ones, the decision trees do not always include only complex models. The best individual model selection method is applied to all datasets using two different model selection criteria. In the first approach, accuracies on the training data are used for each node

taking into account all three model classes as described above. The experimental results are presented in the second column of Table 8. The average number of LMP-, FLD- and MLP-based nodes in the omnivariate trees is also presented in the following three columns. In the second approach, cross-validation is applied where three-hold-out-based accuracy estimate of each model type is considered for model selection. Last four columns show the accuracies achieved and the number of models used in this latter case. The sum and percentage of model types selected are also given in last two rows. As seen in the table, crossvalidation-based model selection selects fewer MLP (36.91%) models compared to training accuracy-based model selection (60.33%) as in the case when all model classes are considered. However, since individual models are used, the overfitting problem due to training accuracy-based model selection cannot be avoided, leading to trees with less generalization ability compared to cross-validation-based model selection approach. The test accuracies achieved are compared to MLPind and statistically better performances are printed in boldface. Out of 20 datasets, training accuracy-based model selection approach provided better accuracies than MLPind for eight datasets. For only “dna” dataset, a significantly better accuracy could be obtained. On the other hand, cross-validation-based model selection performs better, providing improved accuracies on 12 datasets. For eight of these datasets, statistically significant improvements are achieved. However, comparing Tables 7 and 8, it can be seen that ALLperf performs better compared to crossvalidation-based omivariate decision tree for 16 datasets. For 10 of these datasets, statistically significant improvements are

H. Altınçay / Pattern Recognition 40 (2007) 3540 – 3551

3549

dna cmc ecoli image wine waveform40

Datasets

waveform21 dermatology

MLPens

australian

ALLperf ALLcv

heart cancer wdbc vote sonar pima ionosphere credit_g -10.00

0.00

10.00

20.00

30.00

40.00

50.00

60.00

Error reduction rate (in %)

Fig. 4. The average improvements achieved by MLPens and using multiple model classes for generating model ensemble-based nodes (ALLperf and ALLcv) for E = 25.

achieved. Pairwise comparison of all four approaches are presented in Table 9 where ith row and jth column values are the number of datasets on which the model in ith row is statistically significant than model in jth column. These results dictate the following observation: Individual model selection is risky and hence sophisticated selection techniques may be necessary to achieve significant improvements as applied in Refs. [11] and [36]. However, ensembling-based model selection techniques proposed in this study does not require such efforts and good model subsets can be obtained by perturbing the model and input parameters. The additional cost of the proposed approach is increased computational demand during training and testing of the ensembles. However, since the models operate in parallel and they are decoupled from each other, their training and testing can be easily parallelized using a multi-processor system. 5. Conclusions In this paper, we proposed to use ensembles of models at the internal nodes. In generating model ensembles multi-layer perceptron (MLP), linear multivariate perceptron and Fisher’s

linear discriminant type models are considered. Diversity among the ensemble members is achieved by perturbing the design parameters of the models and input attributes. In the proposed approach, instead of selecting the best model as in omnivariate trees, cooperation among models is allowed so as to improve the generalization ability. Instead of eliminating irrelevant features, random subspace method is used which is proven to be efficient in high-dimensional feature spaces. One of the main strengths of the proposed approach is that it uses small number of training samples that reach at nodes close to the leafs in an efficient way. This is achieved by using random subspace method which relatively increases the number of training samples. Therefore, modeling problems that occur at lower level nodes due to small sample size problem are reduced. Experiments performed on 20 datasets have shown that ensemble-based nodes can generalize better compared to the individual nodes. The main drawback of the proposed approach is the increased computational complexity. Instead of one model evaluation, E models are evaluated at each node. Model selection is an important problem in omnivariate decision trees and hence it should be done in a careful way. This is mainly due to the fact that more complex models may

3550

H. Altınçay / Pattern Recognition 40 (2007) 3540 – 3551

Table 8 Performance of omnivariate decision trees using training set and cross-validation-based accuracies as model selection criteria Dataset

Omni (training)

Number of models LMP

69.92 ± 2.06 86.74 ± 3.23 73.93 ± 2.26 80.29 ± 2.75 92.44 ± 1.54 97.04 ± 0.71 81.15 ± 1.37 95.87 ± 1.14 75.56 ± 2.56 81.83 ± 1.58 95.00 ± 2.28 82.91 ± 0.98 82.17 ± 1.04 96.70 ± 1.89 94.62 ± 1.04 95.73 ± 2.42 82.05 ± 1.55 49.03 ± 1.59 89.44 ± 0.77 99.96 ± 0.11

credit_g ionosphere pima sonar vote wdbc phoneme cancer heart australian dermatology waveform21 waveform40 wine image iris ecoli cmc dna monkey1 

— —

%

Omni (cross-valid)

FLD

MLP

0.3 1.3 0.7 0.0 0.3 0.0 0.7 0.7 0.0 0.1 4.1 5.5 7.3 0.0 4.1 1.5 5.8 4.2 2.0 0.0

0.0 0.0 0.2 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.4 0.0 0.0 1.2 0.7 0.1 0.4 3.6 0.0 0.0

3.6 0.6 4.7 1.0 0.7 1.5 4.9 1.6 2.3 2.0 0.5 7.7 5.8 0.8 4.9 0.6 6.0 18.7 0.0 1.0

74.68 ± 2.20 90.97 ± 2.74 76.67 ± 1.15 78.16 ± 4.15 92.81 ± 1.92 96.65 ± 1.36 80.22 ± 1.38 96.31 ± 1.05 81.92 ± 3.56 85.90 ± 1.03 95.38 ± 1.49 85.09 ± 0.42 84.08 ± 0.48 97.39 ± 1.42 93.83 ± 0.98 96.27 ± 1.51 82.17 ± 2.31 49.47 ± 2.34 89.49 ± 0.59 99.50 ± 0.70

38.60 33.80

6.70 5.87

68.90 60.33

— —

Number of models LMP

FLD

MLP

0.1 0.3 0.1 0.3 1.3 0.1 0.3 0.4 0.1 0.8 2.4 0.1 0.3 0.0 3.2 0.9 5.8 0.0 2.2 0.0

1.1 0.1 0.8 0.0 1.1 0.0 0.3 0.4 1.0 2.3 2.2 0.9 0.5 1.4 2.2 0.9 4.2 0.6 0.1 0.0

0.6 1.5 0.7 1.1 0.6 1.0 2.2 0.9 0.3 0.5 0.8 1.3 1.7 0.6 3.9 0.2 2.7 1.1 0.0 1.0

18.70 30.41

20.10 32.68

22.70 36.91

The average number of selected models are also provided.

Table 9 Pairwise comparison of different methods in terms of the number of datasets on which the model in ith row is statistically significant than model in jth column Method

ALLperf

ALLcv

Omni (train)

Omni (cv)

ALLperf ALLcv Omni (train) Omni (cv)

— 2 3 4

6 — 3 4

14 13 — 7

10 9 1 —

provide better accuracies on their training data. However, this may be a consequence of overfitting and hence the resultant tree may have low generalization ability. In order to avoid this, computationally demanding statistical tests may be necessary. This is also verified by our simulations. In the proposed technique, a performance driven selection is applied and the selected members having different strengths and weaknesses due to their dissimilar formalisms are shown to provide good generalization capability. It is also shown that the use of multiple model classes may provide better accuracies compared to using a single model class. In particular, Allperf provided better accuracies compared to MLPens for eight data sets. The model classes considered in this study are among those that are already individually used in decision trees. Taking into account the fact that ensembles of unstable models are more successful, investigation of other sets of such model classes can be considered as a future research direction.

References [1] J.R. Quinlan, Induction of decision trees, Mach. Learn. 1 (2) (1986) 81–106. [2] S.R. Safavian, D. Landgrebe, A survey of decision trees classifier methodology, IEEE Trans. Syst. Man Cybern. 21 (3) (1991) 660–674. [3] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley, New York, 2001. [4] H. Guo, S.B. Gelfand, Classification trees with neural-network feature extraction, IEEE Trans. Neural Networks 3 (1992) 923–933. [5] C.E. Brodley, P.E. Utgoff, Multivariate decision trees, Mach. Learn. 19 (2) (1995) 45–77. [6] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Wadsworth, Belmont, 1984. [7] S.K. Murthy, S. Kasif, S. Salzberg, A system for induction of oblique decision trees, J. Artif. Intell. Res. 2 (1994) 1–32. [8] P.E. Utgoff, Perceptron trees: a case study in the hybrid concept representations, Connect. Sci. 1 (1989) 377–391. [9] K.C. You, K.S. Fu, An approach to the design of linear binary tree classifier, in: Proceedings of the Third Symposium on Machine Processing of Remotely Sensed Data, Purdue University, West Lafayette, 1976. [10] O.T. Yıldız, E. Alpaydın, Linear discriminant trees, Int. J. Pattern Recognition 19 (3) (2005) 323–353. [11] O.T. Yıldız, E. Alpaydın, Omnivariate decision trees, IEEE Trans. Neural Networks 12 (6) (2001) 1539–1546. [12] E. Alpaydın, Combined 5 × 2 cv F test for comparing supervised classification learning algorithms, Neural Comput. 11 (1999) 1885–1892. [13] C.E. Brodley, Dynamic automatic model selection, Technical Report 9230, University of Massachusetts, February 1992. [14] P.E. Utgoff, C.E. Brodley, An incremental method for finding multivariate splits for decision trees, in: Proceedings of the Seventh International Conference on Machine Learning, Austin, TX, Morgan Kaufmann, Los Atlos, CA, 1990, pp. 58–65.

H. Altınçay / Pattern Recognition 40 (2007) 3540 – 3551 [15] T.K. Ho, The random subspace method for constructing decision forests, IEEE Trans. Pattern Anal. Mach. Intell. 20 (8) (1998) 832–844. [16] L. Rokach, O. Maimon, Top-down induction of decision trees classifiers—a survey, IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 35 (4) (2005) 476–487. [17] E. Alpaydın, Introduction to Machine Learning, MIT Press, Cambridge, MA, 2004. [18] F. Esposito, D. Malerba, G. Semeraro, A comparative analysis of methods for pruning decision trees, IEEE Trans. Pattern Anal. Mach. Intell. 19 (5) (1997) 476–491. [19] R.P.W. Duin, PRTOOLS (version 4.0). A Matlab Toolbox for Pattern Recognition, Pattern Recognition Group, Delft University, Netherlands, 2004. [20] S. Geman, E. Bienenstock, R. Doursat, Neural networks and the bias/variance dilemma, Neural Comput. 4 (1992) 1–58. [21] L.I. Kuncheva, Combining Pattern Classifiers Methods and Algorithms, Wiley, New York, 2004. [22] A. Krogh, J. Vedelsby, Neural network ensembles, cross validation, and active learning, in: G. Tesauro, D. Touretzky, T. Leen (Eds.), Advances in Neural Information Processing Systems, vol. 7, MIT Press, Cambridge, MA, 1995, pp. 231–238. [23] G. Brown, Diversity in Neural Network Ensembles, Ph.D. Thesis, School of Computer Science, University of Birmingham, 2004. [24] D. Opitz, R. Maclin, Popular ensemble methods: an empirical study, J. Artif. Intell. Res. 11 (1999) 169–198. [25] J. Kittler, et al., On combining classifiers, IEEE Trans. Pattern Anal. Mach. Intell. 20 (3) (1998) 226–239. [26] H. Altınçay, M. Demirekler, An information theoretic framework for weight estimation in the combination of probabilistic classifiers for speaker identification, Speech Commun. 30 (4) (2000) 255–272. [27] C.J. Whitaker, L.I. Kuncheva, Examining the relationship between majority vote accuracy and diversity in bagging and boosting, Technical Report, School of Informatics, University of Wales, Bangor, 2003. [28] G. Zenobi, P. Cunningham, Using diversity in preparing ensembles of classifiers based on different feature subsets to minimize generalization error, in: L.D. Raedt, P.A. Flach (Eds.), Proceedings of 12th Conference on Machine Learning, Lecture Notes in Computer Science, vol. 2167, 2001, pp. 576–587.

3551

[29] P. Melville, R.J. Mooney, Constructing diverse classifier ensembles using artificial training examples, in: Proceedings of the IJCAI, 2003, pp. 505–510. [30] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140. [31] S.I. Gallant, Perceptron-based learning algorithms, IEEE Trans. Neural Networks 1 (2) (1990) 179–191. [32] M. Skurichina, R.P.W. Duin, Bagging, boosting and the random subspace method for linear classifiers, Pattern Anal. Appl. 5 (2002) 121–135. [33] M. Skurichina, R.P.W. Duin, Bagging and the random subspace method for redundant feature subsets, in: J. Kittler, F. Roli (Eds.), Multiple Classifier Systems, Proceedings of the Second International Workshop, MCS 2001, Lecture Notes in Computer Science, Springer, Berlin, 2002, pp. 1–10. [34] Z. Zhou, Y. Yu, Ensembling local learners through multimodal perturbation, IEEE Trans. Syst. Man Cybern. Part B Cybern. 35 (4) (2005) 725–735. [35] D. Ruta, B. Gabrys, Classifier selection for majority voting, Inf. Fusion J. 6 (2005) 63–81. [36] Y. Li, M. Dong, R. Kothari, Classifiability-based omnivariate decision trees, IEEE Trans. Neural Networks 16 (6) (2005) 1547–1560. [37] C. Blake, C. Merz, UCI repository of machine learning databases, http://www.ics.uci.edu/mlearn/mlrepository.html, Department of Information and Computer Sciences, University of California, Irvine, 1998. [38] G.P. Zhang, Neural networks for classification: a survey, IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 30 (4) (2000) 451–462. [39] G. Langer, U. Parlitz, Modeling parameter dependence from time series, Phys. Rev. E 70 (056217) (2004). [40] A. Tsymbal, M. Pechenizkiy, P. Cunningham, Sequential genetic search for ensemble feature selection, Technical Report, Trinity College Dublin, Computer Science Department, May 2005. [41] J.V. Hansen, A. Krogh, A general method for combining predictors tested on protein secondary structure prediction, in: H. Malmgren, M. Borga, L. Niklasson (Eds.), Proceedings of Artificial Neural Networks in Medicine and Biology, Springer, Berlin, 2000, pp. 259–264. [42] R. Anand, K.G. Mehrotra, C.K. Mohan, S. Ranka, An improved algorithm for neural network classification of imbalanced training sets, IEEE Trans. Neural Networks 4 (6) (1993) 962–969.

About the Author—HAKAN ALTINCAY was born in Cyprus, in 1972. He received his B.S. (with High Honors), M.S. and Ph.D. degrees all from Middle East Technical University, Ankara, Turkey in Electrical and Electronics Engineering department. He was working as a research assistant in Speech Processing Laboratory in Middle East Technical University between February 1996 and February 2000. He is currently working as an Associate Professor in computer engineering department of Eastern Mediterranean University, Northern Cyprus. His main areas of interest include classifier ensembles, pattern recognition and speaker recognition.