Applied Soft Computing 68 (2018) 626–635
Contents lists available at ScienceDirect
Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc
Multi-Manifold based Rotation Forest for classification M. Khamar, M. Eftekhari ∗ Department of Computer Engineering, Shahid Bahonar University of Kerman, Kerman, Iran
a r t i c l e
i n f o
Article history: Received 13 July 2017 Received in revised form 1 April 2018 Accepted 18 April 2018 Keywords: Rotation Forest Ensemble learning Manifold learning
a b s t r a c t Rotation Forest (RF) is a powerful ensemble classifier which has attracted substantial attention due to its performance. The RF algorithm uses Principal Component Analysis (PCA) for constructing the rotation matrix and extracting new features. In this paper, with the aim of extracting new features, three wellknown manifold learning techniques are utilized to extract new features and incorporate into PCA for feature extraction. This new RF algorithm is hereby called Multi-Manifold RF (MMRF), and several experiments are conducted in the present study in order to evaluate its performance. The obtained results reported for nineteen datasets show the high efficiency of MMRF compared to fourteen state-of-the-art ensemble methods in terms of classification accuracy and computational effort. Furthermore, two statistical non-parametric tests (Friedman and Wilcoxon) are carried out to compare the average classification accuracies of MMRF with those of the other methods The experimental results demonstrate that MMRF outperforms twelve of these methods, while there is no significant difference between MMRF and the other two powerful ensemble-based methods, namely the SES-NSGAII and the IDES-P. © 2018 Elsevier B.V. All rights reserved.
1. Introduction Ensemble methods are learning algorithms which leverage a set of classifiers [1] rather than using a single one. These classifiers called base models are trained to jointly solve one specific classification task [2]. In ensemble methods, the final classification label for a sample is usually determined by applying a voting scheme in order to aggregate the individual votes of the base models [3]. In practice, ensemble learning is usually more accurate than individual base models. An ensemble-based model is comprised of three parts: sample selection strategy, training the base classifiers to compose the Base Classifier Pool (BCP), and combining the classifiers of the BCP [2,4]. Some of the ensemble-based algorithms are: bagging, boosting and RF [5]. Generally, Ensemble Learning Systems (ELSs) are divided into two categories: Static Classifier Ensemble (SCE) and Dynamic Classifier Ensemble (DCE). Within the SCE approach, a fixed ensemble scheme, learned during the training phase, is utilized for all the test samples. There are three types of SCE methods: Classifier Fusion (CF), Static Classifier Selection (SCS) [6,7], and Static Ensemble Selection (SES) [4,8]. On the other hand, the basic idea in DCE is to estimate the accuracy of each classifier in a local region of the feature space around a given test sample and to select one (or
∗ Corresponding author. E-mail address:
[email protected] (M. Eftekhari). https://doi.org/10.1016/j.asoc.2018.04.026 1568-4946/© 2018 Elsevier B.V. All rights reserved.
more) classifier(s) with the highest value of the local accuracy in order to separately classify a test sample [9]. The DCE approaches are divided into two strategies: Dynamic Classifier Selection (DCS) [9–12] and Dynamic Ensemble Selection (DES) [9]. One objective behind embedding the data into a low dimension space is to preserve the local structure of the new projected data. Therefore, the projection may reduce the effects of noise and outliers [13]. In order to project data into a lower dimension space, various mapping methods have been devised, such as PCA, LDA, LLE, and LEM (described in Section 2) [13]. The RF algorithm is a method for generating classifier ensembles on the basis of feature extraction. To construct training data for the base classifiers (base classifier is DT), the feature set is randomly split into K subsets, and the PCA is applied to each subset. The training data are projected along with the eigenvectors of the covariance matrix, and then, the models are constructed based on the new samples [13,15]. The main contribution of this paper is to extend the RF algorithm such that three well-known data transformation techniques, namely Local Linear Embedding (LLE), Laplacian Eigen Maps (LEM) and Linear Discriminant Analysis (LDA) are utilized for feature extraction and incorporating into PCA. Accordingly, some idea from manifold learning is used within RF for projecting the feature subsets into the new spaces. The rest of the paper is organized as follows. A brief review of ensemble learning systems and feature extraction methods are conducted in Sections 2 and 3, respectively. In Section 4, a framework is described for the proposed method, which is based on
M. Khamar, M. Eftekhari / Applied Soft Computing 68 (2018) 626–635
different manifold learning approaches. A variety of experiments on nineteen datasets are presented and discussed in Section 5. Finally, Section 6 draws the conclusion.
627
fuzzy-based ideas have been applied for improving RF classification performance across imbalanced datasets [23,24]. 2.2. Static Classifier Selection (SCS)
2. Related works According to review of contributions in the field of ELSs, there are five methods for combining and generating the classifiers: CF [4,8], SCS [5,8,24], SES [7,8], DCS [10,11,25] and DES [8,9,24]. CF, SCS, and SES are static classifiers ensembles, while DCS and DES are dynamic. These strategies are briefly explained below.
SCS is a method which, firstly, selects the best classifier for each region of competence in the feature space during the training phase. After that, each test sample is classified with a classifier related to its own region. In the same context, Kuncheva [6] has proposed a method for classifier combination based on a selection technique. Also, single Best approach (SB) can select the best classifier in the ensemble with the lowest training error for all test samples [25].
2.1. Classifier Fusion (CF) 2.3. Static Ensemble Selection (SES) CF utilizes all classifiers in the ensemble which are trained in the training phase for decision making as Majority Voting (MV), Bagging, Boosting, Adaboost, and RF. One can refer to [4,8] for more details about these methods. Since the objective in the present study is to extend the original RF algorithm, some recentlyintroduced extensions of RF are reviewed in the following. Lu et al. [16] proposed a cost-sensitive RF(C-RF) algorithm for gene expression data classification, with emphasis on misclassification cost, test cost, and rejection cost. Costs were embedded into the RF algorithm. This algorithm is explained in more details in [15]. Using some heterogeneous classifiers, including DT, a heterogeneous RF algorithm have been proposed in the literature [17,18]. The Anticipative Hybrid Extreme RF (AHERF) is a method introduced in [16], being constructed from a pool of DT, Extreme Learning Machines (ELM), Support Vector Machines (SVM), kNearest Neighbors (k-NN), Adaboost, Random Forests and Gaussian Naive Bayes (GNB) classifiers. An improved heterogeneous RF was proposed by Mousavi et al. for MicroRNA target prediction [18]. Self-training method is among the most common schemes of semisupervised techniques. In this regard, self-trained RF has been proposed by Fazakis et al. for semi-supervised learning [19] as a combination of self-training scheme and RF. The steps of this method are as follows: • The RF classifier is trained as the base classifier for selecting a subset of samples called XMC P with the highest confidence prediction. • XMCP is eliminated from the primary labeled samples (U) and is added to the primary unlabeled samples (L). • In each iteration, a few samples per class are selected from U and are added to L. • The RF method is trained again on L samples. • At the end of each iteration, the RF trained on L is used for evaluating the test samples. In order to solve noise and outlier problems, a feature-weighted RF algorithm (FWRF) was proposed by Wang et al. [20] which investigates the interactions among the proteins. Here, features with low weight values are removed and the elimination of extra information facilitates the application of useful features and prediction of the interactions among proteins. Xia et al. [21] presented a spectral and spatial Rotation Forest (SSRF) which contains the following steps: • Pixels are smoothed by the multi-scale (MS) spatial weight mean filtering. • The spectral-spatial data transformation is employed in the RF. • Classification results are obtained through majority voting. Su et al. [22] have improved RF based on Hellinger Distance (HD) for classification of highly imbalanced data. More recently,
SES selects an optimal set of classifiers for all test samples. Here, Yang [8] proposes two methods: classifiers that are selected based on accuracy (SA), and those selected based on accuracy and diversity (SAD). The first method (SA) selects 75% of classifiers with the lowest error in the validation phase for all test samples, while the latter (i.e. SAD) selects 90% of the classifiers with the lowest error in the validation phase. Eventually, 75% of the most accurate and diverse classifiers are selected. 2.4. Dynamic Classifier Selection (DCS) DCS method selects one classifier with the highest accuracy for each test sample based on the validation set [26]. Here, two methods are introduced as follows: DCS-Local Accuracy (DCS-LA) [10] and DCS-Multiple Classifier Behavior (DCS-MCB) [11]. The value of “k” adapts dynamically in these two methods. • DCS-LA The accuracy of each classifier is estimated in a local region defined as k-nearest neighbors of the test instance taken from the validation set. The classifier with the highest value of this local accuracy is selected for classifying the test sample [4,10]. • DCS- MCB DCS-MCB is a method proposed by Giacinto and Roli [11] based on the concepts of classifiers LA (CLA) and MCB. This method firstly estimates the accuracy of each classifier in a local region of the feature space surrounding a test instance, and then selects the classifier with the highest value of this LA to classify the test instance. This algorithm includes the following parts: • First part: the k-nearest neighbors in the training data are computed for each test instance by MCBs (the MCB is defined as a vector of class labels assigned to the test instance x). • Second part: this strategy allocates the class labels to the test instance x and its k-nearest neighbors between all classifiers in the BCP. Next, the similarities between MCBs are computed using Hamming distance. The k-nearest neighbors of a test instance whose similarities are higher than a similarity threshold are selected [4,11]. 2.5. Dynamic Ensemble Selection (DES) The DES approaches utilize an optimal classifier ensemble and have better performance for classifying each test instance, compared to single classifiers. Several DES approaches have recently been proposed [4], including DES-Knora-Eliminate (DES-KE) [4,9], DES-Performance (DES-P), DES-Kullback–Leibler (DES-KL) and
628
M. Khamar, M. Eftekhari / Applied Soft Computing 68 (2018) 626–635
Group Method Data Handling (GMDH)-based DES (GDES) [4,27]. These methods are briefly explained below.
2.6. SES- NSGAII and IDES-P approaches • SES- NSGAII
• DES-KE In this approach, the optimal subset of classifiers is selected for classifying. The DES-KE method first finds the k-nearest neighbors of a test instance in the validation set. Then, it selects a subset of classifiers that correctly classify all the k-nearest neighbors test instance. The number of neighbors, k, is adaptive for each test instance. Combining the classifiers in this approach is based on majority voting [4,9].
This method seeks for the classifiers and their combiner over the training samples based on the accuracy and diversity of ensemble classifiers. The NSGAII is an efficient multi-objective evolutionary algorithm which has been used for finding a trade-off between two contrary objectives, namely accuracy and diversity. The SES-NSGAII algorithm includes 5 steps as follows [4]
• DES-KL and DES-P
• Train the base classifiers. • Do the NSGAII algorithm. (Each chromosome encodes a subset of the trained classifiers and a combiner and is evaluated based on accuracy and diversity of the classifiers). • Select the best Pareto optimal front among the presented Pareto optimal fronts. • Evaluate solutions of the best Pareto optimal front. • Return the best solution. • IDES-P
The DES-KL and DES-P apply a competence measure C(|X) for selecting the classifiers. In this measure, ∅ represents the name of classifiers [25]. The classifiers whose values of competence measure are greater than 0 are selected. In case such classifiers are not available, random classifiers are used. P(|X) shows the accuracy of each classifier ∅, estimated based on the weighted k-nearest neighbors of each test instance in the validation set. The competence measure used in the DES-P approach is defined as follows [4,25]: C(∅|X) = P(∅|X) −
1 , M
(1)
1 where M is the number of class labels, and M is the performance of random classifier. The competence measure is applied in DES-KL for the chosen classifier as follows [26]:
n Y ∈V
An improvement of DES-P approach called IDES-P has been introduced which selects classifiers and their combiner by SESNSGAII [4]. This algorithm is explained in more details in [4].
3. Supervised and unsupervised projection methods (Csrc (Ø |X)exp(−d(x, y)2 ),
(2)
In Eq. (2), Csrc (|X) is computed based on Kullback–Leibler (KL) divergence for each instance y in the validation set. d (x, y) is the similarity between the y in the validation set and test instance x. The distance between the two instances is attained according to Euclidean distance [4,25]. • GDES The competence measure in the GDES method is based on GMDH in order to select a set of classifiers [4,27]. In this method, each test pattern x is classified using the following steps: 1. Find k-nearest neighbors of x taken from the validation set to make a local region of competence Sk . 2. The initial models are considered (i.e. fed into network) as input layer by the classification results of Sk amid all classifiers in the BCP. 3. Get new candidate models as a new layer by combining pairs of models of the previous layer. 4. The parameters of each candidate model are estimated by the least square. 5. The symmetric regularity factor is attained as the external criterion of the candidate models for choosing the efficient models that should be forwarded to the next layer. 6. Repeat steps 3–5 until the optimal complexity classifier ensemble for x is found by the termination principle (introduced by the theory of optimal complexity): “when the model complexity increases, the value of external criterion will first reduce and then soar, and the global minimum corresponds to the optimal complexity model.” [4,27].
There are some supervised and unsupervised data transformation methods in the literature. Four of which are being explained below.
3.1. Principal Component Analysis (PCA) Consider the training input part of a classification dataset D ∈ Rn × Rd+1 to be X ∈ Rn × Rd where n is the number of samples, d is the number of features, and the last column of D is class labels. The PCA is applied over the covariance matrix of X. Then, eigenvectors and their corresponding eigenvalues are extracted using this method. New data are achieved by projecting the original data along with eigenvectors [13,15]. The coefficients of principal components are the eigenvectors (v1 , v2 , . . ., vd ) which are obtained through employing the PCA. The size of each eigenvector is d × 1. The (v1 , v2 , . . ., vd ) eigenvectors are obtained from eigen-decomposition of covariance matrix of X. If all the eigenvectors are appended into a matrix, then the matrix can be shown as follows [13,15]:
VPCA = v1 , v2 , . . ., vd ,
(3)
where the size of VPCA is d × d. In RF algorithm, after some operations, matrix X will convert into submatrices Xj’ j = 1, 2, . . ., K, the sizes of which are m × d with m ≤ n and d ≤ d. Then, eigendecomposition is applied over the covariance matrix of each of Xj’ j
j
matrices to obtain VPCA = v1 , v2 , . . ., vd’ , where each VPCA is of size d
× d .
Afterwards,
j VPCA
(j = 1, 2, . . ., K) matrices are arranged diag-
M. Khamar, M. Eftekhari / Applied Soft Computing 68 (2018) 626–635
onally in a matrix called rotation matrix. The rotation matrix RPCA is constructed as follows: 1 VPCA
0 . RPCA = [ ..
0 ..
.
0
.. .
VPCA
.. .
0
0
0
0
...
j
...
0
0
.. .
0
.. .
..
.
].
0 K VPCA
0
’ Then, by re-arranging the rows of RPCA , the matrix RPCA is obtained. ’ The size of both RPCA and RPCA matrices is d × d, and the new data for feeding into the classifiers is obtained as follows: ’ Xnew = XRPCA .
3.3. Laplacian Eigen Maps (LEM) One of the objectives of researchers in data transform from original space into another space is to preserve local neighborhood information. In the LEM method, one objective is to attain a solution based on geometric structure of the manifold [14,31]. This method includes 3 steps as follows: Step 1: The neighborhood graph can be built based on finding k-nearest neighbors, where k is the user-defined parameter (the value of “k” is set to 5 in our proposed method), and neighbors are selected based on Euclidean distance [14, 31, and 32]. Step 2: Sij for each edge, eij , which is the distance between xi and xj (i.e. the two rows of matrix X). For the nodes close to each other, the value of Sij is one, but if the nodes are not connected, the value Sij is zero. is a user-defined parameter the value of which is usually one, ( = 1 in our proposed method). Sij is defined as follows [14]:
3.2. Linear Discriminant Analysis (LDA) The projection of data is performed in LDA method in such a way that the ratio of within-class variance and between-class variance is Wopt is the maximum of the following criterion [28,29]: Wopt = argmaxW =
trace W T Sb W
,
(4)
trace W T Sw W
in Eq. (4), Sw is a within-class covariance matrix, and Sb is a between-class covariance matrix. Sw and Sb are explained further in [28,29]. The model explained below [28–30] is another way for maximizing the between-class variance and minimizing the within-class variance: Wopt = argmaxw trace(W T Sb W ) − trace(W T Sw W,
>0
(5)
A benefit of the above-mentioned model is the fact that calculation of the inverse of Sb is not required. Nevertheless, justifying the value of weight in this model is a challenging task. This criterion is known as the maximum margin criterion [30] when = 1. In Eqs. (4) and (5), the matrix Wopt is the eigenvectors of the Sb /Sw orSb − Sw matrix corresponding to the maximum eigenvalue. The size of each eigenvector is d × 1. The (v1 , v2 , . . ., vd ) eigenvectors are obtained from eigen-decomposition of Sb /Sw orSb − Sw . If all eigenvectors are appended into a matrix, then this matrix can be represented as follows [13,15,28–30]:
d
VLDA = v1 , v2 , . . ., v
0
,
. RLDA = [ .. .. . 0
0 ..
.
0
.. .
VLDA
0
0
0
j
...
e
−xi − xj
)
Ifxi , xj correspond to neighbors
,
(7)
otherwise
0
where S is of dimension n × n and n is the number of samples; S is a symmetric and sparse matrix [30]. Step 3: The aim of the projection is to compute yi so that the distance between the nodes (yi and yj ) is preserved after mapping. The objective function for satisfying the closeness after mapping is presented as follows (the problem is to minimize EL ) [14,31,32]: EL =
n n
(yi − yj )2 Sij = 2Y T LY,
(8)
i=1 j=1
where, there are two matrices: diagonal matrix D and Laplacian
matrix L, Dii =
Sij and L = D − S that i = 1, 2, . . ., n. In this method,
j
the solution is provided by performing the generalized eigendecomposition of L [14,31]. Y is the eigenvectors of L corresponding to the least eigenvalues (i.e. the first eigenvalues while sorted in a descending order). In Eq. (8), XW are taken instead of Y and Eq. (9) is derived as follows [13,30,32]: EL =
n n
(yi − yj )2 Sij = 2Y T LY = 2(XW )T L(XW )
(9)
(6)
...
0
0
.. .
0
.. .
.. 0
.
].
0 K VLDA
’ Then, by re-arranging the rows of RLDA , the matrix RLDA is obtained. ’ The size of both RLDA and RLDA matrices is d × d, and the new data for feeding into the classifiers are achieved as follows: ’ Xnew = XRLDA .
Sij = {
(
i=1 j=1
where the size of VLDA is d × d. According to the descriptions above for PCA, the corresponding rotation matrix of LDA (RLDA ) is constructed as follows: 1 VLDA
629
EL = 2W T X T LXW = 2W T L W,
(10)
in Eq. (10), the eigen-decomposition of matrix L’ results in obtaining the matrix W so that Y=XW is the projection of X along with W vectors. This projection preserves the neighbors of new nodes Y as well as those of X. Matrix L is sparse and positive semifor PCA and LDA, definite. Similar to the descriptions given above ’ the matrices VLEM = v1 , v2 , . . ., vd , RLEM and RLEM can be prepared for utilizing in RF. 3.4. Local Linear Embedding (LLE) Similar to the LEM, the aim of LLE is to preserve the local structure of the data after projection [13]. The algorithm of LLE includes 3 steps as follows: Step 1: Finding the k-nearest neighbors (the value of k is set to 5 in our proposed method) for each nodexi , i = 1, 2, 3, xps : spanclass = "ceCheck" xps : spanclass = "ceCheck" ../xps : span /xps : span , n.
630
M. Khamar, M. Eftekhari / Applied Soft Computing 68 (2018) 626–635
Step 2: Computing the matrix S, which is the nearest neighbor matrix. Thereby, S is a weight matrix from the neighborhood nodes. Hence, the objective is to minimize the error E as follows: minEs =
n
xi −
n
i=1
sij xij ;
(11)
sij = 1,
i=1
n
sij yij = 2Y T MY,
(14)
j=1
in Eq. (13), yi is a projection of node xi . To solve Eq. (14), matrix M is obtained using the formula below: M = (I − S)T (I − S) ,
(12)
the weights are zero for the nodes that are not in neighborhood with each other [13,14]. Step 3: Use the weights obtained from the previous step so as to find the best projections of xi . Therefore, the second error function must be minimized as follows: n
yi −
i=1
yi −
n j=1
sij yij
(13)
(15)
Since Y is a projection of X along with W vectors (i.e.Y = XW), XW is taken instead of Y in Eq. (14) as follows [13,14,32]:
j=1
minEY =
n
j=1
sum of the sij for all neighbors should be equal to one: n
minEY =
minEY =
n i=1
yi −
n
sij yij = 2Y T MY
(16)
j=1
minEw = 2W T X T MXW = 2W T M W,
(17)
in Eq. (17), the eigen-decomposition of matrix M’ leads to obtaining matrix W (i.e. matrix of eigenvectors). Matrix M’ is sparse and positive semi-definite. Likewise for PCA, the above descriptions ’ can LDA, and LEM, the matrices VLLE = v1 , v2 , . . ., vd , RLLE and RLLE be prepared for being utilized in RF.
Fig. 1. The framework of the proposed MMRF algorithm.
M. Khamar, M. Eftekhari / Applied Soft Computing 68 (2018) 626–635
631
Table 1 Datasets specifications. Row
Datasets
Number of Samples
Number of Features
Number of Classes
Source
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D16 D17 D18 D19
Wine Iris Glass Breast cancer Thyroid Dermatology Ecoli Yeast Vowel Segmentation OptDigits Page blocks Ionosphere Laryngeal3 Hill valley Smartphone MUSK DLBCL SRBCT
178 150 214 699 215 366 336 1484 990 2310 3823 5473 351 353 1212 5744 6598 47 83
13 4 9 9 5 34 7 8 10 19 64 10 34 16 100 561 166 4026 2308
3 3 6 2 3 6 8 10 11 7 10 5 2 3 2 6 2 2 4
UCI UCI UCI UCI LKC UCI UCI UCI UCI UCI UCI UCI UCI LKC UCI UCI UCI [35] [36]
Table 2 Used parameters in the experiments.
4. The proposed method Consider the input part of a classification dataset Dn×(d+1) to be Xn×d , where n is the number of samples, d is the number of features, and the last column of D is class labels. Suppose X = {x1 , x2 , . . ., xn } and each row of X is xi (a vector of size 1 × d). Consider Y = {y1 , y2 , . . ., yn } to be the last column of D or class labels so that yi is the class label of xi . CL = CL1 , CL2 , CL3 , . . ., CLh where CLi is the ith classifier, and h is number of classifiers. In this paper, a new RF-based ensemble learning method is proposed. An algorithmic framework of our proposed method is provided in Fig. 1. The description of this algorithm is explained in the following steps: • The input part of dataset (i.e. X) is divided into two parts: training and testing data (Xtrain and Xtest ). • The features (F) are divided in K subsets. Length of each subset is d = d/K. • Xij is a part of training data (Xtrain ) containing the jth subset of features for CLi classifier. For each Xij a subset of samples is eliminated randomly. To be more specific, if the number of classes for Xij samples is equal to or below 5, then only 1/3 of the classes and their related samples will be eliminated, and if the number of classes is above 5, then 1/2 of the classes and their related samples will be eliminated. Afterward, the bootstrap sampling is applied for the remaining training data (bootstrap percentage is adaptive), and the new data set is namedXij’ . • Eigenvectors are obtained by eigen-decomposion of matrix X ’ : ij
’ .
For each Xij’ (j = 1, . . ., K), apply PCA on Xij’ and obtain matrixRPCA
For each Xij’ (j = 1, . . ., K), apply LDA on Xij’ and obtain ’ . matrixRLDA
For each Xij’ (j = 1, . . ., K) , apply LEM on Xij’ and obtain ’ . matrixRLEM
For each Xij’ (j = 1, . . ., K), apply LLE on Xij’ and obtain ’ . matrixRLLE • The new training data are constructed X new ij
PCA
’ = Xtrain RPCA ,
’ ’ ’ . Xijnew LDA = Xtrain RLDA ,Xijnew LEM = Xtrain RLEM ,Xijnew LLE = Xtrain RLLE • Each classifier CLi is trained four times using X new PCA , X new LDA , i i Xinew LEM and Xijnew LLE . Therefore, 4h classifiers are trained while h is the number of classifiers. • The new testing data are constructed based ’ ’ ’ ’ onRPCA , RLDA , RLEM andRLLE matrices, respectively. • The classifiers are evaluated and then the overall accuracy is obtained by means of majority voting.
methods
k
L
RF LLE LEM
– – 1
– 5 5
10 – –
The base classifier is decision tree. The framework of the proposed algorithm is depicted in Fig. 1. 5. Experimental results 5.1. Datasets and parameters Nineteen different datasets are investigated in this paper. These benchmark datasets are selected from the UCI Machine Learning (UCI) [33], the Ludmila Kuncheva Collection (LKC) [34] and microarray datasets [35,36]. The specifications of these datasets are presented in Table 1. All datasets are divided into training and testing sets by 5 × 2 cross validation for comparison of the proposed method (i.e. the MMRF) with the other methods [4]. Also, the employed parameters of the proposed method are listed in Table 2. 5.2. Ensemble learning approaches used for comparison In this study, the comparison is performed using PR-Tools developed by Duin et al. in MATLAB [37]. Fourteen methods are chosen for comparison as follows: two CF approaches (MV, RF [15,25]), one SCS approach (SB [25]), three SES approaches (SA, SAD [8], and SESNSGAII [4]), three DCS approaches (DCS-LA [10], DCS-MCB [11], and DCS-MLA [12]), and five DES approaches (DES-KE [9], DES-KL[25], DES-P [25], IDES-P [4], and GDES [26,27]). 5.3. Comparing the proposed method with other methods In this section, the accuracy of MMRF method and other methods are obtained on nineteen datasets. The accuracy for each dataset based on MMRF is the averaged accuracy value of 5 different runs obtained by 5 × 2CV. The results are presented in Table 3. The average accuracy obtained from the MMRF algorithm is higher than the other ones, but as shown in Table 3 the average accuracy of SESNSGAII and IDES-P over all datasets are almost similar to that of the MMRF method. In the SES-NSGAII and IDES-P, multi-objective evolutionary algorithms (e.g. NSGAII) are utilized for finding the best
632
M. Khamar, M. Eftekhari / Applied Soft Computing 68 (2018) 626–635
Table 3 Comparison of classification accuracies for all ensemble approaches using 5 × 2CV. Datasets
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D16 D17 D18 D19
Methods MMRF
IDES-P
SES- NSGAII
SA
SAD
SB
MV
DCS-LA
DCS-MCB
DCS-MLA
DES-KE
DES-P
DES-KL
GDES
RF
97 96.5 71.8 97.2 95.25 97.02 86.23 60.62 92.51 89.21 98 97.22 94.56 74.06 91.65 92.43 98.02 90.4 94.6
99.11 97.33 71.9 97 97.22 96.71 88.98 60.11 95.6 96.9 97.83 97.2 94.59 75.35 86.02 91.24 97.48 91.38 92.48
99.11 97.33 72.34 96.77 97.22 97.43 88.43 59.98 94.83 95.56 97.59 97.03 92.88 75.07 86.42 89.24 97.92 91.3 92.4
97.02 96.67 64.72 96.42 95.35 96.03 85.61 58.75 78.85 91.06 87.32 95.95 86.32 72.25 56.07 73.37 96.97 87.75 34.94
97.2 96.67 64.88 96.45 95.35 95.91 85.7 58.87 78.22 91.18 87.43 95.87 86.72 72.18 60.59 88.47 96.52 87.63 66.73
95.86 96.8 66.95 96.25 94.52 95.85 84.05 58.02 87.82 93.7 96.43 96.09 85.3 70.88 58.76 91.5 95.31 73.19 75.82
96.64 94.94 67.7 96.25 91.36 95.68 84.35 57.45 87.94 94.89 97.23 96.06 84.67 73.09 58.17 86.35 96.72 70.22 41.66
95.63 94.93 67.7 94.79 94.42 93.55 78.87 50.58 90.53 94.12 95.99 95.97 83.7 65.21 63.77 53.06 96.73 84.55 82.61
95.39 95.07 68.71 94.74 94.42 93.6 78.93 50.78 88.99 94.36 95.76 96 84.22 65.1 64.12 54.02 97.02 83.22 84.05
95.41 95.33 68.56 94.82 94.42 93.44 78.34 50.32 91.19 94.2 95.9 95.93 83.87 65.27 64.19 53.75 96.67 82.83 87.52
96.85 95.87 67.91 94.94 94.7 94.65 79.29 50.21 92.2 95.22 97.31 96.02 83.99 65.1 54.23 89.04 96.42 70.98 54.23
97.64 95.33 69.11 96.25 95.08 95.68 84.17 55.53 92.81 95.6 97.51 96.04 86.72 70.48 56.22 90.01 96.52 73.64 57.12
96.97 96.27 69.01 95.85 95.63 94.97 83.27 52.43 93.17 92.25 97.08 96.47 86.1 68.84 57.12 89.92 96.02 72 56.22
95.02 96 66.01 95.75 96 92.89 85.96 54.25 87.47 83.12 96.86 96.03 83.25 67.1 50.84 78.11 89.10 57.46 42.61
96.2 95.15 70.00 96.2 93.97 96.21 84.19 58.8 88.74 87.54 97.03 96.52 93.03 72.02 90.03 88.6 96.75 82.56 86.62
Fig. 2. Comparison of the MMRF method with other methods based on average accuracy.
classifiers in ensemble design. The time complexity and running time of the SES-NSGAII and IDES-P methods are higher than those of the MMRF algorithm since only one base classifier is used in MMRF (decision tree C4.5) rather than the other two (i.e. SES-NSGAII and IDES-P) in which a combination of classifiers are employed for the final decision (e.g. the majority voting and other combination methods). Therefore, compared to SES-NSGAII and IDES-P, the MMRF is not a computational prohibitive approach. Also, Fig. 2 depicts the comparison of our proposed method with the other methods based on the average accuracies given in Table 3. The average accuracy of IDES-P and SES-NSGAII are slightly better than that of MMRF. Statistical non-parametric tests are considered in Section 5.5 below. Accordingly, there is not significant difference between the MMRF approach and the two other methods (i.e. SES-NSGAII and IDES-P). 5.4. Computational complexity comparisons The SES-NSGAII is a multi-objective ensemble learning system. Obviously, the evolutionary algorithms show more complexity than the algorithms without optimization phase. The running time of any evolutionary algorithm is G×P× complexity of fitness function (G and P are the number of generation and population size, respectively). • comparison of running time of SES-NSGAII with MMRF
In the SES-NSGAII method, the length of each chromosome is set to 49. In other words, 49 classifiers are trained for evaluating the fitness of each chromosome. The order of trainingsome different classifiers by n data samples and d features is O n × d × t + d3 in the worst case t = max (n, d) [4]. Therefore, the time complexity of evaluating each chromosome function) is (i.e. fitness O(n × d × t + d3 ), and thereby O G × P n × d × t + d3 for SESNSGAII. Meanwhile,3 the time complexity of the original RF method is O n × d × t1 + d (t1 = min (n, d)) [38,39]. According to the maximum number of features in Table 1, being 4026, the worst case running time of SES-NSGAII is equal to O d1.4 n × d × t + d3 , owing to G× P× number of classifiers is approximately equal to d1.4 (where G and P are both 50). The time complexity of the LEM and LLE methods are O dt 2 + d3 [14,40]. The MMRF method uses the PCA as well as three manifold learning methods, namely LDA, LEM, and LEE. Furthermore, the time complexity of MMRF is O(d × (d2 + t2 )). • comparison of running time of IDES with MMRF Generally, the time complexity of IDES-P is more than that of SES-NSGAII since in the IDES-P method, first, according to the best Pareto optimal set, the best classifiers are selected, and then they are utilized in the DES-P algorithm. Also, in the dynamic methods, the accuracy of each test sample is achieved according to validation
M. Khamar, M. Eftekhari / Applied Soft Computing 68 (2018) 626–635 Table 6 The results of Holm and Li post hoc procedure methods (␣ = 0.05).
Table 4 Comparison of running time for MMRF, SES-NSGAII, and IDES-P. Method
Running time
MMRF SES-NSGAII IDES-P
(d + dt 2 ) 1.4 3 d (d + dt 2 ) 2 1.4 3 d (d + dt 2 + (ωn) )
3
Table 5 Average ranking of the algorithms (Friedman). Algorithm
ranking
IDES-P SES-NSAGAII MMRF DES-P RF SAD DES-KL SB SA MV DES-KE DCS-MCB DCS-MLA DCS-LA GDES
1.8684 2.4474 3.2632 7.3158 7.4737 8.1579 8.4737 8.6842 8.7895 9.3684 10.2368 10.4474 10.7632 11.1316 11.5789
633
set, which uses k-nearest neighbor (time complexity for finding nearest neighbor is n2 ). The competence measure, which has running time ω, is utilized in the DES-P method in Section 2.5 [4]. The running times of the MMRF, IDES-P, and SES-NSGAII methods are shown in Table 4. In fact, the running time of SES-NSGAII takes d1.4 times worse than that of the MMRF method. 5.5. Non-parametric statistical tests To evaluate whether there exists a significant difference between the MMRF approach and other studied methods, in KEEL software tool, many non-parametric statistical tests have been implemented for analyzing the results [41]. In the first step, the Friedman test is utilized in this paper for performing a comparison [42]. Li [43] and Holm [44] are suitable post hoc procedures for Friedman test. Table 5 shows the obtained average ranking over all the datasets by performing the Friedman procedure for each approach. The p-value obtained by Holm and Li procedures are provided in Table 6. The control method is IDES-P in Table 6. Holm’s procedure rejects those hypotheses that have PHolm ≤ 0.025. Also, ␣ = 0.05 is considered as the level of significance in the experiments. Hence, the classification performance of the MMRF method is significantly better than those of SA, SB, MV,
Algorithm
P
P Holm
P Li
Hypothesis
DCS-LA DCS-MLA DCS-MCB GDES DES-KE MV SB SA SAD DES-KL DES-P RF MMRF SES-NSGAII
0 0 0 0 0 0 0.000002 0.000003 0.000005 0.000015 0.000112 0.000174 0.336423 0.689884
0 0 0 0 0 0.000002 0.000015 0.000018 0.000032 0.000073 0.000448 0.000521 0.672847 0.684318
0 0 0 0 0 0.000001 0.000006 0.000008 0.000017 0.000047 0.000361 0.00056 0.520345 0.684318
Reject Reject Reject Reject Reject Reject Reject Reject Reject Reject Reject Reject Not Reject Not Reject
DES-KE, GDES, DCS-MCB, DCS-MLA, DCS-LA, SAD, DES-KL, DES-P and RF. Meanwhile, there are no statistically significant differences in classification performance among the IDES-P approach and the other two approaches (i.e. SES-NSGAII and MMRF), as highlighted in Table 6. But the time complexity of MMRF is much less than that of SES-NSGAII and IDES-P. In MMRF method, only one classifier is employed, whereas in the other two methods a combination of classifiers has been utilized. Also, Li’s procedure rejects those hypotheses that have PLi ≤ 0.016322, as shown in Table 6. In the second step, the Wilcoxon test is utilized for a pairwise comparison of the methods, and the results are presented in Table 3. Then, the ranks of the Wilcoxon test among all methods based on accuracies are depicted in Table 7 [45]. The values in each row of Table 7 show the positive ranks of the method in that row against the given methods in the columns. The values of each column show the negative ranks of the method in that column against the given methods in the rows [46]. The more positive the ranks are in each row, the better the method is in that row. Also, the less negative the ranks are in each column, the worse the method is in that column [46,47]. Furthermore, Table 8 demonstrates the results of the comparisons of all methods. The plus indicates that the method in the row improves the method of the column; the star indicates that the method in the column improves the method of the row. According to positive and negative ranks, the Wilcoxon test considers a pvalue for all methods as provided in Table 9. Those hypotheses that have p- value<0.2 are rejected by the MMRF approach. According to positive and negative ranks as shown in Fig. 3, the MMRF rejects 12 methods, and there are no statistically significant differences among the MMRF and the other two successful approaches in terms of ranking (i.e. IDES and SES-NSGAII).
Table 7 Wilcoxon test ranks for 15 methods; values of rows and of columns show the positive and negative ranks, respectively.
MMRF IDES-P SES-NSGAII SA SAD SB MV DCS-LA DCS-MCB DCS-MLA DES-KE DES-P DES-KL GDES RF
MMRF
IDES-P
SES- NSGAII
SA
SAD
SB
MV
DCS-LA
DCS-MCB
DCS-MLA
DES-KE
DES-P
DES-KL
GDES
RF
– 112 113 16 17 14 14 11 11 11 12 21 20 3 0
78 – 48 0 0 5 0 0 0 0 0 0 0 0 14
77 122.5 – 0 0 0 0 0 0 0 0 5 3 0 15
174 190 190 – 128.5 110 109 86 89 85 89 112 107 72 121
173 190 190 61.5 – 99 69.5 83 84 83 61 85.5 74 41 118
176 185 182 80 91 – 70 76 80 80 48 93 69. 20 143
176 190 190 81 120.5 101 – 70 80 82 91 139.5 115 32 150
179 190 190 104 107 114 101 – 111.5 103.5 122 137 126 79 156
179 190 190 101 106 110 110 59.5 – 66.5 99 135 124 79 161
179 190 190 105 107 110 108 67.5 104.5 – 117 121 126 79 149
178 190 190 101 129 142 99 68 72 73 – 184 166 50 143
169 190 185 78 104.5 78 50 53 55 50 6 – 43.5 16 119
170 190 187 83 116 121 75 64 66 64 24 146 – 21 127.5
187 190 190 118 149 170 158 111 111 111 140 174 169 – 171
190 176 175 69 72 47 40 34 29 41 47 71 62 19 –
634
M. Khamar, M. Eftekhari / Applied Soft Computing 68 (2018) 626–635
Table 8 Summary of the Wilcoxon test: the MMRF method rejected 12 methods. Upper triangular part: level of significance ␣ = 0.9. Lower triangular part level of significance ␣ = 0.95. MMRF MMRF IDES-P SES-NSGAII SA SAD SB MV DCS-LA DCS-MCB DCS-MLA DES-KE DES-P DES-KL GDES RF
IDES-P
SES-NSGAII
* * * * * * * * * * * *
* * * * * * * * * * * *
* * * * * * * * * * * *
SA
SAD
SB
MV
DCS-LA
DCS-MCB
DCS-MLA
DES-KE
DES-P
DES-KL
GDES
RF
+ + + -
+ + +
+ + +
+ + +
+ + +
+ + +
+ + +
+ + +
+ + +
+ + +
+ + +
+ + +
-
+ + +
+ -
* *
+ + *
*
* +
+
+
+
* * *
* + *
+ + + +
* * * * * *
* -
Fig. 3. Comparison of the MMRF method and the other methods according to Table 9.
Table 9 Comparison of MMRF with other methods, by the Wilcoxon test. Methods
R+
R−
Exact P-value
Asymptotic P-value
IDES-P SES-NSGAII SA SAD SB MV DCS-LA DCS-MCB DCS-MLA DES-KE DES-P DES-KL GDES RF
78 77 174 173 176 176 179 179 179 178 169 170 187 190
112 113 16 17 14 14 11 11 11 12 21 20 3 0
≥ 0.2 ≥ 0.2 6.446E −4 8.796E −4 4.196E −4 4.196E −4 2.098E −4 2.098E −4 2.098E −4 2.67E −4 1.6938E −3 1.4114E −3 1.9074E −5 3.814E −6
1 1 0.001378 0.001583 0.001039 0.001039 0.000673 0.000673 0.000673 0.000779 0.002717 0.002379 0.000197 0.000112
6. Concluding remarks In this paper, the MMRF algorithm was developed and implemented based on manifold learning techniques as a supplement for PCA. Our method takes advantage of LLE, LEM, and LDA as manifold learning methods which project the original data into a new space so that the locality of the data is preserved. The proposed algorithm utilizes the idea of projection in LLE, LEM, and LDA, and different rotation matrices are constructed accordingly. The input
data to the classifiers were prepared using rotation matrices. The following advantages can be listed for the MMRF: 1. It can be adapted to different data types since it uses four manifold learning techniques and consequently four rotation matrices. In other words, different types of data will have different manifold structures, and thereby the structures need different unfolding methods provided by four different rotation matrices. 2. The MMRF is a static ensemble approach whose performance is comparable to dynamic ensemble approaches. The MMRF was compared with some recent powerful ensemble methods (static and dynamic approaches, like SES-NSGAII and IDES-P). The comparison results through non-parametric statistical tests revealed that the MMRF outperformed twelve powerful methods in the literature in terms of classification accuracy. Moreover, although the average accuracy of SES-NSGAII and IDES-P (the first and the second ranks above the MMRF in the Friedman ranking list) were slightly better than MMRF, the differences were not statistically significant. It can be inferred that the classification performance of MMRF is the same as most efficient methods. 3. An analysis of time complexity for the MMRF as well as for the other two powerful methods in the literature (i.e. SES-NSGAII and IDES-P) was performed. It is worth noting that the time complexity of MMRF was less than that of the above-mentioned methods. Also, its time complexity was approximately equal to that of the RF algorithms considered for comparison in this work.
M. Khamar, M. Eftekhari / Applied Soft Computing 68 (2018) 626–635
For future studies, the MMRF can be further upgraded into a dynamic ensemble version the classifiers of which, in the test phase, compete for their presence in the classifier pool. It is conjectured that its dynamic version will show outstanding performance. Also, future studies can leverage other manifold learning methods and different classifiers so as to achieve better performances than that of the present version. Besides, combining different rotation matrices might be taken into consideration as another line of research to determine final rotation matrix. Another suggestion is to exploit feature-partitioning methods before employing the manifold learning techniques. The last implication is to develop an algorithm for predicting a suitable manifold technique that can be applied on one feature subset. Acknowledgments The authors would like to highly appreciate Dr. F. Saberi Movahed, Dr. M. Hosseinzadeh, and Dr Z.Azimifar for language checking. References [1] T.G. Dietterich, Ensemble methods in machine learning, Mult. Classif. Syst. 1857 (2000) 1–15. [2] Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms, CRC press, 2012. [3] R. Polikar, Ensemble based systems in decision making, IEEE Circuits Syst. Mag. 6 (2006) 21–45. [4] R. Mousavi, M. Eftekhari, A new ensemble learning methodology based on hybridization of classifier ensemble selection approaches, Appl. Soft Comput. 37 (2015) 652–666. [5] C. Zhang, Y. Ma, Ensemble Machine Learning: Methods and Applications, Springer, 2012. [6] L.I. Kuncheva, Clustering-and-selection model for classifier combination, in: Knowledge-Based Intell. Eng. Syst. Allied Technol. 2000. Proceedings. Fourth Int. Conf., IEEE, 2000, pp. 185–188. [7] R. Liu, B. Yuan, Multiple classifiers combination by clustering and selection, Inf. Fusion 2 (2001) 163–168. [8] L. Yang, Classifiers selection for ensemble learning based on accuracy and diversity, Procedia Eng. 15 (2011) 4266–4270. [9] A.H.R. Ko, R. Sabourin, A.S. Britto Jr., From dynamic classifier selection to dynamic ensemble selection, Pattern Recognit. 41 (2008) 1718–1731. [10] K. Woods, W.P. Kegelmeyer, K. Bowyer, Combination of multiple classifiers using local accuracy estimates, IEEE Trans. Pattern Anal. Mach. Intell. 19 (1997) 405–410. [11] G. Giacinto, F. Roli, Dynamic classifier selection based on multiple classifier behaviour, Pattern Recognit. 34 (2001) 1879–1881. [12] P.C. Smits, Multiple classifier systems for supervised remote sensing image classification based on dynamic classifier selection, IEEE Trans. Geosci. Remote Sens. 40 (2002) 801–813. [13] N. García-Pedrajas, J. Maudes-Raedo, C. García-Osorio, J.J. Rodríguez-Díez, Supervised subspace projections for constructing ensembles of classifiers, Inf. Sci. (Ny). 193 (2012) 1–21. [14] S. Theodoridis, K. Koutroumbas, Pattern Recognition, Fourth edition, 2009 (n.d.). [15] J.J. Rodriguez, L.I. Kuncheva, C.J. Alonso, Rotation forest: a new classifier ensemble method, IEEE Trans. Pattern Anal. Mach. Intell. 28 (2006) 1619–1630. [16] H. Lu, L. Yang, K. Yan, Y. Xue, Z. Gao, A cost-sensitive rotation forest algorithm for gene expression data classification, Neurocomputing 228 (2017) 270–276, http://dx.doi.org/10.1016/j.neucom.2016.09.077. ˜ Anticipative hybrid extreme rotation forest, Procedia [17] B. Ayerdi, M. Grana, Comput. Sci. 80 (2016) 1671–1681. [18] R. Mousavi, M. Eftekhari, M.G. Haghighi, A new approach to human microRNA target prediction using ensemble pruning and rotation forest, J. Bioinform. Comput. Biol. 13 (2015) 1550017.
635
[19] N. Fazakis, S. Karlos, S. Kotsiantis, K. Sgarbas, Self-trained rotation forest for semi-supervised learning, J. Intell. Fuzzy Syst. 32 (2017) 711–722. [20] L. Wang, Z.-H. You, S.-X. Xia, X. Chen, X. Yan, Y. Zhou, F. Liu, An improved efficient rotation forest algorithm to predict the interactions among proteins, Soft Comput. (2017) 1–9. [21] J. Xia, L. Bombrun, Y. Berthoumieu, C. Germain, P. Du, Spectral-spatial Rotation Forest for hyperspectral image classification, in: Geosci. Remote Sens. Symp. (IGARSS), 2016 IEEE Int., IEEE, 2016, pp. 5126–5129. [22] C. Su, S. Ju, Y. Liu, Z. Yu, Improving random forest and rotation forest for highly imbalanced datasets, Intell. Data Anal. 19 (2015) 1409–1432. [23] M. Hosseinzadeh, M. Eftekhari, Using fuzzy undersampling and fuzzy PCA to improve imbalanced classification through Rotation Forest algorithm, in: Comput. Sci. Softw. Eng. (CSSE), 2015 Int. Symp., IEEE, 2015, pp. 1–7. [24] M. Hosseinzadeh, M. Eftekhari, Improving rotation forest performance for imbalanced data classification through fuzzy clustering, in: Artif. Intell. Signal Process. (AISP), 2015 Int. Symp., IEEE, 2015, pp. 35–40. [25] T. Woloszynski, M. Kurzynski, P. Podsiadlo, G.W. Stachowiak, A measure of competence based on random classification for dynamic ensemble selection, Inf. Fusion 13 (2012) 207–213. [26] J. Xiao, C. He, X. Jiang, D. Liu, A dynamic classifier ensemble selection approach for noise data, Inf. Sci. (Ny). 180 (2010) 3402–3421. [27] J. Xiao, C. He, Dynamic classifier ensemble selection based on GMDH, in: Comput. Sci. Optim. 2009. CSO 2009. Int. Jt. Conf., IEEE, 2009, pp. 731–734. [28] R.A. Fisher, The Use of Multiple Measurements in Taxonomic Problems, Wiley Online Library, 1936. [29] A.M. Martínez, A.C. Kak, Pca versus lda, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2001) 228–233. [30] H. Li, T. Jiang, K. Zhang, Efficient and robust feature extraction by maximum margin criterion, Adv. Neural Inf. Process. Syst. 2004 (2016) 97–104. [31] S. Theodoridis, A. Pikrakis, K. Koutroumbas, D. Cavouras, Introduction to Pattern Recognition: a Matlab Approach, Academic Press, 2010. [32] A. Ghodsi, Dimensionality Reduction a Short Tutorial, vol. 37, Dep. Stat. Actuar. Sci. Univ., Waterloo, Ontario, Canada, 2006, pp. 38. [33] A. Frank, UCI Machine Learning Repository, 2010 Http//archive.Ics.Uci.Edu/ml. [34] Ludmila Kuncheva’s Home Page, (n.d.). http://pages.bangor.ac.uk/∼mas00a/welcome.html. (Accessed 27 October 2017). [35] Kent Ridge Bio-Medical Gene Expression Datasets, (n.d.). https://www. researchgate.net/figure/221501601 tbl1 Table-1-Kent-Ridge-Bio-medicalgene-expression-datasets. (Accessed 27 October 2017). [36] Microarray Project, (n.d.). https://research.nhgri.nih.gov/microarray/ Supplement/. (Accessed 27 October 2017). [37] R.P.W. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. de Ridder, D.M.J. Tax, S. Verzakov, PRTools 4-A MATLAB Toolbox for Pattern Recognition. Version 4.1, Delft Univ. Technol., 2007. [38] T. Elgamal, M. Hefeeda, Analysis of PCA Algorithms in Distributed Environments, arXiv Prepr. arXiv1503.05214, 2015. [39] D. Cai, X. He, J. Han, Training linear discriminant analysis in linear time, in: Data Eng. 2008. ICDE 2008. IEEE 24th Int. Conf., IEEE, 2008, pp. 209–217. [40] M. Suzuki, Manifold Learning and Convergence of Laplacian Eigenmaps, 2015. [41] J. Alcalá-Fdez, L. Sanchez, S. Garcia, M.J. del Jesus, S. Ventura, J.M. Garrell, J. Otero, C. Romero, J. Bacardit, V.M. Rivas, KEEL. A software tool to assess evolutionary algorithms for data mining problems, Soft Comput. Fusion Found. Methodol. Appl. 13 (2009) 307–318. [42] J.H. Zar, Biostatistical Analysis, Pearson Education, India, 1999. [43] J.D. Li, A two-step rejection procedure for testing multiple hypotheses, J. Stat. Plan. Inference 138 (2008) 1521–1527. [44] S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power, Inf. Sci. (Ny). 180 (2010) 2044–2064. [45] J. Demˇsar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [46] S. García, D. Molina, M. Lozano, F. Herrera, A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the CEC’2005 special session on real parameter optimization, J. Heuristics 15 (2009) 617–644. [47] M.K. Ebrahimpour, M. Eftekhari, Ensemble of feature selection methods: a hesitant fuzzy sets approach, Appl. Soft Comput. 50 (2017) 300–312.