G Model
ARTICLE IN PRESS
ARTMED-1504; No. of Pages 9
Artificial Intelligence in Medicine xxx (2017) xxx–xxx
Contents lists available at ScienceDirect
Artificial Intelligence in Medicine journal homepage: www.elsevier.com/locate/aiim
A novel hierarchical selective ensemble classifier with bioinformatics application Leyi Wei a , Shixiang Wan a , Jiasheng Guo b , Kelvin KL Wong c,∗ a b c
School of Computer Science and Technology, Tianjin University, Tianjin, China School of Information Science and Technology, Xiamen University, Xiamen, China School of Medicine, Western Sydney University, Sydney, Australia
a r t i c l e
i n f o
Article history: Received 23 December 2016 Received in revised form 9 February 2017 Accepted 10 February 2017 Keywords: Selective ensemble learning Parallel optimization Divide and conquer Multi-class classification Bioinformatics
a b s t r a c t Selective ensemble learning is a technique that selects a subset of diverse and accurate basic models in order to generate stronger generalization ability. In this paper, we proposed a novel learning algorithm that is based on parallel optimization and hierarchical selection (PTHS). Our novel feature selection method is based on maximize the sum of relevance and distance (MSRD) for solving the problem of high dimensionality. Specifically, we have a PTHS algorithm that employs parallel optimization and candidate model pruning based on k-means and a hierarchical selection framework. We combine the prediction result of each basic model by majority voting, which employs the divide-and-conquer strategy to save computing time. In addition, the PT algorithm is capable to transform a multi-class problem into a binary classification problem, and thereby allowing our ensemble model to address multi-class problems. Empirical study shows that MSRD is efficient in solving the high dimensionality problem, and PTHS exhibits better performance than the other existing classification algorithms. Most importantly, our classifier achieved high-level performance on several bioinformatics problems (e.g. tRNA identification, and protein-protein interaction prediction, etc.), demonstrating efficiency and robustness. © 2017 Elsevier B.V. All rights reserved.
1. Introduction Generalization performance is one of the focuses of machine learning. An empirical study has demonstrated that ensemble learning methods offer significant improvements in generalization ability [1]. In recent years, ensemble learning methods have become a hot topic and have gained widespread applications in various fields, including multi-label classification, bioinformatics, and image processing [2–6]. In general, an ensemble learning model is an ensemble of basic models (e.g. decision trees, neural networks, and support vector machines [7–9]) that makes predictions by fusing the prediction results of basic models with some strategies, such as weighted or unweighted voting [1]. The construction process of an ensemble method can be roughly divided into two main steps. The first step is the generation and selection of the basic models. The second step is the combination of individual basic models. Bagging and Boosting are currently two well-known ensemble learning algorithms. The
∗ Corresponding author at: School of Medicine, Western Sydney University, Locked Bag 1797, Penrith, NSW, 2751, Australia. E-mail address:
[email protected] (K.K. Wong).
bagging algorithm, which is originally proposed by Breiman et al. [10], constructs a set of diverse models by the Bootstrap method that selects the samples from the original data by repeatedly backsampling. The boosting algorithm is a type of ensemble method. AdaBoost is one of the most popular algorithms in Boosting [11]. AdaBoost is an iterative algorithm that minimizes the weighted error of the training set when training basic models. AdaBoost can increase the error classification instance weights and reduce the weight of the correctly classified instances, which leads the model to focus more on the error classification in the next iteration. Based on intuition and experience, a growing number of basic models can improve the prediction performance. However, this approach requires more time and computing power. And, it probably even makes the performance decline if bad basic models were involved. Diversity of basic models is one of the most fundamental factors that impact the generalization performance of an ensemble model [12]. How to construct a set of diverse basic models is an important issue. Zhou et al. proposed the concept of selective ensemble and proved that it was better to create an ensemble of many basic models than to use all of the basic models [12]. They proposed an algorithm named genetic algorithm-based selective ensemble (GASEN), which selects appropriate models for ensembles. The results showed that this method could obtain good
http://dx.doi.org/10.1016/j.artmed.2017.02.005 0933-3657/© 2017 Elsevier B.V. All rights reserved.
Please cite this article in press as: Wei L, et al. A novel hierarchical selective ensemble classifier with bioinformatics application. Artif Intell Med (2017), http://dx.doi.org/10.1016/j.artmed.2017.02.005
G Model ARTMED-1504; No. of Pages 9
ARTICLE IN PRESS
2
L. Wei et al. / Artificial Intelligence in Medicine xxx (2017) xxx–xxx
performance in smaller size models but stronger generalization performance. It is essential to choose a set of basic models that have the largest diversity instead of an ensemble of the overall basic models. Selective ensemble learning is a method that identifies the optimal subset for creating an ensemble from a given pool of basic models by using a form of evaluation [13]. This problem is a combinatorial optimization problem that is recognized as Nondeterministic Polynomial-complete (NP). It is not easy to select the optimal subset of combined models. People have usually employed heuristic algorithms to find an approximate optimal solution, including a simulated annealing algorithm, genetic algorithm, and particle swarm optimization algorithm [13]. Kosztyán et al. proposed a novel algorithm which an optimal resource allocation with minimal total cost for any arbitrary project can be determined [14]. R. Mátrai et al. tested the characteristic searching routes and navigation methods on a web-page difference between normal users and those with intellectual disabilities [15]. Dosa et al. proved that there does not exist any polynomial-time algorithm with worstcase ratio smaller than 2 unless P = NP, even if all jobs have unit processing time [16]. In the real world, most of a practical classification task is to learn an appropriate function that assigns given input feature values into one of a finite set of classes [17]. The features are important for training a well-performing model. The upcoming problem is: should we obtain more features to build the model? Our answer: absolutely not. In fact, many features are redundant, which means that they do not affect the target class in any way. Only some of the features make the decision. Additionally, higher-dimensional features also create higher computational costs, and the performance of a model can sometimes decrease when a feature set includes irrelevant, redundant or (some) bad features. In many fields, the algorithm model may not work if the size of the dataset is too large [18]. Choosing a subset of irrelevant/redundant features is very necessary. A number of methods of ranking features have been proposed and some of them find the optimal subset of features based on cost functions. They might use exhaustive algorithms to search the best feature subset among the competing 2N candidate subsets of features when the feature number is N. This method can find the best subset in theory, but it costs too much time. Other methods use heuristic algorithms to search for a compromising subset with a stopping condition to avoid exhaustive searching. Multi-class classification problems have attracted a lot of attention in the machine learning field. The most typical application of multi-class classification is text categorization. Most of traditional machine learning algorithms are unable to address multi-class classification problems, because they are more complex than traditional binary classification. In a binary classification problem, each labelled sample belongs to one certain target class; while in a multiclass classification problem, each labelled sample belongs to one or more classes simultaneously [19]. Two methods were employed to address multi-class problems, including algorithm adaptation (AA) and problem transformation (PT) [20], which addresses the multiclass problem by expanding the single-label algorithm directly for multi-class classification, such as K-Nearest Neighbour, Decision trees, Naïve Bayes, and Boosting. PT transforms the multi-class problem into multiple binary problems, and then, uses traditional classification algorithms to address these problems. The PT method is more flexible and scalable than the AA method because it fully makes use of single-label algorithms for classification. Ensemble methods can improve prediction by combining diverse basic models. However, with a large increase in the number of data sets, it becomes especially challenging to load all of the data sets into memory to train basic models. Feature selection, which aims to select the most important subset of a larger number of features, is frequently employed to address problems with large- or
Fig. 1. Parallel optimization of a basic model.
high-dimensional data sets [21–23]. In this paper, we propose maximize the sum of relevance and distance (MSRD), a feature selection method for solving problems with high-dimensional data sets. Drawing from ensemble theory, we also design a selective ensemble algorithm called PTHS (Parallel Optimization and Hierarchical Selection). The combination voting task of PTHS is based on a divide and conquer strategy for saving time, especially when the number of prediction instances or ensemble models is too large. We combined the two methods together to solve higher-dimensional data problems and improve the generalization performance. Finally, we used the PT method to transform the multi-class problem into a single-class problem, making our ensemble model have the ability to solve multi-class classification problems. 2. Methodology design of selective ensemble This section introduces the proposed PTHS algorithm’s process in detail. The algorithm consists of three steps – parallel optimization of the basic models’ parameters, ensemble pruning based on k-means clustering and hierarchical selection, and voting for the last prediction result with the divide-and-conquer strategy. Additionally, we employ the problem transformation (PT) method to expand our ensemble model for multi-class problems. Each step is described in a corresponding subsection. 2.1. Parallel optimization of the basic model’s parameters In this section, each basic model is trained independently, and the parameters are optimized in parallel. Constructing multiple basic models is the ensemble’s precondition. Here, we use 22 models, h = h1 , . . ., hi , .., h22 , in order to generate a set of basic models that includes Logistic Regression, Random Forest, K-Nearest Neighbour (KNN), and Support Vector Machine (SVM), Naive Bayes (NB), decision trees, and other learning models [24–26]. For convenience, we consider only binary class problems here. Let ˝ = 0, 1 be a set of binary class labels; let x ∈ Rn be a features will be labelled in ; and let D = vector that has n that →
→
→
x1 , l1 , x2 , l2 , . . ., xn , ln
be a training set where li ∈ ˝.
After random sorting of the training set D, we choose 80% of D as training set and 20% as validation set. Construction of the basic models occurs in two stages. Firstly, we build the basic structure of each basic model by training the sets horizontally. When training the basic model, we optimize these models’ parameters with the validation set in parallel. Because of the limited resources of the computer, we save each optimal-parameter basic model to the hard disk when it finishes parameter optimization. Note that Fig. 1 is a flowchart of the basic model’s parameter optimization method. Each step is described in details as follows: Step 1: Perform a random sort on set, and split this set into a training set and validation set. Step 2: Build the base structure of model hi using the training set. Here, hi is implemented using a data mining tool, Waikato Envi-
Please cite this article in press as: Wei L, et al. A novel hierarchical selective ensemble classifier with bioinformatics application. Artif Intell Med (2017), http://dx.doi.org/10.1016/j.artmed.2017.02.005
G Model
ARTICLE IN PRESS
ARTMED-1504; No. of Pages 9
L. Wei et al. / Artificial Intelligence in Medicine xxx (2017) xxx–xxx
ronment for Knowledge Analysis (WEKA), which is a collection of data mining-focused machine learning algorithms. Step 3: Optimize the parameters of the models hi in parallel. For example, if the model for a given hi is KNN, then that model has a parameter K, which is the number of neighbours to use. We define the optimal parameters as “K, 1, 15, 5”. These optimal parameters mean that the K values range from 1 to 15 using 5 steps. It means that we create 5 parameter optimization tasks. Each task employs a corresponding value to train the model and evaluate the performance using the validation set. Then, the 5 tasks are submitted into a thread pool. The thread pool has the form of multithreading. We can automatically or manually set the number of threads based on the system environment to achieve the desired effect of the operation. The thread pool will run the submitted task(s) in parallel by the thread number that the user sets and save the best parameter model to hard disk when finished with all of the threads. Here, we use 5 threads to hi optimize the parameter of each model The detailed algorithm is defined as follows: Algorithm 1 Parallel parameter optimization Input: model hi , and parameter ϕ(”K, 1, 15, 5”) 1: split ϕ and get each step ki value 2: for i = 1 to step do 3: add each step task (ki ) in to ThreadPool 4: end for 5: train and save hi in parallel
(1)
and Disav are defined as follows: where p 1 hi (xj , lj ), nt n
= p
t
(2)
j=1 i=1
t t(t − 1) t
Disav =
t
i=1 k=1,i = / k
Disi,k ,
Model(h)
Predict(0) Predict(1)
Disi,k =
Model(k) Predict(0)
Predict(1)
N(00) N(01)
N(10) N(11)
Nik (01) + Nik (10) . Nik (11) + Nik (10) + Nik (01) + Nik (00)
(4)
Table 1 is the definition of Nik (xy) for Disav , and Nik (xy) represents the relationship between models hh and hk for some validation set instance. This presented research did not show a strong correlation between the diversity and ensemble accuracy [2]. The accuracy of the basic model is also an important element for improving the ensemble performance. In this paper, we proposed an evaluation characteristic called Accuracy-Diversity (AD), which considers both the inter-rater agreement k and the accuracy (ACC) of the ensemble model. Then, AD is defined as (5)
We use the parameter ˇ to adjust the weight between the accuracy and diversity of the basic models. We use the same weight by default for this two-evaluation characteristic, which means that ˇ equals 0.5. We can manually set the value of ˇ to achieve the optimal performance.
Choosing the method that is to be used to construct a diverse combination of basic models has always been a key decision in ensemble research [27]. There are many statistics that can measure diversity among the models. Here, we simply consider a measurement of inter-rater agreement k, which is considered to be a measurement of the similarity. The diversity value k is defined as follows: 1 Disav , (1 − p) ¯ 2p
Table 1 Definition of Nik (xy).
AD = ˇ × k + (1 − ˇ) × ACC, 0 ≤ ˇ ≤ 1.
2.2. Evaluation measures for hierarchical selection
k =1−
3
(3)
2.3. Hierarchical selection design In this section, we introduce an approach called hierarchical selection. The experiment results show that our proposed method is flexible and competitive compared with the other model selection methods. Hierarchical selection is a heuristic algorithm that uses a stopping condition to avoid exhaustive search. We use the values 0 and 1 to represent whether a basic model is selected for →
ensemble membership. Now, the vector x = {1, 0, ..., 1} is the chosen combination of the basic models. Before hierarchical selection, we pruned the candidate basic models by the k-means algorithm. We employed the k-means algorithm to partition all of the models into subsets that contain similar models. We have randomly divided the data into training data and validation data in Section 2.2. The prediction results of the validation by each model are considered to be the input of the k-means algorithm. Suppose that the validation data contains m instances, and the prediction class result by model hi is yi = {y1 , y2 , ..., ym }. Then, k-means uses those output values to calculate the Euclidean distance of each model and divide
Fig. 2. Hierarchical selection framework.
Please cite this article in press as: Wei L, et al. A novel hierarchical selective ensemble classifier with bioinformatics application. Artif Intell Med (2017), http://dx.doi.org/10.1016/j.artmed.2017.02.005
G Model ARTMED-1504; No. of Pages 9
ARTICLE IN PRESS
4
L. Wei et al. / Artificial Intelligence in Medicine xxx (2017) xxx–xxx
them into k subsets. We define the Euclidean distance of models hi and hj for the x-th instance as dijx . In fact, yi has only two values, 0 or 1, and thus, if the actual prediction of the x-th instance by hi is 0, which is the same as the prediction result by hj , then dijx = 0; otherwise, dijx = 1. It is obvious that the Euclidean distance of hi and hj is k=m
dijk . Here, we assigned k=12 to partition the basic models into
k=1
12 clusters. Finally, we selected the best performance model from each of the subsets as the candidate model for hierarchical selection. Poor accuracy and similar models can be eliminated according to this unsupervised method. The ensemble framework acts as a base for hierarchical selection. Fig. 2 is the flowchart of hierarchical selection. The vector hi is the solution, which represents the chosen combination of the basic model. At each layer, the algorithm decided whether to accept a new solution or not and updated the selecting probability dynamically. Finally, the algorithm will select a best combination of models as evaluated by the characteristic AD. Each hierarchy is a combinatorial optimization problem that is evaluated by the characteristic AD, which has been proposed in Section 2.2. Initially, the selection probability of each basic model is equal to 0.5. We first generate a solution, which is a vector composed of 0 s and 1 s, based on the initial selection probability. If the performance of the characteristic of AD reached the anticipated value, we then end the procedure of hierarchical selection. Additionally, this solution is the final combination of the basic model selected by hierarchical selection. Otherwise, we can go into the next layer. In the second layer, we generate a new solution as the former layer. Comparing these two solutions, we accept new solutions by using Metropolis, which is adapted to the Simulated Annealing algorithm as a criterion at each hierarchy [28]. We then increase the selection probability of each model in the new solution and decrease the selection probability of those that are not in the solution. Model selection would move to the next layer until the combination model performance reached our anticipated value. Through hierarchical selection, our algorithm will update the probability values of the basic model dynamically. The algorithm of hierarchical selection is summarized as follows: Algorithm 2 Hierarchical Selection Input: parameter ϕ, T,diverisity K, threshold , available classifiers H=
h1 , h2 , . . ., hn , probability p (xr ) (r ∈ 1, . . ., n) and the rate of
classifiers rate = {rate1 , rate2 , . . ., raten } Initialization: p (xr ) = rate 1: generate initial solution based on p (xr ); 2: whiledo 3: randomly generate based on p (xr ); 4: J (H r ) = ϕ ∗ p (xr ) + (1 − ϕ) ∗ kr 5: J = J (H r ) − J (H x ) 6: if min
1, exp J/T
then
7: accept Hr as Hx 8: end if 9: update p (xr ) in Hr and T 10: end while
Here, K represents the diversity value, and the rate is the set of each basic model’s accuracy. In this way, better performing basic models would have greater chances of becoming candidate models for ensembles. Poorly performing basic models would also have the opportunity to become candidates for ensembles. This arrangement can circumvent falling into a locally optimal solution. 2.4. Ensemble combination method Because the ensemble method attempts to combine the prediction results of the basic models to achieve a stronger generalization ability, the choice of the combination method plays an important role [29]. Generally, a combination vote can be divided into
Fig. 3. Ensemble vote based on divide-and-conquer.
a weighted and unweighted vote. The majority vote is probably the most popular combination method of the unweighted vote. It identifies copt , the class that is most often chosen by different models. Suppose the class of xi is copt . And the class set of data is copt = {0, 1}. Here we just consider binary class. The ensemble’s output class label copt is defined as follows:
copt =
⎧ ⎪ ⎨
m
1if
hi (xj ) ≥
i=1 ⎪ ⎩ 0others
1 m 2 ,
(6)
where m is the number of basic models. Using the weights of the votes, the ensemble’s output class label copt is defined as follows:
copt =
⎧ ⎪ ⎨
m
1if
1 m, wi ≥ 0 wi = 1 2 . m
wi hi (xj ) ≥
i=1 ⎪ ⎩ 0others
(7)
i=1
We also propose an ensemble vote method that is based on divide-and-conquer. Fig. 3 is the flow chart of the proposed ensemble combination method. Here, we can find that our method divides a large voting task into two small tasks recursively when the voting task is too large. Suppose that the number of basic models is T and the prediction number instance is n; then, the time complexity of the ensemble vote combination is O(nT). Using the traditional single-thread approach for ensemble voting would be very time-consuming when the prediction number instance is large. To save the ensemble voting time, we employed a combination vote strategy that was based on divide-and-conquer. We set the threshold of each majority vote task first. If the number of prediction instances is larger than the threshold value, then we recursively divide that large ensemble voting task into two smaller tasks. In this way, we divide the larger tasks into many small tasks. We submitted those small tasks into a thread pool. We run these small tasks in parallel through the thread pool and finally conquer the prediction result for all of the small tasks. This approach decreases the time complexity for the ensemble vote to O(Tlog2 n). This method can be used regardless of whether it is a single model or ensemble model prediction. When the number of combination models increases, more time can be saved. 2.5. Method for multi-class problems In the real world, most of the problems are supervised classification, which learn a function by a given data set that is associated
Please cite this article in press as: Wei L, et al. A novel hierarchical selective ensemble classifier with bioinformatics application. Artif Intell Med (2017), http://dx.doi.org/10.1016/j.artmed.2017.02.005
G Model ARTMED-1504; No. of Pages 9
ARTICLE IN PRESS L. Wei et al. / Artificial Intelligence in Medicine xxx (2017) xxx–xxx
Table 2 Instance of multi-class dataset. Instance
The average precision is used to measure an instance that belongs to the former class while still belonging to the relation class set of this instance.
Class
X1 X2
a
b
c
1 0
1 1
0 1
Table 3 Transformed dataset. Instance
Class
X1 X1 X1 X2 X2 X2
a
b
c
1 0 0 1 0 0
0 1 0 0 1 0
1 1 1 0 1 1
with some infinite set class. We use the function to predict the class for a new instance. The multi-class problem is also a supervised classification problem; however, an instance can belong to many classes instead of a binary class. Multi-class problems have gained a large amount of attention in the machine learning field. Generally speaking, PT can be divided into four categories, including binary relevance, label power set, pairwise and ranking transformation. Each method works well in some domain field(s). Here, we use only the binary relevance method, which predicts the relevance of each label, to transform a multi-class problem into a single-class problem. We transform the multi-class problem into a single-class problem and then use our ensemble model PTHS to predict the relevance of each class. Let = {1 , 2 , ..., n } be a set of class labels; let x ∈ n be a vector that has n features that will be labelled in a set of ˝; and letD = {(x1 , {1 }), (x2 , {2 }), ...(xn , n )} be a training set, where i belongs to a set of ˝. For example, there are three labels (a, b, c) in ˝. X1 and X2 is an instance of training data, and X1 ∈ {a, b} , X2 ∈ {b, c}. Table 2 shows the original multi-class dataset, and Table 3 shows the transformed dataset by the binary relevance method. We transform vector x = {1, 1, 0} into the matrix Y , as follows:
⎡→
→
x
Y =⎣x
→ →
x
0
0
0
1 0
0⎦
0
0
1
1
1
i=1
|yi |
l ∈ yi
1 N−1
៝ ¯l) = PCC(F,
N
(fk − f¯ )(lk − ¯l)
, N 2 1 (fk − f¯ ) N−1 (lk − ¯l) k=1
N
1 N−1
(9)
2
k=1
1 f¯ = fk , N N
N 1 1 |{k|rankf (xi , k) ≤ rankf (xi , l), k ∈ yi }|
N
It is necessary to find a minimal sized feature subset, which is irrelevant or redundant. Feature selection is aimed at improving the performance by removing some of the redundant features or decreasing the size of the feature set when the feature dimension is too large. Feature selection is frequently used to manage highdimensional data. Here, we propose a feature selection method called MSRD, which is based on the theory of feature selection. Our experiment shows that our method works well. Most previous studies on feature selection methods focused on achieving the highest relevance between the features and the target class, which leads to redundancy among the selected features. Only taking relevance into account would cause redundancy in the selected feature subsets. As a result, a distance function is utilized to measure the independence of each feature. In this paper, Pearson’s correlation coefficient (PCC) is utilized to measure the relevance between the features that are in a subset, and the Euclidean distance (ED) is utilized to calculate the redundancy among the features in a subset. To facilitate the introduction of MSRD, some notation is provided first. We are given the input dataD = {(x1 , l1 ), (x2 , l2 ), ...(xn , ln )}, where the target class is li and there are M features F = {fi, where i = 1, . . ., M}. Our aim is to find a subspace of m features Rm , which is selected from the M-dimensional (D) original space RM and makes the greatest contribution to classifying the target class li . PCC is used to reflect the linear correlation of two variables. Thus, PPC is selected to measure the relevance between a feature and the target class li . Given a feature vector F៝ and a target class label vector l, their PPCs are calculated as follows:
k=1
The first column is the original vector feature. Additionally, columns 2–4 form a unit matrix, and each column means the corresponding class. The last column represents the final class, which only belongs to {0, 1}. The first and third lines of matrix Y represent that the original vector x belongs to classes a and c respectively. The second line present in x does not belong to class c. After transforming the vector x into matrix Y , we change the multi-class into the original binary class problem. Then, we use the proposed ensemble model (PTHS) for the training and prediction of the relevance of each label. The performance metrics of the multi-class problem are more complex than for the single-class problem. Many performance metrics have been proposed. Here, we introduce one performance metric called the average precision. Suppose that we have learned a prediction function fl (x); we define a ranking function rankf (x, l) according to fl (x). If fl (x) > fk (x), then rankf (x, l) < rankf (x, k), and the average precision is defined as follows: AP(f ) =
3. Feature selection
⎤
1
5
rankf (xi , l)
.
(8)
(10)
k=1
¯l = 1 lk , N N
(11)
k=1
where fk and lk are the kth elements of F៝ and l. The relevance value of feature I, Ri , is defined as follows: Ri = |PCC(F៝i , l)| (1 ≤ i ≤ M).
(12)
Minimal redundancy will contribute to better classification performance. We next focus on measuring the level of similarity among the features by using a distance metric. ED is selected because it is well known, clear to understand and easily implemented. The ED between two features is calculated as follows:
ED
F i, F j
N
2 = fik − fjk , (1 ≤ i, j ≤ M, i = / j).
(13)
k=1
Please cite this article in press as: Wei L, et al. A novel hierarchical selective ensemble classifier with bioinformatics application. Artif Intell Med (2017), http://dx.doi.org/10.1016/j.artmed.2017.02.005
G Model ARTMED-1504; No. of Pages 9
ARTICLE IN PRESS
6
L. Wei et al. / Artificial Intelligence in Medicine xxx (2017) xxx–xxx
Ultimately, Di , the ED value of feature i, is defined as follows: Di =
1 M−1
ED
F i, F k
, (1 ≤ k ≤ M, k = / i)
(14)
The criterion that combines Ri and Di is called “Maximize the sum of relevance and distance” (MSRD). Assume that we have selected m − 1 features and will select the mth feature from the remaining features. The selection condition is as follows: MAX(Ri + Di ).(15) Here, Eq. (15) suggests that the relevance is the same as the distance. However, different problems have different focuses. Therefore, the feature selection criterion can be updated as follows: MAX(wr Ri + wd Di ),
(16)
where the variables wr and wd are the weights for the relevance and distance, and wr + wd = 1,wr ≥ 0,wd ≥ 0.
Table 4 Performance of various Methods on SQF. Model
PTHS Bagging-J48 Boost-J48 RF
Performance (%) ACC
RMSE
ROC
89.37 86.99 88.26 89.53
0.317 0.310 0.327 0.291
95.0 94.2 94.2 96.2
ACC
RMSE
ROC
94.74 93.85 93.13 94.40
0.203 0.236 0.236 0.248
98.4 97.8 96.9 98.6
Table 5 Performance of various Methods on SSF. Model
PTHS Bagging-J48 Boost-J48 RF
Performance (%)
(17)
Ensemble methods can offer significant improvements in the prediction accuracy over single-method models by combining the predictions of basic models. It is a challenge for us to train and ensemble the basic models for higher-dimensional data. Feature selection can be used to reduce the dimensionality by selecting a feature subset at this point. In addition, the accuracy of the basic model could increase after removing the irrelevant/redundant features. PTHS will perform feature selection first for higherdimensional data. Otherwise, we simply use PTHS to train and ensemble the basic models. 4. Results and discussion To examine how well our method performs, we executed extensive empirical studies on a broad range of data sets. In this paper, we conducted our experiments on tRNA prediction, protein-protein interaction predictions and data sets from UCI [30] using a PC with a 3.10 GHz CPU and 8 GB RAM. Additionally, our 20 basic models are implemented with the data mining tool Waikato Environment for Knowledge Analysis (WEKA) [31]. We employed three famous ensemble methods, namely, Bagging, Boosting, and Random Forest (RF), to compare with our proposed method PTHS. It is worth noting that Bagging and Boosting used in this study are with the ensemble strategy of decision tree (J48). Each method was evaluated with 5-fold cross-validation. Three metrics – the accuracy (ACC), root mean squared error (RMSE), and ROC Area (AUC) were used for performance evaluation. We also compared the running time of the voting task and the parameter optimization under single-thread and multi-thread strategies. Testing on datasets from various fields shows that our method works well and our combination strategy based on divide and conquer can significantly save the running time. 4.1. Experiments on tRNA predictions The tRNAScan-SE is a software tool that has been widely used for tRNA annotation. We used RNAScan-SE to obtain positive and negative sequences as training data sets [32]. The training data of tRNA is a set of nucleotide sequences, which cannot be used as the input for the training model directly. Researchers have proposed many feature extraction methods for sequences. We used two feature extraction methods to obtain two fixed feature data sets, including sequence features (SQFs) and sequence-structure-features (SSFs) [33]. SQF was extracted by applying an n-gram model of natural language, which assumes that the appearance of a word is related only
Fig. 4. Accuracy comparison between SQF and SSF among the different models.
to the first n−1 words [34]. We consider tRNA sequences as a natural language and each nucleotide as a word. Here, we considered only n = 3. The tRNA sequences were run in triplicate for each sample. tRNA sequences contain {A, U, C, G}, four nucleotides, and we used a 3-g model to calculate the frequency of the nucleotides and normalized the occurrence count as our final tRNA feature. Many experiments have shown that a sequence’s secondary structure can help to identify the real-tRNAs from pseudo-tRNAs. We employed RNAalifold to obtain the tRNA sequence’s secondary structure, including the paired and unpaired two status [35]. In contrast to the local contiguous SEFs proposed by Xue et al., our SSF focused on n continuous nucleotides and their secondary structure status [36]. As a result, we obtained 43 = 64 possible 3-tuple sequence combinations for the SQF and 21 × 43 = 128 possible 4-tuple sequence and structure combinations for the SSF. There are 1806 samples, including 623 positive and 1183 negative instances. Table 4 shows the comparison performance results among the four ensemble methods on SQF and Table 5 is about SSF. As shown in Fig. 4, we can see that the RF achieved the best performance as compared to the other ensemble models in feature SQF. The performance of PTHS is exactly equal to the RF and better than the Bagging-J48 and Boost-J48. Although the training data are not balanced, the ensemble methods all have good performance on the ROC Area metric (ROC). PTHS and RF still achieve a better result than the other two methods. From Fig. 4, we can find that the prediction accuracy value of the four models increases almost 5% when the SSF is employed as a training feature. The experiment results show the importance of the features for the model perfor-
Please cite this article in press as: Wei L, et al. A novel hierarchical selective ensemble classifier with bioinformatics application. Artif Intell Med (2017), http://dx.doi.org/10.1016/j.artmed.2017.02.005
G Model
ARTICLE IN PRESS
ARTMED-1504; No. of Pages 9
L. Wei et al. / Artificial Intelligence in Medicine xxx (2017) xxx–xxx Table 6 Performance of various Methods on DBWorld e-mails.
Feature selection 86 Accuracy(100%)
7
Model
85 84
PTHS Bagg-J48 Boost-J48 RF
83 MSRD
82
CA
Performance (%) ACC
RMSE
ROC
89.06 89.06 84.38 81.25
30.32 36.32 37.12 38.01
96.6 89.0 85.8 89.6
81 80 0
1
2
3
4
5
6
7
8
Feature dimension
9 10 11 (10^2)
12
Table 7 Performance of various methods on Spambase.
13
Model
Fig. 5. Results of PPIs. MSRD accuracy vs. CA in PTHS. PTHS Bagg-J48 Boost-J48 RF
Models comparation
Accuracy(100%)
86
Performance (%) ACC
RMSE
ROC
95.00 94.18 90.07 94.48
22.09 21.43 27.44 21.09
98.40 98.00 95.90 98.10
ACC
RMSE
ROC
96.09 95.52 96.64 96.09
23.94 18.87 17.08 18.65
99.30 99.20 99.40 99.40
85 Table 8 Performance of various Methods on Letter.
84
Model
PTHS
83
Bag-J48 Boost-J48
82
RandomForest
81 0
1
2
3
4
5
6
7
Feature dimension
8
9
10
11
12
13
PTHS Bagg-J48 Boost-J48 RF
Performance (%)
(10^2)
Fig. 6. Accuracy between various methods using MSRD feature selection.
Table 9 Performance of various Methods on ThoraricSurgery. Model
mance. Extracting a good and representative feature set is the most important step in machine learning. 4.2. Experiments on protein-protein interaction predictions Protein-protein interaction predictions (PPI) prediction is a binary classification problem. For the positive set (interactions that exist), we used interactions from the DIP database (Salwinski et al., 2004), which has collected more than 78,000 PPIs and includes more than 27,000 proteins. For the negative set, we used a set of off-the-shelf negative data called the Negatome. The Negatome set mainly collects pairwise proteins that do not interact in a direct, physical manner. A total of 6445 non-interacting pairs obtained from the Negatome are used as the gold standard negative data set. To maintain balance in the training sets, we randomly select 6445 PPIs from the DIP as a positive set. Here, we employed a method called 2-g-k-skip (k-skip) to extract features from the original protein sequences. 2-g-k-skip simply calculates 2 consecutive symbols at intervals of y amino acids, where y is an integer variable that is not greater than k. Here, we use two feature-selection methods, including the proposed method MSRD and correlation analysis (CA) [37], for the experiment. The dashed lines show the accuracy of the test data after the feature selection. Fig. 5 shows the comparison results of the accuracy of the ensemble model PTHS by MSRD and the CA feature selection method. Fig. 6 is the accuracy performance between PTHS and the other three traditional ensemble methods at different feature dimensions by MSRD. As Fig. 5 has shown, we found that our feature selection method MSRD performs better than the correlation analysis (CA). MSRD can almost find a better subset of features than CA in different dimensions. The accuracy is maximized (85.57%) at a dimension of 900 features. This result demonstrates that these cases of higher feature
PTHS Bagg-J48 Boost-J48 RF
Performance (%) ACC
RMSE
ROC
84.04 83.83 78.93 81.28
35.91 36.42 44.13 37.76
63.50 61.80 56.20 59.50
dimensions can contain redundant data. It is necessary to perform feature selection before training the model, which can improve the performance of the model. Fig. 6 compares the results of using various models on the PPIs after using MSRD as a feature selection method. PTHS and Boost-J48 have better performance than the other two methods. Especially when the dimension of the features is 900, the performance of PTHS has an obvious increase. On the whole, our feature selection method MSRD and ensemble model PTHS exhibit competitive performance compared to the other methods 4.3. Experiments on UCI data sets We also explored our method’s performance using four problems from the UCI Repository–DBWorld e-mails, Spambase, Letter, and ThoraricSurgery [30]. The DBWorld e-mails data set is a binary classification problem with a dimension of 4703. Letter is a 26 category problem. For convenience, we transformed Letter into a binary problem. We converted Letter into a binary problem by converting letters 1–13 into class 0 and letters 14–26 into class 1. The features dimension of these four sets is relatively low, and therefore, we simply tested our method’s performance without using feature selection. Tables 6–9 list the evaluation performances of the proposed PTHS method and existing ensemble methods on the four UCI datasets: DBWorld e-mails, Spambase, Letter and ThoraricSurgery, respectively. Each model is evaluated with 5 fold cross-validation.
Please cite this article in press as: Wei L, et al. A novel hierarchical selective ensemble classifier with bioinformatics application. Artif Intell Med (2017), http://dx.doi.org/10.1016/j.artmed.2017.02.005
G Model ARTMED-1504; No. of Pages 9
ARTICLE IN PRESS
8
L. Wei et al. / Artificial Intelligence in Medicine xxx (2017) xxx–xxx
Fig. 7. Running time comparison of the parameter optimization based on a singlethread and multi-thread approach. Fig. 8. Running time comparison between the single-thread and divide-andconquer strategy with different ensemble model numbers.
As the tables have shown, we can see that the proposed PTHS method achieved the best performance on three of the four datasets (DBWorld e-mails, Spambase, and ThoraricSurgery) compared to the other ensemble models. The accuracies by the proposed PTHS method on the three datasets are 89.06%, 95.00%, and 84.04%, respectively. For the on the dataset Letter, the performances of the PTHS method is slightly worse than that of the Boost-J48 method. These results indicate that our PTHS method is more effective and robust that existing other ensemble methods. 4.4. Experiments on computing time comparison As is well known, appropriate parameters can effectively improve the performance of a model. In Section 2.1, we proposed a method for searching the optimal parameters for each model. To reduce the searching time, we applied a multi-thread strategy to run the algorithm, which is implemented by a thread pool. Fig. 7 shows the comparisons of running time between the single-thread and multi-thread strategy. Two datasets (Letter and Adult) obtained from the UCI Repository were used for this experiment. As shown in Fig. 7, the parameter optimization with the multi-thread strategy significantly reduces the running time as compared to that with the single-thread strategy. 4.5. Experiments on the combined method based on divide-and-conquer In our paper, we combined the prediction of the basic models based on the divide-and-conquer strategy. We set an initial threshold value first. Then, we divided the large combination vote task by comparing the threshold with the voter data instance number into two small voter tasks. After dividing the large combination task into a number of small combination tasks recursively, we submitted the small task into a thread pool and employed the multi-thread strategy to run these small tasks. Finally, we obtained the results for all of the tasks. We used the dataset Adult for comparing the performance of the voting task based on divide-and-conquer and traditional single thread. Fig. 8 is the comparison result of the combination voting time when the number of ensemble models is different. Fig. 9 shows the running time compared when the number of datasets is different but the number of basic models is the same. As seen from Fig. 8, the voting task based on divide-and-conquer can efficiently save time compared with the single-thread strategy. When the number of ensemble classifiers is increased, the promotion effect becomes more obvious. Similar to the trend of the dashed line, it nearly presents a linear time increase tendency. Fig. 9 shows the comparison result when the number of prediction instances is different and there is the same number of models. The result of Fig. 9 is similar to that of Fig. 8. As the number of instances increases, more running time can be saved.
Fig. 9. Running time comparison between the single-thread and divide-andconquer strategy for different prediction instances.
5. Conclusions In this paper, we proposed a selective ensemble method called PTHS. We independently built basic models and optimized each model’s parameters in parallel using a thread pool. We employed the k-means algorithm to prune the similar models first, and then, a set of diverse basic models was chosen from those pruning models using a hierarchical selection method. The PT method is used to make our ensemble model suitable for the multi-class problem. In addition, we proposed a feature selection method called MSRD to successfully solve high-dimensional problems. Experiments on PPI datasets verify the feasibility of our MSRD method. In our empirical studies on the UCI dataset and tRNA dataset, PTHS performs competitively against other traditional ensemble methods. Especially when the dataset is large, the proposed ensemble method based on the divide-and-conquer strategy can efficiently save computing time. In general, our proposed ensemble method is efficient, effective, and robust on most of the datasets. Unfortunately, it does not work well on imbalanced data, which is an interesting issue for our future research. Extending this method to “big data” is also a future direction of our research. Conflict of interests The authors declare no conflict of interests. Acknowledgements This paper is supported by the Natural Science Foundation of China (61370010) and the Natural Science Foundation of Fujian Province of China (2014J01253).
Please cite this article in press as: Wei L, et al. A novel hierarchical selective ensemble classifier with bioinformatics application. Artif Intell Med (2017), http://dx.doi.org/10.1016/j.artmed.2017.02.005
G Model ARTMED-1504; No. of Pages 9
ARTICLE IN PRESS L. Wei et al. / Artificial Intelligence in Medicine xxx (2017) xxx–xxx
References [1] Pérez-Ortiz M, Gutiérrez PA, Hervás-Martínez C. Projection-based ensemble learning for ordinal regression. IEEE Trans Cybern 2014;44:681. [2] Bhatnagar V, Bhardwaj M, Sharma S, Haroon S. Accuracy–diversity based pruning of classifier ensembles. Prog Artif Intell 2014;2:97–111. [3] Lin C, Chen W, Qiu C, Wu Y, Krishnan S, Zou Q. LibD3C: ensemble classifiers with a clustering and dynamic selection strategy. Neurocomputing 2014;123:424–35. [4] Wei L, Liao M, Gao X, Zou Q. Enhanced protein fold prediction method through a novel feature extraction technique. IEEE Trans Nanobiosci 2015;14:649–59. [5] Zhang ML, Zhou ZH. A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 2014;26:1819–37. [6] Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, et al. Hierarchical classification of protein folds using a novel ensemble classifier. PLoS One 2013;8:e56499. [7] Chen W, Ding H, Feng P, Lin H, Chou KC. iACP: a sequence-based tool for identifying anticancer peptides. Oncotarget 2016;7:16895–09. [8] Chen W, Tang H, Ye J, Lin H, Chou KC. iRNA-PseU: Identifying RNA pseudouridine sites. In: Molecular Therapy —Nucleic Acids; 2016. [9] Wei L, Liao M, Gao Y, Ji R, He Z, Zou Q. Improved and promising identification of human MicroRNAs by incorporating a high-quality negative set. IEEE/ACM Trans Comput Biol Bioinform 2014;11:192–201. [10] Breiman L. Bagging predictors. Machine Learning 1996;24:123–40. [11] Freund Y. Experiments with a new boosting algorithm. Thirteenth International Conference on Machine Learning 1996:148–56. [12] Li N, Yu Y, Zhou ZH. Diversity regularized ensemble pruning. European Conference on Machine Learning and Knowledge Discovery in Databases 2012:330–45. [13] Zhou ZH, Wu J, Tang W. Ensembling neural networks: many could be better than all 夽. Artificial Intelligence 2002;137:239–63. [14] Kosztyán ZT, Bencsik A, Póta S. Resource Allocation And Its Distributed Implementation. Netherlands: Springer; 2007. [15] Trai R, Koszty Tibor N, Zsolt Sik L, Nyi C. Navigation methods of special needs users in multimedia systems. Comput Human Behav 2008;24:1418–33. [16] Wang Y, Benko A, Chen X, Sa G, Guo H, Han X, et al. Online scheduling with one rearrangement at the end: revisited. Inf Process Lett 2012;112:641–5. [17] Liu B, Long R, Chou KC. iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. Bioinformatics 2016:32, btw186. [18] Song Q, Ni J, Wang G. A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans Knowl Data Eng 2013;25:1–14. [19] Wei L, Liao M, Gao X, Wang J, Lin W. mGOF-loc: a novel ensemble learning method for human protein subcellular localization prediction. Neurocomputing 2016;217:73–82.
9
[20] Read J, Pfahringer B, Holmes G, Frank E. Classifier chains for multi-label classification. Mach Learn 2011;85:254–69. [21] Yang H, Tang H, Chen XX, Zhang CJ, Zhu PP, Ding H, et al. Identification of secretory proteins in mycobacterium tuberculosis using pseudo amino acid composition. Biomed. Res. Int 2016;2016:5413903. [22] Tang H, Chen W, Lin H. Identification of immunoglobulins using Chou’s pseudo amino acid composition with feature selection technique. Mol Biosyst 2016;12:1269–75. [23] Chandrashekar G, Sahin F. A survey on feature selection methods 夽. Comput Electr Eng 2014;40:16–28. [24] Su R, Zhang C, Pham TD, Davey R, Bischof L, Vallotton P, et al. Detection of tubule boundaries based on circular shortest path and polar-transformation of arbitrary shapes. J Microsc 2016;264:127–42. [25] Wei L, Zou Q. Recent progress in machine learning-based methods for protein fold recognition. Int J Mol Sci 2016;17:2118. [26] Wei L, Zhang B, Chen Z, Xing G, Liao M. Exploring local discriminative information from evolutionary profiles for cytokine-receptor interaction prediction. Neurocomputing 2016;217:37–45. [27] Yin XC, Huang K, Hao HW, Iqbal K, Wang ZB. A novel classifier ensemble method with sparsity and diversity. Neurocomputing 2014;134:214–21. [28] Dowsland KA, Thompson JM. Simulated Annealing. John Wiley & Sons, Inc.; 1993. [29] Kuncheva LI, Rodríguez JJ. A weighted voting framework for classifiers ensembles. Knowl Info Syst 2014;38:259–75. [30] Bache K, Lichman M. UCI Machine Learning Repository; 2013. [31] Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. Acm Sigkdd Explorations Newsl 2008;11:10–8. [32] Lowe TM. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 1997;25:955–64. [33] Wei L, Liao M, Gao X, Zou Q. An improved protein structural prediction method by incorporating both sequence and structure information. IEEE Trans Nanobiosci 2015;14:339–49. [34] Song L, Li D, Zeng X, Wu Y, Guo L, Zou Q. nDNA-prot: identification of DNA-binding proteins based on unbalanced classification. BMC Bioinf 2014;15:298. [35] Bernhart SH, Hofacker IL, Will S, Gruber AR, Stadler PF. RNAalifold improved consensus structure prediction for RNA alignments. BMC Bioinf 2008;9:1–13. [36] Xue C, Fei L, Tao H, Liu GP, Li Y, Zhang X. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinf 2005;6:310. [37] Yu L, Liu H. Feature selection for high-Dimensional data: a fast correlation-Based filter solution. Machine Learning, Proceedings of the Twentieth International Conference 2003:856–63.
Please cite this article in press as: Wei L, et al. A novel hierarchical selective ensemble classifier with bioinformatics application. Artif Intell Med (2017), http://dx.doi.org/10.1016/j.artmed.2017.02.005