Journal Pre-proof
Hybrid Classification Algorithms Based on Instance Filtering Tzu-Tsung Wong , Nai-Yu Yang , Guo-Hong Chen PII: DOI: Reference:
S0020-0255(20)30096-7 https://doi.org/10.1016/j.ins.2020.02.021 INS 15205
To appear in:
Information Sciences
Received date: Revised date: Accepted date:
2 August 2019 15 January 2020 8 February 2020
Please cite this article as: Tzu-Tsung Wong , Nai-Yu Yang , Guo-Hong Chen , Hybrid Classification Algorithms Based on Instance Filtering, Information Sciences (2020), doi: https://doi.org/10.1016/j.ins.2020.02.021
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier Inc.
Hybrid Classification Algorithms Based on Instance Filtering
Tzu-Tsung Wong*, Nai-Yu Yang, Guo-Hong Chen Institute of Information Management National Cheng Kung University 1, Ta-Sheuh Road, Tainan City 701, Taiwan, ROC Tel: +886-6-2757575 ext 53722 E-mail:
[email protected]
Abstract.
Basic classification algorithms induce a single model from training data.
The interpretation of a model is relatively easy, while basic algorithms have limitations in achieving high accuracy. correctly predicted by another.
An instance misclassified by a model may be
Hybrid classification is a concept that employs basic
classification algorithms for model induction and for data preprocessing. Misclassification instances are usually considered to be noise, yet those still may carry useful information for identifying the class values of some other instances.
This
study proposes hybrid classification algorithms in which training instances are filtered to build three models for prediction. Each testing instance is classified by exactly one of them.
The algorithms involved in the proposed hybrid classification algorithms
are decision tree induction and naïve Bayesian classifier.
The testing results on
twenty data sets demonstrate that our hybrid classification algorithms can significantly outperform the basic ones as well as the hybrid algorithm proposed in a previous study.
The hybrid classification algorithms based on instance filtering
achieve relatively high accuracy and maintain the easy interpretation of learning results. Keywords: Decision tree induction, hybrid classification, instance filtering, naïve Bayesian classifier.
1
Introduction
Basic classification algorithms induce a model from data, which will be used to classify every new instance.
Data preprocessing techniques, such as feature
selection and instance filtering, can enhance the performance of basic classification algorithms.
The selective naive Bayesian has been shown to be a successful wrapper
for improving the performance of naïve Bayesian classifier [15]. Instance filtering helps reduce data size without sacrificing classification performance [1, 14].
A deep
belief network is used to select features for support vector machine in processing data sets with a large amount of class values [28].
Genetic algorithm can also be
employed for feature selection in training classification algorithms such as support vector machine, naïve Bayesian classifier, and decision tree induction [9, 12, 20]. For predicting rear-end crashes, attributes are divided into disjoints subsets when training decision tree and naive Bayesian classifier [4].
Chen and Howard [5]
applied random forests to screen attributes for improving the prediction accuracy of decision trees.
Logistic regression and naïve Bayesian classifier were employed to
remove misclassified instances for decision tree induction [10, 24].
De Caigny et al.
[7] introduced a hybrid classification algorithm that builds a logistic regression model in each leaf node of a decision tree for the purpose of customer churn prediction.
2
However, the single model induced from a basic algorithm tends to perform worse than the models induced from an ensemble algorithm [22, 25]. Ensemble algorithms find a set of models to determine the class value of a new instance.
These models are homogeneous when they are induced by the same basic
algorithm, or they can be nonhomogeneous models induced by different basic algorithms.
A random forest has multiple decision trees, which take a majority vote
to determine the class value of a new instance [3].
Jaison et al. [13] proposed a
hybrid classification method composed of nearest neighbor, support vector machine, and naïve Bayesian classifier for analyzing microarray data, and Zhang and Mahadevan [30] built a hybrid model consisting of support vector machine and neural network to predict the risk of aviation events.
Noor et al. [17] designed an ensemble
method composed of k-nearest neighbors, support vector machine, and decision tree induction to analyze network data for identifying key players.
The weighted
majority vote of six classification methods was used to improve slope stability predictions in [17].
Ensemble algorithms are also popular in performing credit risk
analysis [6, 11, 18, 23].
Aburomman and Reaz [2] conducted a survey on ensemble
and hybrid methods employed in intrusion detection systems. Although ensemble algorithms generally achieve a higher accuracy than basic algorithms [29], the learning of ensemble algorithms is relatively complex.
3
The
models found by an ensemble algorithm must be diverse to ensure the stability of their decisions, and hence the training cost is relatively high.
When a new instance is
classified by a single model, it should be easier to determine the reason for the class assignment.
Decision tree induction is popular because it provides descriptive
models that are easier to interpret.
However, it is difficult to interpret a prediction
made by multiple decision trees [8].
This difficulty will affect the adoption of
ensemble methods for realistic cases [11]. Consider the data set ‘weather’ provided by ‘Weka’ software.
This data set has
14 instances and four discrete attributes: ‘outlook,’ ‘temperature,’ ‘humidity,’ and ‘windy,’ which are used to predict whether the class ‘play’ has the value ‘yes’ or ‘no.’ Figure 1 shows three possible decision trees grown from the subsets of this data set. Let the actual class value of a new instance x =
be ‘no’. The predictions of the three decision trees on x are ‘yes,’ ‘yes,’ and ‘no,’ respectively. The prediction made by the decision tree given in Figure 1(a) is wrong because of the attributes ‘outlook’ and ‘humidity.’ Similarly, the prediction made by the decision tree given in Figure 1(c) is correct because of the attribute ‘outlook.’ Suppose that an ensemble algorithm generates the three decision trees for a majority vote for classification.
The resulting prediction will be wrong, and it is difficult to interpret
the role of ‘outlook.’ These show that the prediction made by a single model is easy
4
to interpret.
If a hybrid classification algorithm generates the three decision trees
displayed in Figure 1, and chooses the one in Figure 1(c) for classifying x, then the prediction for this instance will be correct and will in turn be easy to interpret.
(a)
outlook sunny humidity
high
false
no
(c)
windy
sunny
false
humidity
no
true
yes
true
high
windy
yes normal
no
(b)
rainy
overcast
no
yes
yes
outlook overcast
rainy
yes
yes
normal yes
Figure 1. Three decision trees resulting from subsets of the data set ‘weather.’
Misclassification instances are generally considered as noise, and hence are excluded from training data [10, 21].
However, such instances may carry useful
information in identifying the class value of an instance.
5
For example, let the
triangles and circles in Figure 2 represent the instances belonging to two different class values for training. circles will be wrong.
The predictions on the three solid triangles and two solid If these misclassified instances are removed, as shown in
Figure 3(a), then the predictions on any new triangle located in the left half and any new circle located in the right half will be wrong.
However, if the misclassified
training instances are grouped as shown in Figure 3(b), then they can be used to correctly classify any new triangle located in the left half or any new circle located in the right half.
This demonstrates that misclassified training instances can be useful
in classifying some instances. In this study, we propose hybrid classification algorithms based on instance filtering.
Their advantage is that they achieve relatively high accuracy with respect
to basic algorithms without sacrificing the interpretability of the learning results, as encountered in ensemble algorithms.
They are called hybrid because a basic
classification algorithm plays the role of instance filtering, and another is used for model induction.
The two basic classification algorithms need not be the same.
This kind of hybrid classification generates three models, and every new instance is classified by only one of them.
The interpretation of their learning results is
therefore as easy as that obtained using basic algorithms.
6
Figure 2. The instances in a two-dimensional space.
(b)
(a)
Figure 3.
Groups (a) and (b) show respectively the correct and wrong predictions on
the instances given in Figure 2.
The remainder of this paper is organized as follows.
Section 2 briefly
distinguishes hybrid classification from ensemble classification.
Section 3
introduces our hybrid classification algorithms composed of decision tree induction and naïve Bayesian classifier.
The experimental results of our hybrid algorithms on
twenty data sets are given in Section 4 to show that they can significantly outperform basic algorithms and the hybrid algorithm proposed in a previous study.
7
Conclusions
and directions for future works are addressed in Section 5.
2
Ensemble classification and hybrid classification
Every basic classification algorithm has its own learning procedure to induce a model from a training set in order to predict the class value of a new instance.
A model
may be able to assign the correct class value to an instance that cannot be classified correctly by another model.
The concept of ensemble classification is to collect a set
of models for making group decisions. of models.
There are generally two ways to collect a set
One is to collect the models induced from a single training set by
different classification algorithms.
Another is to train the same algorithm with
different data sets that are derived from one particular data set. Let D be a data set, and let Lj for j = 1, 2, …, r be basic classification algorithms. Then MD,Lj represents the model induced from set D by algorithm Lj, and MD,Lj(x) denotes the class value assigned by model MD,Lj for instance x.
The first way for
ensemble classification is to collect models MD,Lj for j = 1, 2, …, r, and the predicted class value of instance x is determined by the (weighted) majority vote from MD,Lj(x) for j = 1, 2, …, r.
An alternative is to generate sets Di for i = 1, 2, …, q obtained by
8
sampling instances or features from D.
Then models MDi,L for i = 1, 2, …, q are
induced by algorithm L, and again the predicted class value of instance x is determined by the (weighted) majority vote from MDi,L(x) for i = 1, 2, …, q. Hybrid classification first applies a basic algorithm to filter instances or features. Then another basic algorithm finds models from the revised data set(s) to achieve higher accuracy than that induced from the original data set. It is called hybrid classification because basic algorithms are involved in both data preprocessing and model induction.
Let L1 and L2 be two basic algorithms used to compose a hybrid
one, and let SD,L2 be the set of models obtained by applying L2 on the sets derived from data set D filtered by algorithm L1.
Then one model in SD,L2 will be chosen to
predict the class value of a new instance.
Note that L1 and L2 can be the same basic
classification algorithm. Ensemble classification will use one or more basic algorithms to find a set of models that make a group decision for prediction, and no requirements exist for data preprocessing.
A basic algorithm must be assigned for data preprocessing in hybrid
classification, and every new instance will be classified by only one model.
This
implies that the interpretation of the learning results of hybrid classification remains easy.
9
3
Hybrid classification algorithms
Misclassified instances are generally considered to be noise in hybrid classification, and hence they are removed from the training set.
As discussed in Section 1, the
instances that are misclassified may still carry useful information for predicting the class value of some instances, and hence they should be moved to another set for further usage.
This section introduces the learning procedure of our hybrid
classification algorithms based on this concept. The hybrid classification process based on instance filtering is depicted in Figure 4. testing sets.
Suppose that a data set is divided into two disjoint subsets: training and There are three steps in hybrid classification based on instance filtering.
The first step is to apply the basic classification algorithm L1 on the training set evaluated by k-fold cross validation.
The correctly classified training instances are
placed into a primary set, and a secondary set stores those misclassified, as shown in the upper-right broken rectangle in Figure 4.
Then the models induced from the full
training set, the primary set, and the secondary set by classification algorithm L2 are called the full model, the primary model, and the secondary model, respectively in the second step, see the lower-right broken rectangle in Figure 4.
10
The left broken
rectangle in Figure 4 shows the third step, in which one of the three models will be chosen to predict the class value of each instance from the testing set.
Step 1 Training set Algorithm L1
A testing instance
Primary set
Algorithm L2
Algorithm L2
Secondary set
Algorithm L2
Full model Model selection
Primary model Secondary model
Step 3 Step 2 Figure 4.
The mechanism for hybrid classification.
The approach used for model selection is the core of the third step. set D be divided into disjoint training set DR and testing set DT.
Let a data
After applying
algorithm L1 on set DR evaluated by k-fold cross validation, we have the primary set DRP and the secondary set DRS such that DRPDRS = and DRPDRS = DR.
The
similarity between a testing instance x = DT and the set DRm is measured by the probability p(DRm|x) = p(x|DRm)p(DRm)/p(x) for m = P or S, where n is the number of attributes in D.
Assume that attributes are all independent for any
11
given set DRm.
Then p(DRm|x) can be simplified as: n
p( DRm | x ) p( DRm ) p( xi | DRm ) i 1
because p(x) is the same for both DRP and DRS. instances in set D.
Let |D| denote the number of
Then p(DRm) is estimated as |DRm|/|DR|, and p(xi|DRm) is calculated
as yim/|DRm|, where yim is the number of instances with attribute value xi in DRm. If p(DRP|x) p(DRS|x), then x is more similar to the instances in DRP. case, instance x should be classified by model M DRP ,L 2 . should be adopted for classifying x.
In this
Otherwise, model M DRS ,L 2
However, when p(DRP|x) and p(DRS|x) are close,
model M DR ,L 2 may be a better choice for predicting the class value of x, because DR is bigger than either DRP or DRS.
We therefore introduce a threshold > 0 to
determine which model should be chosen for classifying a testing instance.
If
log2p(DRP|x) – log2p(DRS|x) > , then x is more similar to the instances in DRP, and hence it is classified by M DRP ,L 2 . model for classifying x.
If log2p(DRS|x) – log2p(DRP|x) > , M DRS ,L 2 is the
If |log2p(DRP|x) – log2p(DRS|x)| , model M DR ,L 2 is
chosen to predict the class value of x.
The set SD,L2 thus has three models M DRP ,L 2 ,
M DRS ,L 2 , and M DR ,L 2 .
Example 1.
Let the data set ‘weather’ provided by ‘Weka’ software, introduced in
Section 1, be represented as DR = {ei, i = 1, 2, …, 14}.
12
Let algorithm L1 applied on
DR be decision tree induction evaluated by 3-fold cross validation.
Then the primary
and secondary sets obtained in step 1 would be DRP = {ei, i = 2, 3, 4, 6, 7, 9, 11, 13, 14} and DRS = {ei, i = 1, 5, 8, 10, 12}, respectively.
In step 2, the models derived by
algorithm L2 that is also decision tree induction from DR, DRP, and DRS are the three decision trees given in Figure 1(a), 1(b), and 1(c), respectively. normal, false> be a new instance with actual class value ‘no’.
Let x =
p( DRP | x ) p( DRP ) p(sunny | DRP ) p(mild | DRP ) p(normal | DRP ) p(false | DRP )
and
p( DRS | x )
9 3 3 5 4 0.0224 11 9 9 9 9 5 2 3 2 4 0.0349 , hence log2p(DRP|x) = -1.3534 and 11 5 5 5 5
When threshold = 0.4, since log2p(DRS|x) – log2p(DRP|x) =
log2p(DRS|x) = -0.7163.
0.6371 > , the secondary model given in Figure 3(c) will be used to make a correct prediction on x in step 3.
If the instance x given in Example 1 is classified by either the full model or the primary model, the prediction will be wrong.
This demonstrates that setting a proper
threshold for determining the classification model can lead to a chance to improve prediction accuracy.
When is zero, none of the instances in DT will be predicted by
M DR ,L 2 .
If is very large, then every testing instance will be classified by model
M DR ,L 2 .
This indicates that the value of threshold can control the usage of the
models
M DRP ,L 2
,
M DRS ,L 2
, 13
and
M DR ,L 2
.
Since
log 2 p( DRm | x) log 2 p( DRm ) in1 log 2 p( xi | RRm ) , the choice of the threshold should consider the number of attributes in data set D.
The threshold is set to be =
0.1nw for w = 0, 1, 2, … to search for the best value that achieves the highest classification accuracy.
This search stops when all instances in DT are classified by
model M DR ,L 2 . The pseudo code of the hybrid classification algorithm based on instance filtering is given in Figure 5.
Line 5 describes the first step of partitioning training
set DR into DRP and DRS by algorithm L1.
Then algorithm L2 is used to induce the
three models for classification in line 6 for step 2. Lines 8 through 14 determine the model for classifying each testing instance in step 3.
Note that when the value of
is changed, only step 3 have to be re-executed. Decision tree induction and naïve Bayesian classifier are two popular algorithms, which denoted as DT and NB, respectively. classification algorithms in this study.
They will be used to construct hybrid
A hybrid classification algorithm with basic
algorithms L1 in the first step and L2 in the second step is represented as L1-L2.
For
example, NB-DT indicates that the algorithms for instance filtering and model induction are naïve Bayesian classifier and decision tree induction, respectively. Note that NB-NB is not the same as NB, because NB-NB has three classification models, while NB only has one.
14
1 2 3 4 5
Input data set D and threshold Perform k-fold cross validation to divide D into folds F1, F2, …, Fk For each testing fold Fj Set DT = Fj and DR = D\Fj Perform algorithm L1 on DR to obtain DRP and DRS
6
Perform algorithm L2 on DR, DRP, and DRS to derive models
M DRP ,L 2 , M DRS ,L 2 , and M DR ,L 2 For each instance x DT Calculate p(DRP|x) and p(DRS|x)
7 8
If log2p(DRP|x) – log2p(DRS|x) > then
9 10
Use model M DRP ,L 2 to classify x
11
Else if log2p(DRS|x) – log2p(DRP|x) > then
12
Use model M DRS ,L 2 to classify x
13
Else
14
Use model M DR ,L 2 to classify x
Figure 5. The pseudo code of the hybrid classification algorithm based on instance filtering.
4
Experimental study
The characteristics of the 20 data sets randomly chosen from the UCI data repository [16] for evaluating the performance of our hybrid classification algorithms are
15
summarized in Table 1.
Their numbers of instances, attributes, and class values were
diversely chosen to avoid that the proposed algorithms are applicable only for certain kind of data sets.
The smallest number of instances in a data set is less than 200, and
hence the evaluation method is k-fold cross validation with k = 5 to ensure that the experimental results of every data set can satisfy the large-sample conditions; i.e., the numbers of correct and wrong predictions in each fold are both larger than or equal to five [26].
In performing k-fold cross validation for dividing a training set into
primary and secondary sets, the number of folds is set to be ten because the model for determining whether an instance will be classified correctly or not should be induced from a large set. The hybrid classification algorithm proposed by Farid et al. [10] played the role of benchmark.
Their algorithm, denoted as FDT, is similar to NB-DT, while only the
primary model is used for classification. algorithm is determined by L2.
The prediction accuracy of a hybrid
Algorithms DT-NB and NB-NB were compared with
NB, and algorithms DT-DT and NB-DT were compared with DT, RF, and FDT. Since the decision trees in algorithm FDT are unpruned, all classification algorithms are implemented using ‘Weka’ software with default settings except for the fact that decision trees were unpruned.
The significance level for statistical comparisons was
set to 0.05.
16
Table 1.
The characteristics of the 20 experimental data sets. Number of instances
Number of attributes
Number of class values
Blood Car
748 1,728
4 6
2 4
Cleve Cmc Crx Ecoli Flags
303 1,473 690 336 194
13 9 15 8 28
2 3 2 8 8
German Haberman Letter Liver Mammographic
1,000 306 20,000 345 959
20 3 16 6 5
2 2 26 2 6
Page Parkinson Pendigits Robot Sick
5,473 195 10,992 5,456 3,163
10 22 16 24 7
5 2 10 4 2
Vehicle Waveform Yeast
846 5,000 1,484
18 21 8
4 3 10
Data set
The experimental results for the 20 data sets are shown in Table 2 for algorithms NB, NB-NB, and DT-NB and in Table 3 for algorithms DT, NB-DT, DT-DT, and FDT. As argued in Section 3, a hybrid algorithm is guaranteed not to have lower classification accuracy than its corresponding basic.
Table 2 shows that NB-NB and
DT-NB is not inferior to NB in any data set, which is also the case for NB-DT and DT-DT with respect to DT in Table 3.
No parametric methods have been proposed 17
to compare the performance of multiple classification algorithms.
Since the training
and testing sets in each iteration of the k-fold cross validation were kept the same for all algorithms, the matched-sample approach proposed by Wong [26] was employed to compare the performance of every pair of algorithms on single data set.
Table 2.
The classification accuracies of NB, NB-NB, and DT-NB.
Data set
NB
NB-NB
DT-NB
Blood Car Cleve Cmc Crx
0.7689
0.7689
0.7689
0.8541
0.8559
0.8541
0.8383
0.8383
0.8416
0.5044
0.5044
0.5091
0.8420
0.8507
0.8522
Ecoli Flags German Haberman Letter
0.8183
0.8184
0.8183
0.6082
0.6082
0.6082
0.7460
0.7480
0.7460
0.7420
0.7420
0.7420
0.7114
0.7135
0.73141,2
Liver Mammographic Page Parkinson Pendigits
0.6145
0.6290
0.6261
0.7654
0.7664
0.7695
0.9196
0.9234
0.9249 1,3
0.7179
0.8154
0.7231
0.8687
0.89381
0.88981
Robot Sick Vehicle Waveform Yeast
0.7912
0.80021
0.7949
0.9418
0.9431
0.9418
0.6147
0.6230
0.6147
0.8062
0.8074
0.8108
0.5822
0.5829
0.5822
Average
0.7528
0.7616
0.7575
Superscripts 1, 2, and 3 represent algorithms NB, NB-NB, and DT-NB, respectively.
Every bold value in Tables 2 and 3 indicates significantly higher accuracy than
18
those of the algorithms noted in its superscripts.
For example, algorithm DT-NB has
significantly higher accuracy on data set ‘Letter’ with respect to both NB and NB-NB given in the superscript of 0.7314.
Since every pair of algorithms is compared using
the matched-sample approach, it is possible that algorithm A is significantly better than C and that algorithm B achieves a higher accuracy than A, while algorithm B is not significantly better than C.
Consider the prediction accuracies of algorithms DT,
NB-DT, and FDT on data set ‘Yeast,’ for which their fold accuracies are given in Table 4.
The test statistic for comparing algorithms DT and NB-DT is t =
- 0.0067/ 0.000028/5 = -2.8313 with 4 degrees of freedom, and its corresponding p-value is 0.0473.
Since the p-value is less than the significance level = 0.05, and
the t-value is negative, NB-DT is significantly more accurate than DT.
Similarly, the
p-value corresponding to the test statistic t = - 0.0398/ 0.003208/5 = -1.5713 for comparing algorithms DT and FDT is 0.1912.
Although algorithm FDT has greater
mean accuracy than NB-DT, FDT is not significantly better than DT on this data set.
19
Table 3.
The classification accuracies of DT, NB-DT, DT-DT, and FDT.
Data set
DT
Blood Car Cleve Cmc Crx
0.7808 0.9427
4
0.7757
NB-DT
DT-DT
FDT
0.7835
0.7808
0.7835
0.9444
4
0.8021 1
0.9427
4
0.8929
0.7857
0.7593
0.4596
0.5071
0.4576
0.4739
0.8275
0.8493
0.8420
0.8464
Ecoli Flags German Haberman Letter
0.6966
0.7413
0.7293
0.75021
0.5565
0.6440
0.6028
0.6127
0.7090
0.7180
Liver Mammographic Page Parkinson Pendigits
1
0.6970
0.7270
0.6994
0.7487
0.7388
0.75841
0.80114
0.80114
0.80114
0.7268
0.5797
0.5826
0.5826
0.6406
0.7580
0.78311,2,3
0.9350
0.9298
0.7497
0.7768
0.9344
0.9344
0.7692
1
0.7897
0.9104
4
Robot Sick Vehicle Waveform Yeast
0.9278
4
Average
0.9491
0.7897
0.9105
4
0.9283
4
0.9491
0.9104
0.8739
0.9282
4
0.8625
0.9491
0.9456
0.6821
4
0.6738
4
0.7312
1
0.7288
1
0.5155
0.5222
1
0.5169
0.5553
0.7481
0.7661
0.7582
0.7538
0.6738 0.7176
4
0.7641 4
0.6337 0.7320
Superscripts 1 through 4 denote algorithms DT, NB-DT, DT-DT, and FDT, respectively.
Table 4.
The fold accuracies of data set ‘Yeast’ for algorithms DT, NB-DT and FDT. DT
Fold
Accuracy
NB-DT Accuracy Difference
FDT Accuracy
Difference
1
0.4579
0.4714
-0.0135
0.5589
-0.1010
2 3 4 5
0.5589 0.5286 0.5051 0.5270
0.5589 0.5387 0.5084 0.5338
0.0000 -0.0101 -0.0033 -0.0068
0.5455 0.5118 0.5387 0.6216
0.0134 0.0168 -0.0336 -0.0946
Mean
0.5155
0.5222
-0.0067
0.5553
-0.0398
Variance
0.000028 20
0.003208
Table 3 shows that algorithm FDT achieves the highest accuracy in some data sets, such as ‘Ecoli’ and ‘Mammographic’.
This suggests that applying naïve
Bayesian classifier to remove instances for inducing decision trees can be helpful for some data sets. In contrast, FDT has the lowest accuracy in several data sets, such as ‘Pendigits’ and ‘Robot.’ Removing instances from learning therefore seems to be risky.
The matched-sample approach proposed by Wong [27] was thus employed to
compare the performance of every pair of the algorithms in Tables 2 or 3 over the 20 data sets, for which the results are summarized in Table 5.
Table 5.
Hypothesis testing for every pair of classification algorithms over the 20
data sets.
t value
Degrees of freedom
p-value
Algorithm 1
Algorithm 2
NB NB
-4.5639 -2.7633
6 11
0.0038 0.0184
NB-NB
NB-NB DT-NB DT-NB
2.2539
13
0.0421
DT DT DT
NB-DT DT-DT FDT
-6.3261 -4.7728 -1.2212
17 21 30
0.0000 0.0001 0.2315
NB-DT NB-DT DT-DT
DT-DT FDT FDT
3.0531 2.6569 1.0357
25 22 32
0.0053 0.0144 0.3081
Every bold value in the last column of Table 5 indicates that the p-value is less 21
than the significance level of 0.05. The better algorithm in the same row is also marked bold.
For example, the p-value in the first row is 0.0038 < 0.05, and hence
NB-NB has significantly higher mean accuracy than NB over the 20 data sets.
The
mean accuracies of algorithms NB, NB-NB, and DT-NB over the 20 data sets are all significantly different, and NB-NB is the best. Among the algorithms with decision trees for class prediction, NB-DT significantly outperforms the other three algorithms. These findings demonstrate that the hybrid classification algorithms proposed in this study are superior to basic algorithms NB and DT and the previous hybrid algorithm FDT.
Naïve Bayesian classifier should be a better basic algorithm for instance
filtering, because NB-NB is the best in the first algorithm group, and NB-DT is the best in the second algorithm group. There are two extra models, M DRP ,L 2 and M DRS ,L 2 , in our proposed hybrid algorithms.
The performance of these two models is presented in Table 6 to
investigate whether they are really helpful in classifying instances. ‘Crx’ analyzed by algorithm NB-NB as an example.
Consider data set
This data set has 690 instances,
and the numbers of instances classified by models M DR ,L 2 , M DRP ,L 2 , and M DRS ,L 2 are 31, 652, and 7, respectively.
Since there are 562 and 3 correct predictions made
by M DRP ,L 2 and M DRS ,L 2 , respectively, the number of correct predictions on the 659 instances classified by M DRP ,L 2 or M DRS ,L 2 is 562+3 = 565, as shown in the fourth
22
column of Table 6.
When the 659 instances are all classified by model M DR ,L 2
induced by naïve Bayesian classifier, 559 of them are predicted correctly, as given in the fifth column of Table 6.
In this case, the improvement achieved by introducing
models M DR ,L 2 and M DR ,L 2 is calculated to be (565-559)/659 = 0.91%.
Table 6.
The improvement when introducing instance filtering in hybrid algorithms
(a) NB-NB, (b) DT-NB, (c) NB-DT, and (d) DT-DT. (a)
No. correct/No. of predictions
Improvement
M DRP ,L 2
M DRS ,L 2
Total
M DR ,L 2
Blood (748)
0/0
0/0
0/0
0/0
0.00
Car (1,728) Cleve (303) Cmc (1,473) Crx (690) Ecoli (336)
734/739
1/2
735/741
732/741
0.40
13/15
0/0
13/15
13/15
0.00
31/36
0/1
31/37
31/37
0.00
562/652
3/7
565/659
559/659
0.91
206/239
20/30
226/269
226/269
0.00
5/5
0/1
5/6
5/6
0.00
272/302
3/4
275/306
273/306
0.65
132/163
2/3
134/166
134/166
0.00
7,144/7,858
915/2,622
8,059/10,480
8,018/10,480
0.39
4/7
5/5
9/12
4/12
41.67
399/467
0/0
399/467
398/467
0.21
4,704/4,899
78/126
4,782/5,025
4,761/5,025
0.42
106/117
28/40
134/157
115/157
12.10
6,908/7,265
488/683
7,396/7,948
7,120/7,948
3.47
2,155/2,410
236/375
2,391/2,785
2,342/2,785
1.76
2,306/2,373
26/34
2,332/2,407
2,328/2,407
0.17
122/132
25/48
147/180
140/180
3.89
2,041/2,246
58/113
2,099/2,359
2,093/2,359
0.25
3/6
0/5
3/11
2/11
9.09
Data set
Flags (194) German (1,000) Haberman (306) Letter (20,000) Liver (345) Mammographic (959) Page (5,473) Parkinson (195) Pendigits (10,992) Robot (5,456) Sick (3,163) Vehicle (846) Waveform (5,000) Yeast (1,484) Average
(%)
3.77 23
(b)
No. correct/No. of predictions
Improvement
M DRP ,L 2
M DRS ,L 2
Total
M DR ,L 2
0/0
0/0
0/0
0/0
0.00
672/672
0/0
672/672
672/672
0.00
34/41
1/1
35/42
34/42
2.38
Cmc (1,473) Crx (690) Ecoli (336) Flags (194) German (1,000)
151/210
217/470
368/680
361/680
1.03
553/637
5/11
558/648
551/648
1.08
102/106
6/8
108/114
108/114
0.00
12/14
1/3
13/17
13/17
0.00
17/17
0/0
17/17
17/17
0.00
Haberman (306)
153/193
7/12
160/205
160/205
0.00
9,157/10,189
1,557/3,672
10,714/13,861
10,314/13,861
2.89
44/64
10/14
54/78
50/78
5.13
585/747
5/13
590/760
586/760
0.53
4,972/5,319
90/154
5,062/5,473
5,033/5,473
0.53
139/191
2/4
141/195
140/195
0.51
8,597/9,495
1,184/1,497
9,781/10,992
9,579/10,992
2.11
3,026/2,578
121/189
3,147/3,767
3,127/3,767
0.53
44/44
0/0
44/44
44/44
0.00
27/28
0/0
27/28
27/28
0.00
3,497/4,114
557/886
4,054/5,000
4,031/5,000
0.46
0/3
0/0
0/3
0/3
0.00
Data set Blood (748) Car (1,728) Cleve (303)
Letter (20,000) Liver (345) Mammographic (959) Page (5,473) Parkinson (195) Pendigits (10,992) Robot (5,456) Sick (3,163) Vehicle (846) Waveform (5,000) Yeast (1,484)
(%)
Average
0.86
(c)
No. correct/No. of predictions
Improvement
Data set
M DRP ,L 2
M DRS ,L 2
Total
M DR ,L 2
Blood (748)
258/289
9/12
267/301
265/301
0.66
Car (1,728) Cleve (303) Cmc (1,473) Crx (690) Ecoli (336)
736/739
2/2
738/741
735/741
0.40
104/125
0/1
104/126
96/126
6.35
121/165
50/128
171/293
147/293
8.19
495/554
3/6
498/560
483/560
2.68
212/260
25/45
237/305
222/305
4.92
Flags (194)
104/129
21/65
125/194
108/194
8.76
24
(%)
German (1,000) Haberman (306) Letter (20,000) Liver (345) Mammographic (959) Page (5,473) Parkinson (195) Pendigits (10,992) Robot (5,456) Sick (3,163) Vehicle (846) Waveform (5,000) Yeast (1,484)
487/602
11/20
498/622
468/622
4.82
200/255
9/15
209/270
194/270
5.56
327/327
0/0
327/327
327/327
0.00
2/2
1/1
3/3
2/3
33.33
628/793
1/7
629/800
603/800
3.25
0/0
0/0
0/0
0/0
0.00
106/117
32/40
138/157
134/157
2.55
210/211
0/0
210/211
209/211
0.47
232/245
11/13
243/258
240/258
1.16
60/60
0/0
60/60
60/60
0.00
132/121
24/48
145/180
138/180
3.89
1,764/2,246
60/113
1,824/2,359
1,756/2,359
2.88
132/192
19/56
151/248
141/248
4.03
Average
4.70
(d)
No. correct/No. of predictions
Improvement
Data set
M DRP ,L 2
M DRS ,L 2
Total
M DR ,L 2
Blood (748) Car (1,728) Cleve (303) Cmc (1,473)
182/205
5/5
187/210
187/210
0.00
672/672
0/0
672/672
672/672
0.00
30/41
1/1
31/42
28/42
7.14
20/25
3/6
23/31
20/31
9.68
Crx (690) Ecoli (336) Flags (194) German (1,000) Haberman (306)
428/480
4/6
432/486
422/486
2.06
157/175
21/46
178/221
167/221
4.98
67/80
6/33
73/113
64/113
7.96
399/483
16/24
415/507
403/507
2.37
177/233
13/19
190/252
178/252
4.76
Letter (20,000) Liver (345) Mammographic (959) Page (5,473)
278/278
0/0
278/278
278/278
0.00
7/9
3/3
10/12
9/12
8.33
480/602
2/8
482/610
474/610
1.31
33/33
10/18
43/51
40/51
5.88
90/101
0/0
90/101
86/101
3.96
(%)
Parkinson (195) Pendigits (10,992) Robot (5,456) Sick (3,163) Vehicle (846)
86/87
0/0
86/87
86/87
0.00
504/522
5/5
509/527
507/527
0.38
44/44
0/0
44/44
44/44
0.00
80/85
0/0
80/85
80/85
0.00
Waveform (5,000)
1,749/2,172
103/196
1,852/2,368
1,796/2,368
2.36
25
Yeast (1,484)
8/15
9/37
17/52
15/52
Average
3.85 3.25
The improvements shown in Table 6 can provide information about the usefulness of hybrid algorithms based on instance filtering on data sets.
For example,
the four hybrid algorithms all achieve over 5% improvement on data set ‘Liver,’ while they do not exhibit more than 1% improvement on data sets ‘Blood,’ ‘Car,’ and ‘Sick.’ These suggest that the improvement depends on the characteristics of data sets. Every improvement cannot be negative, because the proposed hybrid classification algorithms can achieve at least the same performance as the basic ones.
The extra
computational effort required for the hybrid algorithms is performing k-fold cross validation on training instances for obtaining models M DRP ,L 2 and M DRS ,L 2 .
5
Conclusions
Basic classification algorithms induce a single model from data to classify new instances, where the interpretation of any prediction made by the model is relatively easy.
This study proposed an approach by which to construct hybrid classification
algorithms based on instance filtering, where each hybrid algorithm induces three
26
models, and a new instance is classified by only one of them.
The interpretation of a
prediction made by the hybrid algorithms is thus as easy as that made by basic algorithms. Decision tree induction and naïve Bayesian classifier are employed to composed four hybrid classification algorithms.
The experimental results on twenty data sets
demonstrate that the hybrid algorithms are significantly superior to basic algorithms and the hybrid algorithm proposed in a previous study.
This indicates that the hybrid
classification algorithm based on instance filtering cannot only achieve higher prediction accuracy, but also maintain easy interpretation of the learning results. Naïve Bayesian classifier should be a better algorithm than decision tree induction in filtering instances.
The characteristics of data sets can affect the performance
improvement achieved by the proposed hybrid classification algorithms. The classification algorithms involved in this study for hybrid classification are decision tree induction and naïve Bayesian classifier.
More algorithms, such as
classification rule induction and Bayesian network, can be considered in hybrid classification to investigate which algorithm is more suitable for instance filtering or model induction.
Feature selection is also a critical technique for data preprocessing
and can generally be helpful for achieving higher accuracy and enhancing model interpretability.
It would thus be desirable to develop hybrid classification
27
algorithms based on feature selection. CRediT Author Statement
Tzu-Tsung Wong: Conceptualization, Validation, Writing – Original Draft, Funding Acquisition Nai-Yu Yang: Methodology, Software, Formal Analysis, Writing – Review & Editing Guo-Hong Chen: Conceptualization, Methodology, Software, Resources
AUTHOR DECLARATION
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us.
Acknowledgement
This research was supported by the Ministry of Science and Technology in Taiwan under Grant No. 107-2410-H-006-045-MY3. 28
References
[1] Abbasi, Z. and Rahmani, M. (2019) An instance selection algorithm based on ReliefF, International Journal on Artificial Intelligence Tools, 28: 1-14. [2] Aburomman, A. A. and Reaz, M. B. (2017) A survey of intrusion detection systems based on ensemble and hybrid classifiers, Computers and Security, 65: 135-152. [3] Breiman, L. (2001) Random forests, Machine Learning, 45: 5-32. [4] Chen, C., Zhang, G., Yang, J., Milton, J. C., and Alcántara A. D. (2016) An explanatory analysis of driver injury severity in rear-end crashes using a decision table/naïve Bayes (DTNB) hybrid classifier, Accident Analysis & Prevention, 90: 95-107. [5] Chen, F. H. and Howard, H. (2016) An alternative model for the analysis of detecting electronic industries earnings management using stepwise regression, random forest, and decision tree, Soft Computing, 20: 1945-1960. [6] Chi, G. T., Uddin, M. S., Abedin, M. Z., and Yuan, K. P. (2019) Hybrid model for credit risk prediction: an application of neural network approaches, International
29
Journal on Artificial Intelligence Tools, 28(5): 1950017. [7] De Caigny, A., Coussement, K., and De Bock, K. W. (2018) A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees, European Journal of Operational Research, 269: 760-772. [8] Deng, H. (2019) Interpreting tree ensembles with inTrees, International Journal of Data Science and Analytics, 7: 277-287. [9] Ebadati, O. M. E. and Ahmadzadeh, F. (2019) Classification spam email with elimination of unsuitable features with hybrid of GA-naive Bayes, Journal of Information & Knowledge Management, Vol. 18, No. 01, 1950008. [10] Farid, D. M., Zhang, L., Rahman, C. M., Hossain, M. A., and Strachan, R. (2014) Hybrid decision tree and naïve Bayes classifiers for multi-class classification tasks, Expert Systems with Applications, 41: 1937-1946. [11] Florez-Lopez, R. and Ramon-Jeronimo, J. M. (2015) Enhancing accuracy and interpretability
of
ensemble
strategies
in
credit
risk
assessment.
A
correlated-adjusted decision forest proposal, Expert Systems with Applications, 42: 5737-5753. [12] González Perea, R., Camacho Poyato, E., Montesinos, P., and Rodriguez Diaz, J. A. (2019) Prediction of irrigation event occurrence at farm level using optimal
30
decision trees, Computers and Electronics in Agriculture, 157: 173-180. [13] Jaison, B., Chilambuchelvan, A., and Mohamed Junaid, K. A. (2015) Hybrid classification techniques for microarray data, National Academy Science Letters, 38: 415-419. [14] Krawczyk, B., Triguero, I., García, S., Wo'zniak., and Herrera, F. (2019) Instance reduction for one-class classification, Knowledge and Information Systems, 59: 601-628. [15] Langley, P. and Sage, S. (1994) Induction of selective Bayesian classifiers. Proceedings of UAI-94, 10th International Conference on Uncertainty in Artificial Intelligence, Seattle, WA, 399–406. [16] Lichman,
M.
(2013)
UCI
machine
learning
repository,
http://archive.ics.uci.edu/ml, University of California, Irvine, School of Information and Computer Sciences. [17] Noor, F., Shah, A., Akram, M.U., and Khan, S.A. (2019) Deployment of social nets in multilayer model to identify key individuals using majority voting, Knowledge and Information Systems, 58: 113-137. [18] Plawiak, P., Abdar, M., and Acharya, U. R. (2019) Application of new deep genetic cascade ensemble of SVM classifiers to predict the Australian credit scoring, Applied Soft Computing, 84, 105740.
31
[19] Qi, C. C. and Tang, X. L. (2018) A hybrid ensemble method for improved prediction of slope stability, International Journal for Numerical and Analytical Methods in Geomechanics, 42: 1823-1839. [20] Shah, A. A., Ehsan, M. K., Ishaq, K., Ali, Z., and Farooq, M. S. (2018) An efficient hybrid classifier model for anomaly intrusion detection system, International Journal of Computer Science and Network Security, 18: 127-136. [21] Singh, N. and Singh, P. (2019) A novel bagged naive Bayes-decision tree approach for multi-class classification problems, Journal of Intelligent & Fuzzy Systems, 36: 2261-2271. [22] Tan, P. N., Steinbach, M., and Kumar, V. (2006) Introduction to Data Mining, Addison Wesley, Massachusetts. [23] Tripathi, D., Edla, D. R., Cheruku, R., and Kuppili, V. (2019) A novel hybrid credit scoring model based on ensemble feature selection and multilayer ensemble classification, Computational Intelligence, 35(2): 371-394. [24] Wijaya, A. and Bisri, A. (2016) Hybrid decision tree and logistic regression classifier for email spam detection, Proceedings of the 8th International Conference on Information Technology and Electrical Engineering, 1-4. [25] Witten, I. H., Frank, E., and Hall, M. A. (2011) Data Mining: Practical Machine Learning Tools and Techniques, 3rd Edition, Morgan Kaufmann, Massachusetts.
32
[26] Wong, T. T. (2015) Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation, Pattern Recognition, 48: 2839-2846. [27] Wong, T. T. (2017) Parametric methods for comparing the performance of two classification algorithms evaluated by k-fold cross validation on multiple data sets, Pattern Recognition, 65: 97-107. [28] Zareapoor, M., Shamsolmoali, P., Jain, D. K., Wanx, H. X., and Yang, J. (2018) Kernelized support vector machine with deep learning: An efficient approach for extreme multiclass dataset, Pattern Recognition Letters, 115: 4-13. [29] Zhang H., He H., and Zhang W. (2018) Classifier selection and clustering with fuzzy assignment in ensemble model for credit scoring, Neurocomputing, 316: 210-221. [30] Zhang, X. G. and Mahadevan, S. (2019) Ensemble machine learning models for aviation incident risk prediction, Decision Support Systems, 116: 48-63.
33