Hybrid classification algorithms based on instance filtering

Hybrid classification algorithms based on instance filtering

Journal Pre-proof Hybrid Classification Algorithms Based on Instance Filtering Tzu-Tsung Wong , Nai-Yu Yang , Guo-Hong Chen PII: DOI: Reference: S00...

478KB Sizes 0 Downloads 80 Views

Journal Pre-proof

Hybrid Classification Algorithms Based on Instance Filtering Tzu-Tsung Wong , Nai-Yu Yang , Guo-Hong Chen PII: DOI: Reference:

S0020-0255(20)30096-7 https://doi.org/10.1016/j.ins.2020.02.021 INS 15205

To appear in:

Information Sciences

Received date: Revised date: Accepted date:

2 August 2019 15 January 2020 8 February 2020

Please cite this article as: Tzu-Tsung Wong , Nai-Yu Yang , Guo-Hong Chen , Hybrid Classification Algorithms Based on Instance Filtering, Information Sciences (2020), doi: https://doi.org/10.1016/j.ins.2020.02.021

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier Inc.

Hybrid Classification Algorithms Based on Instance Filtering

Tzu-Tsung Wong*, Nai-Yu Yang, Guo-Hong Chen Institute of Information Management National Cheng Kung University 1, Ta-Sheuh Road, Tainan City 701, Taiwan, ROC Tel: +886-6-2757575 ext 53722 E-mail: [email protected]

Abstract.

Basic classification algorithms induce a single model from training data.

The interpretation of a model is relatively easy, while basic algorithms have limitations in achieving high accuracy. correctly predicted by another.

An instance misclassified by a model may be

Hybrid classification is a concept that employs basic

classification algorithms for model induction and for data preprocessing. Misclassification instances are usually considered to be noise, yet those still may carry useful information for identifying the class values of some other instances.

This

study proposes hybrid classification algorithms in which training instances are filtered to build three models for prediction. Each testing instance is classified by exactly one of them.

The algorithms involved in the proposed hybrid classification algorithms

are decision tree induction and naïve Bayesian classifier.

The testing results on

twenty data sets demonstrate that our hybrid classification algorithms can significantly outperform the basic ones as well as the hybrid algorithm proposed in a previous study.

The hybrid classification algorithms based on instance filtering

achieve relatively high accuracy and maintain the easy interpretation of learning results. Keywords: Decision tree induction, hybrid classification, instance filtering, naïve Bayesian classifier.

1

Introduction

Basic classification algorithms induce a model from data, which will be used to classify every new instance.

Data preprocessing techniques, such as feature

selection and instance filtering, can enhance the performance of basic classification algorithms.

The selective naive Bayesian has been shown to be a successful wrapper

for improving the performance of naïve Bayesian classifier [15]. Instance filtering helps reduce data size without sacrificing classification performance [1, 14].

A deep

belief network is used to select features for support vector machine in processing data sets with a large amount of class values [28].

Genetic algorithm can also be

employed for feature selection in training classification algorithms such as support vector machine, naïve Bayesian classifier, and decision tree induction [9, 12, 20]. For predicting rear-end crashes, attributes are divided into disjoints subsets when training decision tree and naive Bayesian classifier [4].

Chen and Howard [5]

applied random forests to screen attributes for improving the prediction accuracy of decision trees.

Logistic regression and naïve Bayesian classifier were employed to

remove misclassified instances for decision tree induction [10, 24].

De Caigny et al.

[7] introduced a hybrid classification algorithm that builds a logistic regression model in each leaf node of a decision tree for the purpose of customer churn prediction.

2

However, the single model induced from a basic algorithm tends to perform worse than the models induced from an ensemble algorithm [22, 25]. Ensemble algorithms find a set of models to determine the class value of a new instance.

These models are homogeneous when they are induced by the same basic

algorithm, or they can be nonhomogeneous models induced by different basic algorithms.

A random forest has multiple decision trees, which take a majority vote

to determine the class value of a new instance [3].

Jaison et al. [13] proposed a

hybrid classification method composed of nearest neighbor, support vector machine, and naïve Bayesian classifier for analyzing microarray data, and Zhang and Mahadevan [30] built a hybrid model consisting of support vector machine and neural network to predict the risk of aviation events.

Noor et al. [17] designed an ensemble

method composed of k-nearest neighbors, support vector machine, and decision tree induction to analyze network data for identifying key players.

The weighted

majority vote of six classification methods was used to improve slope stability predictions in [17].

Ensemble algorithms are also popular in performing credit risk

analysis [6, 11, 18, 23].

Aburomman and Reaz [2] conducted a survey on ensemble

and hybrid methods employed in intrusion detection systems. Although ensemble algorithms generally achieve a higher accuracy than basic algorithms [29], the learning of ensemble algorithms is relatively complex.

3

The

models found by an ensemble algorithm must be diverse to ensure the stability of their decisions, and hence the training cost is relatively high.

When a new instance is

classified by a single model, it should be easier to determine the reason for the class assignment.

Decision tree induction is popular because it provides descriptive

models that are easier to interpret.

However, it is difficult to interpret a prediction

made by multiple decision trees [8].

This difficulty will affect the adoption of

ensemble methods for realistic cases [11]. Consider the data set ‘weather’ provided by ‘Weka’ software.

This data set has

14 instances and four discrete attributes: ‘outlook,’ ‘temperature,’ ‘humidity,’ and ‘windy,’ which are used to predict whether the class ‘play’ has the value ‘yes’ or ‘no.’ Figure 1 shows three possible decision trees grown from the subsets of this data set. Let the actual class value of a new instance x = be ‘no’. The predictions of the three decision trees on x are ‘yes,’ ‘yes,’ and ‘no,’ respectively. The prediction made by the decision tree given in Figure 1(a) is wrong because of the attributes ‘outlook’ and ‘humidity.’ Similarly, the prediction made by the decision tree given in Figure 1(c) is correct because of the attribute ‘outlook.’ Suppose that an ensemble algorithm generates the three decision trees for a majority vote for classification.

The resulting prediction will be wrong, and it is difficult to interpret

the role of ‘outlook.’ These show that the prediction made by a single model is easy

4

to interpret.

If a hybrid classification algorithm generates the three decision trees

displayed in Figure 1, and chooses the one in Figure 1(c) for classifying x, then the prediction for this instance will be correct and will in turn be easy to interpret.

(a)

outlook sunny humidity

high

false

no

(c)

windy

sunny

false

humidity

no

true

yes

true

high

windy

yes normal

no

(b)

rainy

overcast

no

yes

yes

outlook overcast

rainy

yes

yes

normal yes

Figure 1. Three decision trees resulting from subsets of the data set ‘weather.’

Misclassification instances are generally considered as noise, and hence are excluded from training data [10, 21].

However, such instances may carry useful

information in identifying the class value of an instance.

5

For example, let the

triangles and circles in Figure 2 represent the instances belonging to two different class values for training. circles will be wrong.

The predictions on the three solid triangles and two solid If these misclassified instances are removed, as shown in

Figure 3(a), then the predictions on any new triangle located in the left half and any new circle located in the right half will be wrong.

However, if the misclassified

training instances are grouped as shown in Figure 3(b), then they can be used to correctly classify any new triangle located in the left half or any new circle located in the right half.

This demonstrates that misclassified training instances can be useful

in classifying some instances. In this study, we propose hybrid classification algorithms based on instance filtering.

Their advantage is that they achieve relatively high accuracy with respect

to basic algorithms without sacrificing the interpretability of the learning results, as encountered in ensemble algorithms.

They are called hybrid because a basic

classification algorithm plays the role of instance filtering, and another is used for model induction.

The two basic classification algorithms need not be the same.

This kind of hybrid classification generates three models, and every new instance is classified by only one of them.

The interpretation of their learning results is

therefore as easy as that obtained using basic algorithms.

6

Figure 2. The instances in a two-dimensional space.

(b)

(a)

Figure 3.

Groups (a) and (b) show respectively the correct and wrong predictions on

the instances given in Figure 2.

The remainder of this paper is organized as follows.

Section 2 briefly

distinguishes hybrid classification from ensemble classification.

Section 3

introduces our hybrid classification algorithms composed of decision tree induction and naïve Bayesian classifier.

The experimental results of our hybrid algorithms on

twenty data sets are given in Section 4 to show that they can significantly outperform basic algorithms and the hybrid algorithm proposed in a previous study.

7

Conclusions

and directions for future works are addressed in Section 5.

2

Ensemble classification and hybrid classification

Every basic classification algorithm has its own learning procedure to induce a model from a training set in order to predict the class value of a new instance.

A model

may be able to assign the correct class value to an instance that cannot be classified correctly by another model.

The concept of ensemble classification is to collect a set

of models for making group decisions. of models.

There are generally two ways to collect a set

One is to collect the models induced from a single training set by

different classification algorithms.

Another is to train the same algorithm with

different data sets that are derived from one particular data set. Let D be a data set, and let Lj for j = 1, 2, …, r be basic classification algorithms. Then MD,Lj represents the model induced from set D by algorithm Lj, and MD,Lj(x) denotes the class value assigned by model MD,Lj for instance x.

The first way for

ensemble classification is to collect models MD,Lj for j = 1, 2, …, r, and the predicted class value of instance x is determined by the (weighted) majority vote from MD,Lj(x) for j = 1, 2, …, r.

An alternative is to generate sets Di for i = 1, 2, …, q obtained by

8

sampling instances or features from D.

Then models MDi,L for i = 1, 2, …, q are

induced by algorithm L, and again the predicted class value of instance x is determined by the (weighted) majority vote from MDi,L(x) for i = 1, 2, …, q. Hybrid classification first applies a basic algorithm to filter instances or features. Then another basic algorithm finds models from the revised data set(s) to achieve higher accuracy than that induced from the original data set. It is called hybrid classification because basic algorithms are involved in both data preprocessing and model induction.

Let L1 and L2 be two basic algorithms used to compose a hybrid

one, and let SD,L2 be the set of models obtained by applying L2 on the sets derived from data set D filtered by algorithm L1.

Then one model in SD,L2 will be chosen to

predict the class value of a new instance.

Note that L1 and L2 can be the same basic

classification algorithm. Ensemble classification will use one or more basic algorithms to find a set of models that make a group decision for prediction, and no requirements exist for data preprocessing.

A basic algorithm must be assigned for data preprocessing in hybrid

classification, and every new instance will be classified by only one model.

This

implies that the interpretation of the learning results of hybrid classification remains easy.

9

3

Hybrid classification algorithms

Misclassified instances are generally considered to be noise in hybrid classification, and hence they are removed from the training set.

As discussed in Section 1, the

instances that are misclassified may still carry useful information for predicting the class value of some instances, and hence they should be moved to another set for further usage.

This section introduces the learning procedure of our hybrid

classification algorithms based on this concept. The hybrid classification process based on instance filtering is depicted in Figure 4. testing sets.

Suppose that a data set is divided into two disjoint subsets: training and There are three steps in hybrid classification based on instance filtering.

The first step is to apply the basic classification algorithm L1 on the training set evaluated by k-fold cross validation.

The correctly classified training instances are

placed into a primary set, and a secondary set stores those misclassified, as shown in the upper-right broken rectangle in Figure 4.

Then the models induced from the full

training set, the primary set, and the secondary set by classification algorithm L2 are called the full model, the primary model, and the secondary model, respectively in the second step, see the lower-right broken rectangle in Figure 4.

10

The left broken

rectangle in Figure 4 shows the third step, in which one of the three models will be chosen to predict the class value of each instance from the testing set.

Step 1 Training set Algorithm L1

A testing instance

Primary set

Algorithm L2

Algorithm L2

Secondary set

Algorithm L2

Full model Model selection

Primary model Secondary model

Step 3 Step 2 Figure 4.

The mechanism for hybrid classification.

The approach used for model selection is the core of the third step. set D be divided into disjoint training set DR and testing set DT.

Let a data

After applying

algorithm L1 on set DR evaluated by k-fold cross validation, we have the primary set DRP and the secondary set DRS such that DRPDRS =  and DRPDRS = DR.

The

similarity between a testing instance x =  DT and the set DRm is measured by the probability p(DRm|x) = p(x|DRm)p(DRm)/p(x) for m = P or S, where n is the number of attributes in D.

Assume that attributes are all independent for any

11

given set DRm.

Then p(DRm|x) can be simplified as: n

p( DRm | x )  p( DRm )  p( xi | DRm ) i 1

because p(x) is the same for both DRP and DRS. instances in set D.

Let |D| denote the number of

Then p(DRm) is estimated as |DRm|/|DR|, and p(xi|DRm) is calculated

as yim/|DRm|, where yim is the number of instances with attribute value xi in DRm. If p(DRP|x)  p(DRS|x), then x is more similar to the instances in DRP. case, instance x should be classified by model M DRP ,L 2 . should be adopted for classifying x.

In this

Otherwise, model M DRS ,L 2

However, when p(DRP|x) and p(DRS|x) are close,

model M DR ,L 2 may be a better choice for predicting the class value of x, because DR is bigger than either DRP or DRS.

We therefore introduce a threshold  > 0 to

determine which model should be chosen for classifying a testing instance.

If

log2p(DRP|x) – log2p(DRS|x) > , then x is more similar to the instances in DRP, and hence it is classified by M DRP ,L 2 . model for classifying x.

If log2p(DRS|x) – log2p(DRP|x) > , M DRS ,L 2 is the

If |log2p(DRP|x) – log2p(DRS|x)|  , model M DR ,L 2 is

chosen to predict the class value of x.

The set SD,L2 thus has three models M DRP ,L 2 ,

M DRS ,L 2 , and M DR ,L 2 .

Example 1.

Let the data set ‘weather’ provided by ‘Weka’ software, introduced in

Section 1, be represented as DR = {ei, i = 1, 2, …, 14}.

12

Let algorithm L1 applied on

DR be decision tree induction evaluated by 3-fold cross validation.

Then the primary

and secondary sets obtained in step 1 would be DRP = {ei, i = 2, 3, 4, 6, 7, 9, 11, 13, 14} and DRS = {ei, i = 1, 5, 8, 10, 12}, respectively.

In step 2, the models derived by

algorithm L2 that is also decision tree induction from DR, DRP, and DRS are the three decision trees given in Figure 1(a), 1(b), and 1(c), respectively. normal, false> be a new instance with actual class value ‘no’.

Let x =
p( DRP | x )  p( DRP ) p(sunny | DRP ) p(mild | DRP ) p(normal | DRP ) p(false | DRP ) 

and

p( DRS | x ) 

9 3 3 5 4      0.0224 11 9 9 9 9 5 2 3 2 4      0.0349 , hence log2p(DRP|x) = -1.3534 and 11 5 5 5 5

When threshold  = 0.4, since log2p(DRS|x) – log2p(DRP|x) =

log2p(DRS|x) = -0.7163.

0.6371 > , the secondary model given in Figure 3(c) will be used to make a correct prediction on x in step 3.



If the instance x given in Example 1 is classified by either the full model or the primary model, the prediction will be wrong.

This demonstrates that setting a proper

threshold for determining the classification model can lead to a chance to improve prediction accuracy.

When  is zero, none of the instances in DT will be predicted by

M DR ,L 2 .

If  is very large, then every testing instance will be classified by model

M DR ,L 2 .

This indicates that the value of threshold  can control the usage of the

models

M DRP ,L 2

,

M DRS ,L 2

, 13

and

M DR ,L 2

.

Since

log 2 p( DRm | x)  log 2 p( DRm )  in1 log 2 p( xi | RRm ) , the choice of the threshold  should consider the number of attributes in data set D.

The threshold is set to be  =

0.1nw for w = 0, 1, 2, … to search for the best value that achieves the highest classification accuracy.

This search stops when all instances in DT are classified by

model M DR ,L 2 . The pseudo code of the hybrid classification algorithm based on instance filtering is given in Figure 5.

Line 5 describes the first step of partitioning training

set DR into DRP and DRS by algorithm L1.

Then algorithm L2 is used to induce the

three models for classification in line 6 for step 2. Lines 8 through 14 determine the model for classifying each testing instance in step 3.

Note that when the value of 

is changed, only step 3 have to be re-executed. Decision tree induction and naïve Bayesian classifier are two popular algorithms, which denoted as DT and NB, respectively. classification algorithms in this study.

They will be used to construct hybrid

A hybrid classification algorithm with basic

algorithms L1 in the first step and L2 in the second step is represented as L1-L2.

For

example, NB-DT indicates that the algorithms for instance filtering and model induction are naïve Bayesian classifier and decision tree induction, respectively. Note that NB-NB is not the same as NB, because NB-NB has three classification models, while NB only has one.

14

1 2 3 4 5

Input data set D and threshold  Perform k-fold cross validation to divide D into folds F1, F2, …, Fk For each testing fold Fj Set DT = Fj and DR = D\Fj Perform algorithm L1 on DR to obtain DRP and DRS

6

Perform algorithm L2 on DR, DRP, and DRS to derive models

M DRP ,L 2 , M DRS ,L 2 , and M DR ,L 2 For each instance x  DT Calculate p(DRP|x) and p(DRS|x)

7 8

If log2p(DRP|x) – log2p(DRS|x) >  then

9 10

Use model M DRP ,L 2 to classify x

11

Else if log2p(DRS|x) – log2p(DRP|x) >  then

12

Use model M DRS ,L 2 to classify x

13

Else

14

Use model M DR ,L 2 to classify x

Figure 5. The pseudo code of the hybrid classification algorithm based on instance filtering.

4

Experimental study

The characteristics of the 20 data sets randomly chosen from the UCI data repository [16] for evaluating the performance of our hybrid classification algorithms are

15

summarized in Table 1.

Their numbers of instances, attributes, and class values were

diversely chosen to avoid that the proposed algorithms are applicable only for certain kind of data sets.

The smallest number of instances in a data set is less than 200, and

hence the evaluation method is k-fold cross validation with k = 5 to ensure that the experimental results of every data set can satisfy the large-sample conditions; i.e., the numbers of correct and wrong predictions in each fold are both larger than or equal to five [26].

In performing k-fold cross validation for dividing a training set into

primary and secondary sets, the number of folds is set to be ten because the model for determining whether an instance will be classified correctly or not should be induced from a large set. The hybrid classification algorithm proposed by Farid et al. [10] played the role of benchmark.

Their algorithm, denoted as FDT, is similar to NB-DT, while only the

primary model is used for classification. algorithm is determined by L2.

The prediction accuracy of a hybrid

Algorithms DT-NB and NB-NB were compared with

NB, and algorithms DT-DT and NB-DT were compared with DT, RF, and FDT. Since the decision trees in algorithm FDT are unpruned, all classification algorithms are implemented using ‘Weka’ software with default settings except for the fact that decision trees were unpruned.

The significance level for statistical comparisons was

set to 0.05.

16

Table 1.

The characteristics of the 20 experimental data sets. Number of instances

Number of attributes

Number of class values

Blood Car

748 1,728

4 6

2 4

Cleve Cmc Crx Ecoli Flags

303 1,473 690 336 194

13 9 15 8 28

2 3 2 8 8

German Haberman Letter Liver Mammographic

1,000 306 20,000 345 959

20 3 16 6 5

2 2 26 2 6

Page Parkinson Pendigits Robot Sick

5,473 195 10,992 5,456 3,163

10 22 16 24 7

5 2 10 4 2

Vehicle Waveform Yeast

846 5,000 1,484

18 21 8

4 3 10

Data set

The experimental results for the 20 data sets are shown in Table 2 for algorithms NB, NB-NB, and DT-NB and in Table 3 for algorithms DT, NB-DT, DT-DT, and FDT. As argued in Section 3, a hybrid algorithm is guaranteed not to have lower classification accuracy than its corresponding basic.

Table 2 shows that NB-NB and

DT-NB is not inferior to NB in any data set, which is also the case for NB-DT and DT-DT with respect to DT in Table 3.

No parametric methods have been proposed 17

to compare the performance of multiple classification algorithms.

Since the training

and testing sets in each iteration of the k-fold cross validation were kept the same for all algorithms, the matched-sample approach proposed by Wong [26] was employed to compare the performance of every pair of algorithms on single data set.

Table 2.

The classification accuracies of NB, NB-NB, and DT-NB.

Data set

NB

NB-NB

DT-NB

Blood Car Cleve Cmc Crx

0.7689

0.7689

0.7689

0.8541

0.8559

0.8541

0.8383

0.8383

0.8416

0.5044

0.5044

0.5091

0.8420

0.8507

0.8522

Ecoli Flags German Haberman Letter

0.8183

0.8184

0.8183

0.6082

0.6082

0.6082

0.7460

0.7480

0.7460

0.7420

0.7420

0.7420

0.7114

0.7135

0.73141,2

Liver Mammographic Page Parkinson Pendigits

0.6145

0.6290

0.6261

0.7654

0.7664

0.7695

0.9196

0.9234

0.9249 1,3

0.7179

0.8154

0.7231

0.8687

0.89381

0.88981

Robot Sick Vehicle Waveform Yeast

0.7912

0.80021

0.7949

0.9418

0.9431

0.9418

0.6147

0.6230

0.6147

0.8062

0.8074

0.8108

0.5822

0.5829

0.5822

Average

0.7528

0.7616

0.7575

Superscripts 1, 2, and 3 represent algorithms NB, NB-NB, and DT-NB, respectively.

Every bold value in Tables 2 and 3 indicates significantly higher accuracy than

18

those of the algorithms noted in its superscripts.

For example, algorithm DT-NB has

significantly higher accuracy on data set ‘Letter’ with respect to both NB and NB-NB given in the superscript of 0.7314.

Since every pair of algorithms is compared using

the matched-sample approach, it is possible that algorithm A is significantly better than C and that algorithm B achieves a higher accuracy than A, while algorithm B is not significantly better than C.

Consider the prediction accuracies of algorithms DT,

NB-DT, and FDT on data set ‘Yeast,’ for which their fold accuracies are given in Table 4.

The test statistic for comparing algorithms DT and NB-DT is t =

- 0.0067/ 0.000028/5 = -2.8313 with 4 degrees of freedom, and its corresponding p-value is 0.0473.

Since the p-value is less than the significance level  = 0.05, and

the t-value is negative, NB-DT is significantly more accurate than DT.

Similarly, the

p-value corresponding to the test statistic t = - 0.0398/ 0.003208/5 = -1.5713 for comparing algorithms DT and FDT is 0.1912.

Although algorithm FDT has greater

mean accuracy than NB-DT, FDT is not significantly better than DT on this data set.

19

Table 3.

The classification accuracies of DT, NB-DT, DT-DT, and FDT.

Data set

DT

Blood Car Cleve Cmc Crx

0.7808 0.9427

4

0.7757

NB-DT

DT-DT

FDT

0.7835

0.7808

0.7835

0.9444

4

0.8021 1

0.9427

4

0.8929

0.7857

0.7593

0.4596

0.5071

0.4576

0.4739

0.8275

0.8493

0.8420

0.8464

Ecoli Flags German Haberman Letter

0.6966

0.7413

0.7293

0.75021

0.5565

0.6440

0.6028

0.6127

0.7090

0.7180

Liver Mammographic Page Parkinson Pendigits

1

0.6970

0.7270

0.6994

0.7487

0.7388

0.75841

0.80114

0.80114

0.80114

0.7268

0.5797

0.5826

0.5826

0.6406

0.7580

0.78311,2,3

0.9350

0.9298

0.7497

0.7768

0.9344

0.9344

0.7692

1

0.7897

0.9104

4

Robot Sick Vehicle Waveform Yeast

0.9278

4

Average

0.9491

0.7897

0.9105

4

0.9283

4

0.9491

0.9104

0.8739

0.9282

4

0.8625

0.9491

0.9456

0.6821

4

0.6738

4

0.7312

1

0.7288

1

0.5155

0.5222

1

0.5169

0.5553

0.7481

0.7661

0.7582

0.7538

0.6738 0.7176

4

0.7641 4

0.6337 0.7320

Superscripts 1 through 4 denote algorithms DT, NB-DT, DT-DT, and FDT, respectively.

Table 4.

The fold accuracies of data set ‘Yeast’ for algorithms DT, NB-DT and FDT. DT

Fold

Accuracy

NB-DT Accuracy Difference

FDT Accuracy

Difference

1

0.4579

0.4714

-0.0135

0.5589

-0.1010

2 3 4 5

0.5589 0.5286 0.5051 0.5270

0.5589 0.5387 0.5084 0.5338

0.0000 -0.0101 -0.0033 -0.0068

0.5455 0.5118 0.5387 0.6216

0.0134 0.0168 -0.0336 -0.0946

Mean

0.5155

0.5222

-0.0067

0.5553

-0.0398

Variance

0.000028 20

0.003208

Table 3 shows that algorithm FDT achieves the highest accuracy in some data sets, such as ‘Ecoli’ and ‘Mammographic’.

This suggests that applying naïve

Bayesian classifier to remove instances for inducing decision trees can be helpful for some data sets. In contrast, FDT has the lowest accuracy in several data sets, such as ‘Pendigits’ and ‘Robot.’ Removing instances from learning therefore seems to be risky.

The matched-sample approach proposed by Wong [27] was thus employed to

compare the performance of every pair of the algorithms in Tables 2 or 3 over the 20 data sets, for which the results are summarized in Table 5.

Table 5.

Hypothesis testing for every pair of classification algorithms over the 20

data sets.

t value

Degrees of freedom

p-value

Algorithm 1

Algorithm 2

NB NB

-4.5639 -2.7633

6 11

0.0038 0.0184

NB-NB

NB-NB DT-NB DT-NB

2.2539

13

0.0421

DT DT DT

NB-DT DT-DT FDT

-6.3261 -4.7728 -1.2212

17 21 30

0.0000 0.0001 0.2315

NB-DT NB-DT DT-DT

DT-DT FDT FDT

3.0531 2.6569 1.0357

25 22 32

0.0053 0.0144 0.3081

Every bold value in the last column of Table 5 indicates that the p-value is less 21

than the significance level of 0.05. The better algorithm in the same row is also marked bold.

For example, the p-value in the first row is 0.0038 < 0.05, and hence

NB-NB has significantly higher mean accuracy than NB over the 20 data sets.

The

mean accuracies of algorithms NB, NB-NB, and DT-NB over the 20 data sets are all significantly different, and NB-NB is the best. Among the algorithms with decision trees for class prediction, NB-DT significantly outperforms the other three algorithms. These findings demonstrate that the hybrid classification algorithms proposed in this study are superior to basic algorithms NB and DT and the previous hybrid algorithm FDT.

Naïve Bayesian classifier should be a better basic algorithm for instance

filtering, because NB-NB is the best in the first algorithm group, and NB-DT is the best in the second algorithm group. There are two extra models, M DRP ,L 2 and M DRS ,L 2 , in our proposed hybrid algorithms.

The performance of these two models is presented in Table 6 to

investigate whether they are really helpful in classifying instances. ‘Crx’ analyzed by algorithm NB-NB as an example.

Consider data set

This data set has 690 instances,

and the numbers of instances classified by models M DR ,L 2 , M DRP ,L 2 , and M DRS ,L 2 are 31, 652, and 7, respectively.

Since there are 562 and 3 correct predictions made

by M DRP ,L 2 and M DRS ,L 2 , respectively, the number of correct predictions on the 659 instances classified by M DRP ,L 2 or M DRS ,L 2 is 562+3 = 565, as shown in the fourth

22

column of Table 6.

When the 659 instances are all classified by model M DR ,L 2

induced by naïve Bayesian classifier, 559 of them are predicted correctly, as given in the fifth column of Table 6.

In this case, the improvement achieved by introducing

models M DR  ,L 2 and M DR  ,L 2 is calculated to be (565-559)/659 = 0.91%.

Table 6.

The improvement when introducing instance filtering in hybrid algorithms

(a) NB-NB, (b) DT-NB, (c) NB-DT, and (d) DT-DT. (a)

No. correct/No. of predictions

Improvement

M DRP ,L 2

M DRS ,L 2

Total

M DR ,L 2

Blood (748)

0/0

0/0

0/0

0/0

0.00

Car (1,728) Cleve (303) Cmc (1,473) Crx (690) Ecoli (336)

734/739

1/2

735/741

732/741

0.40

13/15

0/0

13/15

13/15

0.00

31/36

0/1

31/37

31/37

0.00

562/652

3/7

565/659

559/659

0.91

206/239

20/30

226/269

226/269

0.00

5/5

0/1

5/6

5/6

0.00

272/302

3/4

275/306

273/306

0.65

132/163

2/3

134/166

134/166

0.00

7,144/7,858

915/2,622

8,059/10,480

8,018/10,480

0.39

4/7

5/5

9/12

4/12

41.67

399/467

0/0

399/467

398/467

0.21

4,704/4,899

78/126

4,782/5,025

4,761/5,025

0.42

106/117

28/40

134/157

115/157

12.10

6,908/7,265

488/683

7,396/7,948

7,120/7,948

3.47

2,155/2,410

236/375

2,391/2,785

2,342/2,785

1.76

2,306/2,373

26/34

2,332/2,407

2,328/2,407

0.17

122/132

25/48

147/180

140/180

3.89

2,041/2,246

58/113

2,099/2,359

2,093/2,359

0.25

3/6

0/5

3/11

2/11

9.09

Data set

Flags (194) German (1,000) Haberman (306) Letter (20,000) Liver (345) Mammographic (959) Page (5,473) Parkinson (195) Pendigits (10,992) Robot (5,456) Sick (3,163) Vehicle (846) Waveform (5,000) Yeast (1,484) Average

(%)

3.77 23

(b)

No. correct/No. of predictions

Improvement

M DRP ,L 2

M DRS ,L 2

Total

M DR ,L 2

0/0

0/0

0/0

0/0

0.00

672/672

0/0

672/672

672/672

0.00

34/41

1/1

35/42

34/42

2.38

Cmc (1,473) Crx (690) Ecoli (336) Flags (194) German (1,000)

151/210

217/470

368/680

361/680

1.03

553/637

5/11

558/648

551/648

1.08

102/106

6/8

108/114

108/114

0.00

12/14

1/3

13/17

13/17

0.00

17/17

0/0

17/17

17/17

0.00

Haberman (306)

153/193

7/12

160/205

160/205

0.00

9,157/10,189

1,557/3,672

10,714/13,861

10,314/13,861

2.89

44/64

10/14

54/78

50/78

5.13

585/747

5/13

590/760

586/760

0.53

4,972/5,319

90/154

5,062/5,473

5,033/5,473

0.53

139/191

2/4

141/195

140/195

0.51

8,597/9,495

1,184/1,497

9,781/10,992

9,579/10,992

2.11

3,026/2,578

121/189

3,147/3,767

3,127/3,767

0.53

44/44

0/0

44/44

44/44

0.00

27/28

0/0

27/28

27/28

0.00

3,497/4,114

557/886

4,054/5,000

4,031/5,000

0.46

0/3

0/0

0/3

0/3

0.00

Data set Blood (748) Car (1,728) Cleve (303)

Letter (20,000) Liver (345) Mammographic (959) Page (5,473) Parkinson (195) Pendigits (10,992) Robot (5,456) Sick (3,163) Vehicle (846) Waveform (5,000) Yeast (1,484)

(%)

Average

0.86

(c)

No. correct/No. of predictions

Improvement

Data set

M DRP ,L 2

M DRS ,L 2

Total

M DR ,L 2

Blood (748)

258/289

9/12

267/301

265/301

0.66

Car (1,728) Cleve (303) Cmc (1,473) Crx (690) Ecoli (336)

736/739

2/2

738/741

735/741

0.40

104/125

0/1

104/126

96/126

6.35

121/165

50/128

171/293

147/293

8.19

495/554

3/6

498/560

483/560

2.68

212/260

25/45

237/305

222/305

4.92

Flags (194)

104/129

21/65

125/194

108/194

8.76

24

(%)

German (1,000) Haberman (306) Letter (20,000) Liver (345) Mammographic (959) Page (5,473) Parkinson (195) Pendigits (10,992) Robot (5,456) Sick (3,163) Vehicle (846) Waveform (5,000) Yeast (1,484)

487/602

11/20

498/622

468/622

4.82

200/255

9/15

209/270

194/270

5.56

327/327

0/0

327/327

327/327

0.00

2/2

1/1

3/3

2/3

33.33

628/793

1/7

629/800

603/800

3.25

0/0

0/0

0/0

0/0

0.00

106/117

32/40

138/157

134/157

2.55

210/211

0/0

210/211

209/211

0.47

232/245

11/13

243/258

240/258

1.16

60/60

0/0

60/60

60/60

0.00

132/121

24/48

145/180

138/180

3.89

1,764/2,246

60/113

1,824/2,359

1,756/2,359

2.88

132/192

19/56

151/248

141/248

4.03

Average

4.70

(d)

No. correct/No. of predictions

Improvement

Data set

M DRP ,L 2

M DRS ,L 2

Total

M DR ,L 2

Blood (748) Car (1,728) Cleve (303) Cmc (1,473)

182/205

5/5

187/210

187/210

0.00

672/672

0/0

672/672

672/672

0.00

30/41

1/1

31/42

28/42

7.14

20/25

3/6

23/31

20/31

9.68

Crx (690) Ecoli (336) Flags (194) German (1,000) Haberman (306)

428/480

4/6

432/486

422/486

2.06

157/175

21/46

178/221

167/221

4.98

67/80

6/33

73/113

64/113

7.96

399/483

16/24

415/507

403/507

2.37

177/233

13/19

190/252

178/252

4.76

Letter (20,000) Liver (345) Mammographic (959) Page (5,473)

278/278

0/0

278/278

278/278

0.00

7/9

3/3

10/12

9/12

8.33

480/602

2/8

482/610

474/610

1.31

33/33

10/18

43/51

40/51

5.88

90/101

0/0

90/101

86/101

3.96

(%)

Parkinson (195) Pendigits (10,992) Robot (5,456) Sick (3,163) Vehicle (846)

86/87

0/0

86/87

86/87

0.00

504/522

5/5

509/527

507/527

0.38

44/44

0/0

44/44

44/44

0.00

80/85

0/0

80/85

80/85

0.00

Waveform (5,000)

1,749/2,172

103/196

1,852/2,368

1,796/2,368

2.36

25

Yeast (1,484)

8/15

9/37

17/52

15/52

Average

3.85 3.25

The improvements shown in Table 6 can provide information about the usefulness of hybrid algorithms based on instance filtering on data sets.

For example,

the four hybrid algorithms all achieve over 5% improvement on data set ‘Liver,’ while they do not exhibit more than 1% improvement on data sets ‘Blood,’ ‘Car,’ and ‘Sick.’ These suggest that the improvement depends on the characteristics of data sets. Every improvement cannot be negative, because the proposed hybrid classification algorithms can achieve at least the same performance as the basic ones.

The extra

computational effort required for the hybrid algorithms is performing k-fold cross validation on training instances for obtaining models M DRP ,L 2 and M DRS ,L 2 .

5

Conclusions

Basic classification algorithms induce a single model from data to classify new instances, where the interpretation of any prediction made by the model is relatively easy.

This study proposed an approach by which to construct hybrid classification

algorithms based on instance filtering, where each hybrid algorithm induces three

26

models, and a new instance is classified by only one of them.

The interpretation of a

prediction made by the hybrid algorithms is thus as easy as that made by basic algorithms. Decision tree induction and naïve Bayesian classifier are employed to composed four hybrid classification algorithms.

The experimental results on twenty data sets

demonstrate that the hybrid algorithms are significantly superior to basic algorithms and the hybrid algorithm proposed in a previous study.

This indicates that the hybrid

classification algorithm based on instance filtering cannot only achieve higher prediction accuracy, but also maintain easy interpretation of the learning results. Naïve Bayesian classifier should be a better algorithm than decision tree induction in filtering instances.

The characteristics of data sets can affect the performance

improvement achieved by the proposed hybrid classification algorithms. The classification algorithms involved in this study for hybrid classification are decision tree induction and naïve Bayesian classifier.

More algorithms, such as

classification rule induction and Bayesian network, can be considered in hybrid classification to investigate which algorithm is more suitable for instance filtering or model induction.

Feature selection is also a critical technique for data preprocessing

and can generally be helpful for achieving higher accuracy and enhancing model interpretability.

It would thus be desirable to develop hybrid classification

27

algorithms based on feature selection. CRediT Author Statement

Tzu-Tsung Wong: Conceptualization, Validation, Writing – Original Draft, Funding Acquisition Nai-Yu Yang: Methodology, Software, Formal Analysis, Writing – Review & Editing Guo-Hong Chen: Conceptualization, Methodology, Software, Resources

AUTHOR DECLARATION

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us.

Acknowledgement

This research was supported by the Ministry of Science and Technology in Taiwan under Grant No. 107-2410-H-006-045-MY3. 28

References

[1] Abbasi, Z. and Rahmani, M. (2019) An instance selection algorithm based on ReliefF, International Journal on Artificial Intelligence Tools, 28: 1-14. [2] Aburomman, A. A. and Reaz, M. B. (2017) A survey of intrusion detection systems based on ensemble and hybrid classifiers, Computers and Security, 65: 135-152. [3] Breiman, L. (2001) Random forests, Machine Learning, 45: 5-32. [4] Chen, C., Zhang, G., Yang, J., Milton, J. C., and Alcántara A. D. (2016) An explanatory analysis of driver injury severity in rear-end crashes using a decision table/naïve Bayes (DTNB) hybrid classifier, Accident Analysis & Prevention, 90: 95-107. [5] Chen, F. H. and Howard, H. (2016) An alternative model for the analysis of detecting electronic industries earnings management using stepwise regression, random forest, and decision tree, Soft Computing, 20: 1945-1960. [6] Chi, G. T., Uddin, M. S., Abedin, M. Z., and Yuan, K. P. (2019) Hybrid model for credit risk prediction: an application of neural network approaches, International

29

Journal on Artificial Intelligence Tools, 28(5): 1950017. [7] De Caigny, A., Coussement, K., and De Bock, K. W. (2018) A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees, European Journal of Operational Research, 269: 760-772. [8] Deng, H. (2019) Interpreting tree ensembles with inTrees, International Journal of Data Science and Analytics, 7: 277-287. [9] Ebadati, O. M. E. and Ahmadzadeh, F. (2019) Classification spam email with elimination of unsuitable features with hybrid of GA-naive Bayes, Journal of Information & Knowledge Management, Vol. 18, No. 01, 1950008. [10] Farid, D. M., Zhang, L., Rahman, C. M., Hossain, M. A., and Strachan, R. (2014) Hybrid decision tree and naïve Bayes classifiers for multi-class classification tasks, Expert Systems with Applications, 41: 1937-1946. [11] Florez-Lopez, R. and Ramon-Jeronimo, J. M. (2015) Enhancing accuracy and interpretability

of

ensemble

strategies

in

credit

risk

assessment.

A

correlated-adjusted decision forest proposal, Expert Systems with Applications, 42: 5737-5753. [12] González Perea, R., Camacho Poyato, E., Montesinos, P., and Rodriguez Diaz, J. A. (2019) Prediction of irrigation event occurrence at farm level using optimal

30

decision trees, Computers and Electronics in Agriculture, 157: 173-180. [13] Jaison, B., Chilambuchelvan, A., and Mohamed Junaid, K. A. (2015) Hybrid classification techniques for microarray data, National Academy Science Letters, 38: 415-419. [14] Krawczyk, B., Triguero, I., García, S., Wo'zniak., and Herrera, F. (2019) Instance reduction for one-class classification, Knowledge and Information Systems, 59: 601-628. [15] Langley, P. and Sage, S. (1994) Induction of selective Bayesian classifiers. Proceedings of UAI-94, 10th International Conference on Uncertainty in Artificial Intelligence, Seattle, WA, 399–406. [16] Lichman,

M.

(2013)

UCI

machine

learning

repository,

http://archive.ics.uci.edu/ml, University of California, Irvine, School of Information and Computer Sciences. [17] Noor, F., Shah, A., Akram, M.U., and Khan, S.A. (2019) Deployment of social nets in multilayer model to identify key individuals using majority voting, Knowledge and Information Systems, 58: 113-137. [18] Plawiak, P., Abdar, M., and Acharya, U. R. (2019) Application of new deep genetic cascade ensemble of SVM classifiers to predict the Australian credit scoring, Applied Soft Computing, 84, 105740.

31

[19] Qi, C. C. and Tang, X. L. (2018) A hybrid ensemble method for improved prediction of slope stability, International Journal for Numerical and Analytical Methods in Geomechanics, 42: 1823-1839. [20] Shah, A. A., Ehsan, M. K., Ishaq, K., Ali, Z., and Farooq, M. S. (2018) An efficient hybrid classifier model for anomaly intrusion detection system, International Journal of Computer Science and Network Security, 18: 127-136. [21] Singh, N. and Singh, P. (2019) A novel bagged naive Bayes-decision tree approach for multi-class classification problems, Journal of Intelligent & Fuzzy Systems, 36: 2261-2271. [22] Tan, P. N., Steinbach, M., and Kumar, V. (2006) Introduction to Data Mining, Addison Wesley, Massachusetts. [23] Tripathi, D., Edla, D. R., Cheruku, R., and Kuppili, V. (2019) A novel hybrid credit scoring model based on ensemble feature selection and multilayer ensemble classification, Computational Intelligence, 35(2): 371-394. [24] Wijaya, A. and Bisri, A. (2016) Hybrid decision tree and logistic regression classifier for email spam detection, Proceedings of the 8th International Conference on Information Technology and Electrical Engineering, 1-4. [25] Witten, I. H., Frank, E., and Hall, M. A. (2011) Data Mining: Practical Machine Learning Tools and Techniques, 3rd Edition, Morgan Kaufmann, Massachusetts.

32

[26] Wong, T. T. (2015) Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation, Pattern Recognition, 48: 2839-2846. [27] Wong, T. T. (2017) Parametric methods for comparing the performance of two classification algorithms evaluated by k-fold cross validation on multiple data sets, Pattern Recognition, 65: 97-107. [28] Zareapoor, M., Shamsolmoali, P., Jain, D. K., Wanx, H. X., and Yang, J. (2018) Kernelized support vector machine with deep learning: An efficient approach for extreme multiclass dataset, Pattern Recognition Letters, 115: 4-13. [29] Zhang H., He H., and Zhang W. (2018) Classifier selection and clustering with fuzzy assignment in ensemble model for credit scoring, Neurocomputing, 316: 210-221. [30] Zhang, X. G. and Mahadevan, S. (2019) Ensemble machine learning models for aviation incident risk prediction, Decision Support Systems, 116: 48-63.

33