Hesitant fuzzy decision tree approach for highly imbalanced data classification

Hesitant fuzzy decision tree approach for highly imbalanced data classification

Accepted Manuscript Title: Hesitant Fuzzy Decision Tree Approach for Highly Imbalanced Data Classification Authors: Sahar Sardari, Mahdi Eftekhari, Fa...

924KB Sizes 2 Downloads 183 Views

Accepted Manuscript Title: Hesitant Fuzzy Decision Tree Approach for Highly Imbalanced Data Classification Authors: Sahar Sardari, Mahdi Eftekhari, Fatemeh Afsari PII: DOI: Reference:

S1568-4946(17)30534-3 http://dx.doi.org/10.1016/j.asoc.2017.08.052 ASOC 4442

To appear in:

Applied Soft Computing

Received date: Revised date: Accepted date:

24-12-2016 18-8-2017 23-8-2017

Please cite this article as: Sahar Sardari, Mahdi Eftekhari, Fatemeh Afsari, Hesitant Fuzzy Decision Tree Approach for Highly Imbalanced Data Classification, Applied Soft Computing Journalhttp://dx.doi.org/10.1016/j.asoc.2017.08.052 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Hesitant Fuzzy Decision Tree Approach for Highly Imbalanced Data Classification

Sahar Sardaria, Mahdi Eftekharia,*, Fatemeh Afsaria

a



Department of computer Engineering, Shahid Bahonar University of Kerman, Kerman, Iran. - Corresponding Author, Email: [email protected]

Graphical abstract

Highlights



Using K-means clustering algorithm to divide majority class samples of imbalanced data sets into several clusters and label each cluster sample by a new synthetic class label for balancing data sets.



Using hesitant fuzzy sets to construct five different fuzzy decision trees to aim of perform imbalanced data classification task.



Defining a new selecting attribute criterion for expanding nodes in construction of fuzzy decision trees based on the information energy five fuzzy information gains (obtained by employing five discretization methods).



Aggregating the results of five different fuzzy decision trees via defining of three different methods for predicting the class label of new data.



Not aggregating the results of five fuzzy decision trees and considering the results of each fuzzy decision tree separately for predicting the class label of new data.

Abstract Fuzzy decision tree algorithms provide one of the most powerful classifiers applied to any kind of data. In this paper, some new Fuzzy Decision Tree (FDT) approaches based on Hesitant Fuzzy Sets (HFSs) have been introduced to classify highly imbalanced data sets. Our proposed classifiers employ k-means clustering algorithm

to divide the majority class samples into several clusters. Then, each cluster sample is labeled by a new synthetic class label. After that, five discretization methods (Fayyad, Fusinter, Fixed Frequency, Proportional, and Uniform Frequency) are considered to generate Membership Functions (MFs) of each attribute. Five FDTs are constructed based on five discretization methods. Hesitant Fuzzy Information Gain (HFIG) is proposed as a new attribute selection criterion that can be used instead of Fuzzy Information Gain (FIG). HFIG is calculated by aggregating obtained FIGs from different discretization methods by information energy. For predicting the class label of new samples, three aggregation methods are utilized. The combination of splitting criterion (HFIG or FIG), five different discretization methods (for generating MFs) and three aggregation methods (to predict class label of new samples) generate special classifiers for addressing the imbalanced classification. For illustrating the difference between our proposed methods, taxonomy is proposed in the paper that categorizes them in three general categories. The experimental results show that our proposed methods outperform the other fuzzy rule-based approaches over 20 highly imbalanced data sets of KEEL in terms of AUC.

Keywords: Imbalanced Data Classification; Fuzzy Decision Tree; K-mean Clustering; Hesitant Fuzzy Set. 1. Introduction Recently, imbalanced data classification has become a problem of interest in both machine learning and bioinformatics [1]. In contrary to the traditional machine learning assumption, in imbalanced data sets the distribution of data samples in different classes is not the same; therefore, the learning algorithm might ignore the minor samples. Indeed, the samples of one class (the majority class) significantly outnumber of the other class (the minority class). Also, the imbalanced classification addresses the classification of just two classes. The aim of this paper is to construct a hesitant fuzzy decision tree for classification of imbalanced data sets. Dealing with imbalanced classification problems is different from balanced data [1]. To solve this problem some methods have been presented so far, these solutions include three general approaches [2]: 1) Data level approaches, 2) Cost-sensitive approaches, 3) Algorithm level approaches. Data level approaches try to balance data sets by increasing the samples of minority class data (Over-Sampling) or decreasing the samples of majority class data (Under-Sampling). Both of these approaches have some disadvantages. For example, in oversampling approaches by increasing the size of training data run time will increase and also in some cases may lead to overfitting problem. The disadvantage of under-sampling method is that it can remove some useful data. In the second approach, some misclassification costs have been considered for data in minority class and in most cases, these costs are determined as a cost matrix. The main disadvantage of this approach is that there is not adequate information to determine actual costs in the cost matrix. The last approach tries to adapt the classification algorithms with imbalanced data sets by making some changes in the algorithms while biasing them to support the minority class samples. Many traditional classification algorithms have shown poor performance for the class imbalance problems [3]. The classifiers that are built with these algorithms for such problems usually ignore the minority class as they tend to maximize the overall classification accuracy [3]. Occasionally, some special sampling methods are combined with a classifier to solve this problem, including over-sampling and under sampling. In recent years hesitant fuzzy sets are utilized a lot in machine learning [4-6] as well as decision trees. Decision trees are one of the most popular classification algorithms used in data mining. ID3 has highly unstable classifiers with respect to minor perturbation in training data [7]. The structure of the decision tree may be entirely different if some things change in the dataset [7]. Fuzzy logic brings in an improvement of these aspects due to the elasticity of fuzzy sets formalism [7]. Therefore, some scholars have suggested Fuzzy Decision Tree (FDT) by utilizing the fuzzy set theory and machine learning concepts [7]. In this section, we review those methods that apply fuzzy and non-fuzzy approaches to classify imbalanced datasets, respectively. Mehdizadeh and Eftekhari [8] proposed a method, which was a fuzzy rule base classifier for highly imbalanced datasets. This classifier was provided from the combination of subtractive clustering, differential evolution and multi-gene genetic programming for generating fuzzy rules. First, they used differential evolution algorithm to find best neighborhood radii values of subtractive clustering method. Then, subtractive clustering was used to generate a Sugeno-type fuzzy inference system. After that, to create consequence of rules, they employed multigene genetic programming. FH-GBML method [9] was proposed by Ishibuchi et al, which was a combination of both Pittsburgh and Michigan methods to build a fuzzy rule base classifier. At first, Pittsburgh approach was performed to generate population from rule sets. Then a predetermined probability was calculated to each rule set in order to execute one Michigan iteration for the all rule set. Nakashima et al [10] proposed WFC method that is a weighted fuzzy rule base classifier approach. Mansoori et al. [11] proposed SGERD method, which is based on a genetic algorithm method in which the number of iterations is based on the problem dimension. The fitness function of SGERD is determined by a rule evaluation criterion. In GFS-LogitBoost [12] and GFS-Max

LogitBoost [13], each weak learner is considered as a fuzzy rule extracted from data by genetic algorithm. When a new simple classifier is added to the compound one, the examples in the training set are re-weighted. Berlanga et al. [14] proposed a GP-COACH method. This method applied the genetic programming algorithm to obtain fuzzy rules. Initial population in the genetic programming is generated by rules in the context-free grammar. This method is designed in such away then competes and cooperates with rules to generate the collection of fuzzy rules. Villar et al. [15] proposed GA-FS +GL that was a genetic algorithm for feature selection and granularity learning for fuzzy rule base classification systems and the aim of this approach was to extract a linguistic fuzzy rule base classifier system (FRBCS) with high accuracy for highly imbalanced data sets. Lopez et al. [16] presented GPCOACH-H method. This method was a hierarchical fuzzy rule base classifier that used information granulation and genetic programming approach. This method employed GP-COACH as a base of the hierarchical model. Hong-Liang Dai [17] proposed a fuzzy total margin based support vector machine. In this approach, first, the total margin was considered instead of the soft margin algorithm. In the total margin, the extra surplus variables were added to the formula of soft margin so that they measured the distance between the correctly classified data points and the hyperplane. Total margin algorithm calculated the distance of all data points from separating hyperplane. The total margin algorithm developed both the soft margin algorithm and SVM by defining the extra surplus variables. Therefore, the extension of error bound performed better than the common SVM. The weights of the misclassification samples and the correctly classified samples were fuzzy values. Edward Hinojosa C et al. [18] proposed IRL-ID-MOEA approach in which multi-objective evolutionary algorithm was used to learn fuzzy classification rules from imbalanced data sets. In this method, one rule in each run of the multi-objective evolutionary algorithm has been learned. This approach first used SMOTE+TL preprocessing approach to balance imbalanced data sets then employed iterative multi-objective evolutionary algorithm called NSGA-II to learn fuzzy rules and used accuracy and the number of premise variables for each fuzzy rule as two objectives. Kim et al. [19] proposed GMBoost method based on AdaBoost algorithm to overcome the imbalanced problems. This method provides learning with attention to both minority and majority class samples by using geometric mean of both classes in error rate and calculation of accuracy. Kim et al. [20] presented a new method to balance data based on clustering technique and genetic algorithm (GA). At first, majority class samples were divided to multiple clusters by using K-means clustering method then Euclidean distance between a sample and centroid in each cluster was calculated. Ultimately, a threshold which indicated the distance from each cluster was founded by using GA algorithm. In this method the samples that were away from the cluster center were identified by threshold and were removed. They applied artificial neural network (ANN) as the classifier and its weights were optimized by GA algorithm. Sun et al. [3] presented a novel ensemble method to classify imbalanced data. In this method multiple balanced data sets were created from imbalanced data set by using split random and clustering algorithm then several classifiers were constructed on these balanced data sets. Finally, to predict the label of new sample, they proposed five new ensemble rules with respect to the relationship between new samples and training data distribution. Tran and Liatsis [21] presented RABOC approach that merged the Real AdaBoost algorithm and one-class classification. Błaszczyński and Stefanowski [22] presented a new bagging method by making changes in bootstrap sampling technique. They focused on bootstrap sampling toward these minority samples, which were awkward to learn. Determining the hard minority class samples was conducted by the knearest neighbors. Zhao et al. [23] proposed a method for imbalanced classification by learning the hidden structure of majority class by using an unsupervised learning algorithm. They first partitioned the majority class into several sub-clusters, and then constructed base classifier on each combination of the minority class and a subset of majority class; thus, the original classification problem was transformed into several sub-problems. The ensemble was tuned to increase its sensitivity towards minority class. They also provided a metric for selecting the clustering algorithm by comparing estimates of the stability of the decomposition. They trained a support vector machine (SVM) classifier with linear kernel on each sub-problem. In this paper, we use the combination of data level approach and algorithm level approach to classify the imbalanced datasets. First, K-means clustering algorithm is applied on majority class samples as a data level approach to balance datasets without changing the number of samples in each class. Then, some FDT approaches based on HFSs are proposed as algorithm level approaches to classify the highly imbalanced data sets. This paper also presents a new splitting criterion (i.e., HFIG) to select best attribute for expanding FDT. The proposed methods have applied five discretization methods to construct five different FDTs in terms of fuzzy attribute values. In some methods we aggregate the obtained results of five FDTs by using three aggregation approaches, naming fuzzy majority voting; hesitant fuzzy information energy voting; and fuzzy weighted voting. Some proposed classifiers utilize FDTs on highly imbalanced data sets without aggregating. The main innovation of this paper is the fusion of different sources of information to construct efficient FDTs via Hesitant Fuzzy Sets (HFSs). First, five discretization methods are utilized to generate different cut points (for each attribute). Then corresponding Membership Functions (MFs) are defined over the obtained cut points. After, these defined MFs are considered as different sources of information that can be combined for FDT construction. Through the fusion of information sources, a new splitting criterion named HFIG is proposed to select the best

attribute for expanding FDT. The value of HFIG criterion is the information energy of hesitant elements that are five different calculated FIG values (calculated by five different MFs for each attribute). The other contribution of this research is to construct five different FDTs by using five discretization methods separately. In this approach each of five FDTs is constructed via its own discretization and consequently MF definition method. Then the outcome of these FDTs are combined through three aggregation methods (one of them is the hesitant information energy of fuzzy compatibility degrees of patterns). These aggregators are employed for fusion of the final results and predict the class label of new data. The rest of this paper is organized as follows. Section 2 introduces some basic concepts that can help to better understanding of the proposed methods. The proposed methods are described in Section 3 and for more illustration they are categorized in three groups. Section 4 presents the experimental results over 20 highly imbalanced data sets. Eventually, Section 5 concludes the paper.

2. Basic Concepts In this section, the basics of hesitant fuzzy sets are given first followed by an introduction to the basics of the FDT. Finally, several discretization methods used in the developed method, is briefly presented. 2.1. Hesitant Fuzzy Sets HFSs are a recent extension of fuzzy sets that models the uncertainty provoked by the hesitation that might appear when it is necessary to assign the membership degree of an element to a fuzzy set [24]. A HFS is defined in terms of a function that returns a set of membership values for each element in the domain [24]. 1) Definition 1: [25].Let X be a reference set, a HFS D on X is defined in terms of a function (ℎ𝐷 (𝑥)) when applied to X, returns a finite subset of [0, 1]. 𝐷 = {< 𝑥, ℎ𝐷 (𝑥) > |𝑥 ∈ 𝑋}

(1)

where ℎ𝐷 (𝑥) is a set of some different values in [0, 1], representing the possible membership degrees of the element 𝑥 ∈ 𝑋 to 𝐴. For convenience, we call ℎ𝐷 (𝑥) a hesitant fuzzy element (HFE) [26]. Example1. Let 𝑋 = {𝑥1 , 𝑥2 , 𝑥3 } be a reference set, ℎ𝐷 (𝑥1 ) = {0.2, 0.5, 0.7}, ℎ𝐷 (𝑥2 ) = {0.4, 0.5} and ℎ𝐷 (𝑥3 ) = {0.2, 0.5, 0.7} be the HFEs of 𝑥𝑖 (𝑖 = 1,2,3) to a hesitant fuzzy set 𝐴, respectively. Then 𝐷 can be considered as a HFS, i.e. 𝐷 = {< 𝑥1 , {0.2, 0.5, 0.7} >, < 𝑥2 , {0.4, 0.5} >, < 𝑥3 , { 0.2, 0.5, 0.7} >} 2) Definition 2: [27] For an HFS D { < 𝑥𝑖 , ℎ𝐷 (𝑥𝑖 ) > |𝑥 𝑖 ∈ 𝑋, i=1, 2, ..., n} the information energy of D is defined as follows: 𝑛

𝑙𝑖

1 𝐸𝐻𝐹𝑆 (𝐷) = ∑( ∑ ℎ2 𝐷𝜎(𝑗) (𝑥𝑖 )) 𝑙𝑖 𝑖=1

(2)

𝑗=1

where n is the cardinality of the universe of discourse , 𝑙𝑖 is the number of membership values and ℎ𝐷𝜎(𝑗) (𝑥𝑖 ) is the jth element of ith universe of discourse members in the respective HFS. 2.2. Fuzzy Decision Tree Before we define the fuzzy decision tree construction and inference procedures, let us introduce some notations [28].  The set of attributes of a dataset is denoted by{𝐴1 , 𝐴2 , … , 𝐴𝑘 , 𝑌}, where 𝐴𝑖 is the ith attribute, k is the number of input attributes, 𝑌 is the target attribute (each attribute is a column in dataset) and A ={𝐴1 , 𝐴2 , … , 𝐴𝑘 } is called the set of attributes.  The set of values of target variable (class labels) is 𝑌 ∈ {𝑐1 , 𝑐2 , … , 𝑐𝑚 }, where m is the number of classes.  The examples in the fuzzy dataset S is denoted by 𝑆 = {(𝑿𝟏 , 𝜇𝑠 (𝑿𝟏 )), (𝑿𝟐 , 𝜇𝑠 (𝑿𝟐 )), . . . , (𝑿𝒏 , 𝜇𝑠 (𝑿𝒏 ))}, where 𝑿𝒊 is the ith example, 𝜇𝑠 (𝑿𝒊 ) is the membership degree of 𝑿𝒊 in S, and n is the number of examples. 𝑿𝒊 has written in bold face because it’s a vector containing the input attributes and the target attribute. (𝑗) (1) (2) (𝑘)  The ith example denoted by𝑿𝒊 = [𝑥𝑖 , 𝑥𝑖 , … , 𝑥𝑖 , 𝑦𝑖 ]𝑇 , where 𝑥𝑖 is the value of the jth attribute, and 𝑦𝑖 is the value of the target attribute. (𝑟 ) (𝑗) (1) (2)  Fuzzy terms defined on the ith attribute, 𝐴𝑖 , are denoted by {𝐹𝑖 , 𝐹𝑖 , … , 𝐹𝑖 𝑖 }, where 𝐹𝑖 is the jth fuzzy term and 𝑟𝑖 is the number of fuzzy terms defined on attribute 𝐴𝑖 .

(𝑗)



Membership function corresponding to the fuzzy term 𝐹𝑖



The number of examples in the fuzzy dataset S denoted by |𝑆| and is defined as:

is denoted by 𝜇𝐹(𝑗) . 𝑖

𝑛

|𝑆| = ∑ 𝜇𝑠 (𝑿𝒊 ) 𝑖=1

 

The examples of fuzzy dataset 𝑆 which belong to the class i is denoted by 𝑆𝑦=𝑐𝑖 . If S denotes the fuzzy dataset of parent node, then the fuzzy dataset of child node corresponding to the (𝑗) (𝑗) fuzzy term 𝐹𝑖 is denoted by 𝑆[𝐹𝑖 ]. For example consider child nodes in Figure 1. In this figure, the branching attribute is 𝐴𝑖 and the name of fuzzy dataset for each node has been written on it. In this paper (𝑗) the jth child means the child node corresponding to the fuzzy term 𝐹𝑖 . 2.2.1 Fuzzy Decision Tree Construction FDTs allow instances to follow down multiple branches simultaneously, with different satisfaction degrees ranged on [0, 1]. To implement these characteristics, FDTs use fuzzy linguistic terms to specify branching condition of nodes. In FDTs, an instance may be fall into many leaves with different satisfaction degree ranged on [0, 1], because it can be fall into many child nodes of parent node at any level [28]. This fact is actually advantageous as it provides more graceful behavior, especially when dealing with noise or incomplete information. However, from a computational point of view, the FDT induction is slower than crisp decision tree induction. This is the price paid for having a more accurate but still interpretable classifier [29-31]. Generally speaking, fuzzy decision tree induction has two major components: a procedure for fuzzy decision tree construction, and an inference procedure for decision making (i.e. class assignment for new instances). One of the FDT building procedures is a fuzzy extension to the well-known ID3 algorithm [32] that is named fuzzy ID3 (FID3). FID3 employs predefined fuzzy linguistic terms by which the attribute values of training data are fuzzified [33]. This method extends the information gain measure to determine branching attribute of each node expansion. Moreover, FID3 uses fuzzy dataset in which for each example a degree of membership in the dataset is included as well as crisp value of attributes (input attributes and target attribute). Fuzzy dataset of the child nodes contains all the examples of the parent ones in which the branching attribute has been eliminated. Another difference occurs in the membership degrees of examples. Suppose that S is the (𝑟 ) (1) (2) fuzzy dataset of parent node and 𝐴𝑖 is the branching attribute with {𝐹𝑖 , 𝐹𝑖 , … , 𝐹𝑖 𝑖 } fuzzy terms, and the fuzzy (𝑗)

(𝑗)

dataset of the child node corresponding to the fuzzy term 𝐹𝑖 is 𝑆[𝐹𝑖 ]. The membership degree of hth example, (𝑗) (1) (2) (𝑘) 𝑿𝒉 = [𝑥ℎ , 𝑥ℎ , … , 𝑥ℎ , 𝑦ℎ ]𝑇 , in 𝑆[𝐹𝑖 ] is defined as [28]: (𝑖) 𝜇𝑆[𝐹(𝑗) ] (𝑿𝒉 ) = 𝜇𝑆 (𝑿𝒉 ) × 𝜇𝐹 (𝑗) (𝑥ℎ ) (3) 𝑖

𝑖

(𝑖)

(𝑖)

where 𝜇𝑆 (𝑿𝒉 ) is the membership degree of 𝑿𝒉 in S, and 𝜇𝐹(𝑗) (𝑥ℎ ) is the membership degree of 𝑥ℎ corresponding to fuzzy term

(𝑗) 𝐹𝑖 ,

𝑖

in the MF

namely 𝜇𝐹(𝑗) . In the generalized case, multiplication operator can be replaced 𝑖

with a t-norm operator. FID3 selects the attribute with maximum FIG as a branching attribute. FIG utilizes Fuzzy Entropy (FE) which is defined as [28]: 𝑚

𝐹𝐸(𝑆) = ∑ − 𝑖=1

|𝑆𝑦=𝑐𝑖 | |𝑆𝑦=𝑐𝑖 | log 2 |𝑆| |𝑆|

(4)

The FIG of attribute 𝐴𝑖 , relative to a fuzzy dataset S is defined as [28]: 𝑟𝑖

(𝑗)

𝐹𝐼𝐺(𝑆, 𝐴𝑖 ) = 𝐹𝐸(𝑆) − ∑ 𝑤𝑗 × 𝐹𝐸(𝑆 [𝐹𝑖 ])

(5)

𝑗=1 (𝑗)

where 𝐹𝐸(𝑆) is the fuzzy entropy of fuzzy entropy of fuzzy dataset S, 𝐹𝐸(𝑆 [𝐹𝑖 ]) is the fuzzy entropy of jth child node, and 𝑤𝑗 is the fraction of examples which belong to jth child node. 𝑤𝑗 is defined as [28]: (𝑗)

𝑤𝑗 =

|𝑆[𝐹𝑖 ]| 𝑟

(𝑘) 𝑖 ∑𝑘=1 |𝑆[𝐹𝑖 ]|

(6)

The first term in Eq. (5) is just the fuzzy entropy of S, and the second term is the expected value of the fuzzy entropy after S is partitioned using attribute𝐴𝑖 . 𝐹𝐼𝐺(𝑆, 𝐴𝑖 ) is therefore the expected reduction in fuzzy entropy caused by knowing the value of attribute 𝐴𝑖 . There are some other methods for selecting branching attribute

[34,35]. Figure 2 summarizes the FDT construction procedure. As it is mentioned in lines 3 and 5 of algorithm, it uses a stopping criterion as an early stopping technique for constructing the FDT. According to the conclusions of ref. 28, Normalized Maximum FIG multiplied by Number of Instances (NMGNI) is utilized in this study as stopping criterion. Based on the above algorithm, the decision tree construction can be performed according to different discretization methods. A HFS B for each attribute Ai can be defined as follows: 𝐵 = {< 𝐴𝑖 , ℎ𝐵 (𝐴𝑖 ) > |𝐴𝑖 ∈ 𝐴}

(7)

where ℎ𝐵 (𝐴𝑖 )={𝐹𝐼𝐺𝐹𝑎𝑦𝑦𝑎𝑑(𝑆,𝐴𝑖) ,𝐹𝐼𝐺𝐹𝑢𝑠𝑖𝑛𝑡𝑒𝑟(𝑆,𝐴𝑖) ,𝐹𝐼𝐺𝐹𝑖𝑥𝑒𝑑𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦(𝑆,𝐴𝑖) , 𝐹𝐼𝐺𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛𝑎𝑙(𝑆,𝐴𝑖) ,𝐹𝐼𝐺𝑈𝑛𝑖𝑓𝑜𝑟𝑚𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦(𝑆,𝐴𝑖) }. Thus, the HFIG of each attribute Ai, is calculated relative to a fuzzy data set S as follows: 𝐻𝐹𝐼𝐺(𝑆, 𝐴𝑖 ) = 𝐸𝐻𝐹𝑆 (𝐵) =1/5(𝐹𝐼𝐺𝐹𝑎𝑦𝑦𝑎𝑑(𝑆,𝐴𝑖 ) 2 +𝐹𝐼𝐺𝐹𝑢𝑠𝑖𝑛𝑡𝑒𝑟(𝑆,𝐴𝑖 ) 2 +𝐹𝐼𝐺𝐹𝑖𝑥𝑒𝑑𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦(𝑆,𝐴𝑖) 2 +𝐹𝐼𝐺𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑎𝑛𝑎𝑙 (𝑆, 𝐴𝑖 ) 2 +𝐹𝐼𝐺𝑈𝑛𝑖𝑓𝑜𝑟𝑚𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 (𝑆, 𝐴𝑖 ) 2 )

(8)

The above HFIG is the information energy of five FIG values being calculated based on five discretization approaches and is equal to 𝐸𝐻𝐹𝑆 (𝐵) defined in Section 2.1, definition 2. HFIG can be used instead of FIG in step 6 of the above algorithm. 2.2.2 Inference of FDT A classical decision tree can be converted to a set of rules [36]. One can think of each leaf as one rule; the conditions leading to the leaf generate the conjunctive antecedence, and the classification of the examples in the leaf produces the consequence. In this case, a consistent set of rules is generated only when examples of every leaf have a unique classification, which may happen only when a sufficient set of attributes is used and the training data is consistent. Since in fuzzy representation a value can have a nonzero membership in more than one fuzzy set, the inconsistency problem dramatically increases. To come up with this problem, approximate reasoning methods have been used to inference for decision assignment [37]. Like classical decision trees, FDTs also can be converted to a set of fuzzy if-then rules. Figure 3 depicts a typical fuzzy decision tree. The fuzzy if-then rules corresponding to each leaf in this FDT are: (1) (1) Leaf1: if [𝐴1 𝑖𝑠 𝐹1 ] and [𝐴2 𝑖𝑠 𝐹2 ] then C1: 0.2 and C2: 0.8 (1) (2) Leaf2: if [𝐴1 𝑖𝑠 𝐹1 ] and [𝐴2 𝑖𝑠 𝐹2 ] then C1: 0.6 and C2: 0.4 (2) Leaf3: if [𝐴1 𝑖𝑠 𝐹1 ] then C1: 0.3 and C2: 0.7 In this example, C1 and C2 represent class one and two respectively. We can apply the approximate reasoning to derive conclusions from these set of fuzzy if-then rules and known facts. Succinctly, approximate reasoning can be divided into four steps [38]: 1) Degrees of compatibility: Compare the known facts with the antecedents of fuzzy rules to find the degrees of compatibility with respect to each antecedent MF. 2) Firing strength: Combine degrees of compatibility with respect to MFs of the antecedent in a rule to form a firing strength that indicates the degree to which the antecedent part of the rule is satisfied. In fuzzy decision tree, MFs of the antecedent are connected with AND operators. Therefore, one t-norm operator can be used for this step in FDT. We adopt the multiplication operator among many alternatives. 3) Certainty degree of each class in each rule: The firing strength of each rule is combined by the certainty degree of the classes attached to the leaf node. A t-norm operator can be employed for this step. The multiplication operator is used in this research. Therefore, for the above given example, each rule has two certainty degrees relating to classes C1 and C2. 4) Overall output: Aggregate all the certainty degrees from all fuzzy if-then rules relating to the same class. One s-norm operator should be used for aggregation. We adopt the sum from several alternatives. It may be the total values of certainty degrees for different classes exceed the unity, thereby we normalize them. Figure 4 shows the compatibility degree values of example 𝑿 = (𝑥 (1) , 𝑥 (2) ) for each antecedent MF of FDT given in Figure 3. The values of compatibility degrees are illustrated on the edges of the tree given in Figure 4. Performing the steps 2, 3 and 4 described above, the total value of certainty degree for each class is calculated by aggregating certainty degrees’ values for all the rules. The aggregation is performed by sum as follows: Total Certainty Degree Value for C1 = 0.65×0.45×0.2 + 0.65×0.55×0.6 + 0.25×0.3 = 0.348= 𝜇(𝐶1 ) Total Certainty Degree Value for C2 = 0.65×0.45×0.8 + 0.65×0.55×0.4 + 0.25×0.7 = 0.552= 𝜇(𝐶2 )

(9) (10)

These results show that X belongs not only to class C1 with the total certainty degree 0.39, but also to class C2 with the total certainty degree 0.61 (the final values of certainty degrees have been normalized). Therefore, the inference procedure labels the example X as C2. 2.3. Discretization methods Discretization is the process of partitioning continuous attributes to discrete attributes. Since some machine learning algorithms are designed to operate with discretized attributes, discretization is an important step in machine learning applications in which continuous attributes should be handled [39]. There are many discretization methods that can be categorized by several ways: splitting vs. merging, supervised vs. unsupervised, dynamic vs. static and global vs. local [40]. A discretization measure contains binning, entropy and dependency. Here, we briefly review five discretization methods namely Fayyad, fusinter, fixed frequency, proportional and uniform frequency, which are applied in the proposed methods. Fayyad [41] is a splitting discretization method based on entropy measure that divides an interval into sub-intervals by using a cut-point (which minimizes the weighted entropy of generated sub-intervals) and uses MDLP to stop discretization process. Fusinter [42], as a merging discretization method, that is a bottom-up approach and starts with a complete list of all continuous values of the attribute as cut-points. Then, this method merges intervals by eliminating some of the cut-points to obtain the final discretization. Fixed-frequency discretization method [43] generates intervals containing k examples, in which k is a user-defined parameter. Proportional discretization method [43] is an unsupervised discretization method in which interval frequency and interval number are equal sets proportional to the amount of training data to secure both low bias and low variance. Uniform-frequency discretization method [44] takes a parameter k and generates k intervals with an equal number of examples as a kind of supervised discretization methods.

3. Proposed Method Our proposed FDT method addresses the class imbalanced problem by converting the majority class samples into several classes, which includes three general components; data balancing, FDT construction and the aggregation of FDTs. In our method, the majority class samples are divided into several classes by using K-means clustering algorithm. Afterward, five discretization methods which are Fayyad, Fusinter, Fixed frequency, Proportional and Uniform frequency are used to obtain cut-points then MFs of each attribute are generated based on each discretization method. Thereafter, five FDTs are constructed on obtained fuzzy data-sets. Finally, these classifiers are aggregated into a classifier to classify new data. In the following sub sections, steps of the proposed method are explained in more details. 3.1. Data Balancing Two conventional data balancing methods are over-sampling and under-sampling being used to balance the imbalanced data-sets. Under-sampling may ignore some useful information by reducing majority class samples and over-sampling is likely to lead overfitting by increasing minority class samples. Therefore, it is rational to convert majority class samples into several classes without changing the number of data. To this end, clustering can be used. At first K-means clustering is applied to the majority class samples and multiple clusters are obtained. Then, each sample of these clusters is labeled as the cluster number and, so a new balanced data with multiple pseudo classes is constructed. This idea is borrowed from Yin et al. research which they used this idea for feature selection [45]. Clustering method was also used for majority class samples in previous works [20],[3],[23]. It was employed for finding far away data points from cluster centers and removing them [20] and constructing several classifiers on generated clusters in order to apply on ensemble methods [3]. Furthermore, clustering was used for imbalanced classification by support vector machine (SVM) classifier [23]. 3.2 Generating the Membership Functions Each discretization method gives special splitting cut-points based on its criterion [46]. Each partition generated by discretization method is bounded by two cut-points: Lower cut-point and Upper cut-point [46]. Generating MFs on the domain of attribute can be considered as either a fuzzy discretization or fuzzy partitioning in which each fuzzy partition is specified by an MF [46]. In our proposed methods, five discretization methods are applied to generate cut-points. Then MFs are defined over cut-points (e.g. two cut points ct1 and ct2) based on standard deviation method proposed in [46]. This approach utilizes mean and standard deviation from the samples located in each partition. A user-defined parameter named StdCoefficient, which is a real positive number, controls fuzziness of generated MFs. Membership grade of lower cut-points and upper ones in standard deviation based method are not necessarily equal to 0.5. Standard deviation based MF definition method defines triangular MFs using the relations presented in Table I. In middle MFs, the second parameter of triangular MF is set to mean value of all examples inside the partition; and first and last parameters of it are set to the points at a distance of 2stdVal in the left and right hand side of mean value respectively. stdVal is the standard deviation (std) of all examples inside the partition which is normalized by multiplying to user-defined parameter stdCoefficient. Standard deviation based MF definition

method changes the type of leftmost and rightmost MFs to trapezoidal. In leftmost trapezoidal MF, membership grade of all examples appeared before the mean value are equal to one; and in the rightmost trapezoidal MF, membership grade of all examples appeared after the mean value are equal to one [46]. MFs for five discretization methods are generated and five FDTs are constructed based on them. Table I shows the parameters of triangular and trapezoidal MFs based on standard deviation method. In the following Table, a,b,c and d in the first and last columns are the trapezoidal MF parameters respectively. a, b and c in the second column are the triangular MF parameters where ct1 and ct2 denote two cut points.

3.3. FDT Construction The FDT construction algorithm was described in Section 2.2.1 and Figure 2. In the FDT construction process, two approaches being utilized will be introduced in Section 3.5, in the following. One of our approaches uses FIG as the node selection criterion and thereby five FDTs are constructed based on five discretization methods. The second approach utilizes HFIG as the splitting criterion and generates five FDTs in which the nodes are selected based on the HFIG and is the same for all FDTs. The only difference among five FDTs in the second approach is the partitioning of nodes that is based on its specific discretization method employed. 3.4. The aggregation of FDTs Suppose that for each data the class labels are C1 and C2. Moreover, AM1 and AM2 represent the final classification results obtained by aggregation methods for the class C1 and C2, respectively. For the ith classifier (𝑖) (𝑖) 𝜇(𝐶1 ) and 𝜇(𝐶2 ) are the membership degrees of new data in the class C1 and C2, respectively (that are calculated in relations 9 and 10). In some of our proposed methods we aggregate the obtained results of five FDTs by using three aggregation approaches, naming fuzzy majority voting (FMV); fuzzy weighted voting (FWV); and hesitant fuzzy information energy voting (HFIEV). Table II summarizes the detailed aggregation strategies and descriptions of these three aggregation methods. Eventually, given the final classification results (i.e. AM1 and AM2) being obtained by the aggregation methods in Table II, the new data is classified as C1 if AM1≥ AM2, otherwise C2. 3.5. Naming different FDT classifiers For naming the FDT classifiers used in this paper, the following notation is utilized; FDT construction measure/Fuzzy partitioning methods/ aggregation FDTs method. For example, HFIG/Fayyad/NA means a FDT that is constructed based on HFIG while the fuzzy partitioning of attributes is based on the Fayyad discretization and no aggregation method is employed. 3.5.1. Single FDT approaches (Category No.1): The models of Category No.1 are single FDTs. FDTs are constructed on the basis of the algorithm given in Figure 2. All classifiers in this category use HFIG in the construction process of FDTs and fuzzy partitioning of attributes is performed by only one discretization method. HFIG/Fayyad/NA: HFIG is used in constructing process of FDT classifier. Furthermore, Fayyad discretization technique is used for generating MFs. There is no aggregation FDT method. The descriptions for HFIG/ Fixed Frequency/NA, HFIG/ Fusinter/NA, HFIG/ proportional/NA and HFIG/Uniform Frequency/ NA are the same as the HFIG/Fayyad/NA but the only difference is their discretization methods employed for fuzzy partitioning. Five FDTs are generated whose nodes in different levels are the same but their fuzzy partitions for their nodes are different (i.e. the partitioning of nodes for each tree is based on its specific discretization method employed). Figure 5 shows the flowchart of Category No.1 methods. 3.5.2. Ensemble FDT approaches based on FIG criterion (Category No.2): FDTs are constructed based on the algorithm given in Figure 2 and are aggregated by a method being given in Table III. This category includes methods which use FIG in constructing process of FDTs and aggregate the obtained results of FDTs for predicting the class labels of new samples. These methods include the following: FIG/Five Partitioning/FMV: in this method, first, five FDTs constructed by FIG measure and five different discretization techniques. Finally, for predicting the class labels of new samples, this method uses FMV aggregate method for aggregating the results of five FDTs Methods. The descriptions for FIG/Five Partitioning/FWV, FIG/Five Partitioning/HFIEV are the same as the FIG/Five Partitioning/FMV, except the aggregation methods employed in them. Figure 6 shows the flowchart of Category No.2 methods.

3.5.3. Ensemble FDT approaches based on HFIG criterion (Category No.3): In this category, the FDTs again are constructed the same method as the previous category, i.e., by using the algorithm of Figure 2 and the method that is given in Table III. This category includes methods which uses HFIG in constructing the process of fuzzy decision trees and aggregate the obtained results of FDTs for predicting the class labels of new samples. These methods include the following: HFIG/Five Partitioning/FMV, HFIG/Five Partitioning/FWV and HFIG /Five Partitioning/HFIEV. Figure 7 shows the flowchart of Category No.3 methods. Similar to the FDTs in Category No.1, the FDT classifiers employed in this category have the same nodes since they employ HFIG as node selection criterion. The difference of FDTs is related to fuzzy partitions of nodes being generated based on specific discretization methods.

4. Experimental Results 4.1. Setup To evaluate the FDT classifiers described in Section 3, several experiments are designed and are implemented in Matlab. The results are compared with some other traditional classification methods and fuzzy rule based classifiers. Traditional classifiers are the crisp decision trees (C4.5) and the K-Nearest Neighbour (KNN). Fuzzy rule based classifiers are GFS-LogitBoost (LB) [12], GFS-Max LogitBoost (MLB) [13], GP-COACH-9 [14], GPCOACH-H [16], GA-FS+GL [15] and IRL-ID-MOEA [18]. FDT construction process stops when the accuracy of node is greater than 0.95. The user-defined parameter stdcoefficient is considered equal to 1.5 in MF definition based on standard deviation. In our proposed method, k-means clustering algorithm was employed for majority class samples to balance datasets. The number of clusters (#Cluster.) being considered to divide majority class samples into multiple clusters is obtained emperrically by trial and error and for each dataset is given in Table III.

4.2. Datasets Twenty highly imbalanced data sets, being gathered from two classes, are taken from the KEEL data set repositories [47]. They are used to evaluate the performance of the proposed method. The detailed description of the data sets as the number of attributes (#Attribute.), the number of samples (#Sample.), the imbalance ratio (IR=the number of minority class samples over majority class samples) of data sets are given in Table III. 4.3. Results and discussion To evaluate the proposed method, thirty independent runs were performed for every algorithm and in each run 5-Fold Cross Validation (5-CV) was conducted for evaluating the performance. The average of AUC values for test data was calculated in various runs of algorithms; it is called AAUCtest in the following. These AAUCtest values were used as a criterion for comparison of the proposed methods with the other classification approaches. The results achieved by AAUCtest for traditional classification methods and fuzzy rule based classifiers are given in Table IV. From Table IV, the results for columns 2-8 are extracted from [8] and the results of IRL-IDMOEA are extracted from its original paper [18]. Since our proposed methods will take eleven columns and it is not possible to place them in Table IV consequently Table V is considered for presenting the AAUCtest values obtained by the proposed methods. As it is mentioned earlier in Section 3, our proposed methods are grouped into three categories. Methods in category No.1 are simple FDT classifiers whose nodes are selected by HFIG (based on five discretization methods) but the fuzzy partitioning of nodes is performed by only one discretization approach. Therefore, in Table V, first five columns are the results of FDTs in category No.1 (columns 2-6). The results of methods in category No.2 are presented in three columns of Table V; columns 7-9. According to the descriptions given in Section 3, the methods in category No.2 are ensemble approaches in which five FDTs are aggregated by three aggregation methods. The FDTs in this category use FIG for node selection and five discretization methods for fuzzy partitioning of nodes. The results of category No.3 methods are given in columns 10-12 of Table V. The bolded values in one specific row of Tables IV and V are the best obtained AAUCtest values across one specific dataset. It is possible that more than one bolded values to be placed in a single row. This means that there exists more than one method which is the best across a dataset. It is notable to mention that most of bolded values are placed in the Table V. This means that our proposed approaches outperform the methods that are in the literature; the results of Table V are better than those given in Table IV. For more illustration and comparison, the cells of Tables IV and V that contain the bolded values, are highlighted by a shading color. As it is apparent the number of shaded cells in Table V is more than the number of shaded cells in Table IV. There are 27 bolded values in both of Tables IV and V such that 8 of them are placed in Table IV (about 30% of total best values). Furthermore, there exists 18 bold cases in Table V (about 70% of total bold

values). Moreover, in Table V, the portions of bolded values related to methods in categories No.1, No.2 and No.3 are about 11% , 44% and 15% respectively. Even though performance of methods in category No.1 and No.3 is approximately equal, it seems the methods in category No.2 outperform other categories of our proposed approaches. In the following, we tend to present a more accurate analysis of results given in Tables IV and V via statistical non-parametric tests. Non-parametric statistical tests are employed for illustrating the significant differences of methods.

4.4. Analyzing the results via non-parametric Statistical Test Statistical analysis should be conducted in order to find whether there is a significant difference among different algorithms or not [48]. Aligned Friedman [49] non-parametric test is employed for analyzing the experimental results. This test first gives ranking scores of methods so as to see whether significant differences exist among the methods or not. Furthermore, the method with the lowest rank is considered as the control method. Figure 8 gives the ranking values of different methods where these ranks are obtained according to the results of Tables IV and V by Aligned Friedman’s ranking algorithm. Aligned Friedman statistic (distributed according to chi-square with 18 degrees of freedom) is equal to 19.20 and P-value computed by it is 0.3794. The null hypothesis is that all the methods have the same performances and the rejection of it means there exists significant differences among the methods. According to the first step of Aligned Freidman test, the null hypothesis is rejected (which means the proposed method is not equal to the other methods). FIG/Five Partitioning/FMV is the control method; the classifier of category No.2 has the best rank value comparing to the other classifiers. According to the Figure 8, the other methods in category No.2 namely FIG/Five Partitioning/HFIEV and FIG/Five Partitioning/FMV are placed in the second and third ranks respectively. These ranks confirm the superiority of category No.2 approaches comparing to the other methods. What is more, it is worth mentioning that the fourth and fifth places in the ranking list are given to the HFIG/Five Partitioning/FMV and HFIG/ proportional/NA respectively; two methods of category No.3 and category No.1 respectively. Moreover, HFIG/ Fixed Frequency/NA that is one of the methods in category No.1, has a rank score near to the successful rule based approaches in the literature such as GAFS+GL.

As the second step, usually a post hoc procedure is utilized after the first step for illustrating the significant differences of the control method and the others according to the ranking list given in the first step. In this research, the Finner test is utilized as the post hoc procedure since it is easy to understand and offers better results than the remaining tests [50, 51]. Therefore, Finner test [51] is employed to compare the best method (namely control method) with the other methods. Table VI shows the results of applying Finner procedure. This procedure rejects those hypotheses whose PFinner ≤ 0.03086. The first column of Table VI is the rank of different methods; the methods are sorted in ascending order according to the z values given in the third column. As it was mentioned earlier FIG/Five Partitioning/FMV is the control method and it is not given in the Table VI. The other 18 methods are compared with the control method in Table IV. The methods with the ranks 1-7 are better than the other ones based on the both first step ranking and Finner procedure ranking. Since the ranking score of the first 7 methods are near to each other and also is near to the control method thereby the Finner test indicates that these methods do not have significant differences in comparison with the control method. As it can be seen in the following Table, the ranks 1-7 are not rejected by the Finner test; their P-values are greater than 0.03086. The results of Table VI confirm that the methods in category No.2 have enough performance for highly imbalanced classification and there are no significant differences among them in terms of classification efficiency. Also according to the first column of Table VI, the methods with ranks 3, 4 and 6 belong to the categories No.2, No.1 and No.1 of our proposed approaches respectively. Finally, since the methods in rows 1-7 are not significantly different therefore our proposed approaches in all the three categories are efficient for imbalanced classification.

5. Conclusion In this paper, we presented a new approach to construct FDTs based on hesitant fuzzy sets for highly imbalanced classification. Our main intention was to use different discretization, aggregation and node selection approaches in FDTs. HFSs have been known as tools for representing and aggregating the opinions of some

experts in recent years. Owing to this point, the main contribution of this research was to employ some algorithms of discretization and various node selection measures as different artificial experts which efficiently contribute for constructing FDT based classifiers via some aggregation methods. We had some ideas for combining node selection measures, discretization methods and aggregation approaches for FDT construction. These different ideas were implemented and grouped into three categories. Several experimental results were achieved across twenty highly imbalanced datasets and followed by non-parametric statistical tests. The outcomes of statistical analysis apparently justified the efficiency of our proposed approaches. As it was mentioned in the results and discussion part, our different approaches in all the three categories (specially category No.2) were efficient while they were not significantly different. From the obtained experimental results and their analysis, the following aspects are concluded to be the advantages of our proposed methods:   



The introduced approaches utilize FDTs that are easy to implement and understand as well as they have the ability to match with different data intervals due to their fuzzy aspects. The algorithm is capable of turning the imbalanced dataset to balanced one by pseudo-labels using k-means. The various FDTs constructed by different partitioning and node selection measures (by employing different discretization approaches) are some diverse classifiers that are efficiently adaptable for solving different classification problems; the aggregation of diverse classifiers usually results in producing an efficient final classifier. Our proposed approaches exploit advantages of some information granulation approaches (different discretization and fuzzy partitioning methods) such that for instance the weakness of a method can be compensated by the capability of another one.

Several efficient algorithms have been developed in recent years for solving imbalanced classification problems such as under-sampling, over-sampling, border filtering and cost sensitive approaches. Our proposed approaches in this research are placed in the algorithm level category. As a future extension of our algorithms, obtained membership degrees from different fuzzy partitioning methods can be aggregated and used for calculating FIG instead of generating HFIG by aggregating the achieved FIGs from different discretization methods. In the algorithm level, there are still some ideas for combining different existed efficient classifiers via HFSs. For example, one can combine C4.5 algorithms adapted for imbalanced classification via HFSs. The other proposed line of research for future works is to consider the data level approach and develop some hybrid approaches for sampling via hesitant fuzzy concepts. For instance, various sampling approaches can be considered as different experts that would contribute efficiently for developing some new under-sampling or over-sampling algorithms through HFSs. Also, in the cost sensitive level, the future works can focus on developing new cost measures through the existing effective ones in the literature. Eventually, in this research and some other studies recently performed by the authors, we are going to stablish a new landscape in the machine learning in which each algorithm and concept in machine learning can be considered as an expert. In other words, different supervised and unsupervised algorithms in machine learning are artificial experts and can be combined for solving a problem through a multi-criteria decision making tool such as HFSs. Therefore, researcher that have developed multicriteria concepts such as HFS are encouraged to apply their novel definitions and theories for combining machine learning algorithms. Acknowledgment The authors would like to thank Dr. M. Najafzadeh and Mr. M.K. Ebrahimpour for language checking.

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24]

[25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38]

M. Hosseinzadeh and M. Eftekhari, "Improving rotation forest performance for imbalanced data classification through fuzzy clustering," in Artificial Intelligence and Signal Processing (AISP), 2015 International Symposium, pp. 35-40, 2015. A. Rajan, C. Chand, and G. Kulkarni, "Survey on Classification Algorithms for Imbalanced Dataset." , International Institution for Technological Research and Development, Volume 1, Issue 2, 2015. Z. Sun, et al., "A novel ensemble method for classifying imbalanced data," Pattern Recognition, vol. 48, pp. 1623-1637, 2015. M.K. Ebrahimpour and M. Eftekhari. Feature subset selection using Information Energy and correlation coefficients of hesitant fuzzy sets. in Information and Knowledge Technology (IKT), 2015 7th Conference on. 2015. IEEE. M.K. Ebrahimpour and M. Eftekhari. Proposing a novel feature selection algorithm based on Hesitant Fuzzy Sets and correlation concepts. in Artificial Intelligence and Signal Processing (AISP), 2015 International Symposium on. 2015. IEEE. M.K. Ebrahimpour and M. Eftekhari, Ensemble of feature selection methods: A hesitant fuzzy sets approach. Applied Soft Computing, 2017. 50: p. 300-312. G. Liang, J. van den Berg, and U. Kaymak, "A comparative study of three Decision Tree algorithms: ID3, Fuzzy ID3 and Probabilistic Fuzzy ID3", Bachelor Thesis Informatics & Economics Erasmus University Rotterdam Rotterdam, Netherlands, 2005. M. Mehdizadeh and M. Eftekhari, "Generating fuzzy rule base classifier for highly imbalanced datasets using a hybrid of evolutionary algorithms and subtractive clustering," Journal of Intelligent & Fuzzy Systems, vol. 27, pp. 3033-3046, 2014. H. Ishibuchi, T. Yamamoto, and T. Nakashima, "Hybridization of fuzzy GBML approaches for pattern classification problems," IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 35, pp. 359-365, 2005. T. Nakashima, et al., "A weighted fuzzy classifier and its application to image processing tasks," Fuzzy sets and systems, vol. 158, pp. 284-294, 2007. E. G. Mansoori, M. J. Zolghadri, and S. D. Katebi, "SGERD: A steady-state genetic algorithm for extracting fuzzy classification rules from data," IEEE Transactions on Fuzzy Systems, vol. 16, pp. 1061-1071, 2008. J. Otero and L. Sanchez," Induction of descriptive fuzzy classifiers with the Logitboost algorithm," Soft Computing 10 (2006), 825–835. L. Sanchez and J. Otero, " Boosting fuzzy rules in classification problems under single-winner inference," International Journal of Intelligent Systems, 22 (2007), 1021–1034. F. J. Berlanga, et al., "GP-COACH: Genetic Programming-based learning of COmpact and ACcurate fuzzy rule-based classification systems for High-dimensional problems," Information Sciences, vol. 180, pp. 1183-1200, 2010. P. Villar, et al., "Feature selection and granularity learning in genetic fuzzy rule-based classification systems for highly imbalanced data-sets," International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 20, pp. 369-397, 2012. V. Lopez, et al., "A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets," Knowledge-Based Systems, vol. 38, pp. 85-104, 2013. H.-L. Dai, "Class imbalance learning via a fuzzy total margin based support vector machine," Applied Soft Computing, vol. 31, pp. 172-184, 2015. C. E. Hinojosa, H. A. Camargo, and V. Yv, "Learning fuzzy classification rules from imbalanced datasets using multi-objective evolutionary algorithm," in 2015 Latin America Congress on Computational Intelligence (LA-CCI), 2015, pp. 1-6. M.-J. Kim, D.-K. Kang, and H. B. Kim, "Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction," Expert Systems with Applications, vol. 42, pp. 1074-1082, 2015. H.-J. Kim, N.-O. Jo, and K.-S. Shin, "Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction," Expert Systems with Applications, vol. 59, pp. 226-234, 2016. Q. D. Tran and P. Liatsis, "RABOC: An approach to handle class imbalance in multimodal biometric authentication," Neurocomputing, vol. 188, pp. 167-177, 2016. J. Błaszczynski and J. Stefanowski, "Neighbourhood sampling in bagging for imbalanced data," Neurocomputing, vol. 150, pp. 529-542, 2015. Zhao, et al., "Imbalanced classification by learning hidden data structure." IIE Transactions (2016): 1-15. R. M. Rodriguez, et al., "A Position and Perspective Analysis of Hesitant Fuzzy Sets on Information Fusion in Decision Making. Towards High Quality Progress", Information Fusion, vol. 29, pp. 89-97, 2016 V. Torra, "Hesitant fuzzy sets," International Journal of Intelligent Systems, vol. 25, pp. 529-539, 2010. R. M. Rodriguez, et al., "Hesitant fuzzy sets: state of the art and future directions," International Journal of Intelligent Systems, vol. 29, pp. 495-524, 2014. N. Chen, Z. Xu, and M. Xia, "Correlation coefficients of hesitant fuzzy sets and their applications to clustering analysis," Applied Mathematical Modeling, vol. 37, pp. 2197-2211, 2013. M. Zeinalkhani and M. Eftekhari, "Comparing different stopping criteria for fuzzy decision tree induction through IDFID3," Iranian Journal of Fuzzy Systems, vol. 11, pp. 27-48, 2014. Y.L. Chen, et al., A Survey of Fuzzy Decision Tree Classifier, Fuzzy Information and Engineering, 2009. 1(2): p. 149-159. T. Wang, et al., A Survey of Fuzzy Decision Tree Classifier Methodology, in Fuzzy Information and Engineering, B.Y. Cao (Ed.), Springer Berlin / Heidelberg, 2007, pp. 959-968. F. Afsari, M. Eftekhari, E. Eslami, and P.-Y. Woo, "Interpretability-based fuzzy decision tree classifier a hybrid of the subtractive clustering and the multi-objective evolutionary algorithm," Soft Computing, vol. 17, pp. 1673-1686, 2013. T.M. Mitchell, Machine Learning, McGraw-Hill, 1997. M. Umano, et al., Fuzzy decision trees by fuzzy ID3 algorithm and its application to diagnosis systems, in Proceedings of Third IEEE Conference on Fuzzy Systems, 1994, pp. 2113-2118. Y. Yuan and M.J. Shaw, Induction of fuzzy decision trees, Fuzzy Sets and Systems, 69 (1995) 125-139. X. Wang and C. Borgelt, Information measures in fuzzy decision trees, in Proceedings of 13th IEEE International Conference on Fuzzy Systems, 2004, pp. 85-90. J.R. Quinlan, C4.5: programs for machine learning, Morgan Kaufmann Publishers, 1993. C.Z. Janikow, Fuzzy decision trees: issues and methods, IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics, 1998. 28(1): p. 1-14. J.S.R. Jang, C.T. Sun and E. Mizutani, Neuro-fuzzy and soft computing: a computational approach to learning and machine intelligence, Prentice Hall, 1997.

[39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51]

J. Alcala-Fdez, et al., "KEEL: a software tool to assess evolutionary algorithms for data mining problems," Soft Computing, vol. 13, pp. 307-318, 2009. A. Gupta, K. G. Mehrotra, and C. Mohan, "A clustering-based discretization for supervised learning," Statistics & probability letters, vol. 80, pp. 816-824, 2010. K. B. Irani, "Multi-interval discretization of continuous-valued attributes for classification learning," in Proceedings of the International Joint Conference on Uncertainty in AI, Chambery, France, pp. 1022–1027, 1993. D. A. Zighed, S. Rabaseda, and R. Rakotomalala, "FUSINTER: a method for discretization of continuous attributes," International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 6, pp. 307-326, 1998. Y. Yang and G. I. Webb, "Discretization for naive-Bayes learning: managing discretization bias and variance," Machine learning, vol. 74, pp. 39-74, 2009. H. Liu, et al., "Discretization: An enabling technique," Data mining and knowledge discovery, vol. 6, pp. 393-423, 2002. L. Yin, Y. Ge, K. Xiao, X. Wang, and X. Quan, "Feature selection for high-dimensional imbalanced data," Neurocomputing, vol. 105, pp. 3-11, 2013. M. Zeinalkhani and M. Eftekhari, "Fuzzy partitioning of continuous attributes through discretization methods to construct fuzzy decision tree classifiers," Information Sciences, vol. 278, pp. 715-735, 2014. J. Alcala-Fdez, et al., “Keel data-mining software tool: Dataset repository, integration of algorithms and experimental analysis framework,” Journal of Multiple-Valued Logic and Soft Computing, vol. 17, no. 2-3, pp. 255 –287, 2011. S. Garcia, et al., "A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability," Soft Computing, vol. 13, pp. 959-977, 2009. J. Hodges and E. L. Lehmann, "Rank methods for combination of independent experiments in the analysis of variance," The Annals of Mathematical Statistics, vol. 33, pp. 482-497, 1962. S. Garcia, et al, "Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power," Information Sciences, vol. 180, pp. 2044-2064, 2010. H. Finner, "On a monotonicity problem in step-down multiple test procedures," Journal of the American Statistical Association, vol. 88, pp. 920-923, 1993.

Parent node S

𝐴𝑖 𝑖𝑠 𝐹𝑖

(1)

𝑆[𝐹𝑖

(1)

]

Child1

𝐴𝑖 𝑖𝑠 𝐹𝑖

(2)

𝐴𝑖 𝑖𝑠 𝐹𝑖

(2)

𝑆[𝐹𝑖

Child2

]

...

(𝑟 𝑖 )

(𝑟 𝑖 )

𝑆[𝐹𝑖

]

Child 𝑟𝑖

Figure.1. A typical parent node with child nodes in the fuzzy decision tree

Inputs: Crisp train data and predefined membership functions on each attribute, splitting criterion, stopping criterion and the threshold value of stopping criterion. 1) Generate the root node with a fuzzy dataset containing all crisp training data that all the membership degrees have assigned to one. 2) For each new node N, with fuzzy dataset S do steps 3 to 7. 3) If stopping criterion is satisfied, then 4) Make N as a leaf and assign the fraction of examples of N belonging to each class as a label of that class. 5) If stopping criterion is not satisfied, then 6) Calculate FIG for each attribute, and select the attribute that maximize FIG as a branching attribute 7) Generate new child nodes , where is corresponding to fuzzy term that contains fuzzy dataset with all attributes of S except . The membership degree of each example in is calculated by Eq. (3). . Figure.2. Fuzzy decision tree induction algorithm(adopted from [28])

A1

(2)

(1)

𝐴1 𝑖𝑠 𝐹1

𝐴1 𝑖𝑠 𝐹1

C1: 0.3 C2: 0.7 Leaf 3

A2 (1)

𝐴2 𝑖𝑠 𝐹2

C1: 0.2 C2: 0.8 Leaf 1

(2)

𝐴2 𝑖𝑠 𝐹2 C1: 0.6 C2: 0.4 Leaf 2

Figure.3. A typical fuzzy decision tree, for binary classification

A1

𝜇𝐹 (2) (𝑥 (1) ) = 0.25

𝜇𝐹 (1) (𝑥(1) ) = 0.65

1

1

C1: 0.3 C2: 0.7

A2

Leaf 3 𝜇𝐹(1) (𝑥(2) ) = 0.45 2

C1: 0.2 C2: 0.8 Leaf 1

𝜇𝐹 (2) (𝑥 (2) ) = 0.55 2

C1: 0.6 C2: 0.4 Leaf 2 Figure.4. Fuzzy reasoning in fuzzy decision tree

Start

Imbalanced data

Data balancing

Obtain cut-points by one of discretization approach

Generate MFs based on cut-points for each attribute

Generating the root node containing training data

Stopping condition satisfied?

Yes

No

Calculating HFIG for each attribute

Selecting attribute with max HFIG for expanding

Generating new child nodes

Figure.5. Flowchart of constructing FDTs of category No.1 methods

Assigning label to leaf nodes

End

Imbalanced data

Data balancing

Obtain cut-points by Obtain cut-points by Fayyad discretization approach

Generate MFs based on cut-points for each attribute

Construct FDT1 by using FIG as splitting criteria

Obtain cut-points by Fusinter discretization approach

Fixed Frequency discretization approach

Generate MFs based on Generate MFs based on cut-points for each cut-points for each attribute attribute

Construct FDT2 by using FIG as splitting criteria

Construct FDT3 by using FIG as splitting criteria

Aggregate FDTs

New data

Classifying

Classification results

Figure.6. Flowchart of proposed category No.2 methods

Obtain cut-points by Proportional discretization approach

Obtain cut-points by Uniform Frequency discretization approach

Generate MFs based on cut-points for each attribute

Generate MFs based on cut-points for each attribute

Construct FDT4 by using FIG as splitting criteria

Construct FDT5 by using FIG as splitting criteria

Imbalanced data

Data balancing

Obtain cut-points by Obtain cut-points by Fayyad discretization approach

Generate MFs based on cut-points for each attribute

Construct FDT1 by using HFIG as splitting criteria

Obtain cut-points by Fusinter discretization approach

Fixed Frequency discretization approach

Generate MFs based on Generate MFs based on cut-points for each cut-points for each attribute attribute

Construct FDT2 by using HFIG as splitting criteria

Construct FDT3 by using HFIG as splitting criteria

Aggregate FDTs

New data

Classifying

Classification results Figure.7. Flowchart of proposed category No.3 methods

Obtain cut-points by Proportional discretization approach

Obtain cut-points by Uniform Frequency discretization approach

Generate MFs based on cut-points for each attribute

Generate MFs based on cut-points for each attribute

Construct FDT4 by using HFIG as splitting criteria

Construct FDT5 by using HFIG as splitting criteria

300 273.575 243.8 240.425 250 229.975

Ranks

200

192.4 181.3

238.775 233.275 200.525 192.8

191.4 172.15

171.25 162.7

152.1

150

144 108.65

100

50

0

Methods

Figure.8. Average rankings of classifiers based on Aligned Friedman and AAUCtest values.

149.75 140.65

TABLE.I. parameters of triangular and trapezoidal MFs defined based on standard deviation (adopted from [46]). Leftmost MF Middle MFs Rightmost MF

a = ct1

a = mean-2.stdval

a = mean-2.stdval

b = ct1

b=mean

b=mean

c = mean

c mean2stdval

c=ct2

d=mean+2.stdval 1

mean = ∑𝑛𝑖=1 𝑥𝑖 𝑛

d = ct2 stdval = std



1

𝑛−1

∑𝑛𝑖=1(𝑥𝑖 − 𝑚𝑒𝑎𝑛)2

TABLE.II. The strategies and descriptions for aggregation methods Aggregation Method Strategy

Description (𝑖)

Fuzzy Majority-Voting (FMV)

(𝑖) (𝑖) AM1=∑5𝑖=1 𝑓(𝜇(𝐶1 ), 𝜇(𝐶2 )), (𝑖) (𝑖) AM2= ∑5𝑖=1 𝑓(𝜇(𝐶2 ), 𝜇(𝐶1 ))

Fuzzy Weighted-Voting (FWV)

AM1=∑5𝑖=1 𝜇(𝐶1 ), (𝑖) AM2= ∑5𝑖=1 𝜇(𝐶2 )

Hesitant Fuzzy Information Energy Voting (HFIEV)

AM1=1/5(∑5𝑖=1 𝜇(𝐶1 )2), (𝑖) AM2=1/5(∑5𝑖=1 𝜇(𝐶2 )2)

(𝑖)

(𝑖)

*The function 𝑓(𝑥, 𝑦)is defined as follows: 𝐹(𝑥, 𝑦) {

1 𝑥≥𝑦 0 𝑥<𝑦

(𝑖)

For the ith classifier, if 𝜇(𝐶1 )≥ 𝜇(𝐶2 ), then (𝑖) 𝜇(𝐶2 )≥

(𝑖)

Class C1 gets a vote, if 𝜇 (𝐶1 ) then class C2 gets a vote. If AM1≥ AM2, then Class C1, else class C2. Use the summation of classification degree of these 5 classifiers for each class label. If AM1≥ AM2, then Class C1, else class C2. Use the information energy of classification degree of these 5 classifiers for each class label. If AM1≥ AM2, then Class C1, else class C2.

TABLE.III. Characteristics of high imbalanced data sets used in experimental results Dataset Code Dataset Name #Attribute #Sample D1 Glass-0-1-6_vs_2 9 192 D2 Yeast-1_vs_7 7 459 D3 Yeast-1-2-8-9_vs_7 8 947 D4 Yeast-14-5-8_vs_7 8 693 D5 Yeast-2_vs_4 8 514 D6 Yeast4 8 1484 D7 Vowel0 13 988 D8 Yeast-2_vs_8 8 482 D9 glass2 9 214 D10 ecoli4 7 336 D11 Shuttle-c0-vs-c4 9 1829 D12 glass4 9 214 D13 abalone9-18 8 731 D14 Glass-0-1-6_vs_5 9 184 D15 Shuttle-c2-vs-4 9 129 D16 yeast5 8 1484 D17 Ecoli-0-1-3-7_vs_2-6 7 281 D18 yeast6 8 1484 D19 Yeast-0-5-6-7-9_vs_4 8 527 D20 page-blocks-1-3_vs_4 10 472

IR 10.29 13.87 30.56 22.10 9.08 28.41 9.98 23.10 10.39 15.8 13.87 15.47 16.68 19.44 20.50 32.78 39.14 41.4 9.35 15.86

#Cluster 3 10 10 10 10 10 3 3 10 15 15 15 10 3 10 15 3 15 3 15

TABLE .IV. AAUCtest values for different methods over highly imbalanced data sets Dataset code

C4.5

3-NN

LB

MLB

GPCOACH-9

GPCOACH-H

GAFS+GL

IRL-IDMOEA

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12

0.5938 0.6275 0.6156 0.5 0.8307 0.6135 0.9706 0.525 0.7194 0.8437 0.9997 0.7542

0.6357 0.6019 0.5484 0.5144 0.7368 0.5947 0.9399 0.7329 0.5302 0.7734 0.969 0.8425

0.5161 0.6321 0.6133 0.6136 0.8273 0.6535 0.9077 0.6978 0.5071 0.8452 1 0.8683

0.4828 0.5666 0.5628 0.5 0.8105 0.5087 0.9027 0.7196 0.4974 0.9202 1 0.7849

0.5685 0.6717 0.6355 0.5774 0.8965 0.7906 0.9463 0.7991 0.5662 0.8715 0.9953 0.8988

0.6366 0.6806 0.6943 0.5828 0.8919 0.8133 0.9513 0.7305 0.5936 0.9262 0.9889 0.788

0.6262 0.7544 0.7499 0.6406 0.884 0.8346 0.9282 0.7104 0.7285 0.8595 0.9994 0.853

0.633 0.632 0.717 0.627 0.905 0.851 0.948 0.786 0.67 0.927 0.992 0.913

D13

0.5859

0.6332

0.6804

0.5562

0.6846

0.76

0.6275

0.714

D14 D15 D16 D17 D18 D19

0.8943 0.95 0.8833 0.7481 0.7115 0.6802

0.8971 0.95 0.8128 0.7982 0.7527 0.6257

0.7414 0.95 0.8243 0.7463 0.7394 0.7288

0.7357 0.95 0.5236 0.7963 0.4993 0.592

0.9281 0.9976 0.9038 0.8448 0.865 0.7744

0.8713 1 0.9281 0.8457 0.8746 0.7702

0.8671 0.992 0.9354 0.8698 0.79 0.7928

0.946 0.892 0.951 0.849 0.887 0.812

D20

0.9978

0.9433

0.9766

0.9044

0.8556

0.8798

0.9369

0.945

TABLE.V. AAUCtest values for the proposed classifiers across highly imbalanced data sets Category No.2

Category No.3

HFIG/Five Partitioning/FMV

HFIG/Five Partitioning/FWV

HFIG/Five Partitioning/HFIEV

FIG/Five Partitioning/HFIEV

HFIG/Uniform Frequency/ NA

FIG/Five Partitioning/FMV

HFIG/ proportional/NA

FIG/Five Partitioning/FWV

HFIG/ Fusinter/NA

HFIG/ Fixed Frequency/NA

HFIG/Fayyad/NA

Dataset code

Category No.1

D1

0.5971

0.55

0.4971

0.5971

0.6471

0.55

0.6805

0.55

0.5

0.5

0.5971

D2

0.693

0.7441

0.6318

0.9648

0.986

0.9895

0.9872

0.9895

0.9707

0.9718

0.8895

D3

0.5589

0.56

0.5589

0.5082

0.5589

0.6

0.6589

0.6

0.56

0.56

0.5589

D4

0.8879

0.9819

0.8909

0.9068

0.577

0.9947

0.974

0.994

0.9932

0.9924

0.9724

D5

0.8913

0.9848

0.8924

0.987

0.9935

0.987

0.9724

0.987

0.9870

0.987

0.9913

D6

0.9607

0.9736

0.9725

0.9788

0.975

0.9887

0.9847

0.9766

0.9805

0.7882 0.6322 0.5783 0.9444 0.9732 0.6375 0.7103 0.6943 0.9 0.6906

0.9737 0.5784 0.6399 0.9492 0.8731 0.5875 0.9891 0.8943 0.9 0.8743

0.8941 0.6656 0.6782 0.627 0.9678 0.645 0.7978 0.6943 0.9 0.6066

0.9555 0.6312 0.8141 0.7937 0.893 0.5975 0.9802 0.7943 0.9 0.8871

0.5176 0.6645 0.6872 0.7889 0.9795 0.64 0.992 0.7943 0.9 0.6205

0.9748 0.5333 0.5833 0.9873 0.9974 0.6 0.9935 0.6 0.992 0.9816

0.9529 0.5656 0.8424 0.9841 0.9968 0.6474 0.9942 0.8886 0.992 0.8927

0.9887 0.9742 0.5333 0.6333 0.9873 0.9977 0.6 0.9935 0.6 0.992 0.983

0.9762

D7 D8 D9 D10 D11 D12 D13 D14 D15 D16

0.9507 0.6 0.5833 0.6952 0.8959 0.6 0.9935 0.6943 0.9 0.7934

0.9448 0.6 0.6167 0.6952 0.8953 0.6 0.9935 0.6943 0.992 0.7927

0.8941 0.6322 0.6757 0.9794 0.9874 0.64 0.9949 0.7943 0.9 0.6927

D17

0.8964

0.8927

0.9964

0.989

0.9964

0.8

0.8927

0.8

0.9

0.9

0.9964

D18 D19

0.7907 0.537

0.6962 0.9049

0.6931 0.8689

0.791 0.9351

0.5309 0.927

0.9924 0.9012

0.8955 0.8536

0.9934 0.9012

0.5986 0.9614

0.5986 0.9614

0.8084 0.9260

D20

0.8254

0.9572

0.617

0.9425

0.9516

0.9775

0.9775

0.9786

0.8763

0.8763

0.9583

TABLE.VI. The results of applying Finner post hoc procedure where the threshold pvalue = 0.03086 rank Algorithm PFinner 𝒛 = (𝑹𝟎 − 𝑹𝒊)/𝑺𝑬 𝒑 FIG/Five Partitioning/HFIEV 0.921269 0.35691 0.05 1

Hypothesis Not Reject

2

FIG/Five Partitioning/FWV

1. 017714

0.308814

0. 047289

Not Reject

3

HFIG/Five Partitioning/FMV

1.183254

0.236708

0.04457

Not Reject

4

HFIG/ proportional/NA

1. 25091

0.210967

0. 041844

Not Reject

5

IRL-ID-MOEA

1. 55608

0.119689

0. 039109

Not Reject

6

HFIG/ Fixed Frequency/NA

1. 802232

0.071509

0. 036367

Not Reject

7

GA-FS+GL

1. 828143

0.067528

0. 033617

Not Reject

8

GP-COACH-H

2. 091568

0.036477

0.03068

Reject

9

HFIG/Uniform Frequency/ NA

2. 382343

0.017203

0.028094

Reject

10

GP-COACH-9

2. 411133

0.015903

0.025321

Reject

11

HFIG/Five Partitioning/FWV

2.422649

0.015408

0.022539

Reject

12

HFIG/Five Partitioning/HFIEV

2. 645049

0.008168

0. 01975

Reject

13

C4.5

3. 492904

0.000478

0. 016952

Reject

14

HFIG/ Fusinter/NA

3. 58791

0.000333

0. 014147

Reject

15

HFIG/ Fayyad/NA

3. 746253

0.000179

0. 011334

Reject

16

3-NN

3. 793756

0.000148

0. 008512

Reject

17

LB

3. 890921

0.0001

0. 005683

Reject

18

MLB

4.748132

0.000002

0. 002846

Reject