A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm

A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm

Knowledge-Based Systems xxx (xxxx) xxx Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/k...

2MB Sizes 0 Downloads 48 Views

Knowledge-Based Systems xxx (xxxx) xxx

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm✩ ∗

Xiaojun Xie a,b , Xiaolin Qin a , , Qian Zhou c,d , Yanghao Zhou a , Tong Zhang a , Ryszard Janicki b , Wei Zhao b a

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China Department of Computing and Software, McMaster University, Hamilton, L8S 4K1, Canada c School of Modern Posts, Nanjing University of Posts and Telecommunications, Nanjing, 210003, China d State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210093, China b

article

info

Article history: Received 17 December 2017 Received in revised form 28 June 2019 Accepted 11 August 2019 Available online xxxx Keywords: Rough set Attribute reduction Test-cost-sensitive Binary bat algorithm

a b s t r a c t Attribute reductions are essential pre-processing steps in such as data mining, machine learning, pattern recognition and many other fields. Moreover, test-cost-sensitive attribute reductions are often used when we have to deal with cost-sensitive data. The main result of this paper is a new metaheuristic optimization method for finding optimal test-cost-sensitive attribute reduction that is based on binary bat algorithm that originally was designed to model the echolocation behavior of bats when they search their prey. First we provide a 0-1 integer programming algorithm that can calculate optimal reduct but is inefficient for large data sets. We will use it to evaluate other algorithms. Next, a new fitness function that utilizes the pairs of inconsistent objects and does not have any uncertain parameter is design and an efficient algorithm for counting inconsistent pairs is provided. Then, an efficient test-cost-sensitive attribute reduction technique that uses binary bat algorithm is provided. Finally, a new evaluation model with four different evaluation metrics has been proposed and used to evaluate algorithms that only provide sub-optimal solutions. Several experiments were carried out on broadly used benchmark data sets and the results have shown the superiority of our new algorithm, in terms of various metrics, computational time, and classification accuracy, especially for high-dimensional data sets. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Rough sets, originally proposed by Pawlak in 1980s [1,2] and substantially developed and extended later [3–8], provide mathematical foundations for many widely used data analysis and knowledge processing techniques. The main advantage of rough sets is that they are completely data-driven and require no additional information. In recent years, rough sets have been successfully used in variety of fields as of machine learning, artificial intelligence, pattern recognition, data mining, decisionmaking support system and abstract approximation [9–19]. With attribute reduction being a research hotspot in rough set theory, ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. 2019.104938. ∗ Corresponding author. E-mail addresses: [email protected] (X. Xie), [email protected] (X. Qin), [email protected] (Q. Zhou), [email protected] (Y. Zhou), [email protected] (T. Zhang), [email protected] (R. Janicki), [email protected] (W. Zhao).

many heuristic attribute reduction methods have been developed, and they can be classified into six categories: positive region [20, 21], discernibility matrix [22–24], information entropy [25–28], knowledge granularity [29,30], inconsistent object pairs [31,32] and meta-heuristic optimization algorithm [33–35]. The test-cost-sensitive rough set theory, proposed in [36], is an extension of the traditional rough sets, where the attribute reduction problem is called test-cost-sensitive attribute reduction, and the ultimate goal is to find the minimal test cost attribute reduction (MTRP). The MTRP deals with cost-sensitive data, which is different from traditional approach to attribute reduction. In real-world applications data usually are not cost free. In such case, to each attribute we assign some cost each, that could be money, time, or other resources. For example, in medical diagnosis, since different test may have different cost, obtaining the correct diagnosis result with the minimal test cost is an MTRP. In most cases MTRP provides more precise analysis than the traditional attribute reduction problem. In many cases test costs of all attributes are equal, then the MTRP converts into classical minimal attribute reduction problem (MRP). The MRP has been attracted to many researchers in the

https://doi.org/10.1016/j.knosys.2019.104938 0950-7051/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

2

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

last decade. Ding et al. [37] proposed an algorithm based on particles swarm optimization for MRP. Ye et al. [38] proposed a population-based stochastic optimization approach for MRP, which transformed MRP into a fitness maximization problem over a multi-dimensional Boolean space, and in [39], they applied the discrete artificial bee colony in MRP, then presented an algorithm to find a minimal reduct. Jiang et al. [40] provided a new approach for MRP based on a new concept which called a compactness discernability information tree. Zhao et al. [41] proposed a complete minimum attribute reduction algorithm based on the fusion of rough equivalence classes and tabu search. Since its introduction in [36], the MTRP has also been substantially developed. A heuristic algorithm with three metrics to evaluate the performance of MTRP and a backtracking algorithm with three pruning techniques for finding a minimal cost reduct have been presented in Min et al. [42]. Many others contributed as well to the problem of finding the minimal test cost attribute reduction. Pan et al. [43] have designed a genetic algorithm for MTRP. Yang et al. [44] introduced the test-cost-sensitive multigranulation rough set model, and then have proposed a backtracking algorithm for granular structure selection with minimal test cost. Fan et al. [45] introduced an adaptive neighborhood model for the heterogeneous attribute and proposed some heuristic reduction algorithm for MTRP. Min et al. [46] proposed the feature selection with test cost constraints problem and developed a backtracking algorithm as well as a heuristic algorithm. Dai et al. [47] applied particle swarm optimization for MTRP and provided four relative evaluation metrics. Tan et al. [48] employed the methods of set cover problem to deal with MTRP and introduced a set cover-based heuristic algorithm to solve this problem. Min et al. [49] designed an ant colony optimization with the partial-complete searching method for three attribute reduction problem, i.e., minimal attribute reduction, minimal cost attribute reduction, and minimal time-cost attribute reduction. Xie et al. [50] provided a test-cost-sensitive rough set-based algorithm for the minimum weight vertex cover problem, which can also be used to solve MTRP. Test-cost-sensitive attribute reduction discussed in this paper is the only one of many possible cost-sensitive models. Many of other models are based on different paradigms or consider different types of cost, one of the leading approaches is based on three-way decision paradigm that was proposed by Y. Yao in [51]. This paradigm, that is a version of natural three-value logic but formulated within rough set model, has recently been applied to cost-sensitive attribute reduction. Ma and Zhao [52] have proposed the concept of class-specific minimum cost reducts by combining the test cost with the result cost of three-way decisions. They have also developed two algorithms for finding a family of class-specific minimum cost reducts, one based on addition–deletion strategy and another on deletion only strategy. Fang and Min [53] have presented a framework based on threeway decisions and discernibility matrix to handle cost-sensitive approximate attribute reduction problem and also designed appropriate addition based and deletion based cost-sensitive approximate reduction algorithms. Their model which allows decision makers to leverage the advantages of knowledge discovery and their own preferences. Under this framework, they designed deletion-based and addition-based cost-sensitive approximate reduction algorithms. As for other paradigms, Zhao and Yu [54] used the ℓ2,1 -norm and proposed a cost-sensitive embedded feature selection algorithm that minimizes the total cost rather than maximize the accuracy. Their algorithm also takes into account both test costs and misclassification costs simultaneously. Liao et al. [55] have assumed that the test cost of a feature is usually a variable with error range, and the variability of the

misclassification cost is related to the object. Based on these assumptions they proposed a multi-granularity feature selection approach which considers measurement errors and variable costs in terms of feature-value granularities. Liu et al. [56] also dealt with misclassification costs and presented an effective feature selection algorithm that explores the class imbalance issue by optimizing F-measures. Since the MTRP is NP-hard problem finding optimal solutions faces all obstacles typical for NP-hard problems. Fortunately, over the past few decades, many meta-heuristic optimization algorithms have been developed to solve this kind of problems. The ones that can potentially be useful for the MTRP include: some genetic algorithm [57], ant colony optimization [58,59], particle swarm optimization [60,61], bee colony optimization [62,63], and artificial fish swarm algorithm [64,65]. The bat algorithm [66–70] that also belongs to the class of meta-heuristic optimization algorithms was introduced recently and has successfully been applied to solve numerous optimization problems in diverse fields. Among many versions of bat algorithm, its binary version is close to other population-based meta-heuristic optimization algorithms [71–74], and we will use it in this paper. So far, most of the existing algorithms for MTRP are not very efficient and their evaluation models are not comprehensive, which were the main motivations for this paper. In this paper, we apply the binary version of the bat algorithm for MTRP and propose a comprehensive evaluation model for evaluating the performance of algorithms without the optimal solution. The main contributions of the paper are the following: 1. a method for obtaining an optimal reduct, based on 0-1 integer programming, has been proposed, and the results achieved by this method are used to assess algorithms based on the existing evaluation models; 2. an efficient method, based on binary bat algorithm, for testcost-sensitive attribute reduction is proposed, with a new fitness function, and this function is a key reason of the algorithm efficiency; 3. a comprehensive evaluation model with four different evaluation metrics is established and used to assess the performance of multiple sub-optimal algorithms; 4. finally, a series of comparative experiments with the broadly used benchmark data sets are conducted to illustrate the efficiency and effectiveness of our proposed algorithms. The paper is organized as follows. In Section 2, basic concepts related to the test-cost-sensitive rough set theory are introduced. In Section 3, 0-1 integer programming-based approach for obtaining an optimal reduct is presented. In Section 4, a binary bat algorithm for test-cost-sensitive attribute reduction is proposed. The experimental results are shown in Section 5. Some conclusions and outline plans for further research are drawn in Section 6. 2. Preliminaries We first recall some concepts and results that will be used throughout the paper (cf. [1,5,6,11,20,36]). A decision system is a quintuple: S = (U , A, D, V , f ), where U is a finite nonempty set of objects, A is the set of conditional attributes, D is the set of decision attributes, V is the domain of attributes from A ∪ D, and f : U × (A ∪ D) → V is an information function. In the decision system S = (U , A, D, V , f ), if each conditional attribute is associated with a test cost, then the decision system is called a test-cost-sensitive decision system (TCDS).

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

The TCDS is represented as a sextuple: S = (U , A, D, V , f , c ), where U, A, D, V and f have the same meaning as for the general decision system, and c is the test cost of conditional attributes. When test costs of conditional attributes are inde( cost )] c can be expressed as a vector c = [pendent, then the test c (a1 ) , c (a2 ) , . . . , c a|A| , while the test cost ∑ of subset of conditional attributes B ⊆ A is given by c(B) = ai ∈B c (ai ). Table 1 gives a simple example of TCDS. Definition 1 (Indiscernibility Relation). Given a decision system S = (U , A, D, V , f ), for B ⊆ A, the indiscernibility relation: IND (B) = {(x, y) ∈ U × U | ∀a ∈ B, f (x, a) = f (y, a)}

(1)

Clearly, IND(B) is an equivalence relation on U and U /B = {[x]B | x ∈ U } is the classification induced by B, where [x]B ={y | ∀a ∈ B, f (x, a) = f (y, a)} denotes the equivalence class of x with respect to B. ⋃ The positive region of D with respect to B is POSB (D) = X ∈U /D B− (X )⋃and the ) of D with respect to B ( boundary region is BNDB (D) = X ∈U /D B− (X ) − B− (X ) , where B− (X ) and B− (X ) denote the B-lower approximation and B-upper approximation of − X , i.e., B− (X ) = {{ x ∈ U | [x]B ⊆ X } and } B (X ) = {x ∈ U | [x]B ∩ X ̸ = ∅}, U /D = X1 , X2 , · · · , X|U /D| is the classification induced by decision attribute D. The definition of attribute reduction based on positive region is given below. Definition 2 (Attribute Reduction). Given a decision system S = (U , A, D, V , f ), any B ⊆ A is called an attribute reduct of S based on positive region if and only if: POSB (D) = POSA (D) ∧ ∀a ∈ B, POSB−{a} (D) ̸ = POSB (D)

(2)

From Definition 2, it follows that if B ⊆ A is an attribute reduction based on positive region, then two conditions need to be satisfied. First, the attribute set B must ensure that the positive region of D is unchanged. Second, if any attribute in B is removed then the positive region of D must be changed. This leads to the following definition of MTRP. Let S = (U , A, D, V , f , c) be a test-cost-sensitive decision system and let Z = {B | B ⊆ A ∧ POSB (D) = POSA (D)}. We can now define MTRP in TDCS as follows:



min{ B∈Z

c(ai )}

(3)

ai ∈B

From Eq. (3), it follows that if the test costs of all attributes are equal to 1, that a MTRP is just some MRP. 3. 0-1 integer programming for test-cost-sensitive attribute reduction As opposed to for example [75], we assume that the data are static, so we can use 0-1 integer programming techniques to find optimal reducts. In this section we will show how it can be done for test-cost-sensitive decision systems. Let S = (U , A, D, V , f , c ) be a test-cost-sensitive decision system and U ′ be a set derived from BNDB (D) by deleting all repetitive objects. We can now define the discernibility matrix [75], D[mij ], in the following way:

( ) ( )} ⎧ { a ∈ A | f (xi , a) ̸ = f xj , a ∧ f (xi , D) ̸ = f xj , D ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ { if xi , xj ∈ POSA (D) ( )} mij = a ∈ A | f (xi , a) ̸ = f xj , a ⎪ ⎪ ⎪ if xi ∈ POSA (D) ∧ xj ∈ U ′ ⎪ ⎪ ⎩ ∅ otherwise

(4)

3

Let RED(S) = {B1 , B2 , · · · , Bv }, where Bi ⊆ A, be some set of potential attribute reducts in S. We can now define the minimal test cost reduct as



min {

B∈RED(S)

c(ai )},

ai ∈B

where c(a) is the test cost of attribute a. Assume that |A| = k, A = {a1 , . . . , ak }, |{mij | mij ̸ = ∅}| = n and {mij | mij ̸ = ∅} = {m′1 , . . . , m′n }. We can now define the induced matrix M [Mij ]k×n (i.e. k columns and n rows) as follows:

{ Mij =

1 0

aj ∈ m′i aj ∈ / m′i

(5)

We can now formulate a simple test for the property: POSB (D)

= POSA (D). Proposition 1. Let S = (U , A, D, V , f , c) be a test-cost-sensitive decision, A = {a1 , . . . , ak } and M [Mij ]k×n be the induced matrix of S. Then for each conditional attribute set B ⊆ A, the property POSB (D) = POSA (D) is equivalent to:

∀i = 1, . . . , k,



Mij ≥ 1.

(6)

aj ∈B

Proof. ⇒ Suppose there is l ∈ {1, . . . , k} such that aj ∈B Mlj = 0. But this clearly means that B = ∅, i.e., POSB (D) ̸ = POSA (D). (⇐) Suppose POSB (D) ̸ = POSA (D). This means that for some i, j, we have mij ∩ B = ∅. Assume mij = m′l for some l. Then ∑ a ∈B Mlj = 0. □



j

The simple test given by Eq. (6) above allows us to use 0-1 integer programming techniques for finding a minimal test cost attribute reduction, that resulted in Algorithm 1. In Algorithm 1, Step 1 computes the positive region, and the time complexity is O(|A ∥ U |). Step 2 computes the discernibility matrix, the time complexity is O(|POSA (D)|(|POSA (D)|+|U ′ |)). Step 3 extracts the induced matrix M, the time complexity is now O(|POSA (D)|(|POSA (D)|+|U ′ |)). Steps 4–7 solve the 0-1 integer programming problem with |A| variables, so the time complexity is O(2|A| ) in the worst case, however for majority of practical cases, the time complexity is far less than O(2|A| ) when for example branch-and-cut method [76–78] is used. If the value |A| is small, then an optimal solution can be easily found. But if the value |A| is large, then it usually takes a long time to obtain the optimal solution. The induced matrix M is needed to be stored in the process, and the space complexity is O(|A ∥ POSA (D)|(|POSA (D)| + |U ′ |)). Algorithm 1 can calculate the optimal reduct, but if the data set is large it may run out of memory or it may need a very long run time to do it. Example 1. The minimal test cost attribute reduction of Table 1 using Algorithm 1 is described as follows. Firstly, we compute the positive region POSA (D) = {x1 , x2 , x3 , x4 , x5 , x6 , x7 }, then U ′ = ∅. Hence, the discernibility matrix D[mij ] looks as follows. D [mij ]

⎡ ∅ ⎢∅ ⎢ ⎢∅ ⎢ ⎢ = ⎢∅ ⎢ ⎢∅ ⎢ ⎣∅ ∅

∅ ∅ ∅ ∅ ∅ ∅ ∅

⎤ ∅ {a1 , a3 , a4 , a5 } {a2 , a3 , a4 , a6 } {a2 , a3 , a5 , a6 } ∅ ⎥ { a2 , a4 } {a2 , a3 , a5 } ∅ {a1 , a4 , a5 , a6 } ∅ ⎥ ⎥ {a2 , a3 } ∅ {a1 , a3 , a4 , a6 } {a2 , a3 , a4 , a5 } ∅ ⎥ ⎥ {a1 , a2 , a3 , a4 } ⎥ ∅ ∅ ∅ ∅ ⎥ {a2 , a3 , a4 , a5 , a6 }⎥ ∅ ∅ ∅ ∅ ⎥ ⎦ {a2 , a6 } ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅

(8)

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

4

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx Table 1 An example of TCDS with test cost vector is c = [2, 2, 15, 20, 20, 10] . Patient

Headache

Temperature

Lymphocyte

Leukocyte

Eosinophil

Heartbeat

Flu

x1 x2 x3 x4 x5 x6 x7

Yes Yes Yes No Yes Yes Yes

High High High High Normal Normal Low

High Normal High Normal Normal Low Low

High High High Normal Low High High

High High Normal Normal High Normal Normal

Normal Abnormal Abnormal Normal Abnormal Abnormal Normal

Yes Yes Yes No No No Yes

Algorithm 1: A minimal test cost attribute reduction algorithm based on 0-1 integer programming

4

Input: A test-cost-sensitive decision system S = (U , A, D, V , f , c) with |A|= k Output: The attribute reduct of S and its test cost. Compute the positive region POSA (D) of S and U ′ ; Calculate the discernibility matrix mij of S ; Extract the induced matrix M from mij ; ∑ Construct the following constrains: B ⊆ A and ∀i = 1, . . . , k, Mij ≥ 1;

5

Construct the following objective function: min{

1 2 3

aj ∈B



B⊆A a ∈B i

6

Solve the following 0-1 integer programming:



Z = min{ B∈R

7 8 9

c(ai )}, where c(aj ) is the test cost of attribute aj ;

c(ai )} where R = {B | ∀i = 1, . . . , k,

ai ∈ B



Mij ≥ 1}

(7)

aj ∈ B

Obtain its optimal solution x = x1 , x2 , · · · , x|A| ; Achieve the reduct RED from optimal solution x and its test cost costRED ; Return RED and costRED ;

(

)

We now convert the discernibility matrix D[mij ] into to the induced matrix M [Mij ].

⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ M [Mij ] = ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

1 1 1 0 0 0 0 0 0 1 0 0

0 0 0 1 1 1 1 1 1 1 1 1

1 0 1 1 0 1 1 1 1 1 1 0

1 1 1 1 1 1 0 0 0 1 1 0

1 1 0 0 0 1 1 1 0 0 1 0

0 1 1 1 0 0 1 0 0 0 1 1

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

4. Binary bat algorithm for test-cost-sensitive attribute reduction (9)

We need to solve the following 0-1 integer programming optimization problem, which for this case looks as follows: Z = min{xc T } such that Mx ≥ b

(10)

x

where x = [x1 , . . . , x6 ], xi ∈ {0, 1}, the cost vector c = [2, 2, 15, 20, 20, 10] and the vector b, that represents R from Eq. (7), is just b = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]T . After deleting some redundant constraints, the 0-1 integer programming can be expressed as follows. Z = min {2x1 + 2x2 + 15x3 + 20x4 + 20x5 + 10x6 } x1 ,...,x6

⎧ x2 + x6 ≥ 1 ⎪ ⎪ ⎪ x2 + x4 ≥ 1 ⎪ ⎪ ⎪ ⎨ x2 + x3 ≥ 1 x1 + x4 + x5 + x6 ≥ 1 such that ⎪ ⎪ x1 + x3 + x4 + x6 ≥ 1 ⎪ ⎪ ⎪ ⎪ ⎩ x1 + x3 + x4 + x5 ≥ 1 xi = 0, 1, i ∈ [1, 6]

optimal solution x = [1, 1, 0, 0, 0, 0] and the optimal objective function Z = 4. Therefore, the minimal test cost reduct is RED = {Headache, Temperature} and its test cost is costRED = 4.

.

(11)

We apply mixed-integer linear programming solver in Matlab 2017a for this 0-1 integer programming, and we have found the

The bat algorithm is presented by Yang in 2010 [66], and it is a new meta-heuristic search algorithm, inspired by the behavior of bats, with the following idealized rules: (a) All bats use echolocation to sense distance, and they also ‘know’ the difference between food/prey and background barriers in some magical way. (b) Bats fly randomly with velocity vi at position xi with a fixed frequency fmin , varying wavelength λ and loudness A0 to search for prey. They can automatically adjust the wavelength (or frequency) of their emitted pulses and adjust the rate of pulse emission r ∈ [0, 1], depending on the proximity of their target. (c) The loudness can vary from a large (positive) A0 to a minimum constant value Amin . Fig. 1 shows a flowchart of the bat algorithm. In Fig. 1, firstly, initializing the bat population: position xi , velocity vi and frequency fi . The maximum number of iterations is T , and the step of updating the bat’s position vector xt , velocity vector vt , and frequency vector ft at iteration t are described in Eqs. (12), (13) and (14), as follows:

vi (t + 1) = vi (t) + (xi (t) − Gbest ) · fi , xi (t + 1) = xi (t) + vi (t + 1), fi = fmin + (fmax − fmin ) · β,

(12) (13) (14)

where Gbest is the current global best solution, β ∈ [0, 1] is a random number of uniform distribution, fmin and fmax are the lower and upper limits of frequency respectively. The bat algorithm works similarly to the traditional particle swarm optimization as

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

5

Fig. 1. Flowchart of the bat algorithm.

fi essentially controls the pace and range of the movement of the bats. To improve the variability of the possible solutions, a random walk procedure is also been used to perform new exploitation using the following formula:

4.1. Binary bat algorithm

where ε is a random number in [-1,1] and At is the average loudness of all the bats at current generation t. For each iteration of the algorithm, the loudness Ai and the pulse emission rate ri are updated, and if a random number rand is less than loudness Ai and fitness(xi ) < fitness(Gbest), where fitness(x) is the fitness function, then a solution will be accepted, where. The update formulas for Ai and ri are the following:

In the standard bat algorithm, each bat moves toward the positions in search space that may have continuous coordinates. However one may assume that search space is discrete, then updating positions just switches between 0 and 1. This is the case for finding MTRP. If}the conditional attribute set is described as { A = a1 , a2 , · · · , a|A| , we can use binary vectors to define subsets of conditional attribute set. Any set of conditional attributes B ⊆ A can be expressed as a binary vector XB = [a∗1 , a∗2 , · · · , a∗|A| ], where a∗i = 1 if ai ∈ B and a∗i = 0 if ai ∈ / B. Obviously, the bat’s position can be represented by such binary vectors. The binary version of bat algorithm is called binary bat algorithm, which restricts the new bat’s position to the binary value by using a V-shaped transfer function:

Ai (t + 1) = α ·

(π )⏐ ( ) ⏐2 vik (t) ⏐⏐ , V vik (t) = ⏐⏐ arctan π 2

xnew = xold + ε · At ,

(15)





,

(16)

ri (t + 1) = ri0 · [1 − exp (−γ · t )] ,

(17)

Ati

where α and γ are constants. If no more relevant information is given, we can use α = γ = 0.9 in bat algorithms. It can be easily found that both the loudness and pulse rate are updated and the bats are moving toward the best solutions.

(18)

where vik (t) is the velocity of ith particle at iteration t in kth dimension. The position updating formula is represented as follows: xki (t + 1) =

{ (

)−1

xki (t ) xki (t )

if rand < V vik (t + 1) otherwise

(

) (19)

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

6

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

In Eq. (19), xki (t) indicates the position of ith particle at iteration t in kth dimension, (xki (t))−1 is the complement of xki (t) and rand is a random number of uniform distribution.

Pair(B) = {(x, y) | x, y ∈ U ∧ (∀a ∈ B.f (x, a) = f (y, a))

4.2. Fitness function

∧ (∃d ∈ D.f (x, d) ̸= f (y, d))}.

When calculating MTRP we have deal with two issues. On the one hand, the solution must be a reduct. On the other hand, the test cost of the conditional attribute set’s subset corresponding to the solution should be as small as possible. Because of these two issues, most of the existing fitness functions are designed based on positive region with a parameter. The most popular fitness functions are presented in Table 2. All the existing functions from Table 2 have similar shortcomings: 1. There is some indeterminate parameter, which affect the final result. It is necessary to select different values and run many times for obtaining a better solution. 2. The most of the existing fitness functions contain the positive region factor, the number of attributes set factor and the test cost factor. On the other hand, according to the formal definition of minimal test cost attribute reduction, the fitness function should only take into account the test cost factor and the positive region factor. 3. Computing the positive region brings additional, often lengthy computation. To clarify the shortcomings discussed above, we take Dai’s fitness function as an example to illustrate its problems. Dai’s fitness function can be shown in Table 2. There are three |POS (D)| |A|−|B| items on the right side of the Eq. (F.2), i.e. |POSB (D)| , |A| and

(

) T λ

Definition 3 (Inconsistent Pairs). Let S = (U , A, D, V , f , c) be a test-cost-sensitive decision system. For each set of attributes B ⊆ A, its set of inconsistent pairs is defined as follows:

A

xB · c . From the formula for fitness(xB ) for this case, we can see that the larger value of any of these three items we have, the better attribute set B is. However, these three items are not independent. If one item grows larger, the other two will decrease, so the relationship between those three items and the attribute set B is a little bit murky at least. In practice, Dai’s fitness function could either lead to the result that is not a reduct, or to the result that is a reduct, but the test cost is rather large. Consider the case from Example 2. Example 2. We apply Dai’s fitness function to evaluate the TCDS from Table 1 for the following values of λ, namely for λ ∈ {−3, −2.75, −2.5, −2.25, −2, −1.75, −1.5, −1.25, −1, −0.75, −0.5, −0.25, 0}. The results are presented in Table 3. From Example 1 we have that xB = [1, 1, 0, 0, 0, 0] is the minimal test cost reduct of the TCDS from Table 1. We evaluate six attribute set using Dai’s fitness function in Table 3. As shown in Table 3, if λ ∈ {−3, −2.75, −2.5, −2.25, − 2, −1.75, −1.5, −1.25, −1, −0.75, −0.5}, then the fitness value of {a2 } is larger than that of {a1 , a2 }. However {a2 } is not a reduct! The fitness function value of the minimal test cost reduct {a1 , a2 } is larger than that of the attribute set {a2 } in only two cases λ = 0.25 and λ = 0. When λ = −0.25 (and λ = 0.0 is not counted), the value of fitness function for {a1 , a2 } is the largest. The value of fitness function depends on the value of λ, and this example shows that if choose some inappropriate parameters, then the fitness function may not represent our objectives. All these provides strong motivation for designing a new, more representative, fitness function. We will design a new fitness function based on inconsistent object pairs, which contains the inconsistent object pairs factor and the test cost factor without any uncertain parameter and avoids computing the positive region.

(20)

In Definition 3, if a pair ⟨x, y⟩ belongs Pair(B) then the values of all conditional attributes are the same and the values of some decision attributes are different, which implies Pair(B) = IND(B)− IND(B ∪ D). The inconsistent pairs are clearly some extension of the indiscernibility relation. Hence, B1 ⊆ B2 implies Pair(B2 ) ⊆ Pair(B1 ) and |Pair(B2 )| ≤ |Pair(B1 )|. If |Pair(B2 )| ≤ |Pair(B1 )| than Pair(B2 ) ⊆ Pair(B1 ), in fact it also implies B1 ⊆ B2 . In addition, if |Pair(A)| = 0 then the decision system is consistent with itself, otherwise the decision system is an inconsistent itself. Proposition 2. Let S = (U , A, D, V , f , c) be a test-cost-sensitive decision system. For each conditional set of attributes B ⊆ A, if |Pair(B)| = |Pair(A)| then POSB (D) = POSA (D). Proof. Clearly, if |Pair(B)| = |Pair(A)| that Pair(B) = Pair(A). Assume POSB (D) ̸ = POSA (D). This means there are x ∈ POSA (D) − POSB (D) and y ∈ POSB (D) such that ⟨x, y⟩ ∈ Pair(B) but ⟨x, y⟩ ∈ / Pair(A). Hence POSB (D) ̸ = POSA (D), a contradiction. □ By Proposition 2, if |Pair(B)| = |Pair(A)| then B is a relative reduct. Moreover inconsistent pairs are some extension of the indiscernibility relation. This allows us to define a fitness function that does not have any indeterminate parameter. fitness (xB ) =

⎧ ⎨

∑ xB ·c



|Pair (B)|+1 |Pair (A)|+1

T

a∈A c (a)

if |Pair (B)| = |Pair (A)|

(21)

otherwise

In Eq. (21), B is the attribute set that corresponds to xB . Note that if |Pair(B)| = |Pair(A)| then we have fitness(xB ) ≤ 1, and if |Pair(B)| ̸= |Pair(A)| then fitness(xB ) ≥ 1. When fitness(xB ) ≥ 1, |Pair B 1 is, the better reduct is reprethe smaller the value of |Pair ((A)|+ )|+1 sented by the set B. We want to obtain a reduct with as little test cost as possible, so if fitness(xB ) ≤ 1, we want the smallest value T of ∑ x·c c (a) . This means, in both cases, the smaller the value of the a∈A fitness function, the better reduct the set B is. We also apply the above proposed fitness function to evaluate the attribute set of the TCDS from Table 1, and the results are shown in Table 4. In Table 4, if the value of fitness function is less than or equal to 1, then the result must be a relative reduct. Moreover the smaller the value of the fitness function is, the better the solution is. For example the relative reduct {a1 , a2 , a3 } is better than the reduct {a2 , a4 }. This means that there is no relationship between the number of the attributes and the test cost of the attributes. The value of fitness function for {a1 , a2 } is the smallest one, and it is easy to verify that the set {a1 , a2 } is the minimal test cost reduct. In order to accelerate the speed of calculating the fitness function, we propose a fast algorithm for computing the number of inconsistent object pairs (so one can avoid computing the positive region). To do this, we need the following convenient result. Theorem 1. Given a decision table S = (U , A, D, V , f ), for B ⊆ A, the number of the inconsistent object pairs Pair(B) is as follows.

|Pair (B)| =



(|[x]B | − |[x]B∪D |).

(22)

x∈U

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

7

Table 2 Summary of the popular existing fitness functions. Scholars

Fitness functions

Pan et al. [43]

fitness (xB ) =

) ( )λ + |POSB (D) | · xB · cT ( ) ( )λ |POS (D)| |A|−|B| fitness (xB ) = |POSB (D)| + |A| · x B · cT A ⎧ |POSA (D)|−|POSB (D)| xB ·c T ⎨ ∑ if ≤ ε0 |U | c (a) fitness (xB ) = a∈A | POS D ( )|+µ ⎩ A otherwise

Dai et al. [47]

Xie et al. [50]

(

|A|−|B| |A|

(F.1) (F.2)

(F.3)

|POSB (D)|+µ

Table 3 Evaluation with Dai’s fitness function (F.2). Solutions

{a1 } {a2 } {a4 } {a1 , a2 } {a1 , a4 } {a2 , a4 }

The values of λ

Reduct?

−3

−2.75

−2.5

−2.25

−2

−1.75

−1.5

−1.25

−1

−0.75

−0.5

−0.25

0

0.122 0.158 0.000 0.026 0.000 0.000

0.145 0.188 0.000 0.037 0.000 0.000

0.173 0.223 0.001 0.052 0.000 0.001

0.205 0.265 0.001 0.074 0.001 0.002

0.244 0.316 0.003 0.104 0.002 0.003

0.290 0.375 0.006 0.147 0.004 0.008

0.345 0.446 0.013 0.208 0.009 0.016

0.410 0.531 0.027 0.295 0.020 0.035

0.488 0.631 0.056 0.418 0.043 0.076

0.580 0.750 0.118 0.589 0.094 0.164

0.690 0.892 0.250 0.833 0.203 0.355

0.821 1.061 0.530 1.179 0.440 0.770

0.976 1.262 1.119 1.667 0.952 1.667

No No No Yes No Yes

Table 4 Evaluation of all attribute subsets with our proposed fitness function (Eq. (21)). Solutions

Value

Reduct?

Solutions

Value

Reduct?

Solutions

Value

Reduct?

{a1 } {a2 } {a3 } {a4 } {a5 } {a6 } {a1 , a2 } {a1 , a3 } {a1 , a4 } {a1 , a5 } {a1 , a6 } {a2 , a3 } {a2 , a4 } {a2 , a5 } {a2 , a6 } {a3 , a6 } {a3 , a5 } {a3 , a6 } {a4 , a5 } {a4 , a6 } {a5 , a6 }

17.000 7.000 7.000 9.000 13.000 13.000 0.058 5.000 9.000 9.000 9.000 3.000 0.319 3.000 3.000 3.000 5.000 3.000 5.000 5.000 7.000

No No No No No No Yes No No No No No Yes No No No No No No No No

{a1 , a2 , a3 } {a1 , a2 , a4 } {a1 , a2 , a5 } {a1 , a2 , a6 } {a1 , a3 , a4 } {a1 , a3 , a5 } {a1 , a3 , a6 } {a1 , a4 , a5 } {a1 , a4 , a6 } {a1 , a5 , a6 } {a2 , a4 , a6 } {a2 , a3 , a5 } {a2 , a3 , a6 } {a2 , a4 , a5 } {a2 , a4 , a6 } {a2 , a5 , a6 } {a3 , a4 , a5 } {a3 , a4 , a6 } {a3 , a5 , a6 } {a4 , a5 , a6 } {a1 , a2 , a3 , a4 }

0.275 0.348 0.348 0.203 3.000 5.000 3.000 5.000 5.000 5.000 0.536 0.536 0.391 0.609 0.464 0.464 3.000 0.652 3.000 3.000 0.565

No No No No No No No No No No No Yes Yes No No Yes No Yes No No No

{a1 , a2 , a3 , a5 } {a1 , a2 , a3 , a6 } {a1 , a2 , a4 , a5 } {a1 , a2 , a4 , a6 } {a1 , a2 , a5 , a6 } {a1 , a3 , a4 , a5 } {a1 , a3 , a4 , a6 } {a1 , a3 , a5 , a6 } {a1 , a4 , a5 , a6 } {a2 , a4 , a5 , a6 } {a2 , a3 , a4 , a6 } {a2 , a3 , a5 , a6 } {a2 , a4 , a5 , a6 } {a3 , a4 , a5 , a6 } {a1 , a2 , a3 , a4 , a5 } {a1 , a2 , a3 , a4 , a6 } {a1 , a2 , a3 , a5 , a6 } {a1 , a2 , a4 , a5 , a6 } {a1 , a3 , a4 , a5 , a6 } {a2 , a3 , a4 , a5 , a6 } {a1 , a2 , a3 , a4 , a5 , a6 }

0.565 0.420 0.638 0.493 0.493 3.000 0.681 3.000 3.000 0.826 0.681 0.681 0.754 0.942 0.855 0.710 0.710 0.783 0.971 0.971 1.000

No No No No No No No No No No No No No No No No No No No No No

∑|U |

Proof. For each object x ∈ U and each object y ∈ [x]B − [x]B∪D , the object pair ⟨x, y⟩ must be an inconsistent object pair, i.e. ⟨x, y⟩ ∈ Pair(B). Therefore, the number of inconsistent object pairs associated with object x is |[x]B | − |[x]B∪D |. We add up the number of pairs of ∑ inconsistent objects associated with all objects, i.e., |Pair (B)| = x∈U (|[x]B | − |[x]B∪D |). □

The step 14 outputs |Pair (B)| = i=1 (countB [i] − countB∪D [i]), and the time complexity of this process is O(|U |) as well. Altogether, the time complexity of Algorithm 2 is O(|U |). Table 5 compares our proposed fitness function with the previous fitness functions. As one can see, our proposed fitness function is more efficient and does not contain parameters.

Theorem 1 provides the means for construction of a fast algorithm that calculates the number of inconsistent pairs, i.e. Algorithm 3. Before calculating the number of inconsistent object pairs, we must know the equivalence partition, and this is provided by Algorithm 2, which uses mappings to compute such a partition. In Algorithm 2, num2str(xi (B)) represents converting the data of object xi into characters. Algorithm 2 is quite straightforward and its time complexity is clearly O(|U |). Algorithm 3 starts from computing the partition U /B by using Algorithm 2. The following steps 2–7 construct the number of equivalent classes on attribute set B for each object, and the time complexity of this process is O(|U |). The step 8 computes the equivalence partition U /(B ∪ D) using U /B, and the time complexity of this process is again O(|U |). The steps 9–13 are similar to the steps 2–7, so the time complexity of this process is also O(|U |).

Example 3. Consider a decision system S = (U , A, D, V , f ) from Table 1, so we have U = {x1 , x2 , x3 , x4 , x5 , x6 , x7 }, A = {a1 , a2 , a3 , a4 , a5 , a6 }, and D = {Flu}. The detailed process of calculating |Pair(A)| and |Pair({a1 , a2 }) using Algorithms 2 and 3 can be detailed as follows. First, we construct the mapping map, map(num2str(x1 )) = {1}, map(num2str(x2 )) = {2}, map(num2str(x3 )) = {3}, map (num2str(x4 )) = {4}, map(num2str(x5 )) = {5}, map(num2str(x6 )) = {6}, and map(num2str(x7 )) = {7}. Then we get the equivalence partition U /B = {{1} , {2} , {3} , {4} , {5} , {6} , {7}}, and the count array CountB [7] = [1, 1, 1, 1, 1, 1, 1]. Next we calculate the equivalence partition U /(B ∪ D) = {{1} , {2} , {3} , {4} , {5} , {6} , {7}}, and the count array CountB∪D [7] = [1, 1, 1, 1, 1 , 1, 1]. Therefore, |Pair(C )| = (1 − 1) + (1 − 1) + (1 − 1) +

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

8

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

Table 5 Comparison of the fitness functions. Fitness function

Time complexity

Uncertain parameter

Factors considered

The The The Our

O(|B ∥ U |) O(|B ∥ U |) O(|B ∥ U |) O(|U |)

Yes Yes Yes No

Positive region, number of attribute set, test cost Positive region, number of attribute set, test cost Positive region, test cost Inconsistent pairs, test cost

fitness function in [43] fitness function in [47] fitness function in [50] proposed fitness function

Algorithm 2: Computing the equivalence partition based on mapping

1 2

3 4 5

Input: A decision system S = (U , A, D, V , f ), an attribute set B ⊆ A Output: The equivalence partition U /B. initialize a map table mapB = containers.Map ; initialize the value of key v alue = 0 and the equivalence partition objects array U /B; for i = 1 to U do if mapB .isKey(num2str(xi )) then U /B[mapB (num2str(xi ))] = U /B[mapB (num2str(xi ))] ∪ {i}; /* if the object

exists a mapping, the object is added to the corresponding objects array */ 6 7 8 9

else

v alue = v alue + 1;

mapB (num2str(xi )) = v alue; U /B[v alue] = {i}; /* if the object does not

have a mapping, construct a mapping and initialize the corresponding objects array */ 10 11 12

end end return U /B;

Algorithm 3: A fast algorithm for calculating the number of inconsistent object pairs

1 2

3 4 5

Input: A decision system S = (U , A, D, V , f ), an attribute set B ⊆ A Output: The number of Pair(B). compute the equivalence partition U /B using Algorithm 2; initialize two count arrays CountB [|U |] = 0 and CountB∪D [U ] = 0; /* Count [i] represents the number of equivalence classes for the ith object */ for i=1 to |U /B| do for each k ∈ U /B[i] do CountB [k] = |U /B[i]|; /* the number of

equivalence classes of each object is equal to the total number of the corresponding equivalence partitions */ 6 7 8 9 10 11

end end compute the equivalence partition U /(B ∪ D) using U /B; for j = 1 to |U /(B ∪ D)| do for each k′ ∈ U /(B ∪ D)[j] do CountB∪D [k′ ] = |U /(B ∪ D)[j]|; /* the same as Step

5 */ 13

end end

14

return |Pair (B)| =

12

|U | ∑

(countB [i] − countB∪D [i]);

i=1

(1 − 1) + (1 − 1) + (1 − 1) + (1 − 1) = 0. The process of computing |Pair ({a1 , a2 }) | is virtually the same. We first construct

the mapping map and in this case we have map(num2str(x1 )) = {1, 2, 3}, map(num2str(x4 )) = {4}, map(num2str(x5 )) = {5, 6}, and map(num2str(x7 )) = {7}. Then we can get the equivalence partition U /{a1 , a2 } = {{x1 , x2 , x3 } , {x4 } , {x5 , x6 } , {x7 }} and the count array Count{a1 ,a2 } [7] = [3, 3, 3, 1, 2, 2, 1]. We calculate the equivalence partition U /({a1 , a2 } ∪ D) = {{x1 , x2 , x3 } , {x4 } , {x5 , x6 } , {x7 }} and the count array Count{a1 ,a2 }∪D [7] = [3, 3, 3, 1, 2, 2, 1]. This means |Pair ({a1 , a2 }) | = (3 − 3) + (3 − 3) + (3 − 3) + (1 − 1) + (2 − 2) + (2 − 2) + (1 − 1) = 0. 4.3. Binary bat algorithm for test-cost-sensitive attribute reduction Having defined the fitness function based on the inconsistent pairs and a fast algorithm for computing the number of inconsistent pairs, we can apply the binary bat algorithm ideas described in Section 4.1 for finding test-cost-sensitive attribute reduction. The result is our Algorithm 4. Algorithm 4: Binary bat algorithm for test-cost-sensitive attribute reduction Input: A test-cost-sensitive decision system S = (U , A, D, V , f , c ), bat population size M, max number of iterations T , pulse frequency fi , pulse rates ri , the loudness Ai Output: The minimal attribute reduction and its test cost. 1 Initialize the bat population xi using Eq. (24) (i = 1, 2, . . . , n); 2 Initialize velocity vi = 0; 3 while t < T do 4 Adjust frequency and update velocities; 5 Update positions using Eqs. (18) and (19); 6 if rand > ri then 7 Select a solution Gbest among the best solutions; 8 Change some of the dimensions of position vector with some of the dimensions of Gbest; 9 end 10 Generate a new solution by flying randomly; 11 if rand < Ai ∧ fitness(xi ) < f (Gbest) then 12 Accept the new solutions; 13 Increase ri and reduce Ai ; 14 end 15 Rank the bats and find the current Gbest; 16 Set t = t + 1; 17 end 18 Achieve the reduct RED from the solution Gbest and its test cost costRED ; 19 Return RED and costRED ; For MTRP, an attribute ai ∈ A is a core attribute if |Pair(A − {ai })| ̸= |Pair(A)|. If an attribute ai is a core attribute, the value of xi (or a∗i as denoted in Section 4.1 when the vector XB is discussed) that represent corresponding bat population should be equal to 1, and it has to be taken into account when the bat population is randomly initialized. In this paper we define the vector p[pi ] that represents the probability of each xi = 1 as follows:

{ pi =

0.5 0.8

if |Pair (A − {ai })| = |Pair (A)| otherwise

(23)

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

9

Table 6 Description of ten data sets from UCI. Data sets

Names

Areas

No. of objects

No. of attributes

No. of classes

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10

Soybean(small) Zoo Voting Tic-tac-toe P-gene K-R vs. K-P Mushroom Letter CNAE-9 Musk(Ver.2)

Life Zoology Social Game Life Game Botany Computer Business Physical

47 101 435 958 106 3 196 8 124 20 000 1 080 6 598

35 16 16 9 57 36 22 16 856 166

4 7 2 2 2 2 2 26 9 2

Our experiments have shown that the constants 0.5 and 0.8 works well, but of course they are not the only one that could be used. Based on the probability vector p, the initializing bat population equation is as follows:

{ xi =

0 1

{

if rand ≥ pi otherwise

(24)

where rand is a random number that satisfies 0 < rand < 1. A detailed pseudo code of the binary bat algorithm for testcost-sensitive attribute reduction is described in Algorithm 4 that follows. 5. Experiments In this section the results of experiments performed to evaluate performance and usefulness of our binary bat algorithm (denoted by BBA) are presented. These experiments used ten real data sets from the UCI repository of machine learning databases [79]. All ten data sets are shown in Table 6 and these data sets do not have any more intrinsic test costs. We will use three different distributions to produce random test costs, namely, uniform distribution, normal distribution, and Pareto distribution. We assume that the test costs are integers ranging from M to N (in our experiments, M = 1 and N = 100). Moreover ⌊x⌋ will represent a maximum integer less than x. Uniform distribution: Let cu denote a random test cost when uniform distribution is used. In this case we have P (cu = n) =

1 N −M +1

,

(25)

where n is an integer and M ≤ n ≤ N. Let U be uniformly distributed random number from the interval (0, 1). Then the following simple formula cu (M , N , x) = M + ⌊(N − M + 1) · U ⌋

(26)

gives discrete and uniformly distributed values on [M , N ]. Normal distribution: normal distribution is defined as follows: f (x) = √

1 2π σ 2

e

− (x−µ) 2 2σ

2

,

(27)

where µ is the mean or expectation of the distribution, σ is the standard deviation, and σ 2 is the variance. The simplest case of a normal distribution is known as the standard normal distribution, which appears with µ = 0 and σ = 1. Suppose U1 and U2 are independent random variables that are uniformly distributed over the interval (0, 1). Let Z0 (U1 , U2 ) =

√ −2 ln |U1 | cos (2π U2 )

(28)

and Z (M , N , U1 , U2 ) =

where α is a non-positive number (in our experiments, α = 8). It can be shown that the values of Z (M , N , U1 , U2 ) distribute normally (cf. [36]). In this case the test costs are set as follows:

M +N +1 2

+ α Z0 (U1 , U2 ) ,

(29)

cn (M , N , U1 , U2 ) =

M N, ⌊Z (M , N , U1 , U2 )⌋

if y < M if y > N otherwise

(30)

Pareto distribution: Pareto distribution, especially its bounded version, is often used to describe various types of social, scientific, geophysical, actuarial, and others observable phenomena. Let U be uniformly distributed on (0, 1), then the bounded Pareto-distributed on (M , N + 1) is given by

( ( ))− α1 U (N + 1)α − UM α − (N + 1)α p (M , N , U ) = − M α (N + 1)α

(31)

where α is a parameter that determines the shape of the distribution (in our experiments we assume α = 2). Natural transformation into integers from the closed interval [M , N ] gives us the test costs. cp (M , N , U ) = ⌊p (M , N , U )⌋ .

(32)

In our experiments a test cost of each attribute is randomly drawn from the interval [1, 100], i.e., M = 1 and N = 100, each data set generates ten different sets of test cost under each distribution. Therefore, each data set has 30 instances with different settings of test cost and a total of 300 instances needed to be tested. The goal of our experiments was to compare our BBA with other existing techniques for finding optimal solutions. We have used 0-1 integer programming-based approach (denoted by 0-1 ILP) to obtain the optimal solution of each instance, then compare the results of BBA with the general heuristic algorithm (denoted by GHR) in [48], the discrete particle swarm optimization algorithm (denoted by DPSO) in [47], the set cover-based algorithm (denoted by SCA), and the immune quantum-behaved particle swarm optimization algorithm (denoted by QIPSO) in [50]. GHR is a general heuristic algorithm with three key processes. The first process computes the core attributes. The next process adds an attribute based on the significance measure of the attribute with the test cost factor until the attribute set is a reduct, and the final process removes some redundant attributes according to respective test cost in descending order. The time complexity of GHR is O(|U |2 |C |3 ), and the results of GHR are usually not the optimal solution. SCA is a discernibility matrix-based heuristic algorithm, the time complexity of SCA is O(|U |2 |C |2 ). The main shortcoming of SCA is that its excessively high space complexity. DPSO and QIPSO apply the particle swarm optimization algorithm for the MTRP, which also have some known weaknesses. These four algorithms were only tested on small data sets in previous studies, and for small data sets one can quickly obtain

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

10

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

Table 7 The setting of parameters. Approaches

Parameters

0-1 ILP GHR DPSO SCA QIPSO BBA

Mixed-integer linear programming solver in Matlab 2017 (a) λ = [0, −0.25, −0.5, . . . , −3] Popsize M = 50, Iterations T = 200, p = 0.05, λ = [0, −0.25, −0.5, . . . , −3], c1 = c2 = 0.35 No parameters Popsize M = 50, Iterations T = 200, K = 10, ε0 = 0.01 Popsize M = 50, Iterations T = 200, A0 = rand(0, 1), r0 = rand(0, 1), ε ∈ [−1, 1], fmin = 0, fmax = 2, α = γ = 0.9

Fig. 2. Comparison of FOF on the first eight data sets.

the optimum solutions in a more straightforward way (such as 0-1 ILP). For better and more comprehensive evaluation of the compared algorithms and models, we will introduce several evaluation metrics, which is also an essential contribution to this paper. The setting of parameters for each approach is shown in Table 7. For DPSO, QIPSO, and BBA, the population size and the maximal iteration are the same. Other parameters in each algorithm are set according to the relevant literature, including the parameters ε , fmin , fmax , α and γ for BBA (as they are typical for binary bat algorithms [72]). For BBA, to get more adequate results we also randomly initialize A0 and r0 to replace the fixed traditional values of 0.25 and 0.5. Each algorithm runs ten times in each instance. We record the minimal test cost of these ten results for each instance and the average computation time of each data set. The results are presented in Tables 8–17. The columns 1 to 10 represent the results of all instance, and the ith column represents the results of ith instance for each approach. In these tables,‘‘-’’ means that we cannot obtain the solution within an acceptable computation time or the algorithm overflows memory in this case. As shown in Tables 8–17, 0-1 ILP can quickly find the optimal solutions for all instances on data sets Soybean(small), Zoo, Voting, Tic-tac-toe, P-gene, and Mushroom. It takes a long time to find the optimal solutions for all instances on the data sets K-R vs. K-P and Letter, and it cannot find the optimal solution on other two data sets. GHR can quickly find the solutions for all instances on all data sets except for data sets CNAE-9 and Musk(Ver.2), these solutions are usually not the optimal solution, and the solutions on data sets Soybean(small), P-gene and CNAE-9

are the worst. The performance of SCA is unstable. It is worst than GHR on some data sets, but it may be the best in some instances. DPSO can find the optimal solutions on the most cases of the first 8 data sets and the solutions on data sets P-gene and Musk(Ver.2) are the worst. QIPSO also can obtain the optimal solutions on the most instances of the first 8 data sets, the solutions on P-gene and Musk(Ver.2) are better than DPSO and GHR. BBA can find all the best solutions except for some instances on Soybean(small), P-gene and Musk(Ver.2). BBA solutions are the best especially on CNAE-9, and its average elapsed times on all data sets are less than DPSO and QIPSO.

5.1. Evaluation metrics with the optimal solutions In order to evaluate the performance of an algorithm, some statistical metrics are needed. A few such metrics, as finding optimal factor, maximal exceeding factor and average exceeding factor, were proposed in [36]. In all these three metrics, we assume that the optimal solutions can be obtained. All optimal solutions on the first eight data sets are shown Tables 8–17. In this case 0-1 integer programming was used, and we can compare these three metrics on all eight data sets. Each data set has ten different costs for each cost distribution, i.e. the value of K is 10. The results of FOF, MEF, and AEF on the first eight data sets are shown in Figs. 2–4. Finding optimal factor (FOF): Let the number of experiments be K , and the number of successful find an optimal reduct be k. FOF is defined as FOF =

k K

.

(33)

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

11

Table 8 Experimental results of different algorithms on Soybean(small) with three distributions. Algorithms

head 0-1 ILP

GHR

SCA

DPSO

QIPSO

BBA

Distributions

Test cost of ith instance

Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto

Time

1

2

3

4

5

6

7

8

9

10

5 106 4 5 111 4 5 170 5 5 138 4 5 127 4 5 106 4

82 75 3 97 102 4 104 75 3 106 84 3 87 75 5 82 75 4

17 73 2 45 94 5 17 73 2 17 77 3 17 73 4 17 73 2

108 90 3 109 109 3 108 90 5 138 90 4 114 90 4 110 90 4

91 88 3 91 97 5 114 88 3 99 88 3 91 88 3 91 88 3

57 87 3 108 87 39 60 127 3 57 87 3 57 87 3 57 87 3

107 84 4 134 84 7 126 116 4 124 84 4 121 84 4 107 84 4

59 84 3 59 85 4 65 85 3 65 84 5 59 84 4 59 84 4

103 102 3 119 109 5 103 167 3 115 127 3 115 105 3 114 102 3

35 84 2 77 84 4 35 84 2 35 92 3 35 84 3 35 84 2

0.407 0.402 0.410 0.777 0.940 0.624 0.636 0.483 0.497 2.763 2.792 2.785 2.610 2.717 2.803 2.466 2.538 2.525

Table 9 Experimental results of different algorithms on Zoo with three distributions. Algorithms

0-1 ILP

GHR

SCA

DPSO

QIPSO

BBA

Distributions

Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto

Test cost of ith instance

Time

1

2

3

4

5

6

7

8

9

10

156 234 19 196 246 19 156 234 24 156 234 19 156 234 19 156 234 19

181 238 5 191 238 11 210 268 5 181 238 5 181 238 5 181 238 5

223 242 8 254 242 8 223 242 8 223 242 8 223 242 8 223 242 8

118 226 7 130 249 8 118 226 7 118 226 7 118 226 7 118 226 7

200 241 8 227 260 8 200 241 8 200 241 8 200 241 8 200 241 8

266 245 12 317 267 14 290 245 16 266 250 12 266 245 12 266 245 12

102 255 11 102 258 23 102 255 12 102 255 11 102 255 11 102 255 11

102 230 7 110 231 9 102 230 8 102 230 7 102 230 7 102 230 7

72 214 9 73 229 9 72 214 10 72 214 9 72 214 9 72 214 9

169 243 33 189 262 33 169 243 35 169 243 34 169 242 33 169 243 33

1.212 1.163 1.210 1.431 1.122 1.483 1.307 1.480 1.300 4.357 4.296 4.287 4.386 4.409 4.581 4.199 4.135 4.199

Table 10 Experimental results of different algorithms on Voting with three distributions. Algorithms

0-1 ILP

GHR

SCA

DPSO

QIPSO

BBA

Distributions

Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto

Test cost of ith instance

Time

1

2

3

4

5

6

7

8

9

10

583 486 22 583 486 22 614 516 22 583 486 23 583 486 22 583 486 22

500 478 20 500 478 20 514 478 25 514 478 20 506 478 20 500 478 20

509 460 21 509 460 21 509 460 41 509 460 21 509 460 21 509 460 21

416 458 37 416 458 40 416 458 45 416 458 37 416 458 37 416 458 37

332 442 59 332 442 59 332 442 65 332 442 59 332 442 59 332 442 59

546 490 39 546 490 39 546 490 39 546 490 39 546 490 39 546 490 39

292 475 16 292 475 16 292 511 16 292 475 16 292 475 16 292 475 16

447 484 31 447 484 31 447 484 31 447 484 31 447 484 31 447 484 31

499 400 20 499 400 20 507 400 22 499 400 20 499 400 20 499 400 20

414 447 40 414 447 43 414 447 47 414 447 43 414 447 40 414 447 40

3.110 3.059 3.073 1.432 1.489 1.420 3.141 3.222 3.667 18.166 18.239 18.067 18.394 20.381 20.198 18.180 18.122 17.981

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

12

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

Table 11 Experimental results of different algorithms on Tic-tac-toe with three distributions. Algorithms

0-1 ILP

GHR

SCA

DPSO

QIPSO

BBA

Distributions

Test cost of ith instance

Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto

Time

1

2

3

4

5

6

7

8

9

10

285 404 17 285 404 17 285 404 17 285 404 17 285 404 17 285 404 17

380 409 75 380 420 75 380 409 75 380 409 75 380 409 75 380 409 75

352 386 74 352 386 74 352 386 74 352 386 74 352 386 74 352 386 74

362 400 12 368 400 12 362 400 12 362 400 12 362 400 12 362 400 12

238 383 9 238 403 10 238 383 9 238 383 9 238 383 9 238 383 9

353 389 13 369 390 13 353 389 13 353 389 13 353 389 13 353 389 13

398 370 14 407 370 14 398 370 14 398 370 14 398 370 14 398 370 14

455 383 16 455 391 16 455 383 16 455 383 16 455 383 16 455 383 16

474 401 10 484 403 10 474 401 10 474 401 10 474 401 10 474 401 10

211 378 24 211 378 25 211 378 24 211 378 24 211 378 24 211 378 24

4.275 4.289 4.285 2.121 2.092 2.137 4.290 4.305 4.653 36.833 36.530 37.639 39.108 40.218 41.008 32.192 32.055 33.070

Table 12 Experimental results of different algorithms on P-gene with three distributions. Algorithms

0-1 ILP

GHR

SCA

DPSO

QIPSO

BBA

Distributions

Test cost of ith instance

Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto

Time

1

2

3

4

5

6

7

8

9

10

31 172 4 94 203 4 38 223 4 48 201 5 38 192 5 31 190 4

26 168 5 82 174 6 29 180 5 45 195 5 30 181 5 26 173 5

37 157 4 37 192 4 37 189 4 39 159 5 37 157 5 39 157 5

21 165 4 21 177 6 21 184 5 55 184 5 21 184 5 21 178 5

37 180 4 95 194 7 37 231 5 59 183 5 41 180 5 37 180 5

36 172 4 72 184 4 44 225 4 53 191 5 44 191 5 36 188 4

98 167 4 118 185 7 107 190 5 119 195 5 103 188 5 101 167 4

41 167 4 53 186 4 49 221 4 42 201 5 41 187 4 41 187 4

44 168 4 122 181 5 48 192 5 56 200 5 58 185 5 51 168 5

28 178 4 44 186 5 28 191 5 52 200 5 37 186 4 29 189 5

1.871 7.124 4.652 1.731 1.776 1.601 5.025 4.778 4.710 5.519 5.479 5.416 6.107 6.121 6.217 4.942 5.071 4.901

Table 13 Experimental results of different algorithms on K-R vs. K-P with three distributions. Algorithms

0-1 ILP

GHR

SCA

DPSO

QIPSO

BBA

Distributions

Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto

Test cost of ith instance

Time

1

2

3

4

5

6

7

8

9

10

1272 1441 85 1287 1441 85 1272 1441 85 1272 1441 85 1272 1441 85 1272 1441 85

1384 1483 101 1437 1483 101 1384 1483 101 1384 1483 102 1384 1483 101 1384 1483 101

1478 1508 98 1489 1508 98 1478 1508 98 1478 1508 98 1478 1508 98 1478 1508 98

1566 1442 135 1597 1442 135 1566 1442 135 1597 1442 135 1566 1442 135 1566 1442 135

1305 1473 345 1312 1483 347 1305 1473 345 1305 1483 345 1305 1473 345 1305 1473 345

1379 1479 118 1388 1479 118 1379 1479 118 1394 1479 118 1388 1479 118 1379 1479 118

1555 1434 159 1608 1434 159 1555 1434 159 1572 1452 159 1568 1434 159 1555 1434 159

1462 1466 66 1462 1466 66 1462 1466 66 1520 1466 66 1462 1466 66 1462 1466 66

1390 1419 102 1406 1419 111 1390 1419 102 1390 1419 102 1390 1419 102 1390 1419 102

1296 1408 71 1309 1408 71 1296 1408 71 1309 1408 71 1309 1408 71 1296 1408 71

1611.971 1375.501 1498.126 19.429 19.851 17.389 99.915 106.952 91.698 200.715 215.386 231.068 210.921 236.481 257.902 185.592 190.774 197.572

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

13

Table 14 Experimental results of different algorithms on Mushroom with three distributions. Algorithms

0-1 ILP

GHR

SCA

DPSO

QIPSO

BBA

Distributions

Test cost of ith instance

Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto

Time

1

2

3

4

5

6

7

8

9

10

65 175 8 65 191 13 72 175 8 65 175 8 65 175 8 65 175 8

70 171 9 86 196 11 70 171 13 70 196 10 70 185 9 70 171 9

136 184 26 142 186 28 151 215 28 137 184 26 136 184 26 136 184 26

125 164 4 162 175 27 147 175 4 132 174 4 125 172 4 125 164 4

80 202 4 105 204 4 80 202 4 80 202 4 80 202 4 80 202 4

66 167 7 112 167 7 71 201 7 66 167 9 66 167 7 66 167 7

65 173 4 65 209 6 87 173 5 65 173 4 65 173 4 65 173 4

177 167 4 221 181 4 184 167 5 177 167 5 177 167 5 177 167 4

196 181 6 227 198 7 271 181 6 196 181 6 196 181 6 196 181 6

136 190 6 136 190 15 226 259 6 136 190 6 136 190 6 136 190 6

133.435 124.323 119.883 19.776 22.616 19.216 333.033 310.516 307.061 550.908 512.951 457.120 571.012 594.073 472.863 415.615 430.232 400.762

Table 15 Experimental results of different algorithms on Letter with three distributions. Algorithms

0-1 ILP

GHR

SCA

DPSO

QIPSO

BBA

Distributions

Test cost of ith instance

Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto

Time

1

2

3

4

5

6

7

8

9

10

582 549 19 602 599 20 582 598 22 582 549 19 582 549 19 582 549 19

313 556 19 313 556 19 357 556 20 313 559 19 313 556 19 313 556 19

432 491 18 432 491 18 510 528 18 432 491 18 432 491 18 432 491 18

510 556 19 520 563 19 510 562 19 510 556 19 510 556 19 510 556 19

351 521 30 351 521 30 351 521 31 351 521 30 351 521 30 351 521 30

333 506 16 359 506 18 367 506 16 333 506 16 333 506 16 333 506 16

575 562 22 591 568 22 626 603 22 575 562 24 575 562 22 575 562 22

618 546 47 644 559 47 639 546 50 618 546 47 618 546 47 618 546 47

357 509 56 382 542 59 379 542 58 357 509 58 357 509 58 357 509 56

445 560 23 445 566 23 473 566 23 445 560 23 445 560 23 445 560 23

6277.717 5761.459 6018.953 89.886 91.280 116.817 1952.939 1718.548 1853.832 1325.495 1424.011 1263.587 1418.019 1510.909 1301.864 1060.058 1112.005 1033.678

Table 16 Experimental results of different algorithms on CNAE-9 with three distributions. Algorithms

0-1 ILP

GHR

SCA

DPSO

QIPSO

BBA

Distributions

Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto

Test cost of ith instance

Time

1

2

3

4

5

6

7

8

9

10

– – – 19054 14840 1355 5259 7012 499 12815 14659 785 7129 11982 601 3420 3696 417

– – – 16604 16621 912 4592 6094 367 13262 15111 805 6171 9817 634 3402 3774 316

– – – 14459 18251 1334 4635 6277 342 13005 15519 895 5282 8176 693 3145 3703 313

– – – 15831 19707 1397 5935 6919 356 13632 14384 774 7062 11918 684 3741 3710 336

– – – 14384 18631 1073 5254 6990 335 13372 14092 721 8920 9109 578 3863 3678 293

– – – 13578 16698 529 4829 6671 268 14285 13618 688 8745 8093 552 3627 3593 227

– – – 16020 18837 1293 4551 5993 329 13071 14619 724 6176 8192 645 3312 3686 267

– – – 15770 16900 721 5314 5892 219 13630 14172 731 6931 7873 701 3688 3717 176

– – – 18960 13082 1577 5113 6256 419 13060 14558 809 7856 9829 619 3330 3610 374

– – – 18249 15655 1059 5025 6101 318 12468 13808 744 6278 11435 570 3490 3628 326

– – – 6224.255 5144.791 6720.893 13671.982 13076.290 12416.092 1166.571 1288.866 1438.433 1392.970 1491.087 1591.209 303.371 307.878 344.535

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

14

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

Table 17 Experimental results of different algorithms on Musk(Ver.2) with three distributions. Algorithms

Distributions

Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto Uniform Normal Pareto

0-1 ILP

GHR

SCA

DPSO

QIPSO

BBA

Test cost of ith instance

Time

1

2

3

4

5

6

7

8

9

10

– – – 28 156 5 19 170 4 95 175 8 28 164 5 24 156 4

– – – 50 165 5 84 191 4 70 171 9 46 165 5 43 141 4

– – – 49 171 4 21 182 4 136 184 26 47 160 5 46 151 4

– – – 47 190 4 25 227 4 125 164 4 30 164 4 22 153 4

– – – 20 139 5 12 148 4 80 202 4 20 129 4 12 121 4

– – – 37 154 4 12 173 4 66 167 7 44 158 4 12 163 4

– – – 59 174 9 12 221 4 65 167 4 37 164 4 24 162 4

– – – 28 161 4 8 213 4 177 173 4 28 161 4 11 155 4

– – – 101 191 4 22 203 4 196 181 6 69 179 5 47 179 4

– – – 9 190 12 10 204 4 136 190 6 37 187 5 10 187 5

– – – 351.809 369.447 380.428 819.048 811.993 351.767 489.785 501.163 496.394 531.604 577.343 539.852 449.719 438.095 460.285

Fig. 3. Comparison of MEF on the first eight data sets.

Maximal exceeding factor (MEF): Let R′ be the optimal reduct. The exceeding factor of a reduct R is defined as ef (R) =



a∈R

c (a) −



a∈R′



a∈R′

c (a)

c (a)

.

(34)

The exceeding factor provides a quantitative metric to evaluate the performance of a reduct. It indicates the badness of a reduct when it is not optimal. If R is an optimal reduct, the exceeding factor is 0. Let the number of experiments be K . MEF is defined as MEF = max {ef (Ri )}

(35)

1≤i≤K

Average exceeding factor (AEF): Similarly, if the number of experiments is K , AEF is defined as

∑K AEF =

i=1

ef (Ri )

K

.

(36)

Fig. 2 shows the results of FOF with all three distributions on the first eight data sets. FOF describes the probability of

finding the optimal reduct. The larger its value, the better the performance of the corresponding method. According to Fig. 2, BBA obtains higher FOF values than GHR, SCA, DPSO, and QIPSO, while GHR obtains the lowest FOF values on the eight data sets except for Voting, P-gene and Letter. SCA gets the lowest FOF values on Voting and Letter, while it gets the best FOF values on Tic-tac-toe and K-R vs. K-P. DPSO has the lowest FOF values on P-gene, but it performs better than GHR and SCA on the most data sets in terms of FOF. QIPSO obtains higher FOF values than GHR, SCA and DPSO on eight data sets, and it achieves the same FOF values as BBA on data sets Zoo, Voting, Tic-tac-toe, and K-R vs. K-P. Fig. 3 shows the results of MEF for three distributions discussed in the beginning of this section, i.e. uniform, normal and Pareto, on first eight data sets. This metric reflects the exceeding cost factor relative to the optimal solution. The smaller the value is, the better the performance of the corresponding method is. From Fig. 3, we can see that BBA performs the best in terms of MEF. GHR obtains the highest MEF values on all eight data sets except for Voting for all three distributions, Letter for all three distributions, and K-R vs. K-P for uniform distribution and normal

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

15

Fig. 4. Comparison of AEF on the first eight data sets.

distribution. SCA achieves the highest MEF values on Voting for all three distributions, Letter for all three distributions, and it performs than GHR on other six data sets. DPSO obtains the highest MEF values on K-R vs. K-P for uniform distribution and normal distribution, and it performs better than GHR and SCA on the most data sets. QIPSO obtains the lowest MEF values on Zoo, Voting, Tic-tac-toe, K-R vs. K-P, P-gene for normal distribution, Mushroom for uniform distribution, and Letter for uniform distribution and normal distribution, and it also performs worse than BBA on P-gene for all three distributions, Mushroom for normal distribution and Pareto distribution, and Letter for Pareto distribution. Fig. 4 shows the results of AEF for all three distributions on the first eight data sets. AEF reflects the average exceeding test cost relative to the optimal solution. Similar to MEF, the smaller the value, the better the performance of the corresponding method. According to the data from Fig. 4, BBA tends to obtain lower AEF values than the other algorithms. GHR obtains highest AEF values on all eight data sets except for Soybean(small) for normal distribution, Voting for all three distributions, P-gene for normal distribution, Mushroom for normal distribution and Letter for all three distributions. SCA performs better than GHR on Soybean(small) for uniform distribution and Pareto distribution, Zoo for all three distributions, Tic-tac-toe for all three distributions, Pgene for uniform distribution and Pareto distribution, Mushroom for uniform distribution and Pareto distribution, and K-R vs. KP for all three distributions. DPSO obtains highest AEF values on Voting for uniform distribution, and it performs better than GHR and SCA on the most data sets. QIPSO achieves the lowest AEF values on Zoo, Voting, Tic-tac-toe, K-R vs. K-P for normal distribution and Pareto distribution, Mushroom for uniform distribution, and Letter for uniform distribution and normal distribution, and it performs better than GHR, SCA and DPSO in the most cases. Summing up, the results of FOF, MEF, and AEF on the first eight data sets indicate that our method based on the binary bat algorithm has a better performance than other algorithms. GHR and SCA are the worst, GHR is better than SCA in some data sets, and SCA is better than GHR in other data sets. DPSO is worse than QIPSO and our BAA, and it performs better than GHR and SCA for most of data sets. QIPSO is better than GHR, SCA and DPSO in most cases, and the performance of QIPSO is the same as that of BBA on some data sets.

5.2. Evaluation metrics without optimal solutions While obtaining the optimal solutions is our goal, it is not always possible. We were unable to get the optimal solutions for data sets CNAE-9, Musk(Ver.2). Therefore, we need also some evaluation metrics that do not refer to optimal solutions. In [47], four such metrics: winner factor (WF), loser factor (LF), maximal reducing factor (MRF) and average reducing factor (ARF) were proposed. Later in this section we define additional four metrics that do not need optimal solutions, namely: finding best factor (FBF), finding worst factor (FWF), maximal exceeding factor with the best results (MEFb), and distance factor with the best results (DF). First let start with the metrics from [47]. Definition 4. Let a1 and a2 be two algorithms generating the reducts R and Q , respectively; K be the total number of experiments, k be the number of experiments satisfying c(R) < c(Q ) and l be the number of experiments that satisfy c(R) > c(Q ). Then we define: Winner factor (WF) of a1 to a2 as: WF =

k

(37)

K

Loser factor (LF) of a1 to a2 as: LF =

l

(38)

K

Reducing factor of a1 to a2 as: ef (R, Q ) =

c (Q ) − c (R) c (Q )

(39)

Maximal reducing factor (MRF) as: max {ef (Ri , Qi )}

(40)

1≤i≤K

Average reducing factor (ARF) as:

∑K

i=1

ef (Ri , Qi ) K

(41)

First note that we always have WF + LF ≤ 1, and clearly, if WF ≥ LF then a1 has a better performance than a2 . The reducing factor ef (R, Q ) is a relative measure which shows relative changes

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

16

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 5. Comparing WF, LF, MRF, and ARF of BBA with GHR on all data sets.

of the cost function. Since we intend to select a reduct with as lower cost as possible, a reducing cost value bigger than zero means that algorithm a1 performs better than a2 . The bigger the value of ef (R, Q ) is, the bigger the superiority of a1 over a2 . MRF represents maximal value of ef (R, Q ) for K experiments, and ARF represents an average value of ef (R, Q ) for K experiments. The detailed results of WF, LF, MRF and ARF that compare BBA with the other two algorithms are shown in Figs. 5–8. As shown in Fig. 5, the WF values of BBA are always bigger than the corresponding LF values. The MRF values are always positive, and BBA tends to obtain lower costs when compared with GHR. The ARF values are positive in most cases, and BBA again tends to obtain lower costs than GHR. From Fig. 6 it follows that the WF values of BBA are bigger than corresponding LF values on all instances except for Soybean(small) with uniform distribution and Musk(Ver.2) with uniform distribution and Pareto distribution. The MRF values are always positive values, and BBA tends to obtain lower costs compared with the set cover-based algorithm from the viewpoint of maximal reducing costs. The ARF values are positive values in most cases, and BBA again achieves lower costs than the set cover-based algorithm. Fig. 7 shows the WF, LF, MRF, and ARF for BBA with DPSO on all data sets. From Fig. 7, the WF values of BBA relative to the compared particle swarm optimization algorithm tend to be bigger than corresponding LF values. The MRF values are always positive values, and BBA tends to obtain lower costs compared with the particle swarm optimization algorithm from the viewpoint of maximal reducing costs. The ARF values are positive values in most cases, and the BBA tends to obtain lower costs compared with the particle swarm optimization algorithm. As Fig. 8 shows, the WF values of BBA relative to the compared immune quantum-behaved particle swarm optimization algorithm tend to be bigger than corresponding LF values. The MRF values are always positive values, and BBA tends to obtain lower costs compared with the immune quantum-behaved particle swarm optimization algorithm from the viewpoint of maximal reducing costs. The ARF values are positive values in most cases, and the BBA tends to obtain lower costs compared with the immune quantum-behaved particle swarm optimization algorithm from the viewpoint of average reducing costs. We now introduce our new metrics. Definition 5. Let K be the number of experiments, and k be the number of experiments that reaches the best reduct. Then Finding Best Factor (FBF) is defined as:

FBF =

k K

(42)

Let w be the number of experiments that reaches the worst reduct. We then define Finding Worst Factor (FWF) as: FWF =

w

(43) K Let the best results of all instances be given by a vector bestc = [bestc1 , . . . , bestcK ], and the results of the algorithm be represented by a vector c = [c1 , . . . , cK ]. Maximal Exceeding Factor with the best results is then given by:

( MEFb = max

ci − bestci

1≤i≤K

bestci

) (44)

The Distance Factor (DF) with the best results is the Euclidean distance between the best results of all algorithms and the results of the algorithm, i.e.

  K ∑ ci DF = √ ( i=1

bestci

− 1)2

(45)

Directly from Definition 5, we have FBF +∑ FWF ≤ 1 When n the number of tested algorithms equals n, then i=1 FBFi = 1 and ∑ n i=1 FW Fi = 1, where FBFi and FWFi are Finding Best Factor and Finding Worst Factor of the ith algorithm. If the value of FBFi is the biggest one, then the ith algorithm is the best one among the tested algorithms. And if the value of FBFi is the biggest one, then the ith algorithm is the worst among the tested ones. MEFbi and DFi are Maximal Exceeding Factor and Distance Factor for the ith algorithm. If the value of MEFbi is the smallest one, then the ith algorithm is the best one, and if the value of DFi is the smallest one, then the ith algorithm is also the best one among the tested algorithms. The detailed results of FBF, FWF, MEFb and DF are shown in Figs. 9–12. Fig. 9 shows the results of FBF with all three distributions on all data sets. This metric describes the probability of finding the best reduct in all algorithms. Apparently, the larger the value is, the better the performance of the corresponding method is. According to Fig. 9, we can conclude that BBA tends to obtain higher FBF values in comparison with GHR, SCA, DPSO, and QIPSO. GHR obtains the lowest FBF values on all instances except for Zoo with Pareto distribution, Voting with uniform distribution, P-gene with all three distributions, K-R vs. K-P with normal distribution, Mushroom with uniform distribution, Letter with all three distributions, and Musk(Ver.2) also with all three distributions. SCA achieves the lowest FBF values on Zoo with Pareto distribution, Voting with uniform distribution, Mushroom with uniform distribution, Letter with all three distributions, and CNAE-9 with uniform distribution and normal distribution. It performs best on Tic-tac-toe with all three distributions, K-R vs. K-P with all three distributions, and Musk(Ver.2) with uniform distribution and Pareto distribution. DPSO obtains the highest FBF values

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

17

Fig. 6. Comparing WF, LF, MRF, and ARF of BBA with SCA on all data sets.

Fig. 7. Comparing WF, LF, MRF, and ARF of BBA with DPSO on all data sets.

Fig. 8. Comparing WF, LF, MRF, and ARF of BBA with QIPSO on all data sets.

Fig. 9. Comparison of FBF on all data sets.

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

18

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 10. Comparison of FWF on all data sets.

Fig. 11. Comparison of MEFb on all data sets.

Fig. 12. Comparison of DF on all data sets.

on Zoo with uniform distribution, Voting with normal distribution, Tic-tac-toe with all three distributions, and Letter with uniform distribution. It performs poorly on P-gene, CNAE-9, and Musk(Ver.2) with all three distributions in terms of FWF. The FBF values of QIPSO are higher than that of GHR, SCA, and DPSO on all instances except for P-gene with all three distributions, CNAE-9 with Pareto distribution, and Musk(Ver.2) with uniform distribution and Pareto distribution, but it does not perform any better than BBA on all instances in terms of FWF. Fig. 10 shows the results of FWF with all three distributions on all data sets. This metric describes some probability measure for finding the worst reduct in all algorithms. Contrary to FBF, the smaller the value, the better the performance of the corresponding method. From Fig. 10, we can conclude that BBA obtains smaller FWF values than the other four algorithms. GHR obtains the highest FWF values on all instances except for Voting with

all three distribution, P-gene with normal distribution, K-R vs. KP with normal distribution, Letter with uniform distribution and Pareto distribution, and Musk(Ver.2) with all three distributions. SCA is worse than GHR on Voting with all three distribution, Letter with uniform distribution and Pareto distribution, and Musk(Ver.2) with uniform distribution and normal distribution. It is better than BBA on P-gene with uniform distribution and Pareto distribution. The FWF values of DPSO are the highest on P-gene with normal distribution and Pareto distribution, and Musk(Ver.2) with uniform distribution and Pareto distribution. It is better than GHR in most case, but it is worse than SCA on about half of the instances from the viewpoint of FWF. QIPSO obtains the lowest FWF values on all cases except for P-gene with normal distribution and Pareto distribution, Pareto distribution, K-R vs. KP with uniform distribution, Mushroom with Pareto distribution. It is better than GHR, SCA, and DPSO in most cases, but there is only one instance where QIPOS is better than BBA.

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 11 shows the results of MEFb for all three distributions on all data sets. MEFb reflects the exceeding cost factor relative to the best solution from all algorithms. Similarly to MEF (a metric with the optimal solutions), the smaller the value, the better performance of the corresponding method. According to Fig. 11, we can conclude that BBA gets smaller MEFb values then the other three algorithms. GHR obtains the highest MEFb values on all instances except for Musk(Ver.2) with all three distributions. SCA achieves the highest MEFb values for the most cases with normal distribution, Voting with Pareto distribution, and Mushroom with Pareto distribution. DPSO has the highest MEFb values on Musk(Ver.2) with all three distributions. MEFb values of QIPSO are smaller than GHR, SCA, and DPSO in most cases; it does not perform any better than BBA on all instances in terms of MEFb. Fig. 12 shows the results of DF for all three distributions on all data sets. DF describes a distance between the results of the algorithm and the best results of all algorithms. Apparently, the smaller its value, the better the performance of the corresponding method. The curves in Fig. 12 are roughly the same as the curves in Fig. 11, so the analysis is virtually the same. 5.3. Classification accuracy We have also conducted some experiments to compare the classification accuracy of reducts generated by BBA with those generated by GHR, SCA, DPSO, and QIPSO. The classification accuracy was done for some selected attribute reducts found by the five algorithms with classifiers 3NN and SVM. All of the classification accuracies are obtained with 10-fold cross-validation. In 10-fold cross-validation, a given data set is randomly divided into ten nearly equally sized subgroups, of these ten subgroups, nine subgroups are used as training sets, a single subgroup is retained as testing set to assess the classification accuracy. Experimental results of classification accuracy are listed in Table 18 and Table 19, where the value of average classification accuracy is expressed as a percentage, the numbers in parentheses ranks the five methods for each data set, with the best-performing feature subset ranked 1, the second best ranked 2, and so on. Table 18 indicates that BBA obtains a comparable classification accuracy for classifier 3NN. The results of BBA matched the best classification accuracies for 15 out of 30 cases, and the results of GHR and DPSO have the inferior classification accuracies compared with BBA, which matched the best classification accuracies for 9 out of 30 cases. The results of SCA matched the best classification accuracies for 7 out of 30 cases, and the results of QIPSO perform worst in all these four algorithms, as they only matched the best classification accuracies for 4 out of 30 cases. From Table 19, we can conclude that BBA achieves a comparable classification accuracy for classifier SVM. The results of BBA and GHR matched the best classification accuracies for 15 out of 30 cases, and the results of SCA, DPSO and QIPSO only matched the best classification accuracies for 2 out of 30 cases. To show the statistical significance of our results, a nonparametric Friedman test has been performed. Friedman test is a statistical test that uses the rank of each algorithm on each data set. Friedman statistic is expressed as follows (cf. [80]):

χF2 = FF =

12N k (k + 1)

⎛ ⎞ k 2 ∑ k k + 1 ( ) ⎝ ⎠, R2j − j=1

(N − 1)χF2 N(k − 1) − χF2

4

(46)

(47)

where k is the number of algorithms, N is the number of data sets and Rj is the average rank of algorithm j among all data sets. FF

19

is a Fisher distribution with k − 1 and (k − 1)(N − 1) degrees of freedom. Friedman test is used to verify that five feature selection algorithms are equivalent with respect to classification accuracy on each distribution. We expected the algorithms not to be equivalent. We use a confidence level α = 0.1, the number of algorithms k = 5 and the number of data sets N = 10. For classifier 3NN, pvalues calculated for our three distributions are 0.0744, 0.0586, and 0.0245, respectively. For classifier SVM, p-values for these three distributions are 0.00283, 0.0137, and 0.0107. Therefore, the null hypotheses with α = 0.1 has to be rejected, so most likely these five algorithms might be regarded as different with respect to classification accuracy on each distribution. Next, we performed a Nemenyi post hoc test (cf. [81]) to verify again if performance of the algorithms we have used, can be considered as statistically different. The results are shown in Figs. 13 and 14. Algorithms are connected by a horizontal line, which indicates that equivalence of these algorithms is not statistically significant at the level of 0.1, so the result is the same as for Friedman test, but let discuss it in more detail. As shown in Fig. 13, the test results show that there is no consistent evidence to indicate statistical differences between any two of the five algorithms for classifier 3NN for uniform distribution, and the same conclusion also holds for normal distribution. For Pareto distribution, the classification accuracy of BBA is statistically better than that of QIPSO for classifier 3NN, and there is no consistent evidence to indicate statistical accuracy equivalence between any two of BBA, GHR, SCA, and DPSO at the level of 0.1. From Fig. 14, the test results show that there is no consistent evidence to indicate statistical equivalence between any two of BBA, GHR, and SCA for classifier SVM with uniform distribution, and the classification accuracy of BBA is statistically better than that of QIPSO for classifier SVM. The same is true for BBA and DPSO. For normal distribution, there is no consistent evidence to indicate statistical equivalence of any two of the BBA, GHR, SCA, and DPSO for classifier SVM, and the classification accuracy of BBA is statistically better than that of QIPSO for classifier SVM. The same holds for Pareto distribution with α = 0.1. 5.4. Results for high-dimensional data sets For these ten data sets in Table 6, one can see that their number of dimensions is not very large. In fact, in some domains such as text categorization and genomic microarray analysis, the real-world data often contain thousands of features. To further demonstrate the effectiveness of the new approach, we also evaluate the performance of our proposed approach on five benchmark high-dimensional data sets from [82]. These data sets are shown in Table 20 and the test cost of each attribute is also randomly drawn from the interval [1, 100]. Then, each data set generates ten different sets of test cost under each distribution. Therefore, each data set has 30 instances with different settings of test cost and a total of 150 instances need to be tested. We tested ten times for each instance and recorded the test cost of these ten results and an average running time of each data sets. The results of GHR and SCA for data sets Pcmac, Relathe, and Basehock are not reported here since they are both very time-consuming. DPSO, QIPSO, and BBA are population-based meta-heuristic algorithms, and we set the population size M = 200 and maximum iterations T = 1000 for high-dimensional data sets in these three algorithms. The other parameter settings are the same as those in Table 7. The results of FBF, FWF, MEFb, and DF are shown in Fig. 15. From this figure it follows that GHR performs the best on Colon with normal distribution, and SCA performs the best on Colon with uniform distribution and Pareto distribution. GHR and SCA

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

20

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

Table 18 Classification accuracies of 3NN on three distributions. Data sets

Distributions

Classification / % GHR

SCA

DPSO

QIPSO

BBA

Soybean(small)

Uniform Normal Pareto

99.20 (5) 100.00 (2) 98.20 (2)

94.60 (1) 97.45 (5) 92.75 (5)

99.35 (2.5) 99.80 (4) 99.50 (1)

95.45 (4) 100.00 (2) 94.40 (4)

99.35 (2.5) 100.00 (2) 97.05 (3)

Zoo

Uniform Normal Pareto

89.14 (5) 90.03 (1) 90.18 (2)

90.02 (2) 89.24 (4) 89.64 (4)

90.11 (1) 89.41 (2) 89.42 (5)

89.48 (4) 88.95 (5) 89.97 (3)

89.55 (3) 89.37 (3) 90.66 (1)

Voting

Uniform Normal Pareto

92.96 (5) 93.20 (1) 92.98 (4)

93.15 (3) 93.17 (2) 90.21 (5)

93.24 (2) 93.03 (3) 93.15 (2)

93.08 (4) 92.96 (5) 93.08 (3)

93.34 (1) 92.99 (4) 93.21 (1)

Tic-tac-toe

Uniform Normal Pareto

76.76 (1.5) 76.55 (1) 76.36 (1)

76.17 (5) 76.35 (4) 76.34 (2)

75.76 (1.5) 76.39 (2.5) 76.20 (4)

76.19 (4) 76.27 (5) 76.00 (5)

76.25 (3) 76.39 (2.5) 76.28 (3)

P-gene

Uniform Normal Pareto

61.19 (2) 67.71 (1) 71.05 (2)

56.18 (5) 59.47 (5) 67.25 (3)

56.24 (4) 62.44 (4) 59.42 (5)

60.51 (3) 62.87 (3) 63.73 (4)

62.92 (1) 65.07 (2) 71.83 (1)

K-R vs. K-P

Uniform Normal Pareto

92.16 (3) 92.13 (4) 92.13 (2)

92.22 (1) 92.26 (3) 92.07 (3.5)

92.04 (4) 92.30 (1) 92.04 (5)

91.98 (5) 92.12 (5) 92.07 (3.5)

92.17 (2) 92.28 (2) 92.15 (1)

Mushroom

Uniform Normal Pareto

99.99 (5) 100.00 (3) 99.90 (5)

100.00 (2.5) 100.00 (3) 100.00 (2.5)

100.00 (2.5) 100.00 (3) 100.00 (2.5)

100.00 (2.5) 100.00 (3) 100.00 (2.5)

100.00 (2.5) 100.00 (3) 100.00 (2.5)

Letter

Uniform Normal Pareto

93.42 (1.5) 93.15 (3) 93.21 (4)

93.42 (1.5) 93.55 (2) 93.42 (1)

93.09 (4.5) 92.84 (5) 93.23 (3)

93.09 (4.5) 92.98 (4) 92.93 (5)

93.11 (3) 93.71 (1) 93.25 (2)

CNAE-9

Uniform Normal Pareto

85.86 (2) 83.77 (5) 84.63 (4)

85.44 (3) 84.52 (3) 84.71 (3)

84.54 (5) 85.20 (1) 84.78 (2)

85.38 (4) 84.51 (4) 84.47 (5)

85.87 (1) 84.64 (2) 84.99 (1)

Musk(Ver.2)

Uniform Normal Pareto

92.29 (3) 92.01 (4.5) 91.64 (4)

91.89 (4) 92.61 (2) 91.00 (5)

94.92 (1) 92.07 (3) 94.58 (2)

91.41 (5) 92.01 (4.5) 94.55 (3)

92.36 (2) 94.52 (1) 95.14 (1)

Table 19 Classification accuracies of SVM on three distributions. Data sets

Distributions

Classification / % GHR

SCA

DPSO

QIPSO

BBA

Soybean(small)

Uniform Normal Pareto

98.70 (3) 100.00 (2) 99.60 (1.5)

96.25 (5) 98.90 (5) 97.15 (5)

98.90 (4) 99.80 (4) 99.60 (1.5)

99.15 (2) 100.00 (2) 98.40 (4)

99.20 (1) 100.00 (2) 98.55(3)

Zoo

Uniform Normal Pareto

91.52 (5) 90.26 (5) 91.56 (4)

91.92 (2) 90.35 (4) 92.38 (1)

91.85 (3) 90.57 (3) 92.28 (2)

91.70 (4) 90.81 (2) 91.39 (5)

92.22 (1) 90.84 (1) 91.72 (3)

Voting

Uniform Normal Pareto

93.81 (1) 93.63 (4) 93.80 (1)

93.67 (3) 93.64 (3) 92.25 (5)

93.52 (5) 93.66 (2) 93.77 (2)

93.66 (4) 93.59 (5) 93.68 (4)

93.72(2) 93.67(1) 93.72 (3)

Tic-tac-toe

Uniform Normal Pareto

65.35 (1.5) 65.34 (4) 65.34 (4)

65.34 (4) 65.35 (1.5) 65.34 (4)

65.34 (4) 65.34 (4) 65.34 (4)

65.34 (4) 65.34 (4) 65.35 (1.5)

65.35 (1.5) 65.35 (1.5) 65.35 (1.5)

P-gene

Uniform Normal Pareto

61.05 (1) 67.58 (1) 66.73 (1)

58.04 (5) 56.52 (5) 65.99 (2)

59.38 (4) 65.39 (2) 54.76 (5)

60.09 (3) 59.66 (4) 56.48 (4)

61.01 (2) 62.04 (3) 63.57 (3)

K-R vs. K-P

Uniform Normal Pareto

95.67 (4) 95.69 (2) 95.71 (2.5)

95.73 (2) 95.65 (3) 95.63 (5)

95.68 (3) 95.64 (4) 95.75 (1)

95.59 (5) 95.62 (5) 95.66 (4)

95.74 (1) 95.73 (1) 95.71 (2.5)

Mushroom

Uniform Normal Pareto

83.57 (3) 86.39 (1) 86.72 (2)

82.59(4) 86.00 (3) 82.26 (4)

83.65 (2) 86.19 (2) 84.34 (3)

82.56 (5) 84.78 (5) 81.19 (5)

86.77 (1) 84.97 (4) 87.77 (1)

Letter

Uniform Normal Pareto

78.38 (1) 78.95 (1) 78.61 (1)

78.06 (2) 78.38 (2) 77.66 (4)

77.45 (3.5) 77.56 (5) 77.77 (3)

77.43(5) 77.57 (4) 77.54 (5)

77.45 (3.5) 77.87 (3) 78.17 (2)

CNAE-9

Uniform Normal Pareto

91.88 (1) 91.50 (2) 92.60 (1)

91.70 (3) 91.49 (3) 91.36 (4)

91.65 (4) 91.42 (4) 91.75 (2)

91.61 (5) 91.19 (5) 91.28 (5)

91.80 (2) 91.86 (1) 91.56 (3)

Musk(Ver.2)

Uniform Normal Pareto

71.15 (3) 75.25 (2) 76.51 (1)

69.78 4) 72.60 (3) 76.49 (2)

61.85 (5) 62.73 (5) 62.00 (5)

74.52 (2) 67.82 (4) 63.23 (4)

75.51 (1) 77.97 (1) 64.08 (3)

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

21

Fig. 13. Accuracy comparison of classifier 3NN with BBA against the others with the Nemenyi test.

Fig. 14. Accuracy comparison of classifier SVM with BBA against the others with the Nemenyi test. Table 20 Benchmark high-dimensional data sets. Data sets

Names

Areas

No. of objects

No. of attributes

No. of classes

D1 D2 D3 D4 D5

Colon Pcmac Relathe Basehock Leukemia

Biological Text Text Text Biological

62 1943 1427 1993 72

2000 3289 4322 4862 7070

2 2 2 2 2

obtain better FBF, FWF, MEFb, and DF values than other algorithms in all cases mentioned above. BBA obtains the higher FBF values than DPSO and QIPOS on Colon with normal distribution and Pareto distribution, and it achieves the better FWF, MEFb, and DF on Colon with all three distributions. DPSO is the worst one among these five algorithms on Colon, but it obtains a better MEFb value than GHR with Pareto distribution. For data set Pcmac, BBA is the best one, and DPSO is the worst one. QIPSO obtains the better MEFb and DF values than GHR with all three distributions. For data set Relathe, BBA is the best one in terms of FBF, FWF, MEFb, and DF, and DPSO is the worst on in terms of FBF and FWF with all three distributions. DPSO achieves the better MEFb and DF values than QIPSO with uniform distribution and normal distribution, and its MEFb and DF values are the worst one with Pareto distribution. BBA obtains the better FBF, FWF, MEFb, and DF values than other algorithms on Basehock with all three distributions, and DPSO is the worst one in terms of the four evaluation metrics on Basehock. For data set Leukemia, the FBF value of BBA is the best with uniform distribution, and GHR achieves the best FBF values with normal distribution and Pareto distribution. The evaluation metrics values of DPSO are the worst one, and SCA is better than DPSO and QIPSO in terms of the four evaluation metrics with all three distributions. Fig. 16 records the average running time for five algorithms on benchmark high-dimensional data sets. We can observe that GHR and SCA are more efficient especially for the data set Colon and Leukemia. The average running times of Pcmac, Relathe and Basehock further verify the superior computational efficiency of our proposed algorithm in comparison with DPSO and QIPSO. In summary, the results of the existing metrics and our proposed metrics indicate that our proposed method based that uses the binary bat algorithm has a better performance than all other algorithms. It can obtain the highest FOF and FBF value

and the lowest AEF, MEF, FWF, MEFb and DF value with all three distributions, and the results of WF, LF, MRF, and ARF also demonstrate that BBA has a superior performance to other four algorithms. According to the analysis of classification accuracy, the classification accuracy of GHR, SCA, DPSO, and BBA are almost identical for classifier 3NN, and the classification accuracy of GHR, SCA, and BBA are almost identical for classifier SVM. However our algorithm can achieve a reduct with a comparable classification accuracy and a lower test cost in a much shorter time. The superiority of the results and high performance of the BBA are due to the following reasons. First, we define a more suitable and accurate fitness function, which is one of the main reasons that the BBA performs better than other algorithms. The V-shaped transfer function obliges bats to switch their positions when they are moving into the unpromising area of the search space, and which is also an important reason. One more reason is that the loudness and the pulse emission balance between exploration and exploitation assisting the BBA to avoid local minima and then accelerate with fast convergence rate toward the global optimum throughout iterations. Also, the results on high-dimensional data sets indicate that BBA is better than other population-based meta-heuristic algorithms DPSO and QIPSO. The data sets Colon and Leukemia are two high-dimensional imbalanced data sets, and the results on these two data sets show that the population-based meta-heuristic algorithms are worse than the general heuristic algorithms GHR and SCA. 6. Conclusion In this paper, we present an efficient minimal test cost attribute reduction algorithm and also more adequate evaluation model. To obtain the optimal solution, a 0-1 integer programming approach is designed for this specific attribute reduction problem.

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

22

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 15. Comparison of FBF, FWF, MEFb, and DF on high-dimensional data sets with three distributions.

Regarding the test cost factor, we give a fitness function without any uncertain parameter based on the inconsistent object pairs, which can express the relationship between the reduct and the value of fitness function more accurately. Then, a fast algorithm for computing the number of the pairs is offered to accelerate the calculation of our proposed fitness function. Next, to achieve the minimal test cost reduct, we propose an efficient algorithm

based on our fitness function and the binary bat algorithm. To address the limitation of existing evaluation metrics, we develop four evaluation metrics to assess the performance without the optimal solutions. The results of the experiment on the broadly used benchmark data sets show that the performance of our proposed algorithm is better than other state-of-the-art algorithms

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 16. Comparison of the average running time on high-dimensional data sets.

for both current evaluation metrics and proposed evaluation metrics. At the same time, the reducts generated by our proposed algorithm have a comparable classification accuracy. From the results for high-dimensional data sets, we can also conclude that our new algorithm can quickly find a better solution then other population-based meta-heuristic algorithms. The following research topics deserve further investigation. First, the exact algorithm for test-cost-sensitive attribute reduction is worth studying in-depth since 0-1 ILP cannot obtain the optimal solution of high-dimensional data. Second, the population-based meta-heuristic algorithms have no distinct advantages compared with general heuristic algorithms for imbalanced data. Thus, test-cost-sensitive attribute reduction for imbalanced data sets is also a good topic. The third one is to establish a wide comprehensive evaluation model for test-costsensitive attribute reduction. Acknowledgments The authors thank five anonymous reviewers for their valuable comments and suggestions. The research was supported by The National Natural Science Foundation of China (grant nos. 61373015, 61728204), China Scholarship Council (grant no. 201806830058), State Key Laboratory for smart grid protection and operation control Foundation, China, Science and Technology Funds, China from National State Grid Ltd. (The Research on Key Technologies of Distributed Parallel Database Storage and Processing based on Big Data), NUPTSF, China (grant no. NY219142), and NSERC of Canada (Discovery grant no. 6466-15). References [1] Z. Pawlak, Rough set, Int. J. Comput. Inf. Sci. 11 (5) (1982) 341–356. [2] Z. Pawlak, Rough Sets, Kluwer, 1991. [3] A. Skowron, J. Stepaniuk, Tolarence approximation spaces, Fund. Inform. 27 (1996) 245–253. [4] Y.Y. Yao, Two views of the theory of rough sets in finite universes, Internat. J. Approx. Reason. 15 (1996) 291–317. [5] Z. Pawlak, Rough set theory and its applications to data analysis, Cybern. Syst. 29 (7) (1998) 661–688. [6] Z. Pawlak, Rough sets and intelligent data analysis, Inform. Sci. 147 (1) (2002) 1–12. [7] Z. Pawlak, A. Skowron, Rudiments of rough sets, Inform. Sci. 177 (1) (2007) 3–27. [8] R. Janicki, A.A. Lenarčič, Optimal approximations with rough sets and similarities in measure spaces, Internat. J. Approx. Reason. 71 (2016) 1–14. [9] R.W. Swiniarski, A. Skowron, Rough set methods in feature selection and recognition, Pattern Recognit. Lett. 24 (6) (2003) 833–849.

23

[10] L. Polkowski, S. Tsumoto, T.Y. Lin, Rough Set Methods and Applications, Physica-Verlag, HD, 2000. [11] X. Hu, N. Cercone, Learning in relational databases: A rough set approach, Comput. Intell. 11 (2) (2010) 323–338. [12] P.J. Lingras, Y.Y. Yao, Data Mining Using Extensions of the Rough Set Model, John Wiley & Sons Inc., 1998. [13] T. Herawan, M.M. Deris, J.H. Abawajy, A rough set approach for selecting clustering attribute, Knowl.-Based Syst. 23 (3) (2010) 220–231. [14] C. Wang, M. Shao, Q. He, Y. Qian, Y. Qi, Feature subset selection based on fuzzy neighborhood rough sets, Knowl.-Based Syst. 111 (2016) 173–179. [15] H. Fujita, A. Gaeta, V. Loia, F. Orciuoli, Resilience analysis of critical infrastructures: A cognitive approach based on granular computing, IEEE Trans. Cybern. (2018) 1–14. [16] G. Lang, M. Cai, H. Fujita, Q. Xiao, Related families-based attribute reduction of dynamic covering decision information systems, Knowl.-Based Syst. 162 (2018) 161–173. [17] H. Li, L. Zhang, X. Zhou, B. Huang, Cost-sensitive sequential three-way decision modeling using a deep neural network, Internat. J. Approx. Reason. 85 (2017) 68–78. [18] X. Yang, T. Li, H. Fujita, D. Liu, Y. Yao, A unified model of sequential threeway decisions and multilevel incremental processing, Knowl.-Based Syst. 134 (2017) 172–188. [19] R. Janicki, Approximations of arbitrary relations by partial orders, Internat. J. Approx. Reason. 98 (2018) 177–195. [20] Z. Xu, Z. Liu, B. Yang, A quick attribute reduction algorithm with complexity of max(O(|C ||U |), O(|C |2 |U /C |)), Chinese J. Comput. 29 (3) (2006) 391–399. [21] F. Jing, J. Yunliang, L. Yong, Quick attribute reduction with generalized indiscernibility models, Inform. Sci. 397–398 (Suppl. C) (2017) 15–36. [22] Y. Yao, Y. Zhao, Discernibility matrix simplification for constructing attribute reducts, Inform. Sci. 179 (7) (2009) 867–882. [23] W. Wei, X. Wu, J. Liang, J. Cui, Y. Sun, Discernibility matrix based incremental attribute reduction for dynamic data, Knowl.-Based Syst. 140 (Suppl. C) (2018) 142–157. [24] C. Wang, Q. He, M. Shao, Q. Hu, Feature selection based on maximal neighborhood discernibility, Int. J. Mach. Learn. Cybern. 9 (11) (2018) 1929–1940. [25] F. Wang, J. Liang, C. Dang, Attribute reduction for dynamic data sets, Appl. Soft Comput. 13 (1) (2013) 676–689. [26] X. Zhang, C. Mei, D. Chen, J. Li, Feature selection in mixed data: A method using a novel fuzzy rough set-based information entropy, Pattern Recognit. 56 (Suppl. C) (2016) 1–15. [27] C. Wang, Q. He, M. Shao, Y. Xu, Q. Hu, A unified information measure for general binary relations, Knowl.-Based Syst. 135 (2017) 18–28. [28] C. Wang, Q. Hu, X. Wang, D. Chen, Y. Qian, Z. Dong, Feature selection based on neighborhood discrimination index, IEEE Trans. Neural Netw. Learn. Syst. 29 (7) (2018) 2986–2999. [29] Y. Jing, T. Li, J. Huang, Y. Zhang, An incremental attribute reduction approach based on knowledge granularity under the attribute generalization, Internat. J. Approx. Reason. 76 (2016) 80–95. [30] Y. Jing, T. Li, H. Fujita, Z. Yu, B. Wang, An incremental attribute reduction approach based on knowledge granularity with a multi-granulation view, Inform. Sci. 411 (2017) 23–38. [31] J. Dai, Q. Hu, H. Hu, D. Huang, Neighbor inconsistent pair selection for attribute reduction by rough set approach, IEEE Trans. Fuzzy Syst. (2017) 1. [32] X. Xie, X. Qin, A novel incremental attribute reduction approach for dynamic incomplete decision systems, Internat. J. Approx. Reason. 93 (Suppl. C) (2018) 443–462. [33] H.H. Inbarani, A.T. Azar, G. Jothi, Supervised hybrid feature selection based on pso and rough sets for medical diagnosis, Comput. Methods Programs Biomed. 113 (1) (2014) 175. [34] Y. Chen, Q. Zhu, H. Xu, Finding rough set reducts with fish swarm algorithm, Knowl.-Based Syst. 81 (Suppl. C) (2015) 22–29. [35] H.H. Inbarani, M. Bagyamathi, A.T. Azar, A novel hybrid feature selection method based on rough set and improved harmony search, Neural Comput. Appl. 26 (8) (2015) 1859–1880. [36] F. Min, H. He, Y. Qian, W. Zhu, Test-cost-sensitive attribute reduction, Inform. Sci. 181 (22) (2011) 4928–4942. [37] W. Ding, J. Wang, Z. Guan, Cooperative extended rough attribute reduction algorithm based on improved pso, J. Syst. Eng. Electron. 23 (1) (2012) 160–166. [38] D. Ye, Z. Chen, S. Ma, A novel and better fitness evaluation for rough set based minimum;attribute reduction problem, Inform. Sci. 222 (3) (2013) 413–423. [39] D. Ye, Z. Chen, A new approach to minimum attribute reduction based on discrete artificial bee colony, Soft Comput. 19 (7) (2015) 1893–1903. [40] Y. Jiang, Y. Yu, Minimal attribute reduction with rough set based on compactness discernibility information tree, Soft Comput. 20 (6) (2016) 2233–2243.

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.

24

X. Xie, X. Qin, Q. Zhou et al. / Knowledge-Based Systems xxx (xxxx) xxx

[41] J. Zhao, K. Zhang, Z. Dong, D. Hua, K. Xu, Complete minimum attribute reduction algorithm using fusion of rough equivalence class and tabu search, Syst. Eng. Theory Pract. 37 (7) (2017) 1867. [42] F. Min, W. Zhu, Minimal Cost Attribute Reduction Through Backtracking, Springer, Berlin, Heidelberg, 2011, pp. 100–107. [43] G. Pan, F. Min, W. Zhu, A genetic algorithm to the minimal test cost reduct problem, in: IEEE International Conference on Granular Computing, 2012, pp. 539–544. [44] X. Yang, Y. Qi, X. Song, J. Yang, Test cost sensitive multigranulation rough set: Model and minimal cost selection, Inform. Sci. 250 (11) (2013) 184–199. [45] A. Fan, H. Zhao, W. Zhu, Test-cost-sensitive attribute reduction on heterogeneous data for adaptive neighborhood model, Soft Comput. 20 (12) (2016) 4813–4824. [46] F. Min, Q. Hu, W. Zhu, Feature selection with test cost constraint, Internat. J. Approx. Reason. 55 (1, Part 2) (2014) 167–179. [47] J. Dai, H. Han, Q. Hu, M. Liu, Discrete particle swarm optimization approach for cost sensitive attribute reduction, Knowl.-Based Syst. 102 (Suppl. C) (2016) 116–126. [48] A. Tan, W. Wu, Y. Tao, A set-cover-based approach for the test-costsensitive attribute reduction problem, Soft Comput. 21 (20) (2017) 6159–6173. [49] F. Min, Z.-H. Zhang, J. Dong, Ant colony optimization with partial-complete searching for attribute reduction, J. Comput. Sci. 25 (2018) 170–182. [50] X. Xie, X. Qin, C. Yu, X. Xu, Test-cost-sensitive rough set based approach for minimum weight vertex cover problem, Appl. Soft Comput. 64 (2018) 423–435. [51] Y. Yao, Three-way decision: An interpretation of rules in rough set theory, in: P. Wen, Y. Li, L. Polkowski, Y. Yao, S. Tsumoto, G. Wang (Eds.), Rough Sets and Knowledge Technology, Springer, Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 642–649. [52] X.-A. Ma, X.R. Zhao, Cost-sensitive three-way class-specific attribute reduction, Internat. J. Approx. Reason. 105 (2019) 153–174. [53] Y. Fang, F. Min, Cost-sensitive approximate attribute reduction with three-way decisions, Internat. J. Approx. Reason. 104 (2019) 148–165. [54] H. Zhao, S. Yu, Cost-sensitive feature selection via the ℓ2,1 -norm, Internat. J. Approx. Reason. 104 (2019) 25–37. [55] S. Liao, Q. Zhu, Y. Qian, G. Lin, Multi-granularity feature selection on costsensitive data with measurement errors and variable costs, Knowl.-Based Syst. 158 (2018) 25–42. [56] M. Liu, C. Xu, Y. Luo, C. Xu, Y. Wen, D. Tao, Cost-sensitive feature selection by optimizing f-measures, IEEE Trans. Image Process. 27 (3) (2018) 1323–1335. [57] J. Yang, V. Honavar, Feature subset selection using a genetic algorithm, IEEE Intell. Syst. Appl. 13 (2) (2002) 44–49. [58] M. Dorigo, T. Stutzle, Metaheuristic: Algorithms, applications, and advances, in: Handbook of Metaheuristics, Springer, 2003, pp. 250–285. [59] L. Ke, Z. Feng, Z. Ren, An efficient ant colony optimization approach to attribute reduction in rough set theory, Pattern Recognit. Lett. 29 (9) (2008) 1351–1357. [60] J. Kennedy, Particle Swarm Optimization, Springer, US, 2011. [61] B. Xue, M. Zhang, W.N. Browne, Particle swarm optimization for feature selection in classification: A multi-objective approach, IEEE Trans. Cybern. 43 (6) (2013) 1656.

[62] D. Karaboga, B. Basturk, Artificial bee colony (abc) optimization algorithm for solving constrained optimization problems, in: Foundations of Fuzzy Logic and Soft Computing, Ifsa 2007 Cancun, Mexico, June (2007) 18– 21, Proceedings, International Fuzzy Systems Association World Congress, 2007, pp. 789–798. [63] E. Hancer, B. Xue, M. Zhang, D. Karaboga, B. Akay, Pareto front feature selection based on artificial bee colony optimization, Inform. Sci. 422 (2018) 462–479. [64] M. Neshat, G. Sepidnam, M. Sargolzaei, A.N. Toosi, Artificial fish swarm algorithm: a survey of the state-of-the-art, hybridization, combinatorial and indicative applications, Artif. Intell. Rev. 42 (4) (2014) 965–997. [65] X.-Y. Luan, Z.-P. Li, T.-Z. Liu, A novel attribute reduction algorithm based on rough set and improved artificial fish swarm algorithm, Neurocomputing 174 (2016) 522–529. [66] X.S. Yang, A New Metaheuristic Bat-Inspired Algorithm, Springer, Berlin, Heidelberg, 2010, pp. 65–74. [67] X. Yang, Bat algorithm for multi-objective optimisation, Int. J. Bio. Inspir. Comput. 3 (5) (2012) 267–274, (8). [68] A.H. Gandomi, X.S. Yang, A.H. Alavi, S. Talatahari, Bat algorithm for constrained optimization tasks, Neural Comput. Appl. 22 (6) (2013) 1239–1255. [69] X.S. Yang, X. He, Bat algorithm: literature review and applications, Int. J. Bio. Inspir. Comput. 5 (3) (2013) 141–149. [70] X.B. Meng, Y. Liu, Y. Liu, H. Zhang, A novel bat algorithm with habitat selection and doppler effect in echoes for optimization, Expert Syst. Appl. Int. J. 42 (17) (2015) 6350–6364. [71] R.Y.M. Nakamura, L.A.M. Pereira, D. Rodrigues, K.A.P. Costa, J.P. Papa, X.-S. Yang, 9 –binary bat algorithm for feature selection, in: Swarm Intelligence and Bio-Inspired Computation, 2013, pp. 225–237. [72] S. Mirjalili, S.M. Mirjalili, X.S. Yang, Binary bat algorithm, Neural Comput. Appl. 25 (3–4) (2013) 663–681. [73] V. Basetti, A.K. Chandel, Optimal pmu placement for power system observability using taguchi binary bat algorithm, Measurement 95 (2017) 8–20. [74] M. Kang, J. Kim, J.-M. Kim, A.C.C. Tan, E.Y. Kim, B.-K. Choi, Reliable fault diagnosis for incipient low-speed bearings using fault feature analysis based on a binary bat algorithm, Inform. Sci. 294 (Suppl. C) (2015) 423–438. [75] Y. Xu, L. Wang, R. Zhang, A dynamic attribute reduction algorithm based on 0-1 integer programming, Knowl.-Based Syst. 24 (8) (2011) 1341–1347. [76] H.W. Lenstra, Integer programming with a fixed number of variables, Math. Oper. Res. 8 (4) (1983) 538–548. [77] J.E. Mitchell, Branch-and-cut algorithms for combinatorial optimization problems, in: HandBook of Applied Optimization, Vol. 2241, 2002, pp. 65–77. [78] P. Hokama, F.K. Miyazawa, E.C. Xavier, A branch-and-cut approach for the vehicle routing problem with loading constraints, Expert Syst. Appl. 47 (2016) 1–13. [79] UCI machine learning repository. http://www.ics.uci.edu/mlearn/ MLRepository.html. [80] M. Friedman, A comparison of alternative tests of significance for the problem of m rankings, Ann. Math. Statist. 11 (1940) 86–92. [81] P.B. Nemenyi, Distribution-Free Multiple Comparisons (Ph.D. thesis), Princeton University, 1963. [82] J. Li, K. Cheng, S. Wang, F. Morstatter, T. Robert, J. Tang, H. Liu, Feature selection: A data perspective, arXiv:1601.07996.

Please cite this article as: X. Xie, X. Qin, Q. Zhou et al., A novel test-cost-sensitive attribute reduction approach using the binary bat algorithm, Knowledge-Based Systems (2019) 104938, https://doi.org/10.1016/j.knosys.2019.104938.