Journal Pre-proof A graph approach for fuzzy-rough feature selection
Jinkun Chen, Jusheng Mi, Yaojin Lin
PII: DOI: Reference:
S0165-0114(19)30321-5 https://doi.org/10.1016/j.fss.2019.07.014 FSS 7712
To appear in:
Fuzzy Sets and Systems
Received date: Revised date: Accepted date:
11 September 2017 7 July 2019 28 July 2019
Please cite this article as: J. Chen et al., A graph approach for fuzzy-rough feature selection, Fuzzy Sets Syst. (2019), doi: https://doi.org/10.1016/j.fss.2019.07.014.
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. ©
2019 Published by Elsevier.
A graph approach for fuzzy-rough feature selection Jinkun Chena,b , Jusheng Mib,∗, Yaojin Linc,d a School
of Mathematics and Statistics, Minnan Normal University, Zhangzhou 363000, P. R. China of Mathematics and Information Science, Hebei Normal University, Shijiazhuang, Hebei 050024, P.R. China c School of Computer Science, Minnan Normal University, Zhangzhou 363000, P. R. China d Key Laboratory of Data Science and Intelligence Application, Fujian Province University, Zhangzhou 363000, P. R. China b College
Abstract Rough sets, especially fuzzy-rough sets, have proven to be a powerful tool for dealing with vagueness and uncertainty in data analysis. Fuzzy-rough feature selection has been shown to be highly useful in data dimensionality reduction. However, many fuzzy-rough feature selection algorithms are still time-consuming when dealing with the large-scale data sets. In this paper, the problem of feature selection in fuzzy-rough sets is studied in the framework of graph theory. We propose a new mechanism for fuzzy-rough feature selection. It is shown that finding the attribute reduction of a fuzzy decision system can be translated into finding the transversal of a derivative hypergraph. Based on the graph-representation model, a novel graph-theoretic algorithm for fuzzy-rough feature selection is proposed. The performance of the proposed method is compared with those of the state-of-the-art methods on various classification tasks. Experimental results show that the proposed technique outperforms all other known feature selection methods in terms of the computation time. Especially for the large-scale data sets, it demonstrates promising performance. Moreover, our proposed method can achieve better classification accuracies with the usage of small number of features. Keywords: Feature selection; Fuzzy-rough sets; Hypergraphs; Transversals.
1. Introduction Feature selection has been an active field of research for the past two decades in pattern recognition [6, 19, 20, 53, 54, 55, 56]. In the framework of rough sets [40], feature selection is also called the attribute reduction, which is to find a minimal feature subset that provides the same discriminating information as the whole set of features [30, 31, 32, 43]. The methods for feature selection in the traditional rough set theory are only suitable for handling the symbolic data. However, the existing data in real world may often contain either a mixture of symbolic and real-value attributes or only real-value attributes. It is difficult to deal with such data when applying the traditional rough set model. Fuzzy-rough sets [13, 14], as a combining of fuzzy sets [66] and rough sets, have proven to have the capability of modeling the mixed data [35, 36, 45, 63, 64, 71]. Over the past ten years, fuzzy-rough sets have become a topic of investigation for many researchers [23, 49, 58, 59, 68] and have been successfully applied in neural networks [46], gene expression [11], mining stock price [52] and dimensionality reduction [4, 7, 25, 62, 69]. A beautiful theoretical foundation for the fuzzy-rough feature selection method is based on the discernibility matrix [51]. However, it is very time-consuming when applying the theoretical results. In fact, as shown in [48, 60], finding the set of all reducts (or finding an optimal reduct) is an NP-hard problem. To support efficient feature selection, many heuristic methods such as positive-region methods [22, 44], information entropy methods [21] and discernibility matrix methods [5, 51] have been developed. However, in most heuristic methods, one needs to construct a fitness function to evaluate the significance measure of every attribute. By using the fitness function, we can take a fit attribute into the attribute subset in each loop to form a reduct. A large amount of times are spent to calculate ∗ Corresponding
author. Email addresses:
[email protected] (Jinkun Chen),
[email protected] (Jusheng Mi),
[email protected] (Yaojin Lin)
Preprint submitted to Elsevier
July 31, 2019
the significance measure of attributes. Thus, most heuristic methods have the same problem of heavy computational cost when dealing with the large-scale data sets. Therefore, it is desirable to establish a more effective method for addressing such problem. The graph-based approach appears to be a useful tool for describing the reduction information of features [8, 9, 28, 37]. For example, Kulaga et al. [28] introduced a graph framework for feature selection with rough sets while there were no theoretical analysis and numerical simulations. It should be noted that the above graph-based method for reduction is only suitable for the symbolic data. Chen et al. [8, 9] established the relationship between attribute reducts in rough sets and minimal vertex covers of graphs, and they paid much attention to studying the minimal vertex cover problem by using the rough set method. Furthermore, similar to the construction of a discernibility matrix, a graph is needed to be generated before further analysis. But this process is time-consuming and requires much more storage. To overcome the aforementioned weaknesses, we propose a new framework based on graph theory for the fuzzy-rough feature selection. We show that the problem of finding the attribute reduction of a fuzzy decision system is equivalent to the problem of finding the transversal of a hypergraph. Specifically, we avoid generating the hypergraph when using the graph-based method for finding a subset of features. Therefore, the time complexity of finding reducts can be largely reduced when dealing with the large-scale data sets. The remainder of this paper is organized as follows. In Section 2, the existing research works on fuzzy-rough feature selection are summarized. In Section 3, some basic notions related to fuzzy-rough sets and graph theory are introduced. In Section 4, a new hypergraph induced from a given fuzzy decision system is constructed, and some properties of the derivative hypergraph are studied. Furthermore, a novel algorithm for fuzzy-rough feature selection based on graph theory is presented. In Section 5, some numerical experiments are given to show the effectiveness of the proposed method. Finally, some conclusions are drawn in the last section. 2. Related work As concluded in [5, 50], the existing methods on fuzzy-rough feature selection can be grouped into two broad classes: fuzzy discernibility matrix-based (FDM) approaches and those that are not. In [51], a discernibility matrix was constructed to obtain all the reducts of a fuzzy decision system. Although this method is time-consuming, the notion of the discernibility matrix is a mathematical foundation for the reduction construction. Indeed, many researchers studied the attribute reduction by using the information of the discernibility matrix [5, 18, 26]. For example, Jensen et al. [26] extended the crisp discernibility matrix by employing fuzzy clauses to the fuzzy case. The method used to find reducts for crisp discernibility matrices was then applied in the fuzzy cases. In [18], He et al. developed an algorithm based on the discernibility matrix to search the reduction with general fuzzy-rough sets [64]. A modified form of the original FDM approach based on the nearest neighbour was given in [27]. As stated in [5], only the minimal elements rather than all the ones in the fuzzy discernibility matrix are sufficient and necessary to find reducts. Additionally, a novel algorithm based on the minimal elements was designed to obtain the reduct and the time cost was greatly reduced as compared to the other FDM approaches. Different to the FDM method, some measures were employed to develop heuristic algorithms for fuzzy-rough feature selection. The idea was first introduced in [24], where a dependence function was defined to measure the discernibility power of attributes and a technique called the fuzzy-rough QUICKREDUCT algorithm was given to find a reduct. Many researches on fuzzy-rough feature selection mainly pay attention to improving the results of [24]. For example, Bhatt et al. [2] put forward the concept of fuzzy-rough sets on compact computational domain and improved the computational efficiency of the algorithm in [24]. Hu et al. [21] constructed a forward hybrid attribute reduction algorithm called the fuzzy positive region reduction for fuzzy-rough sets. Based on the fuzzy similarity relations, Jensen et al. [26] proposed three new fuzzy-rough feature selection algorithms. In particular, they developed a fuzzy boundary region-based method which not only considers the information of lower approximations, but also takes into account the information of upper approximations. A new definition for fuzzy-rough sets based on the divergence measure was introduced and an algorithm for feature selection using the fuzzy positive region was presented in [47]. Information entropy plays an important role in feature selection. Hu and Yu et al. extended the entropy to measure the information quantity in fuzzy sets and applied the information measure to the dimensionality reduction for hybrid data [22, 65]. As shown in [27, 44], these methods are still time-consuming for the large-scale data sets. In [27], an alternative approach (nnFRFS) by recalculating the membership degree to the approximation with only 2
considering the closest neighbours to improve the standard fuzzy-rough feature selection (FRFS) was presented. The results demonstrated that the number of calculations of nnFRFS can be reduced drastically and it was suitable for handling the large-scale data. Qian et al. [44] developed an accelerated algorithm based on the fuzzy positive region (FA-FPR) for the large-scale tasks. The results showed that FA-FPR is much faster than their original counterpart and becomes more visible when dealing with the large-scale data. Furthermore, based on the accelerator-positive approximation and fuzzy information entropy, another fuzzy-rough feature selection algorithm called FA-FSCE was presented. It showed that the modified algorithm can take far less time as compared to the original one. Zhang et al. [70] investigated the fuzzy-rough feature selection by using representative instances. There were also some nice fuzzy-rough feature selection methods designed to model more complex situations [23, 62, 67]. 3. Preliminaries In this section, we recall some basic notions related to fuzzy-rough sets and graph theory [3, 5, 10, 13, 29, 41, 51]. 3.1. Feature selection with fuzzy-rough sets Formally, an information system can be seen as a pair (U, A), where U is a nonempty finite set of samples and A is a nonempty finite set of attributes (features) such that a : U −→ Va for every a ∈ A. A decision system is a special information system with the form (U, A ∪ D), where (U, A) is an information system and A ∩ D = ∅. Usually, A is called the conditional attribute set and D is the decision attribute set. A fuzzy binary relation R is called a fuzzy similarity (or equivalence) relation if R is reflexive (R(x, x) = 1), symmetric (R(x, y) = R(y, x)) and sup-min transitive (R(x, y) ≥ sup min{R(x, z), R(z, y)}). z∈U
As a generalization of the crisp rough set, the first fuzzy-rough set model was introduced by Dubois and Prade [13, 14]. In fact, there are several other kinds of fuzzy-rough set models [19, 20, 22, 64]. For our purpose, we only recall the first one proposed in [13]. Let R be a fuzzy similarity relation on the universe U. A fuzzy-rough set is a pair (R∗ (F), R∗ (F)) of fuzzy sets on U such that for every x ∈ U: ⎧ ⎪ ⎪ R (F) = inf max{1 − R(x, y), F(y)}, ⎪ ⎪ y∈U ⎨ ∗ (1) ⎪ ∗ ⎪ ⎪ ⎪ ⎩R (F) = sup min{R(x, y), F(y)}. y∈U
In the theory of fuzzy-rough sets, attribute reduction is one of key processes for knowledge discovery. Many approaches to fuzzy-rough feature selection have been developed in the literature. For our purpose, we introduce the following method based on the discernibility matrix and logical operation [5, 51]. By means of the discernibility matrix method, one can get all the reducts of a fuzzy decision system. Definition 1 ([51]). Let (U, A ∪ D) be a decision system and R be a family of fuzzy similarity relations resulting from the conditional attribute set A. Then S = (U, R ∪ D) is called a fuzzy decision system. There are many methods to translate a decision system into a fuzzy decision system [20]. Denote S im(R) = ∩{R : R ∈ R},
(2)
where S im(R)(x, y) = min{R(x, y) : R ∈ R}. It is easy to check that S im(R) is also a fuzzy similarity relation. Definition 2 ([51]). Let U/D = {D1 , D2 , · · · , Dr } be the set of equivalence classes induced by a given set of attributes r D. POSS im(R) (D) = S im(R)∗ (Dk ) is called the positive region of d w.r.t. R. Then P ⊆ R is a reduct of R relative to k=1
D if P is a minimal subset such that POSS im(P) (D) = POSS im(R) (D). Definition 3 ([5, 51]). Let S = (U, R ∪ D) be a fuzzy decision system. For any x, y ∈ U, we define ⎧ ⎪ ⎪ ⎨{R ∈ R : 1 − R(x, y) ≥ λ x }, y [x]D , M(x, y) = ⎪ ⎪ ⎩∅, otherwise 3
(3)
where [x]D denotes the equivalence class containing x w.r.t. the decision attribute set D and λ x = S im(R)∗ ([x]D )(x). Then M(x, y) is referred to as the discernibility attribute set of x and y in R. By MS , we denote an |U| × |U| matrix with the entry M(x, y), which is called the discernibility matrix of S . A discernibility function fS for a fuzzy decision system S is a Boolean function of m variables R∗1 , R∗2 , · · · , R∗m corresponding to the fuzzy attributes R1 , R2 , · · · , Rm , respectively, and it is defined as follows: fS (R∗1 , R∗2 , · · · , R∗m ) = ∧{∨M(x, y) : M(x, y) ∈ MS , M(x, y) ∅},
(4)
where ∨M(x, y) is the disjunction of all variables R∗ such that R ∈ M(x, y). By means of the operators of disjunction (∨) and conjunction (∧), Tsang et al [51] showed that the attribute reduction computation can be translated into the calculation of prime implicants of a Boolean function. The set of all attribute reducts of a fuzzy decision system S is denoted by Red(S ). Lemma 1 ([51]). Let S = (U, R ∪ D) be a fuzzy decision system. A fuzzy attribute subset P ⊆ R is a reduct of R iff ∗ Ri is a prime implicant of the discernibility function fS .
Ri ∈P
From Lemma 1, we can see that if fS (R∗1 , R∗2 , · · · , R∗m ) = ∧{∨M(x, y) : M(x, y) ∈ MS , M(x, y) ∅} =
si t ( R∗j ),
(5)
i=1 j=1
where
si j=1
R∗j , i ≤ t, are all the prime implicants of the discernibility function fS , then Pi = {R j : j ≤ si }, i ≤ t, are all
the reducts of S . Without any confusion, we will write Ri instead of R∗i in the sequel. 3.2. Transversals of hypergraphs A hypergraph is a pair H = (V, E) of a finite vertex set V and a family E of subsets of V. The elements of E are called hyperedges (or simply edges). Note that hyperedges are arbitrary sets of vertices, and can therefore contain an arbitrary number of vertices, while in a traditional graph, edges are pairs of elements of V. Thus, hypergraphs can be thought of as a generalization of the traditional graph. A transversal of H is a set K ⊆ V that has a nonempty intersection with each edge of H. In other words, a transversal is a set of vertices that covers all the edges. A transversal K is minimal if no proper subset of K is a transversal. A minimum transversal is a transversal with the least number of vertices. The set of all minimal transversals of H is denoted by T (H). A vertex v ∈ V is isolated if there is no edge E ∈ E such that v ∈ E. An edge E ∈ E with the same ends is called a loop, i.e., |E| = 1. For Ei , E j ∈ E, if Ei = E j , then we say that the edges Ei and E j are parallel. The degree of a vertex v in a hypergraph H, denoted by dH (v), is the number of edges of H incident with v. Similar to the attribute reduction method in fuzzy-rough sets, all the minimal transversals of a hypergraph can also be obtained via Boolean formulaes. Given a hypergraph H = (V, E), we define a function fH for H as follows, which is a Boolean function of m Boolean variables v∗1 , v∗2 , · · · , v∗m corresponding to the vertices v1 , v2 , · · · , vm , respectively. fH (v∗1 , v∗2 , · · · , v∗m ) = ∧{∨E : E ∈ E},
(6)
where ∨E is the disjunction of all variables v∗ such that v ∈ E. The following theorem gives a method for computing all the minimal transversals of a hypergraph. Lemma 2 ([3]). Let H = (V, E) be a hypergraph. Then a subset K ⊆ V is a minimal transversal of H iff
vi ∈K
v∗i is a
prime implicant of the Boolean function fH . Lemma 2 demonstrates that if fH (v∗1 , v∗2 , · · · , v∗m ) = ∧{∨E : E ∈ E} =
li t ( v∗j ), i=1 j=1
4
(7)
where
li j=1
v∗j , i ≤ t, are all the prime implicants of the Boolean function fH , then Ki = {v j : j ≤ li }, i ≤ t, are all the
minimal transversals of H. We will also write vi instead of v∗i in the discussion to follow. Lemma 2 gives a mathematical foundation to solve the hypergraph transversal problem (i.e., a transversal with the minimum number of vertices). However, it is an NP-hard problem. There are many approximation algorithms for solving this problem [10, 16]. 4. A graph method for fuzzy-rough feature selection In this section, we first construct a hypergraph induced from a fuzzy decision system and then discuss the relationship between attribute reduction of the fuzzy decision system and the minimal transversal of the derivative hypergraph. Finally, a new fuzzy-rough feature selection algorithm based on graph theory is developed. 4.1. The derivative hypergraph of a fuzzy decision system Definition 4. Let S = (U, R ∪ D) be a fuzzy decision system. MS is the discernibility matrix of S and M∗ = {M ∈ MS : M ∅}. We call the pair H = (R, M∗ ) an induced hypergraph from S . Recall that a hypergraph is a generalization of the traditional graph. The induced graph H from a fuzzy decision system may be a traditional graph, but in most cases it is not. Since the following results are not only true for the traditional graph, but also for the hypergraph. Therefore, we will call the induced graph in Definition 4 a hypergraph if there is no danger of ambiguity. It is worth noting that the edge of the induced hypergraph H is the entry of the discernibility matrix MS . Note that some entries in the discernibility matrix are the same, this implies that the induced hypergraph may has the parrael edges. Combining with Lemmas 1 and 2, we have the following result. Theorem 1. Let H = (R, M∗ ) be an induced hypergraph from a fuzzy decision system S = (U, R ∪ D). Then Red(S ) = T (H). Proof. To prove it we only need to show that fS (R1 , R2 , · · · , Rm ) = fH (R1 , R2 , · · · , Rm ) by Lemmas 1 and 2. In fact, according to Definition 4, we know that the edges of the induced hypergraph are the discernibility matrix of the fuzzy decision system. Thus, by the definitions of fS and fH , we can see that the equation holds. Theorem 1 shows that the problem of finding a reduct of a fuzzy decision system is equivalent to finding a minimal transversal of its derivative hypergraph. This may provides us the motivation based on graph theory to obtain the attribute reduction of fuzzy-rough sets. In the following, we use an example to illustrate the result obtained earlier. Example 1. Let S = (U, R ∪ D) be a fuzzy decision system induced from a decision system (U, A ∪ D), where U = {x1 , x2 , x3 , x4 } and R = {Ra , Rb , Rc , Rd } is a family of fuzzy similarity relations resulting from the conditional attribute set A = {a, b, c, d}. U/D = {D1 , D2 } with D1 = {x1 , x4 } and D2 = {x2 , x3 }. Elements in R are defined as follows: ⎛ ⎞ ⎛ ⎜⎜⎜ 1 0.3 0.2 0.6 ⎟⎟⎟ ⎜⎜⎜ ⎜⎜⎜ 0.3 1 0.2 0.3 ⎟⎟⎟ ⎜⎜⎜ ⎜ ⎟ ⎜ ⎟ , Rb (x, y) = ⎜⎜⎜⎜ Ra (x, y) = ⎜⎜ ⎟ ⎟ ⎜⎜⎝ 0.2 0.2 1 0.2 ⎟⎟⎠ ⎜⎜⎝ 0.6 0.3 0.2 1 ⎛ ⎛ ⎞ ⎜⎜⎜ 1 0.2 0.5 0.3 ⎟⎟⎟ ⎜⎜⎜ ⎜⎜⎜ 0.2 1 0.2 0.2 ⎟⎟⎟ ⎜⎜⎜ ⎜ ⎟ ⎜⎜⎜ ⎜ ⎟ Rc (x, y) = ⎜⎜ (x, y) = , R ⎟ d ⎜⎜⎜ ⎜⎜⎝ 0.5 0.2 1 0.3 ⎟⎟⎟⎠ ⎝ 0.3 0.2 0.3 1 It is easy to compute that
⎞ 1 0.2 0.3 0.3 ⎟⎟ ⎟ 0.2 1 0.2 0.2 ⎟⎟⎟⎟ ⎟, 0.3 0.2 1 1 ⎟⎟⎟⎟⎠ 0.3 0.2 1 1 ⎞ 1 0.5 0.5 0.2 ⎟⎟ ⎟ 0.5 1 0.7 0.2 ⎟⎟⎟⎟ ⎟. 0.5 0.7 1 0.2 ⎟⎟⎟⎟⎠ 0.2 0.2 0.2 1
⎛ ⎞ ⎜⎜⎜ 1 0.2 0.2 0.2 ⎟⎟⎟ ⎜⎜⎜ 0.2 1 0.2 0.2 ⎟⎟⎟ ⎟⎟ , and S im(R) = ⎜⎜⎜⎜ ⎜⎜⎝ 0.2 0.2 1 0.2 ⎟⎟⎟⎟⎠ 0.2 0.2 0.2 1 5
⎧ ⎧ ⎪ ⎪ ⎪ ⎪ 0.8, x = x1 0.8, x = x2 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎨ S im(R)∗ (D1 )(x) = ⎪ (D )(x) = , S im(R) . 0.8, x = x 0.8, x = x3 ⎪ ∗ 2 4 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩0, ⎩0, otherwise otherwise The discernibility matrix of S is (we use a separator-free form for sets, e.g., Rb Rc stands for {Rb , Rc }): ⎛ ⎞ Rb Rc Ra ∅ ⎜⎜⎜ ∅ ⎟⎟⎟ ⎜⎜⎜⎜ Rb Rc ∅ ∅ Rb Rc Rd ⎟⎟⎟⎟ ⎟. M = ⎜⎜⎜ ∅ ∅ Ra Rd ⎟⎟⎟⎠⎟ ⎜⎝⎜ Ra ∅ Rb Rc Rd Ra Rd ∅ By Definition 4, the induced hypergraph H = (R, M∗ ) from S has the form: R = {Ra , Rb , Rc , Rd } and M∗ = {E1 , E2 , · · · , E8 } (see Fig. 1). Rb E3 , E4
Rd
Rc
E7 , E8 E5 , E6
Ra
E1 , E2
Figure 1: The hypergraph of Example 1 By Lemmas 1 and 2, it is easy to see that Red(S ) = T (H) = {{Ra , Rb }, {Ra , Rc }}. In other words, it implies that the decision system (U, A ∪ D) has two reducts: {a, b} and {a, c}. As is well-known, the core attribute plays an important role in decision making. The following theorem characterizes the core attribute by graph theory. Theorem 2. Let H = (R, M∗ ) be an induced hypergraph from a fuzzy decision system S = (U, R ∪ D). For any R ∈ R, then R is a core attribute of S iff R has a loop in H. Proof: “⇒” Suppose R ∈ R is a core attribute of S . If R is a vertex of H without loops, we then note that the core attribute is included in every reduct. Thus by Theorem 1, we conclude that R is included in each minimal transversal of H. Since R has no loops, we can see that V − {R} is a transversal of H. Hence, there is a minimal transversal K such that K ⊆ V − {R}. This implies R K, which is contradict to the assumption that R is included in each minimal transversal of H. Therefore, R has a loop in H. “⇐” Assume R is a vertex with loops in H. If R is not a core attribute of S , then by Theorem 1 again, there exists a minimal transversal K ⊆ V such that R K. Since R K and R has a loop, it implies that the ends of the loop are not contained in K, which is contradict to the assumption. Thus, R is a core attribute of S . By Theorem 1, we know that the problem of finding a reduct of a fuzzy decision system can be translated into finding a transversal of a hypergraph. The concept of the degree of hypergraph vertices plays an important role in solving the hypergraph transversal problem. To obtain the degree of the induced hypergraph vertex, we first introduce the following notion. For a given fuzzy decision system S = (U, R ∪ D), for any R ∈ R, we define a crisp relation (B)R = (b xy )|U|×|U| such that ⎧ ⎪ ⎪ ⎨1, R(x, y) ≤ 1 − λ x and y [x]D b xy = ⎪ . (8) ⎪ ⎩0, otherwise Theorem 3. Let H = (R, M∗ ) be an induced hypergraph from a fuzzy decision system S = (U, R ∪ D). For any R ∈ R, b. then the degree of the vertex R is dH (R) = b∈(B)R
6
Proof. For any vertex R ∈ R, note that the degree of R is the number of edges of H incident with R. By Definition 4, we know that the edge of H is the entry of the discernibility matrix of S . For any edge E of H, if R is incident with b. E (i.e., R ∈ E), then there are x, y ∈ U such that R ∈ M(x, y). By Definition 3 and (B)R , we have dH (R) = b∈(B)R
Theorem 3 shows that the degree of a hypergraph vertex R is the sum of the entries in the crisp relation matrix (B)R . We do not directly obtain the degree of a hypergraph vertex via generating the hypergraph. Example 2. In Example 1, it is easy to obtain
(B)Ra
⎛ ⎜⎜⎜ 0 ⎜⎜⎜ 0 = ⎜⎜⎜⎜ ⎜⎜⎝ 1 0
0 0 0 0
1 0 0 1
0 0 1 0
⎞ ⎛ ⎟⎟⎟ ⎜⎜⎜ 0 ⎟⎟⎟ ⎜ ⎟⎟⎟ , (B)R = ⎜⎜⎜⎜⎜ 0 d ⎟⎟⎟ ⎜⎜⎜ 0 ⎝ ⎠ 0
0 0 0 1
0 0 0 1
0 1 1 0
⎞ ⎟⎟⎟ ⎟⎟⎟ ⎟⎟⎟ . ⎟⎟⎟ ⎠
From Fig. 1, we know that the degrees of vertices Ra and Rd are both 4. Furthermore, b= b∈(B)Ra we have dH (Ra ) = b and dH (Rd ) = b. b∈(B)Ra
b∈(B)Rd
b = 4. Thus,
b∈(B)Rd
For any R ∈ R, we call HR = (R, M∗ − {E ∈ M∗ : R ∈ E}) an edge-deleted subgraph by deleting all the edges incident to the vertex R from H but leaving the vertices and the remaining edges intact. Another crisp relation called the τ-cuts of a given fuzzy relation R is denoted by Rτ , where ⎧ ⎪ ⎪ ⎨1, R(x, y) ≥ τ Rτ (x, y) = ⎪ . (9) ⎪ ⎩0, R(x, y) < τ Theorem 4. Let H = (R, M∗ ) be an induced hypergraph from a fuzzy decision system S = (U, R ∪ D). For a given c. R∗ ∈ R, HR∗ is an edge-deleted subgraph. For any R ∈ R, denote C = (B)R − (B)R∗ . Then dHR∗ (R) = c∈C0
Proof. For a given vertex R∗ ∈ R, recall the definition of HR∗ , it means that all the edges containing the vertex R∗ in H have been deleted. Thus, for any R ∈ R, we have dHR∗ (R) = dH (R) − |E ∈ M∗ : R, R∗ ∈ E|. By Theorem 3, we know that the degree of the vertex R can be described by (B)R . Thus, we can conclude that |E ∈ M∗ : R, R∗ ∈ E| = (B)R ∩ (B)R∗ . This implies that dHR∗ (R) = b − (B)R ∩ (B)R∗ . By the definition of C, we have dHR∗ (R) = c.
c∈C0
b∈(B)R
Theorem 4 indicates that the degree of the vertex R in HR∗ is the sum of the entries in the zero-cut relation matrix C. Example 3. By Fig. 1, the edge-deleted subgraph HRd is (see Fig. 2): Rb Rd
E3 , E4
Ra
Rc
E1 , E2
Figure 2: The edge-deleted subgraph of the hypergraph in Example 1 From Fig. 2, it is easy to see that dHRd (Ra ) = 2. Furthermore, for Ra , by Example 2, we obtain that 7
We have dHRd (Ra ) =
c∈C0
⎛ ⎜⎜⎜ 0 ⎜⎜⎜ 0 C = ⎜⎜⎜⎜ ⎜⎜⎝ 1 0
⎞ ⎛ 0 1 0 ⎟⎟ ⎜⎜⎜ 0 0 ⎟⎟⎟ ⎜⎜ 0 0 0 0 −1 ⎟⎟ ⎟⎟⎟ , C0 = ⎜⎜⎜⎜⎜ 0 0 0 ⎟⎟⎠ ⎜⎜⎝ 1 0 −1 0 0 0 0
1 0 0 0
0 0 0 0
⎞ ⎟⎟⎟ ⎟⎟⎟ ⎟⎟⎟ . ⎟⎟⎟ ⎠
c = 2.
4.2. A graph-based algorithm for fuzzy-rough feature selection Now we propose the following feature selection algorithm for fuzzy decision systems based on graph theory. Algorithm 1 A graph-based algorithm for fuzzy-rough attribute reduction (GFAR) Input: A fuzzy decision system S = (U, R ∪ D); Output: One reduct red. 1: Calculate S im(R); //see Eq. (2). 2: for each xi ∈ U do 3: Calculate λ xi = S im(R)∗ ([xi ]D )(xi ). // see Eqs. (1) and (2). 4: end for 5: Let s = 0, red = ∅; 6: for each R ∈ R do 7: Calculate (B)R and b; //calculate the degrees of vertices, see Theorem 3. b∈(B)R 8: s= s+ b. b∈(B)R
9: 10: 11:
end for while s 0 do Select the vertex R∗ with the maximum degree b; b∈(B)R∗
12: 13: 14: 15: 16:
red = red ∪ R∗ ; s = 0; for each R ∈ R do Calculate C = (B)R − (B)R∗ and let (B)R = Cτ ;//where τ = 0, see Theorem 4. Calculate b and s = s + b. //calculate the degrees of vertices. b∈(B)R
17: 18: 19:
b∈(B)R
end for end while Return red.
Theorem 5. Let S = (U, R ∪ D) be a fuzzy decision system. Then the time complexity of GFAR is O(|U||R|(1 + ln(|R|))). Proof. It is easy to see that Steps 1-4 of Algorithm 1 need O(|U||R|). Assuming k(1 ≤ k ≤ |R|) attributes are selected by the proposed algorithm GFAR. It should be noted that Steps 10-18 is a greedy search algorithm similar to the one described in [10]. From the result presented in [10] and Definition 4, we know that k ≤ H( max∗ |M|)kopt , where M∈M n 1 H(n) = i and kopt denotes the cardinality of an optimal solution. According to the result stated in [10] again, we i=1
can obtain k ≤ ln(|R|)kopt . Note that kopt ≤ |R|, thus we can get k < |R|ln(|R|). Hence, Steps 10-18 can be done in O(|U||R|ln(|R|)). Therefore, we can conclude that GFAR needs O(|U||R|(1 + ln(|R|))). This completes the proof. It is worth noting that ln(|R|) |R| for high dimensional data. Thus from Theorem 5, on high dimensional data, GFAR needs a time complexity of O(|U||R|). It means that GFAR can effectively handle the large-scale data when compared with these algorithms described in [5, 20, 26, 44] that need O(|U|2 |R|) or O(|U||R|2 ). It should be emphasized that the proposed algorithm GFAR has two advantages. One is that we do not generate the hypergraph. Generating the induced hypergraph from a fuzzy decision system has a space complexity of O(|U|2 |R|) 8
and this requires additional computation time. The other is that the computation of the degrees of the hypergraph vertex is by means of the fuzzy similarity relation matrix without generating the hypergraph. 5. Experimental analysis In this section, we empirically evaluate our approach with other current state-of-art methods on hybrid data. 5.1. Experimental setup and data sets To evaluate the performance of GFAR and compare it with other well known fuzzy-rough feature selection algorithms, we set up the following experimental procedure. 1) 18 publicly available data sets 1 are used in our experiment. The sizes of these samples vary from 148 to 14980, and the numbers of their features are between 13 and 49151. The statistical information of these data sets is shown in Table 1. The experiment is performed on a personal computer with Intel (R) Core (TM) i7-4790 CPU 3.60GHz. All the algorithms are implemented in Matlab. Table 1: Data description used in this experiment
No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Data sets Bands Biodeg EEG Firm Gearbox GLA-BRA-180 Heart Hiva Ionosphere Lymphography Segment Sonar Texture TOX-171 Urban Vowel Wine Wpbc
Samples 365 1055 14980 10800 6412 180 270 384 351 148 2310 208 5500 171 507 990 178 194
Features 19 41 14 19 72 49151 13 1617 33 18 19 60 40 5748 147 13 13 33
Classes 2 2 2 2 4 2 2 2 2 4 7 2 13 4 9 11 3 2
2) Before reduction, each data set is needed to change into a corresponding fuzzy decision system. If a is a nominal attribute, the value of the fuzzy similarity degree Ra (x, y) between the objects x and y w.r.t. a is measured as: ⎧ ⎪ ⎪ ⎨1, a(x) = a(y) . (10) Ra (x, y) = ⎪ ⎪ ⎩0, otherwise If a is a numerical attribute, the value of the fuzzy similarity degree Ra (x, y) is calculated as [26]: a(y) − a(x) + σ a(x) − a(y) + σ Ra (x, y) = max min , ,0 , σ σ
(11)
where σ2 is the variance of the attribute a. 1 The data sets used in the experiment are from: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi, http://eps.upo.es/bigs/datasets.html, http://archive.ics.uci.edu/ml/datasets.html, http://sci2s.ugr.es/keel/datasets.php
9
The above relation Ra is not necessarily a fuzzy equivalence relation. However, we can get a fuzzy equivalence relation from Ra with the fuzzy transitive closure operation. 3) Four different types of classification algorithms are employed to test the quality of the feature subset selection algorithms. They are the NaiveBayes [17], the tree-based CART [39], SVM [42] and the nearest-neighbour classifier KNN [1]. The classification performances of the raw data and the reduced data are obtained with 10-fold crossvalidation. That is to say, the given data set is randomly divided into 10 subsets of the same size, 9 for training and the remaining one is used for testing. For each fold, feature selection method is performed on the training set, then the classification accuracies of the reduced training set and testing set are obtained by using the classifier learner. After 10 rounds, the average result is obtained. This process is independently repeated 10 times and the means and standard deviations are taken as the final performance. For these data sets with small samples such as GLA-BRA-180 and TOX-171, ten stratified randomisations of 3-fold cross-validation are employed. 4) To further assess the statistical significance of the results, the Friedman test [15] and the Nemenyi post-hoc test [38] are given. The W/D/L record [57] is also provided, i.e., the numbers of data sets for which the proposed algorithm performs better, equal and worse than the other algorithms, respectively. The Friedman test is a statistical test that uses the rank of each algorithm on each data set. Given k algorithms and N data sets, the Friedman statistic is defined as [15]: ⎛ k ⎞ (N − 1)χ2F 12N ⎜⎜⎜⎜ 2 k(k + 1)2 ⎟⎟⎟⎟ 2 ⎜⎝⎜ ri − ⎟⎠⎟ and F F = , (12) χF = k(k + 1) i=1 4 N(k − 1) − χ2F where ri is the average rank of algorithm i among all data sets. F F follows a Fisher distribution with k − 1 and (k − 1)(N − 1) degrees of freedom. If the null hypothesis, in which all algorithms are performing equivalently, is rejected under the Friedman test, post-hoc tests such as the Nemenyi test can be used to further explore which algorithms perform statistically different. In fact, two algorithms are considered significantly different if the distance of the average ranks exceeds the critical value CDα = qα k(k+1) 6N , where qα is a tabulated value proposed in [12]. 5) The proposed algorithm is compared with six different types of representative fuzzy-rough feature selection algorithms. They are the sample pair selection-based algorithm (SPS) [5], the fuzzy boundary region-based algorithm (B-FRFS) [26], the fuzzy positive region-based algorithm (FA-FPR) [44], the fuzzy information entropy-based algorithm (FA-FSCE) [44], the fuzzy mutual information based min-Redundancy and Max-Relevance algorithm (FMImRMR) [65] and the nearest neighbour-based algorithm (nnFRFS) [27]. 5.2. Experimental results and analysis In this subsection, we present the experimental results in terms of the number of selected features, the running time and the classification accuracy. 5.2.1. The average size of selected features Table 2 records the average sizes of selected features of the seven feature selection algorithms for each data set. In this table, “-” means that FMI-mRMR fails to get any results due to long running time (exceeding 70 hours). From it we can see that the proposed algorithm GFAR can achieve significant reduction of dimensionality in most of the cases as compared with the original data set. Especially for the data set with large features, such as GLA-BRA-180, there is a 99.99% decrease in dimension. This implies that the redundancy features contained in the feature space can be effectively reduced by GFAR. We can observe the similar results for other algorithms. In relation to B-FRFS, SPS, FA-FSCE, FMI-mRMR and nnFRFS, the W/D/L records show that there is a decrease in the size of the feature subset obtained by GFAR. Furthermore, GFAR is never worse than SPS, FMI-mRMR and nnFRFS in the size of selected features. Finally, GFAR on average obtains the best size of selected features of 16.49. To further explore whether the reduct sizes of the seven algorithms are significantly different, the Friedman test is made. Under the seven algorithms and the 18 data sets, F F is distributed with 6 and 102 degrees of freedom. Note that the critical value of the Fisher distribution F(6, 102) at α = 0.1 is 1.83. According to the Friedman test, the test result F F is 31.61. Obviously, F F is greater than 1.83. This means that at α = 0.1, it is evidence to reject the null hypothesis that all the algorithms are equivalent in terms of the reduct size. To further show the seven feature selection algorithms whose reduct sizes have statistically significant differences, the Nemenyi test is then conducted. For α = 0.1, we can 10
Table 2: Average sizes of selected features of the seven algorithms
Data sets Bands Biodeg EEG Firm Gearbox GLA-BRA-180 Heart Hiva Ionosphere Lymphography Segment Sonar Texture TOX-171 Urban Vowel Wine Wpbc Average Average rank W/D/L
All 19 41 14 19 72 49151 13 1617 33 18 19 60 40 5748 147 13 13 33 3170.56 18/0/0
GFAR 16.59±0.62 33.89±0.42 9.23±2.18 12.20±1.35 27.67±1.23 7.14±0.38 10.16±0.47 8.23±0.86 25.83±1.06 8.04±0.55 15.99±0.12 16.20±1.18 37.00±0.00 8.00±0.00 18.47±1.22 12.00±0.00 9.75±0.48 20.42±1.28 16.49 1.92 -
B-FRFS 16.69±0.63 34.07±0.56 9.23±2.18 12.43±1.33 28.50±1.31 8.43±0.53 10.63±0.49 18.30±4.00 25.34±1.05 8.00±0.40 16.99±0.12 17.11±1.64 37.00±0.00 8.43±0.50 18.87±1.46 12.00±0.00 10.01±0.61 20.32±1.58 17.35 2.81 12/3/3
SPS 17.09±0.67 34.71±0.54 9.23±2.18 16.00±0.00 32.33±2.31 15.24±1.13 12.31±0.46 23.43±2.54 28.39±1.19 9.35±0.61 17.11±0.36 30.03±2.67 37.00±0.00 12.87±0.86 34.43±3.14 12.10±0.30 11.50±0.56 24.83±1.36 21.00 4.53 16/2/0
FA-FPR 16.13±0.51 31.05±1.10 7.50±3.15 12.43±1.33 26.42±1.51 10.86±0.38 10.67±0.59 18.30±4.00 24.59±2.59 8.89±0.79 15.08±0.28 17.64±1.43 28.13±1.91 10.03±0.49 20.80±1.19 12.00±0.00 9.89±0.71 18.57±1.52 16.61 2.33 9/1/8
FA-FSCE 17.82±0.48 40.65±1.49 9.63±2.46 12.07±1.34 72.00±0.00 8.00±0.00 12.37±1.10 8.97±1.56 32.00±0.00 7.97±0.36 17.38±0.51 16.67±1.54 40.00±0.00 8.13±0.35 18.83±1.12 12.06±0.24 12.31±1.24 30.77±4.10 20.98 4.08 16/0/2
FMI-mRMR 17.85±0.46 40.67±1.41 9.93±2.45 16.00±0.00 72.00±0.00 12.81±0.39 23.43±2.54 32.00±0.00 9.51±0.69 17.45±0.49 30.03±2.67 40.00±0.00 34.43±3.14 12.14±0.35 12.56±0.78 31.37±2.54 24.73 6.25 18/0/0
nnFRFS 17.80±0.40 35.95±0.26 11.43±1.14 14.43±0.57 54.33±1.30 18.38±0.59 12.96±0.20 57.37±15.05 31.82±0.54 12.82±0.41 18.00±0.00 39.57±1.70 37.00±0.00 14.60±0.97 51.97±3.21 13.00±0.00 11.95±0.22 31.66±1.05 26.95 6.08 17/1/0
get the corresponding critical distance CDα = 1.94. Fig. 3 shows the results with α = 0.1 on the 18 data sets. The figure plots the average ranks of the seven feature selection algorithms. Furthermore, the groups of algorithms that are not significantly different are connected with a red line. The statistical test shown in Fig. 3 reveals that the size of selected features of GFAR is statistically smaller than those of SPS, FA-FSCE, FMI-mRMR and nnFRFS, and there is no consistent evidence to indicate the statistical differences between GFAR, B-FRFS and FA-FPR, respectively.
Figure 3: Comparison of the reduct sizes of all the algorithms against each other with the Nemenyi test
5.2.2. Runtime analysis Table 3 reports the runtimes of the seven feature selection algorithms for each data set. From it we can see that both the two feature selection algorithms GFAR and nnFRFS are consistently faster than all other five algorithms. GFAR is never worse than the other algorithms (except for nnFRFS) in the runtime. The runtime of GFAR on average is only 7.96% of that of B-FRFS, 2.06% of that of SPS, 14.33% of that of FA-FPR, 5.75% of that of FA-FSCE, 5.01% of that of FMI-mRMR, and 126.06% of that of nnFRFS, respectively. Especially for the large-scale data, the difference is profoundly large. For example, for the data set Gearbox, the reduced times achieve 2684.24s, 3693.14s, 11
1335.84s, 4824.04s, 4182.24s, and 20.43s, respectively. Furthermore, it should be noted that the differences between the runtime performances of GFAR and nnFRFS are small in most cases. Finally, the W/D/L records show that GFAR outperforms the other algorithms as well. Table 3: Runtimes of the seven algorithms (in seconds)
Data sets Bands Biodeg EEG Firm Gearbox GLA-BRA-180 Heart Hiva Ionosphere Lymphography Segment Sonar Texture TOX-171 Urban Vowel Wine Wpbc Average Average rank W/D/L
GFAR 0.07±0.01 3.46±0.55 45.43±6.83 50.64±5.22 184.46±11.33 270.60±62.42 0.03±0.00 5.42±1.98 0.24±0.05 0.01±0.00 3.65±0.07 0.16±0.01 63.22±1.56 3.29±0.11 0.99±0.08 0.42±0.01 0.01±0.00 0.06±0.00 35.12 1.17 -
B-FRFS 0.75±0.03 34.00±7.12 618.47±73.94 461.46±55.75 2868.70±184.11 1344.90±192.24 0.27±0.02 100.02±26.65 2.98±0.47 0.23±0.02 129.48±2.59 2.01±0.21 2254.90±75.08 83.54±6.59 34.64±2.78 6.74±0.29 0.19±0.01 0.86±0.05 441.34 5.11 18/0/0
SPS 0.72±0.04 11.40±0.45 4683.59±728.86 2321.20±280.89 3877.60±697.79 16957.00±1811.40 0.35±0.01 30.54±1.55 1.38±0.17 0.20±0.02 103.32±1.82 1.09±0.03 2330.10±128.75 280.04±9.72 9.41±0.31 10.08±0.22 0.21±0.01 0.50±0.01 1701.04 4.97 18/0/0
FA-FPR 0.58±0.04 28.48±2.56 420.27±192.94 202.64±29.65 1520.30±339.58 1407.80±44.79 0.20±0.01 70.65±25.66 2.62±0.45 0.09±0.01 43.74±1.00 1.42±0.14 662.33±46.64 35.19±3.44 10.61±0.76 3.23±0.13 0.09±0.01 0.55±0.04 245.04 3.44 18/0/0
FA-FSCE 1.13±0.07 74.65±14.18 1820.75±248.91 775.26±149.34 5008.50±506.25 1708.00±99.01 0.39±0.02 69.79±15.35 4.73±0.85 0.14±0.01 108.79±2.15 2.08±0.20 1349.40±51.63 43.87±2.86 14.67±1.19 5.41±0.15 0.16±0.01 1.28±0.11 610.50 5.39 18/0/0
FMI-mRMR 1.53±0.04 51.54±1.26 859.96±10.45 529.68±39.38 4366.70±66.73 0.41±0.01 3496.74±75.43 5.39±0.50 0.22±0.01 59.57±1.10 6.98±0.10 1005.80±17.64 125.92±3.76 5.95±0.06 0.24±0.01 1.76±0.04 700.55 6.08 18/0/0
nnFRFS 0.11±0.06 2.76±0.36 104.41±6.56 77.69±10.33 204.89±43.95 17.17±2.04 0.05±0.01 16.47±7.13 0.36±0.08 0.02±0.00 8.94±0.16 0.25±0.01 64.41±4.51 1.47±0.12 1.56±0.15 0.80±0.03 0.03±0.00 0.09±0.01 27.86 1.83 15/0/3
Fig. 4 presents the detailed changes of the six feature selection algorithms except for FMI-mRMR (since FMImRMR fails on the GLA-BRA-180 data set) as the size of data increases. Here, we take four large data sets into consideration. In each of these sub-figures, the x-axis pertains to the proportion of the data, whereas the y-axis concerns the computational time (in seconds). It appears from Fig. 4 that the runtime of each of the six algorithms grows monotonically as the increase of the size of data. However, it should be pointed out that GFAR and nnFRFS take less time to select a new feature subset than the other four algorithms. Furthermore, we can use the standard deviation to evaluate the stability of the algorithms (the lower, the better). As seen from Table 3 and Fig. 5, GFAR and nnFRFS both have far lower means and standard deviations than the ones produced by the other algorithms. This implies that GFAR and nnFRFS both provide much better robustness than the other four algorithms. Therefore, we can draw a conclusion that GFAR and nnFRFS are both suitable for handling the large-scale data sets. To further explore whether the runtimes of the seven algorithms are significantly different, we make a Friedman test. The test result F F is 54.93, which is greater than 1.83. This means that at α = 0.1, we can reject the null hypothesis that all the seven algorithms perform equally well in terms of the runtime. Thus the Nemenyi test is conducted. The test result shown in Fig. 6 demonstrates that the runtime of the proposed algorithm GFAR is statistically better than those of B-FRFS, SPS, FA-FPR, FA-FSCE and FMI-mRMR with α = 0.1, and there is no consistent evidence to indicate the statistical differences between GFAR and nnFRFS. 5.2.3. Classification accuracy results Tables 4, 5, 6 and 7 show the classification accuracies by the usage of the seven feature selection algorithms based on the four different types of classifiers: NaiveBayes, CART, SVM and KNN, where “All” denotes the accuracies with the original data. From them we can observe that (1) For NaiveBayes, the average classification accuracy has been improved by GFAR by 3.97% when compared with the original data. Furthermore, all other six algorithms have increased the classification accuracies by 3.50%, 2.21%, 2.74%, 3.77%, 1.92% and 2.46%, respectively. Furthermore, the proposed algorithm GFAR outperforms the original data 13 times over the 18 data sets. GFAR ranks 1 with a margin of 0.20% to the second best accuracy 77.67% of FA-FSCE and a margin of 2.05% to the worst accuracy 75.82% of FMI-mRMR. Meanwhile, the W/D/L records show that GFAR also outperforms all other six algorithms. 12
5
5
10
10 GFAR B−FRFS SPS FA−FPR FA−FSCE nnFRFS
Computational time (s)
3
4
10
Computational time (s)
4
10
10
2
10
1
3
10
2
10
GFAR B−FRFS SPS FA−FPR FA−FSCE nnFRFS
1
10
10
0
10 15%
0
30%
45% 60% 75% Proportion of samples
90%
10 15%
100%
30%
45% 60% 75% Proportion of samples
(a) EEG
90%
100%
(b) Gearbox
5
4
10
10
4
10
Computational time (s)
Computational time (s)
3
3
10
2
10
1
GFAR B−FRFS SPS FA−FPR FA−FSCE nnFRFS
10
0
10
−1
10 15%
30%
45% 60% 75% Proportion of samples
90%
10
2
10
GFAR B−FRFS SPS FA−FPR FA−FSCE nnFRFS
1
10
0
10 15%
100%
30%
45% 60% 75% Proportion of samples
(c) GLA-BRA-180
90%
100%
(d) Texture
Figure 4: Runtimes of the six algorithms on the four large-scale data sets 5
10
GFAR
nnFRFS
SPS
FA−FPR
FA−FSCE
B−FRFS
4
Mean and standard deviation
10
3
10
2
10
1
10
0
10
−1
10
EEG
Gearbox
GLA−BRA−180
Texture
Data sets
Figure 5: Means and standard deviations of the six algorithms on the four large-scale data sets (2) According to Table 5 using CART, all the seven algorithms except for GFAR show a slight decrease in the classification accuracy when compared with the original data. We can see that GFAR is superior to all other six feature selection algorithms on most data sets. Furthermore, the average ranks for GFAR, B-FRFS, SPS, FA-FPR, FA-FSCE, FMI-mRMR and nnFRFS are 2.03, 4.31, 4.28, 4.56, 4.19, 4.67 and 3.97, respectively. Accordingly, we 13
Figure 6: Runtime comparisons of all the algorithms against each other with the Nemenyi test can conclude that the proposed method GFAR achieves highly competitive performance against the other six feature selection algorithms when using the CART classifier. (3) Similar to the results obtained by NaiveBayes, all the seven algorithms show a slight increase in the classification accuracy of SVM when compared with the original data. For nearly 83% of the data sets listed, the GFAR method offers better or the same SVM accuracy as the other algorithms. Furthermore, GFAR obtains 0.69%, 2.47%, 1.84%, 1.20%, 3.37% and 1.96% average improvements in comparison with B-FRFS, SPS, FA-FPR, FA-FSCE,FMI-mRMR and nnFRFS, respectively. Finally, according to the SVM classifier, GFAR ranks 1 with a margin of 0.69% to the second best accuracy 79.58% of B-FRFS. (4) As to KNN, all the seven algorithms show a slight decrease in the classification accuracy when compared with the original data. It should be emphasized that this decrease is small when compared with the dimension of the reduction. Notably for the GFAR method, there is a decrease of up to 99.48% in the average dimensionality, the corresponding decrease in the average accuracy however is only 0.12%. We can see the similar results for the other feature selection algorithms. Moreover, when comparing to the other six feature selection algorithms, the W/D/L records show that the proposed algorithm GFAR outperforms all other algorithms. To further explore whether the accuracies of each classifier with the seven algorithms are significantly different, we make the following Friedman tests. The test results show that at α = 0.1, we can accept the alternative hypothesis that all the seven algorithms are different in terms of the accuracies. Thus the Nemenyi tests are conducted. The test result shown in Fig. 7 (a) demonstrates that the accuracy of NaiveBayes with GFAR is statistically better than that of FMI-mRMR with α = 0.1. But there is no consistent evidence to indicate statistical differences between GFAR, B-FRFS, SPS, FA-FPR, FA-FSCE and nnFRFS, respectively. Fig. 7 (b) shows that the proposed method GFAR is statistically superior than all the feature selection methods except for nnFRFS w.r.t to the CART classifier. From Fig. 7 (c), we see that the accuracy of SVM with GFAR is statistically better than those with FA-FPR, FAFSCE, FMI-mRMR and nnFRFS, respectively, but there is no consistent evidence to indicate the statistical accuracy differences between GFAR, B-FRFS and SPS. Finally, from Fig. 7 (d), we observe that there is no consistent evidence to indicate the statistical differences between the proposed method and B-FRFS, FA-FSCE, FMI-mRMR and nnFRFS, respectively. However, it shows that the accuracy of KNN with GFAR is statistically better than those with FA-FPR and SPS, respectively. 5.2.4. Scalability test In this subsection, we conduct experiments to examine the scalability of the proposed method. Fig. 8 shows the results running the proposed algorithm on data sets with varying the proportion of objects or features from 20% to 100% with a step of 20%. From Fig. 8, we can see that the computation time increases when the size of data set gets larger and it increases almost linearly to the size of objects or features. Therefore, we can draw the conclusion that the proposed method GFAR scales well with an increasing number of objects or features.
14
Table 4: Accuracies of NaiveBayes with the seven algorithms (%)
Data sets Bands Biodeg EEG Firm Gearbox GLA-BRA-180 Heart Hiva Ionosphere Lymphography Segment Sonar Texture TOX-171 Urban Vowel Wine Wpbc Average Average rank W/D/L
All 67.44±7.42 62.63±3.82 68.23±0.81 83.47±0.34 99.12±0.26 45.00±5.14 78.89±8.02 96.35±1.37 91.34±5.18 75.14±11.69 88.29±2.35 74.58±10.27 80.45±1.15 20.94±4.32 64.58±3.21 72.41±3.74 96.35±4.85 65.05±10.08 73.90 13/2/3
GFAR 67.86±7.13 60.10±4.15 63.62±3.41 83.47±0.34 99.41±0.18 64.07±3.13 81.19±7.13 96.35±1.37 91.08±5.23 75.22±12.39 88.97±2.18 74.92±10.50 80.71±1.13 57.95±5.52 78.58±3.07 72.82±3.92 96.91±4.31 68.46±9.97 77.87 2.72 -
B-FRFS 67.85±7.24 60.21±4.10 63.62±3.41 83.47±0.34 99.38±0.17 59.26±4.50 80.22±7.34 96.35±1.37 91.25±5.44 76.60±11.72 88.61±2.40 72.81±9.84 80.71±1.13 58.54±5.82 77.30±3.25 72.74±3.83 96.62±4.21 67.65±10.12 77.40 3.67 10/4/4
SPS 67.74±7.00 60.24±4.18 63.62±3.41 83.47±0.34 99.37±0.19 63.33±10.31 79.63±7.44 96.35±1.37 91.42±5.29 70.81±11.82 88.70±2.42 71.59±8.99 80.71±1.13 42.51±6.07 74.20±4.04 72.76±3.76 96.46±4.81 67.12±10.09 76.11 4.36 12/4/2
FA-FPR 67.04±7.23 59.83±4.41 64.40±3.73 83.47±0.34 99.49±0.17 55.56±5.27 80.33±7.98 96.35±1.37 90.99±5.68 72.57±13.77 89.13±2.24 73.06±9.46 81.44±1.21 53.39±8.40 74.77±4.22 72.98±3.86 96.47±4.88 68.32±10.11 76.64 3.72 11/2/5
FA-FSCE 67.69±7.44 62.46±3.85 63.74±3.53 83.47±0.34 99.12±0.26 61.85±6.26 79.93±8.16 96.28±1.40 91.34±5.18 76.01±11.83 88.56±2.38 73.10±9.76 80.45±1.15 61.17±6.14 78.24±3.05 72.73±3.81 96.35±4.85 65.56±10.55 77.67 4.19 12/1/5
FMI-mRMR 67.44±7.42 62.63±3.82 68.23±0.81 83.47±0.34 99.12±0.26 79.89±8.02 96.35±1.37 91.34±5.18 75.14±11.69 88.29±2.35 74.58±10.27 80.45±1.15 65.94±2.86 72.41±3.74 96.35±4.85 65.05±10.08 75.82 4.92 13/2/3
6. Conclusion and future work This paper has presented a new fuzzy-rough feature selection method based on graph theory with the aim of improving the speed of selecting a small feature subset while maintaining sufficient accuracy. To take full use of the information contained in the discernibility matrix, a graph-theoretic model for fuzzy-rough feature selection has been established. We have shown that finding the attribute reduction of a fuzzy decision system is equivalent to finding the minimal transversal of a derivative hypergraph. The new proposed model provides another perspective for the study of fuzzy-rough feature selection by the use of graph theory. Based on this framework, a new fuzzy-rough feature selection algorithm based on graph theory (GFAR) has been presented. The novel method can effectively tackle the problem associated with the large-scale data. The proposed GFAR was experimentally evaluated against other state-of-art fuzzy-rough feature selection methods (B-FRFS, SPS, FA-FPR, FA-FSCE, FMI-mRMR and nnFRFS). The experimental results presented in the previous section indicated that GFAR, as well as FA-FPR, can extract slightly fewer features than those of existing approaches in most data sets. In fact, GFAR obtained the smallest average subset size. Classification accuracy results confirmed that our proposed algorithm performs very well against other feature selection methods in most cases. Generally, GFAR obtained the rank of 1 on the four different types of classifiers in terms of the classification accuracy. Finally, the running time results demonstrated that GFAR performs very well against other feature selection methods. Especially for the large-scale data, the differences are profoundly significant. At the same time, we found that nnFRFS is a good alternative for the large-scale data set although the average subset sizes for nnFRFS are slightly larger than those of other approaches. A possible direction of future work would be to adapt the methodology presented here to deal with more complex data such as incomplete data, dynamic data and multi-label data [33, 34, 61]. Another part of the future work would be to consider the combination of the proposed approach with other techniques such as the stochastic local search algorithm for the minimal transversal problem to further reduce the computational cost and subset size. Acknowledgements We would like to thank the referees for their valuable comments and suggestions. This work was supported by the National Natural Science Foundation of China (Nos. 61170107, 61672272 and 61603173), the Natural Science Foun15
nnFRFS 67.35±7.48 61.67±4.15 67.97±1.92 83.47±0.34 99.21±0.21 64.26±9.36 79.89±8.09 96.35±1.37 91.34±5.18 69.72±10.84 88.29±2.35 72.89±8.85 80.71±1.13 44.33±5.99 72.09±3.25 72.41±3.74 96.68±4.57 65.78±10.56 76.36 4.42 11/3/4
Table 5: Accuracies of CART with the seven algorithms (%)
Data sets Bands Biodeg EEG Firm Gearbox GLA-BRA-180 Heart Hiva Ionosphere Lymphography Segment Sonar Texture TOX-171 Urban Vowel Wine Wpbc Average Average rank W/D/L
All 62.73±8.15 81.36±3.51 82.26±0.61 99.95±0.04 99.46±0.18 58.70±5.94 75.67±8.02 94.24±1.77 88.06±6.52 74.44±9.78 95.79±1.31 70.51±10.61 91.08±0.70 56.96±5.75 74.60±3.78 76.61±4.07 90.30±6.64 69.02±10.84 80.10 11/0/7
GFAR 64.56±7.77 81.86±3.78 76.36±3.67 99.97±0.03 99.42±0.19 55.74±4.65 76.89±7.93 95.50±1.76 87.83±6.29 77.31±9.60 95.91±1.37 73.80±10.01 91.25±0.59 53.68±6.19 76.27±3.82 76.46±4.17 91.36±6.19 68.91±9.76 80.17 2.03 -
B-FRFS 63.89±7.95 81.52±3.76 76.33±3.68 99.96±0.03 99.44±0.17 52.04±5.26 75.59±8.28 95.47±2.00 88.43±5.95 73.91±11.06 95.94±1.35 70.67±10.01 91.19±0.66 51.46±7.19 75.21±2.92 76.39±4.07 91.20±6.66 67.63±10.60 79.24 4.31 15/0/3
SPS 63.31±7.85 81.49±3.58 76.34±3.68 81.75±0.81 99.38±0.17 53.52±7.52 75.85±7.87 95.47±1.62 88.12±6.57 78.30±10.29 95.87±1.34 73.28±10.39 91.05±0.67 39.18±7.02 75.52±3.94 76.43±4.06 91.30±6.35 68.09±10.66 78.01 4.28 16/0/2
FA-FPR 63.13±7.72 81.77±3.51 74.02±6.36 99.96±0.03 99.40±0.21 54.63±3.80 75.85±8.16 95.47±2.00 87.33±5.59 74.63±10.68 95.68±1.24 71.36±10.64 91.16±0.66 49.71±9.14 72.76±3.86 76.13±4.52 91.03±6.37 70.13±11.05 79.12 4.56 17/0/1
FA-FSCE 62.83±8.41 81.51±3.22 76.80±3.93 99.95±0.03 99.47±0.20 55.19±5.68 75.33±7.99 94.82±2.39 88.32±6.56 72.07±11.60 95.88±1.34 71.65±9.37 91.29±0.60 53.51±5.54 75.27±3.51 76.41±4.03 90.18±6.43 68.50±10.43 79.39 4.19 14/0/4
FMI-mRMR 62.79±8.82 81.30±3.56 82.30±0.56 99.95±0.04 99.45±0.19 75.63±8.02 94.53±1.64 88.24±6.49 74.12±9.71 95.80±1.39 70.95±10.71 91.07±0.67 75.38±3.55 76.67±4.12 90.92±6.73 68.90±10.52 78.85 4.67 14/0/4
dation of Fujian Province (Nos. 2017J01507 and 2018J01548), the Natural Science Foundation of Hebei Province (Nos. A2018205103, F2018205196 and CXZZBS2019076) and the Foundation of Minnan Normal University (No. L11802). [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21]
D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithms, Mach. Learn. 6 (1) (1991) 37–66. R.B. Bhatt, M. Gopal, On fuzzy rough sets approach to feature selection, Pattern Recognit. Lett. 26 (2005) 965–975. J.A. Bondy, U.S.R. Murty, Graph Theory with Applications, London: Macmillan, 1976. D. Chen, S. Zhao, Local reduction of decision system with fuzzy rough sets, Fuzzy Sets Syst. 161 (2010) 1871–1883. D. Chen, L. Zhang, S. Zhao, Q. Hu, P. Zhu, A Novel Algorithm for Finding Reducts With Fuzzy Rough Sets, IEEE Trans. Fuzzy Syst. 20 (2) (2012) 385–389. D. Chen, S. Zhao, L. Zhang, Y. Yang, X. Zhang, Sample pair selection for attribute reduction with rough set, IEEE Trans. Knowl. Data Eng. 24 (11) (2012) 2080–2093. D. Chen, Y.Yang, Attribute reduction for heterogeneous data based on the combination of classical and fuzzy rough set models, IEEE Trans. Fuzzy Syst. 22 (5) (2014) 1325–1334. J. Chen, Y. Lin, G. Lin, J. Li, Z. Ma, The relationship between attribute reducts in rough sets and minimal vertex covers of graphs, Inf. Sci. 325 (2015) 87–97. J. Chen, Y. Lin, J. Li, G. Lin, A. Tan, A rough set method for the minimum vertex cover problem of graphs, Appl. Soft Comput. 42 (2016) 360–367. V. Chvatal, A greedy heuristic for the set-covering problem, Math. Oper. Res. 4 (3) (1979) 233–235. Q. Dai, Q. Xu, Attribute selection based on information gain ratio in fuzzy rough set theory with application to tumor classification, Appl. Soft Comput. 13 (2013) 211–221. J. Demsar, Statistical comparison of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. D. Dubois, H. Prade, Rough fuzzy sets and fuzzy rough sets, Int. J. Gen. Syst. 17 (2–3) (1990) 191–209. D. Dubois, H. Prade, Putting rough sets and fuzzy sets together, Intell. Decis. Support, 1992, 203–232. M. Friedman, A comparison of alternative tests of significance for the problem of m ranking, Ann. Math. Stat. 11 (1940) 86–92. F.C. Gomes, C.N. Meneses, P.M. Pardalos, G.V.R. Vianaa, Experimental analysis of approximation algorithms for the vertex cover and set covering problems, Comput. Oper. Res. 33 (12) (2006) 3520–3534. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Second Edition. NY: Springer, 2008. Q. He, C. Wu, D. Chen, S. Zhao, Fuzzy rough set based attribute reduction for information systems with fuzzy decisions, Knowl.-Based Syst. 24 (2011) 689–696. Q. Hu, D. Yu, W. Pedrycz, D. Chen, Kernelized fuzzy rough sets and their applications, IEEE Trans. Knowl. Data Eng. 23 (11) (2011) 1649–1667. Q. Hu, L. Zhang, S. An, D. Zhang, D. Yu, On Robust Fuzzy Rough Set Models, IEEE Trans. Fuzzy Syst. 20 (4) (2012) 636–651. Q. Hu, Z. Xie, D. Yu, Hybrid attribute reduction based on a novel fuzzy-rough model and information granulation, Pattern Recognit. 40 (12) (2007) 3509–3521.
16
nnFRFS 62.90±8.13 81.52±3.86 80.47±1.69 99.97±0.03 99.41±0.21 52.41±7.82 75.63±7.95 95.73±1.43 88.06±6.52 75.78±10.37 95.79±1.31 71.49±10.93 91.08±0.70 40.53±4.86 75.35±3.95 76.54±4.15 90.47±6.23 68.86±10.81 79.00 3.97 13/1/4
Table 6: Accuracies of SVM with the seven algorithms (%)
Data sets Bands Biodeg EEG Firm Gearbox GLA-BRA-180 Heart Hiva Ionosphere Lymphography Segment Sonar Texture TOX-171 Urban Vowel Wine Wpbc Average Average rank W/D/L
All 66.23±7.92 86.96±3.09 47.72±5.21 99.96±0.03 99.90±0.04 42.41±17.86 83.22±6.83 95.18±1.59 84.30±5.11 81.24±11.06 96.28±1.18 78.30±9.53 99.05±0.18 34.39±8.79 34.64±5.16 79.78±3.88 95.17±4.46 71.13±13.34 76.44 13/1/4
GFAR 67.84±7.81 87.03±3.22 54.55±3.11 99.96±0.03 99.86±0.07 50.74±9.28 84.44±6.96 96.35±1.37 84.76±5.36 75.08±10.92 96.30±1.18 73.81±10.33 99.08±0.14 55.91±5.50 68.13±9.40 79.23±3.79 96.23±4.60 75.52±10.96 80.27 2.14 -
B-FRFS 67.03±8.62 87.02±3.19 52.84±4.13 99.96±0.03 99.86±0.07 48.33±5.27 84.44±6.80 96.30±1.39 85.59±5.15 77.10±11.38 96.30±1.16 74.31±9.64 99.08±0.14 54.39±8.13 60.45±11.42 79.15±4.03 96.12±4.36 74.12±12.11 79.58 3.11 10/5/3
SPS 66.04±9.01 86.98±3.17 53.95±4.27 83.47±0.34 99.87±0.06 45.19±5.10 84.19±6.78 96.28±1.32 84.56±5.20 74.88±10.29 96.27±1.19 76.37±8.99 99.08±0.14 40.99±5.91 63.02±7.84 79.29±4.05 95.45±4.36 74.57±11.81 77.80 4.03 14/1/3
FA-FPR 66.94±8.74 86.82±3.35 53.69±4.19 99.96±0.03 99.85±0.08 46.11±13.44 84.11±6.99 96.30±1.39 83.87±6.35 74.59±10.08 96.19±1.17 73.58±9.59 98.97±0.14 51.40±8.85 50.12±11.25 79.17±3.92 96.12±4.37 73.86±12.21 78.43 5.08 17/1/0
FA-FSCE 66.75±8.13 86.97±3.08 53.91±4.36 99.96±0.03 99.90±0.04 41.67±9.05 82.67±7.20 95.99±1.61 84.30±5.11 76.49±11.59 96.27±1.19 73.70±9.79 99.05±0.18 55.32±7.40 64.06±8.69 79.18±4.04 95.28±4.30 71.78±12.87 79.07 4.56 15/1/2
FMI-mRMR 67.03±7.35 86.96±3.09 47.88±5.70 99.96±0.03 99.90±0.04 83.22±6.83 94.69±1.87 84.30±5.11 81.24±11.06 96.28±1.20 78.30±9.53 99.05±0.18 35.42±8.02 79.79±3.88 95.17±4.46 72.33±12.93 76.90 4.61 13/1/4
[22] Q. Hu, D. Yu, Z. Xie, Information-preserving hybrid data reduction based on fuzzy-rough techniques, Pattern Recognit. Lett. 27 (5) (2006) 414–423. [23] Q. Hu, L. Zhang, Y. Zhou, W. Pedrycz, Large-scale multimodality attribute reduction with multi-kernel fuzzy rough sets, IEEE Trans. Fuzzy Syst. 26 (1) (2018) 226–238. [24] R. Jensen, Q. Shen, Fuzzy-rough sets for descriptive dimensionality reduction, in: Proc. 2002 IEEE Int. Conf. Fuzzy Syst., 2002, pp. 29–34. [25] R. Jensen, Q. Shen, Fuzzy-rough sets assisted attribute selection, IEEE Trans. Fuzzy Syst. 15 (1) (2007) 73–89. [26] R. Jensen, Q. Shen, New approaches to fuzzy-rough feature selection, IEEE Trans. Fuzzy Syst. 17 (4) (2009) 824–838. [27] R. Jensen, N. Mac Parthal´ain, Towards scalable fuzzy–rough feature selection, Inf. Sci. 323 (2015) 1–15. [28] P. Kulaga, P. Sapiecha, S. Krzysztof, Approximation algorithm for the argument reduction problem, Computer Recognition Systems, Berlin: Springer Heidelberg, 2005, pp. 243–248. [29] G.H. Lan, G.W. DePuy, G.E. Whitehouse, An effective and simple heuristic for the set covering problem, Eur. J. Opera. Res. 176 (3) (2007) 1387–1403. [30] G. Lang, D. Miao, M. Cai, Z. Zhang, Incremental approaches for updating reducts in dynamic covering information systems, Knowl.-Based Syst. 134 (2017) 85–104. [31] G. Lang, D. Miao, M. Cai, Three-way decision approaches to conflict analysis using decision-theoretic rough set theory, Inf. Sci. 406–407 (2017) 185–207. [32] J.Y. Liang, F. Wang, C.Y. Dang, Y.H. Qian, A group incremental approach to feature selection applying rough set technique, IEEE Trans. Knowl. Data Eng. 26 (2) (2014) 294–308. [33] Y. Lin, Q. Hu, J. Liu, J. Li, X. Wu, Streaming Feature Selection for Multilabel Learning Based on Fuzzy Mutual Information, IEEE Trans. Fuzzy Syst., 25 (6) (2017) 1491–1507. [34] Y. Lin, Y. Li, C. Wang, J. Chen, Attribute reduction for multi-label learning with fuzzy rough set, Knowl.-Based Syst. 152 (2018) 51–61. [35] J.S. Mi, W.X. Zhang, An axiomatic characterization of a fuzzy generalization of rough sets, Inf. Sci. 160 (1) (2004) 235–249. [36] J.S. Mi, Y. Leung, H.Y. Zhao, T. Feng, Generalized fuzzy rough sets determined by a triangular norm, Inf. Sci. 178 (16) (2008) 3203–3213. [37] M. Moshkov, M. Piliszczuk, Graphical representation of information on the set of reducts,rough sets and knowledge technology, Lect. Notes Comput. Sci. 4481 (2007) 372–378. [38] B. Nemenyi, Distribution-free multiple comparison, PhD Thesis, Princeton University, 1963. [39] L. Olshen, C.J. Stone, Classification and regression trees, Wadsworth International Group 93 (99) (1984) 101. [40] Z. Pawlak, Rough sets, Int. J. Comput. Inf. Sci. 5 (1982) 341–356. [41] U.N. Peled, B. Simeone, An O(nm)-time algorithm for computing the dual of a regular Boolean function, Discrete Appl. Math. 49 (1) (1194) 309–323. [42] J. C. Platt, 12 fast training of support vector machines using sequential minimal optimization, Advances in kernel methods, 185–208, 1999. [43] Y. Qian, J. Liang, W. Pedrycz, C. Dang, Positive approximation: an accelerator for attribute reduction in rough set theory, Artif. Intell. 174 (2010) 597–618. [44] Y. Qian, Q. Wang, H. Cheng, J. Liang, C. Dang, Fuzzy-rough feature selection accelerator, Fuzzy Sets Syst. 258 (2015) 61–78. [45] Y. Qian, Y. Li, J. Liang, G. Lin, C. Dang, Fuzzy granular structure distance, IEEE Trans. Fuzzy Syst. 23 (6) (2015) 2245–2259. [46] M. Sarkar, B. Yegnanarayana, Fuzzy-rough neural networks for vowel classification, In Proc. 1998 IEEE Int. Conf. Syst. Man Cybern., 1998,
17
nnFRFS 66.57±7.71 86.97±3.20 50.78±5.28 99.96±0.03 99.89±0.05 42.41±7.03 83.22±6.83 95.91±1.43 84.02±5.43 74.35±12.07 96.28±1.18 76.89±9.49 99.08±0.14 42.22±5.27 63.39±6.28 79.79±3.87 95.28±4.37 72.57±13.42 78.31 4.47 13/2/3
Table 7: Accuracies of KNN (K=10) with the seven algorithms (%)
Data sets Bands Biodeg EEG Firm Gearbox GLA-BRA-180 Heart Hiva Ionosphere Lymphography Segment Sonar Texture TOX-171 Urban Vowel Wine Wpbc Average Average rank W/D/L
All 59.11±9.26 80.69±3.47 94.73±0.30 98.27±0.25 99.88±0.08 68.70±5.12 65.63±7.88 96.35±1.37 84.06±5.31 75.56±11.64 92.58±1.65 70.27±9.68 97.76±0.32 56.49±4.21 41.32±3.93 62.93±5.40 70.05±10.56 75.41±10.33 77.21 10/1/7
GFAR 59.86±9.02 80.23±3.81 87.60±4.57 97.20±0.64 99.86±0.07 63.52±5.17 65.78±7.83 96.35±1.37 84.12±5.41 75.63±12.25 92.61±1.62 73.25±9.90 97.92±0.33 55.67±6.28 44.75±6.06 67.91±9.01 70.16±10.44 75.15±9.94 77.09 2.53 -
B-FRFS 59.80±9.11 80.25±3.82 87.60±4.57 97.40±0.71 99.88±0.06 60.56±3.33 65.56±7.91 96.12±1.35 83.69±5.63 73.70±12.85 92.45±1.67 74.20±10.07 97.92±0.33 52.40±8.67 40.93±4.49 63.37±6.51 70.16±10.43 74.89±10.39 76.16 3.86 11/4/3
SPS 59.47±9.10 80.16±3.83 87.60±4.57 83.95±0.55 99.86±0.09 54.07±4.65 65.63±7.92 96.35±1.37 84.01±5.47 71.52±12.27 92.56±1.66 73.98±9.14 97.92±0.33 45.09±5.70 44.26±3.28 63.10±5.61 69.83±10.59 74.63±9.93 74.67 4.83 13/4/1
FA-FPR 59.77±9.19 79.78±3.66 83.13±8.68 97.40±0.71 99.87±0.09 57.04±7.35 65.37±7.97 96.12±1.35 83.30±5.57 74.74±13.40 92.55±1.65 72.24±9.25 98.02±0.34 48.25±9.07 40.43±4.61 78.34±4.81 70.05±10.36 74.96±10.17 76.19 4.58 14/0/4
FA-FSCE 59.22±9.16 80.68±3.50 88.22±4.92 97.11±0.70 99.88±0.08 62.22±3.82 65.56±7.73 96.35±1.37 84.06±5.31 72.97±12.56 92.53±1.62 72.71±10.23 97.76±0.32 54.56±8.67 40.10±4.90 63.13±5.82 69.93±10.72 75.20±10.25 76.23 4.17 13/1/4
FMI-mRMR 59.11±9.26 80.693.47 94.73±0.30 98.27±0.25 99.88±0.08 65.63±7.88 96.35±1.37 84.06±5.31 75.56±11.64 92.58±1.65 70.27±9.68 97.76±0.32 40.34±4.03 62.93±5.40 70.05±10.56 75.41±10.33 75.67 3.78 12/1/5
pp. 4160–4165. [47] T. K. Sheeja, A. S. Kuriakose, A novel feature selection method using fuzzy rough sets, Comput. Ind. 97 (2018) 111–121. [48] A. Skowron, C. Rauszer, The discernibility matrices and functions in information systems, In Intelligent Decision Support, 1992, pp. 331–362. [49] B. Sun, W. Ma, Y. Qian, Multigranulation fuzzy rough set over two universes and its application to decision making, Knowl.-Based Syst. 123 (2017) 61–74. [50] K. Thangavel, A. Pethalakshmi, Dimensionality Reduction Based on Rough Set Theory: A Review, Appl. Soft Comput. 9 (2009) 1–12. [51] C.C.E. Tsang, D.G. Chen, S.D. Yueng, W.T.J. Lee, X.Z. Wang, Attribute reduction using fuzzy rough sets, IEEE Trans. Fuzzy Syst. 16 (5) (2008) 1130–1141. [52] Y.F. Wang, Mining stock price using fuzzy rough set system, Exp. Syst. Appl. 24 (1) (2003) 13–23. [53] C. Wang, Y. Qi, M. Shao, Q. Hu, D. Chen, Y. Qian, Y. Lin, A fitting model for feature selection with fuzzy rough sets, IEEE Trans. Fuzzy Syst. 25 (4) (2017) 741–753. [54] C. Wang, Y. Huang, M. Shao, D. Chen, Uncertainty measures for general fuzzy relations, Fuzzy Sets Syst. 360 (2019) 82–96. [55] C. Wang, Y. Shi, X. Fan, M. Shao, Attribute reduction based on k-nearest neighborhood rough sets, Int. J. Approx. Reason. 106 (2019) 18–31. [56] C. Wang, Y. Huang, M. Shao, X. Fan, Fuzzy rough set-based attribute reduction using distance measures, Knowl.-Based Syst. 164 (2019) 205–212. [57] G.I. Webb, Multiboosting: A technique for combining boosting and Wagging, Mach. Learn. 40 (2) (2000) 159–196. [58] W.Z. Wu, W.X. Zhang, Constructive and axiomatic approaches of fuzzy approximation operators, Inf. Sci. 159 (3–4) (2004) 233–254. [59] W.Z. Wu, Y. Leung, J.S. Mi, On characterizations of (I,T)-fuzzy rough approximation operators, Fuzzy Sets Syst. 154 (1) (2005) 76–102. [60] S.K.M. Wong, W. Ziarko, On optimal decision rules in decision tables, Bulletin of Polish Academy of Sciences 33 (1985) 693–696. [61] Y. Yang, D. Chen, H. Wang, E.C. Tsang, D. Zhang, Fuzzy rough set based incremental attribute reduction from dynamic data with sample arriving, Fuzzy Sets Syst. 312 (2017) 66–86. [62] Y. Yang, D. Chen, H. Wang, X. Wang, Incremental perspective for feature selection based on fuzzy rough sets, IEEE Trans. Fuzzy Syst. 26 (3) (2018) 1257–1273. [63] Y.Y. Yao, A comparative study of fuzzy sets and rough sets, Inf. Sci. 109 (1998) 21–47. [64] D.S. Yeung, D. Chen, E.C.C. Tsang, J.W.T Lee, X. Wang, On the generalization of fuzzy rough sets, IEEE Trans. Fuzzy Syst. 13 (3) (2005) 343–361. [65] D. Yu, S. An, Q. Hu, Fuzzy mutual information based min-redundancy and max-relevance heterogeneous feature selection, Int. J. Comput. Int. Sys. 4 (4) (2011) 619–633. [66] L.A. Zadeh, Fuzzy sets, Inform. Control. 8 (1965) 338–353. [67] A.P. Zeng, T.R. Li, D. Liu, J.B. Zhang, H. Chen, A fuzzy rough set approach for incremental feature selection on hybrid information systems, Fuzzy Sets Syst. 258 (2015) 39–60. [68] J. Zhan, M.I. Ali, N. Mehmood, On a novel uncertain soft set model: Z-soft fuzzy rough set model and corresponding decision making methods, Appl. Soft Comput. 56 (2017) 446–457. [69] X. Zhang, C. Mei, D. Chen, J. Li, Feature selection in mixed data: A method using a novel fuzzy rough set-based information entropy, Pattern Recognit. 56 (2016) 1–15. [70] X. Zhang, C. Mei, D. Chen, Y. Yang, A fuzzy rough set-based feature selection method using representative instances, Knowl.-Based Syst.
18
nnFRFS 59.11±9.26 80.21±3.75 92.70±1.44 97.78±0.51 99.85±0.08 53.33±5.27 65.63±7.92 96.35±1.37 84.06±5.28 71.32±12.61 92.58±1.65 72.29±9.18 97.92±0.33 45.09±5.66 46.11±3.36 62.93±5.40 69.99±10.64 75.40±10.41 75.70 4.25 12/2/4
(a) NaiveBayes
(b) CART
(c) SVM
(d) KNN
Figure 7: Accuracy comparisons of all the algorithms against each other with the Nemenyi test 151 (2018) 216–229. [71] S. Zhao, E.C.C. Tsang, D. Chen, The model of fuzzy variable precision rough sets, IEEE Trans. Fuzzy Syst. 17 (2) (2009) 451–467.
19
4
120 1.2 100
2.5 2 1.5 1
0.8 0.6 0.4
40%
60% 80% Proportion of samples
0 20%
100%
80
60
40
20
0.2
0.5 0 20%
Computational time (s)
1
3
Computational time (s)
Computational time (s)
3.5
40%
(a) Biodeg
60% 80% Proportion of samples
0 20%
100%
40%
(b) Segment
200 180
60% 80% Proportion of samples
100%
(c) Texture
1.6
8
1.4
7
140 120 100 80 60
Computational time (s)
Computational time (s)
Computational time (s)
160 1.2 1 0.8 0.6
6 5 4 3
40 0.4
20 0 20%
40%
60% 80% Proportion of features
(d) GLA-BRA-180
100%
0.2 20%
2
40%
60% 80% Proportion of features
100%
(e) Hiva
Figure 8: Scalability test in the runtime
20
1 20%
40%
60% 80% Proportion of features
(f) TOX-171
100%