Feature selection using rough set-based direct dependency calculation by avoiding the positive region

Feature selection using rough set-based direct dependency calculation by avoiding the positive region

JID:IJA AID:8132 /FLA [m3G; v1.224; Prn:18/10/2017; 15:03] P.1 (1-23) International Journal of Approximate Reasoning ••• (••••) •••–••• 1 Contents...

2MB Sizes 0 Downloads 40 Views

JID:IJA AID:8132 /FLA

[m3G; v1.224; Prn:18/10/2017; 15:03] P.1 (1-23)

International Journal of Approximate Reasoning ••• (••••) •••–•••

1

Contents lists available at ScienceDirect

2

1 2

3

International Journal of Approximate Reasoning

4 5

3 4 5

6

6

www.elsevier.com/locate/ijar

7

7

8

8

9

9

10

10

11 12 13

Feature selection using rough set-based direct dependency calculation by avoiding the positive region

14 15 16 17 18

23 24 25 26 27 28 29 30 31 32

13

Muhammad Summair Raza, Usman Qamar

15

Department of Computer Engineering, College of Electrical and Mechanical Engineering (E&ME), National University of Sciences and Technology (NUST), Pakistan

17

16 18 19

20 22

12 14

19 21

11

20

a r t i c l e

i n f o

Article history: Received 7 January 2017 Received in revised form 8 August 2017 Accepted 13 October 2017 Available online xxxx Keywords: Positive region Rough set theory Dependency rules Feature selection Reducts

33 34 35 36 37 38 39 40 41 42 43

a b s t r a c t Feature selection is the process of selecting a subset of features from the entire dataset such that the selected subset can be used on behalf of the entire dataset to reduce further processing. There are many approaches proposed for feature selection, and recently, rough set-based feature selection approaches have become dominant. The majority of such approaches use attribute dependency as criteria to determine the feature subsets. However, this measure uses the positive region to calculate dependency, which is a computationally expensive job, consequently effecting the performance of feature selection algorithms using this measure. In this paper, we have proposed a new heuristic-based dependency calculation method. The proposed method comprises a set of two rules called Direct Dependency Calculation (DDC) to calculate attribute dependency. Direct dependency calculates the number of unique/non-unique classes directly by using attribute values. Unique classes define accurate predictors of class, while non-unique classes are not accurate predictors. Calculating unique/non-unique classes in this manner lets us avoid the time-consuming calculation of the positive region, which helps increase the performance of subsequent algorithms. A two-dimensional grid was used as an intermediate data structure to calculate dependency. We have used the proposed method with a number of feature selection algorithms using various publically available datasets to justify the proposed method. A comparison framework was used for analysis purposes. Experimental results have shown the efficiency and effectiveness of the proposed method. It was determined that execution time was reduced by 63% for calculation of the dependency using DDCs, and a 65% decrease was observed in the case of feature selection algorithms based on DDCs. The required runtime memory was decreased by 95%. © 2017 Elsevier Inc. All rights reserved.

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

44

44

45

45

46

46

47 48

1. Introduction

49 50 51 52 53 54 55

60 61

50 51 52 53 54 55 56

57 59

48 49

Feature selection is the process of selecting a subset of features from the entire dataset such that the selected subset can be presented on behalf of the entire dataset. Selecting feature subsets thus lets us reduce datasets to a manageable size by eliminating unnecessary and redundant information. The feature selection process thus provides a subset of features from the dataset that contains most of the useful information. The reduced set consequently reduces execution time for further tasks, thus enhancing the performance. This approach helps to reduce the number of noisy and irrelevant features. In the past two decades, the dimensionality of the datasets involved in machine learning and data mining applications has

56 58

47

57

E-mail addresses: [email protected] (M.S. Raza), [email protected] (U. Qamar). https://doi.org/10.1016/j.ijar.2017.10.012 0888-613X/© 2017 Elsevier Inc. All rights reserved.

58 59 60 61

JID:IJA AID:8132 /FLA

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58

[m3G; v1.224; Prn:18/10/2017; 15:03] P.2 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

increased explosively. There are two main approaches for such an increase in data dimensionality: transform-based reduction, also called attribute reduction, and selection-based reduction, also called feature selection. Transform-based reduction, as the name implies, transforms the underlying semantics of data. Selection-based reduction, i.e., feature selection, selects the features to represent the data instead of transforming the underlying semantics. Thus, the underlying semantics are preserved. Feature subset selection in these domains has helped to reduce the dimensionality of the feature space, improve the predictive accuracy of classification algorithms, and improve the visualization and comprehensibility of the induced concepts. Feature selection not only implies dimensionality reduction, i.e., reduction of the number of attributes that should be considered when building a model but also the choice of attributes, i.e., attributes can be selected or discarded based on criteria specifying their usefulness. Data in the real world may contain far more information than necessary, e.g., a database table may contain many attributes of a customer, out of which only a few are necessary to perform a certain type of analysis. Therefore, feature selection has become a necessary step to make the analysis more manageable and extract useful knowledge regarding a given domain [1]. It is very important in the analysis of high-dimensional data [2], where it serves several purposes, such as reducing the dimensionality of a dataset, decreasing the computational time required for classification and enhancing the classification accuracy of a classifier by removing redundant and misleading or erroneous features [3]. Various feature selection techniques have been proposed in the literature. These include correlation-based feature selection [4,5], mutual information-based feature selection [6,7], heterogeneous feature selection [8], consistency-based feature selection [9], the graph theoretic approach [10], ACO-based feature selection [11], feature selection in possibilistic modelling [12], and SVM-based feature selection [13]. Rough Set Theory (RST), proposed by Pawlak [14,15], is a mathematical tool for data analysis. RST-based approaches for attribute reduction [16–19] and feature selection [20–24] have become dominant. For feature selection, RST provides a positive region-based dependency measure called “attribute dependency” to perform feature selection. Attribute dependency determines how uniquely the value of an attribute determines the value of a dependent attribute. The value of attribute dependency ranges from zero (0) to one (1), where zero (0) means that an attribute does not depend on the other and one (1) means that an attribute fully depends on the other. However, this approach uses the positive region to calculate dependency, which is a time-consuming and complex step, adversely affecting the performance of feature selection algorithms using this measure and thus making them almost impossible to use for feature selection when datasets grow beyond smaller sizes. Rough set-based dependency requires three steps, i.e., calculation of equivalence classes using a decision attribute (decision class), calculation of equivalence classes using conditional attributes, and finally calculation of the positive region. Performing these tasks is a computationally expensive job and becomes inappropriate for datasets having large numbers of attributes or large numbers of examples. To overcome this issue, we require an alternate method to calculate dependency for which these computationally expensive steps are unnecessary. However, the accuracy of such an alternate approach should be exactly the same as that of the original positive region-based dependency measure so that it could be effectively applied to any of the feature selection algorithms without affecting their accuracy. This research proposes a new method called Direct Dependency Calculation (DDC), which directly calculates the dependency measure without performing the time-consuming positive region calculation. It directly scans the number of unique/non-unique classes in a dataset using attribute values and calculates dependency. Calculating dependency in this manner lets us avoid the positive region, which makes DDC-based feature selection algorithms suitable for average and larger datasets. The proposed approach is an alternative to the conventional positive region-based dependency measure and can be safely used in any of the feature selection algorithms using a rough set-based dependency measure. The rest of the paper is organized as follows. Section 1 discusses preliminaries of rough set theory, and section 2 describes various related works. In section 3, DDC is discussed in detail. In section 4, various feature selection algorithms using DDC are presented. Finally, sections 5, 6 and 7 present experimental analysis. 2. Rough set theory preliminaries Rough Set Theory (RST) was proposed by Pawlak [14,15]. Since its inception, it has been used in various domains for data analysis, including economics and finance [25], medical diagnosis [26], medical imaging [27], banking [28], and data mining [29]. 2.1. Information system

61

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

An information system is the basic structure for representing the underlying information in RST. It comprises objects and their attributes. Formally, an information system I = {U , A }, where U is a non-empty finite set of objects representing the universe and A is the number of attributes, which are also called features. Every attribute has a value: U → V a , where V a is called the value set of attribute “a”. Table 1 is an Information System (IS) where A = {Age, Income} and U = { X 1 , X 2 , X 3 , X 4 , X 5 , X 6 , X 7 }.

52

2.2. Decision system

58

59 60

1

53 54 55 56 57 59

A decision system is not only an information system; it also has decision attribute(s). A decision attribute, also called a “class”, is a feature whose value depends on other attributes called conditional attributes. Formally, a decision system

60 61

JID:IJA AID:8132 /FLA

[m3G; v1.224; Prn:18/10/2017; 15:03] P.3 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

1

3

1

Table 1 Information system.

2

2

3

Customer

Age

Income

3

4

X1 X2 X3 X4 X5 X6 X7

35–40 35–40 45–50 25–35 45–50 25–35 25–35

30000–40000 30000–40000 50000–60000 20000–30000 50000–60000 20000–30000 20000–30000

4

5 6 7 8 9

5 6 7 8 9

10

10

11

11

Table 2 Decision system.

12

12

13

Customer

Age

Income

Policy

13

14

X1 X2 X3 X4 X5 X6 X7

35–40 35–40 40–45 25–35 40–45 25–35 25–35

30000–40000 30000–40000 50000–60000 20000–30000 50000–60000 20000–30000 20000–30000

Platinum Platinum Gold Silver Gold Silver Gold

14

15 16 17 18 19

15 16 17 18 19

20 21 22 23

20

α = (U , C ∪ D ), where “C ” represents conditional attribute(s) or features and “D” represents decision attribute(s). Table 2 shows a sample decision system with {Age, Income} as conditional attributes and {Policy} as the decision attribute.

24

2.3. Indiscernibility

25

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

26

Indiscernibility identifies the objects that, with regard to certain attributes, cannot be distinguished from each other. It is simply an equivalence relation between objects. For a decision system α = (U , C ∪ D ), there may exist an indiscernibility relation INDα (C ):

  INDα (C ) = (xi , x j ) ∈ U 2 | ∀c ∈ C c (xi ) = c (x j )

49 50 51 52 53 54 55 56 57 58 59 60 61

27 28 29 30

(1)

INDα (C ) is called a “C-indiscernibility” relation. It means that two objects xi and x j are indiscernible with respect to “C” if (xi , x j ) ∈ INDα (C ). Considering Table 2, objects { X 1 , X 2 } are indiscernible with respect to the attribute “Age”. Similarly, objects X 3 and X 5 are indiscernible with respect to the attribute “Income”. Figs. 1, 2, 3 and 4 (taken from [30]) show four types of indiscernibility. Note that the dotted line represents the part of the universe covered by an indiscernibility relation, whereas the solid lines represent the universe or a part of the universe. Fig. 1 shows an indiscernibility relation not related to a particular concept and decision attribute. These reducts are minimal subsets that discern all of the cases from each other up to some extent. Fig. 2 shows an indiscernibility relation relative to a decision attribute but not relative to a particular case. These reducts are a minimal set of attributes that provide the same classification as obtained by the entire set of conditional attributes. Fig. 3 shows an indiscernibility relation relative to a case or object X but not relative to a decision attribute. Reducts of such type are a minimum subset of attributes that can discern object X from others up to the same extent as by the full set of conditional attributes. Fig. 4 shows an indiscernibility relation relative to both a case and a decision attribute. These reducts let us determine the outcome of a case as determined by the full set of conditional attributes.

47 48

22 23

24 25

21

31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

2.4. Approximations

48

Rough sets cannot be defined in a crisp manner. It is defined by a lower and an upper approximation. For B ⊆ C and X ⊆ U , the B-lower approximation (denoted by B X ) and B-upper approximation (denoted by B X ) of X will be:



B X = x | [x] B ⊆ X





49 50 51 52



(2)

53

(3)

54

The B-Lower approximation B X , also known as the positive region, defines the objects that, with complete certainty, can be concluded to belong to concept X with respect to information in attribute B. It comprises the equivalence classes in [X]B . The B-Upper approximation B X defines the objects that can possibly be concluded to belong to concept X with respect to information in attribute B. The set B N B ( X ) = B X ↔ B X is called the boundary region and consists of those objects that we cannot decisively classify into X on the basis of knowledge in B. Fig. 5 on the next page shows an approximation of the set of objects belonging to the “Gold” policy from Table 2.

56

B X = x | [x] B ∩ X = ∅

55 57 58 59 60 61

JID:IJA AID:8132 /FLA

4

[m3G; v1.224; Prn:18/10/2017; 15:03] P.4 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10 11

10

Fig. 1. Indiscernibility relation not related to a particular concept and decision attribute.

11

12

12

13

13

14

14

15

15

16

16

17

17

18

18

19

19

20

20

21

21

22

22

23

Fig. 2. Indiscernibility relation related to a decision attribute but not relative to a particular case.

23

24

24

25

25

26

26

27

27

28

28

29

29

30

30

31

31

32

32

33

33

34

34

35 36

35

Fig. 3. Indiscernibility relation related to a case or object X but not relative to a decision attribute.

36

37

37

38

38

39

39

40

40

41

41

42

42

43

43

44

44

45

45

46

46

47

47

48 49

Fig. 4. Indiscernibility relation related to both a case and a decision attribute.

48 49

50

50

51

51

52

52

53

53

54

54

55

55

56

56

57

57

58

58

59

59

60 61

60

Fig. 5. Approximation of the set of objects belonging to the “Gold” policy.

61

JID:IJA AID:8132 /FLA

[m3G; v1.224; Prn:18/10/2017; 15:03] P.5 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

1

5

2.5. Dependency

1

2 3 4 5

2

Dependency is the most commonly used measure to perform data analysis. Dependency defines how uniquely the value of an attribute determines the value of another attribute. Thus, the degree “K” by which the attribute “C” determines the value of attribute “D” is calculated by

6

k = γ (C , D ) =

7 8 9

12 13 14 15 16 17 18 19 20 21 22 23 24

4 5 6

|POSC ( D )| |U |

(4)

7 8 9

where

10 11

3

10

POSC ( D ) = ∪ X ∈U / D C ( X )

(5)

POSC ( D ) is also called a positive region of the partition U/D with respect to C and is the set of all elements of “U ” that can be uniquely classified to blocks of the partition U/D by means of C. K is the degree of dependency. K = 1 means D fully depends on C , K = 0 means D does not depend on C , and 0 < K < 1 implies that D partially depends on C . Calculating dependency using this conventional method requires calculation of the positive region to find the value of K . We will demonstrate the procedure for calculating dependency with the help of an example. We will calculate γ ({Age, Income}, Policy), i.e., the dependency of the decision class “Policy” on the attribute set “Age” and “Income” given in Table 2 by using the positive region. Three steps are required to calculate this conventional dependency measure; the following presents each of these steps in detail. In the first step, equivalence classes are identified (i.e., we construct the equivalence class structure) using the decision attribute` (“Policy” in our case). An equivalence class specifies the objects that cannot be distinguished with respect to a given attribute. We will have three decision classes in our case:

11 12 13 14 15 16 17 18 19 20 21 22 23 24

Q 1 = { X1, X2}

25

27

Q 2 = { X 3 , X 5, X 7 }

27

28

Q 3 = { X4, X6}

25 26

29 30 31 32 33 34 35 36 37 38 39 40

26 28 29

The equivalence class Q 1 shows that, with respect to the attribute “Policy”, we cannot distinguish objects X 1 and X 2 because both have the same value. In Step-2, we construct the equivalence class structure using the set of conditional attributes, i.e., “{Age, Income}”.

43 44 45 46 47 48 49 50 51

P 2 = { X 3, X 5 }

35

34 36

P 3 = { X4, X6, X7}

37

Finally, in the third step, the positive region is calculated. It is calculated by finding the equivalence classes (identified in Step-2) that are a subset of those identified in step-1. Thus, we will see which classes from P 1 , P 2 , and P 3 belong to Q 1 , Q 2 and finally Q 3 . In our case:

56 57 58 59 60 61

38 39 40 41

P1 ⊆ Q 1

42 43

P2 ⊆ Q 2

44

P 3 does not belong to any of Q 1 , Q 2 and Q 3 . Thus, as per equation (4), the dependency will be:

|POSC ( D )| | P 1 | + | P 2 | = |U | |U |   4 k = γ {age, Income}, Policy = k = γ (C , D ) =

7

45 46 47

(6)

48 49

(7)

50 51 52

2.6. Reducts

53

54 55

32 33

52 53

31

P 1 = { X 1, X 2 }

41 42

30

54

Reducts are the set of conditional attributes that contain the same amount of information present otherwise in the entire set of conditional attributes. An ideal reduct provides the same classification accuracy as that of the entire dataset and thus can be used on behalf of the entire dataset for any further task. Formally, an attribute subset will be a reduct if:





γ (C , D ) = γ C , D





for C ⊆ C

(8)

i.e., for decision class “D”, an attribute set C ⊆ C will be a reduct if the dependency of “D” on “C”’ is equal to the dependency of “D” on “C”, i.e., the entire set of conditional attributes.

55 56 57 58 59 60 61

JID:IJA AID:8132 /FLA

6

1

[m3G; v1.224; Prn:18/10/2017; 15:03] P.6 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

3. Related work

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

1 2

Many approaches have been proposed in the literature using rough sets to perform feature selection by avoiding the conventional positive region-based dependency measure. Here, we will present a few of the representative approaches. In [31], the authors present an incremental attribute reduction approach for feature selection using information granularity in decision systems having variation of attributes. They first introduce an incremental granularity mechanism, on the basis of which two new feature selection algorithms are presented using knowledge granularity. The first is a matrix-based incremental reduction algorithm with variation of attributes in a decision system. When a new attribute set P is added in a decision system, the algorithm computes new knowledge granularity, and then it selects the attribute with higher outer significance and adds it to the reduct set. Finally, the algorithm deletes the redundant attribute set. The second proposed algorithm is incremental in the sense that it first calculates the initial knowledge granularity, and after the addition of new attributes, the granularity is calculated again. If the new granularity is the same as the previous one, the reduct set remains unchanged; otherwise, the previous reduct is updated. The algorithms use knowledge granularity instead of the conventional positive region-based dependency measure, but the effectiveness of this measure still needs to be compared with the conventional positive region-based approach. In [32], the authors presented two general reduction algorithms using relative discernibility in inconsistent decision tables. Furthermore, the efficiency of the algorithms was enhanced in two quick general reduction algorithms. The two general reduction algorithms are based on a forward and backward search strategy. Instead of using the positive regionbased approach, the algorithms use the proposed relative discernibility measure. However, to improve the efficiency of the algorithms, two acceleration approaches called quick general reduction algorithms were presented. The acceleration approaches reduce the radix sort time and thus increase the efficiency of the reduction process. Overall, the research presents a general reduction mechanism that can be used with five different reduct selection measures, including positive regionbased dependency. However, no results were given to compare the significances of these measures with respect to each other. In [33], the authors presented a new rough set-based feature selection algorithm via the hybridization of exhaustive and heuristic search strategies to take advantage of both approaches. Exhaustive search provides an optimal solution, but it is computationally too expensive. Heuristic-based search increases the performance, but the selected set of reducts is not optimal. In the proposed hybridized approach, a heuristic search is used as a pre-processor to find the initial set of reducts, which is then optimized by using exhaustive search. Exhaustive search, in this case, removes redundant features. Evaluation of attributes is based on relative dependency instead of the conventional positive region-based approach. Relative dependency avoids calculation of the positive region, but this measure cannot be used for inconsistent datasets. First, we need to split the dataset into two consistent datasets, and then the measure can be used. In [34], Qian et al. proposed an accelerator called forward approximation, which combines sample reduction and dimensionality reduction at the same time. The proposed method is based on fuzzy rough sets. Using this method, they enhanced the three representative heuristic fuzzy-rough feature selection algorithms. The authors also designed an improved algorithm based on the proposed accelerator. The algorithm operates in three steps. In the first step, it initializes the variables, including the reduct set. Starting with the empty reduct set, in step-2, the algorithm continues to add attributes based on the significance measure used. Finally, in step-3, the algorithm outputs the reduct set. The process is called the forward reduction algorithm, and it has time complexity O(|C||U|2 ). The proposed forward approximation combines both sample reduction and dimensionality reduction together, but the datasets used were very small, and the approach still needs to be analysed for large datasets. Furthermore, the reduct set obtained by this algorithm may not be optimal and may contain many superfluous attributes. In [35], the authors proposed a new feature selection algorithm using rough set theory and attribute clustering for unsupervised data. They used rough set theory-based relative dependency to computer similarities between the attributes. The clustering approach combines the distance-based classification with clustering based on a prototype. The intention is to group similar features without requiring the number of clusters. The proposed algorithm combines the K-means and K-NN clustering algorithms to take advantage of both approaches and overcome their deficiencies; in particular, it runs over the samples that arrive continually because it can update or reconfigure the clusters any time a new sample arrives. Thus, it also has the property of dynamic clustering. Overall, the algorithm combines classification based on distance with clustering based on prototype to group similar features without requiring the number of clusters as an input, but again, relative dependency can be used for consistent datasets only. The authors in [36] proposed a new concept of compactness discernibility information tree (CDI-tree) to find redundant and pointless elements. The authors then proposed a complete algorithm for attribute reduction based on CDI-tree. The algorithm intends to delete unimportant attributes. For evaluation purposes, it uses the significance of each attribute. Significance is defined as the number of each condition attribute symbol appearing in the CDI-tree and denoted as Sig(a) for attribute “a”. The overall time complexity of the algorithm is O(|C|2 ∗ |U|2 ). Although the CDI-tree allows non-empty elements to map to one path and numerous non-empty elements to share a compact structure, which is used to store nonempty elements in the discernibility matrix, getting the minimal number of nodes in CDI-tree still needs to be explored. Furthermore, theoretically, we still need to prove that the algorithm finds optimal reducts. Tan et al. [37] proposed a matrix-based method for computing set approximations and reducts of a covering decision information system. Initially, they defined some matrices and matrix-based operations to compute the set approximation.

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

JID:IJA AID:8132 /FLA

[m3G; v1.224; Prn:18/10/2017; 15:03] P.7 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

7

Then, the minimal and maximal descriptions were employed to construct a new discernibility matrix. They also designed a matrix-based algorithm for finding one suboptimal reduct. In the discernibility matrix, the total number of discernibility sets that need to be computed is dramatically reduced, but it should be noted that the algorithm can only be used for covering decision-based systems. Zhang et al. [38] used information entropy based on fuzzy rough sets to perform feature selection in mixed datasets. They proposed a conditional entropy-based heuristic algorithm for this purpose based on the λ-conditional entropy. After adding an attribute in the conditional attribute subset, the λ-conditional entropy decreases monotonously. They used this decrease in the entropy to measure the significance of an attribute. The time complexity of the algorithm is polynomial. The overall time complexity of the algorithm is O(|U|2|A|2). The research presents a new approach for feature selection using information entropy, but the significance of this measure still needs to be justified compared to other counterparts, especially the positive region-based dependency measure. In [39], Raza et al. proposed a new heuristic-based dependency calculation technique to help increase the performance of feature selection algorithms. The authors proposed a new heuristic method to calculate dependency called Incremental Dependency Classes (IDCs). The proposed method comprises four dependency classes that guide how the dependency of decision attribute(s) “D” on conditional attributes “C” changes as we read new records. The proposed method helps to increase the performance of feature selection algorithms while maintaining the accuracy of the dependency value, but we need to keep track of four decision classes for each record, which implies a complex implementation. Although IDCs present an effective method for avoiding the calculation of the positive region, each record needs to be tested against each class to determine which class it belongs to. Classes need to be further generalized. Table 3 provides a summary of the related work.

21 22

25 26 27 28 29 30

35

38 39 40 41 42

47

A unique dependency class defines a set of objects, all of which lead to the same decision class for same values of conditional attributes. Mathematically:

50 51 52 53 54 55 56 57 58 59

k = γ (C , D ) =

|U |

(9)

For example, in Table 2, if we consider the concept Policy = {Gold}, the objects X 3 and X 5 always lead to the same decision class on the basis of information in conditional attributes {Age, Income}. Dependency, in this case, will be the total number of unique decision classes regarding all the concepts in the decision attributes.

9 10 11 12 13 14 15 16 17 18 19 20

24 25 26 27 28 29 30

34 35 37 38 39 40 41 42 43

4.2. Non-unique dependency class

44 45

A non-unique decision class defines the set of objects that all lead to more than one decision classes for the same values of conditional attributes. Mathematically:

k = γ (C , D ) = 1 −

total number of non-unique classes

|U |

46 47 48

(10)

49 50

For example, in Table 2, if we consider the concept Policy = {Gold}, the objects { X 4 , X 6 , X 7 } belong to more than one decision class for the same values of conditional attributes. The dependency in this case will be the ratio of the total number of non-unique classes for all possible concepts, subtracted from one (1). With the help of DDC, we can successfully avoid calculation of the positive region and thus provide a significant increase in the performance of feature selection algorithms using DDC for calculation of the dependency of attributes. It should be noted that the number of unique classes is directly proportional to the degree of dependency, i.e., the greater the number of unique classes, the higher the dependency value. However, it is otherwise in the case of non-unique dependency classes. For a decision class D, the dependency K of D on C is shown in Table 4. For a decision system:

60 61

8

36

total number of unique classes

48 49

7

33

45 46

6

32

43 44

5

31

4.1. Unique dependency class

36 37

4

23

The conventional dependency measure uses the positive region to calculate dependency. It is a three step process, where the first step is to calculate the equivalence class structure for decision attribute(s), the second step involves calculating the equivalence class structure for conditional attribute(s), and the third and final step involves calculating the positive region that actually finds the cardinality of equivalence classes calculated in the second step, which are a subset of the equivalence classes calculated in the first step. The three step process is a computationally expensive job, which makes it almost inappropriate for use with larger datasets. To overcome this deficiency, we have proposed a new method called Direct Dependency Calculation (DDC) to calculate attribute dependency.

33 34

3

22

31 32

2

21

4. Direct Dependency Calculation (DDC)

23 24

1

51 52 53 54 55 56 57 58 59 60

number of unique classes + number of nonunique classes = size of universe

(11)

61

JID:IJA AID:8132 /FLA

1 2

[m3G; v1.224; Prn:18/10/2017; 15:03] P.8 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

8

1

Table 3 Summary of the related work.

2

3

Algorithm

Technique used

Advantages

Disadvantages

3

4

An incremental attribute reduction approach based on knowledge granularity under the attribute generalization [31].

Information granularity.

Performs feature selection for dynamic datasets where attributes are continually added in datasets.

Effectiveness of the measure information granularity still needs to be compared with the conventional positive region-based approach.

4

Quick general reduction algorithms for inconsistent decision tables [32].

A general reduction technique based on the unified decision table model.

General reduction algorithm for five reducts.

Five types of reducts still need to be compared to identify the most effective technique.

9

A hybrid feature selection approach based on heuristic and exhaustive algorithms using rough set theory [33].

Hybridization of exhaustive and heuristic search.

Provides optimal feature subset selection.

Based on relative dependency, so it can only be used for consistent datasets. For inconsistent datasets, the dataset first needs to be converted to a consistent one.

12

5 6 7 8 9 10 11 12 13 14 15 16 17 18

Fuzzy-rough feature selection accelerator [34].

Fuzzy rough set-based forward approximation.

19 20 21 22 23 24 25 26 27 28

Attribute clustering using rough set theory for feature selection in fault severity classification of rotating machinery [35].

RST-based Relative dependency.

Minimal attribute reduction with rough set based on compactness discernibility information tree [36].

Compactness discernibility information tree.

29

32 33 34 35 36 37 38

41 42

10 11 13 14 15 16 17

Combines classification based on distance with clustering based on prototype to group similar features, without requiring the number of clusters as an input.

Relative dependency requires the dataset to be consistent. For inconsistent datasets, first we need to split them, a step that effects performance.

21

The CDI-tree has the ability to map non-empty elements into one path and allow numerous non-empty elements to share the same prefix.

Getting the minimal number of nodes in CDI-tree still needs to be explored. Furthermore, theoretically we still need to prove that the algorithm finds minimal reducts.

26

18 19 20 22 23 24 25 27 28 29 30 31

Matrix-based method for reduction of covering decision systems.

Total number of discernibility sets that need to be computed is dramatically reduced.

Targets only covering based decision systems.

Feature selection in mixed data: A method using a novel fuzzy rough set-based information entropy [38].

Fuzzy rough set-based information entropy.

A new approach for feature selection in mixed data.

Significance of fuzzy rough set-based information entropy still needs to be justified compared with other counterparts.

35

Rough set-based incremental dependency calculation technique [39].

Heuristic-based rules to measure dependency.

Set of four classes, and each record needs to be tested against each class to determine which class it belongs to.

40

32 33 34

Heuristic-based rule offer significant increase in performance compared to positive region based dependency measure.

44

36 37 38 39 41 42 43 44

45 47

8

Matrix-based set approximations and reductions in covering decision information systems [37].

43

46

7

Very small datasets used, approach still needs to be analysed for larger datasets.

39 40

6

Forward approximation combines both sample reduction and dimensionality reduction together.

30 31

5

45

Table 4 How DDC calculates dependency.

46 47

48

Dependency

Number of unique/non-unique classes

48

49

0 1 n

If there is no unique class If there is no non-unique class Otherwise, where 0 < n < 1

49

50 51 52

55 56 57 58 59 60 61

51 52

53 54

50

53

Thus, we need to calculate the number of either unique classes or non-unique classes. The algorithm for Direct Dependency Calculation (DDC) is shown in Fig. 6. Fig. 7 shows the flow chart of the proposed DDC-based dependency calculation method. The algorithm reads the dataset from the first record to the last one. If a record is already present in the grid, the algorithm matches the decision class, and if the decision class is matched, it is marked as unique; otherwise, it is non-unique. Every time it reads a record that is not present in the grid, it is inserted into it with its status set as unique. At the end, the total number of unique or non-unique records is counted based on whether we are using unique or non-unique dependency classes.

54 55 56 57 58 59 60 61

JID:IJA AID:8132 /FLA

[m3G; v1.224; Prn:18/10/2017; 15:03] P.9 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

1

9

1

Function FindNonUniqueDependency Begin

Function FindUniqueDependency Begin

Step-1: Update Grid

Step-1: Update Grid

1) InsertInGrid(X1 ) 2) For i = 2 to TotalUnivesieSize 3) IfAlreadyExistsInGrid(Xi ) 4) Index = FindIndexInGrid(Xi ) 5) Grid(Index,INSTANCECOUNT) + = 1 6) If DecisionClassMatched(index,i) = False 7) UpdateUniquenessStaus(index) 8) End-IF 9) Else 10) InsertInGrid(Xi ) 11) End-IF

1) InsertInGrid(X1 ) 2) For i = 2 to TotalUnivesieSize 3) IfAlreadyExistsInGrid(Xi ) 4) Index = FindIndexInGrid(Xi ) 5) Grid(Index,INSTANCECOUNT) + = 1 6) If DecisionClassMatched(index,i) = True 7) UpdateUniquenessStaus(index) 8) End-IF 9) Else 10) InsertInGrid(X i ) 11) End-IF

Step-2: Calculate Dependency 12) Dep = 0 13) For i = 1 to TotalRecordsInGrid 14) If Grid(i,CLASSSTATUS) = 1 15) Dep = Dep + Grid(i,INSTANCECOUNT) 16) End-IF 17) Return (1-Dep)/TotalRecords End Function

Step-2: Calculate Dependency 12) Dep = 0 13)For i = 1 to TotalRecordsInGrid 14) If Grid(i,CLASSSTATUS) = 0 15) Dep = Dep + Grid(i,INSTANCECOUNT) 16) End-IF 17)Return Dep/TotalRecords End Function

Function InsertInGrid(Xi) 18) For j = 1 to TotalAttributesInX

Function InsertInGrid(Xi) 18) For j = 1 to TotalAttributesInX

19) Grid(NextRow,j) = 20) End-For 21) Grid(NextRow,DECISIONCLASS) = D i 22) Grid(NextRow, INSTANCECOUNT) = 1 23) Grid(NextRow, CLASSSTATUS) = 0 //0 => uniqueness End-Function

19) Grid(NextRow,j) = 20)End-For 21) Grid(NextRow,DECISIONCLASS) = D i 22) Grid(NextRow, INSTANCECOUNT) = 1 23) Grid(NextRow, CLASSSTATUS) = 0 // 0 => uniqueness End-Function

24

Function IfAlreadyExistsInGrid(Xi ) 24) For i = 1 to TotalRecordsInGrid 25) For j = 1 to TotalAttributesInX 26) If Grid(i,j) <> X j 27) Return False 28) End-For 29) End-For 30) Return True 31)End-Function

Function IfAlreadyExistsInGrid(Xi ) 24) For i = 1 to TotalRecordsInGrid 25) For j = 1 to TotalAttributesInX 26) If Grid(i,j) <> X j 27) Return False 28) End-For 29) End-For 30)Return True End-Function

30

Function FindIndexInGrid(Xi ) 31) For i = 1 to TotalRecordsInGrid 32) RecordMatched = TRUE 33) For j = 1 to TotalAttributesInX 34) If Grid(i,j) <> X j 35) RecordMatched = FALSE 36) End-For 37) If RecordMatched = TRUE 38) Return j 39) End-If 40) End-For 41)Return True End-Function

Function FindIndexInGrid(Xi ) 31) For i = 1 to TotalRecordsInGrid 32) RecordMatched = TRUE 33) For j = 1 to TotalAttributesInX 34) If Grid(i,j) <> X j 35) RecordMatched = FALSE 36) End-For 37) If RecordMatched = TRUE 38) Return j 39) End-If 40) End-For 41)Return True End-Function

38

Function DecisionClassMatched(index,i) 42) If Grid(index, DECISIONCLASS) = D i 43) Return TRUE Else 44) Return False 45) End-If End-Function

Function DecisionClassMatched(index,i) 42) If Grid(index, DECISIONCLASS) = D i 43) Return TRUE Else 44) Return False 45) End-If End-Function Function UpdateUniquenessStaus(index) 46) Grid(index, CLASSSTATUS) = 1 End-Function

56

58

Function UpdateUniquenessStaus(index) 46) Grid(index, CLASSSTATUS) = 1 End-Function

59

DDC using non-unique classes

DDC using Unique Classes

59

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

3

j Xi

4

j Xi

49 50 51 52 53 54 55 56 57

60 61

2

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 26 27 28 29 31 32 33 34 35 36 37 39 40 41 42 43 44 45 46 47 48 49

Fig. 6. Pseudocode for directly finding the dependency using unique and non-unique classes.

50 51 52 53 54 55 57 58 60 61

JID:IJA AID:8132 /FLA

10

[m3G; v1.224; Prn:18/10/2017; 15:03] P.10 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17

17

18

18

19

19

20

20

21

21

22

22

23

23

24

24

25

25

26

26

27

27

28

28

29

29

30

30

31

31

32

32

33

33

34

34

35

35

36

36

37

37

38

38

39

39

40

40

41

41

42

42

43 44

43 44

Fig. 7. Flow chart of the proposed DDC-based dependency calculation method.

45 46

45 46

4.3. How DDC computes dependency

47 48 49

47

To calculate dependency, we use the “grid”, a two-dimensional matrix, as an intermediate data structure. The dimensions of the grid will be calculated as follows:

50

number of rows = number of records in the dataset

(12)

52

number of columns = number of conditional attributes + number of decision attributes + 2

(13)

54 55 56 57 58 59 60 61

49 50

51 53

48

Thus, if a data set has four conditional attributes, one decision attribute and five objects, the dimension of the grid will be 5 × 7, i.e., five rows and seven columns. The first four columns will store four conditional attributes (i.e., one column for each attribute), and the fifth column will store the decision class. The sixth and seventh columns will store INSTANCECOUNT and CLASSSTATUS, where INSTANCECOUNT represents the total number of objects in the dataset with the same values of conditional attributes. CLASSSTUATUS signifies whether these values represent a unique or non-unique decision class. After the grid is filled with all the attribute values and their status, we will read the total number of instances in either unique or non-unique classes, depending on whether we are using unique or non-unique classes.

51 52 53 54 55 56 57 58 59 60 61

JID:IJA AID:8132 /FLA

[m3G; v1.224; Prn:18/10/2017; 15:03] P.11 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

1

Instances

2

11

Conditional attributes

Decision class

INSTANCECOUNT

CLASSSTATUS

1

35–40

Platinum

1

0

2

30000–40000

3

3

4

4

5

5

6

6

7

7

8

8

Fig. 8. Grid Contents after X 1 .

9

9

10 11

Instances

12

Conditional attributes

Decision class

INSTANCECOUNT

Decision class

10

35–40 40–45

Platinum Gold

2 1

0 0

11

30000–40000 50000–60000

12

13

13

14

14

15

15

16

16

17

17

Fig. 9. Grid Contents after X 3 .

18 19 20

Instances

21 22

18

Conditional attributes

Decision class

INSTANCECOUNT

Decision class

35–40 40–45 25–35

Platinum Gold Silver

2 2 2

0 0 0

30000–40000 50000–60000 20000–30000

19 20 21 22

23

23

24

24

25

25

26

26

Fig. 10. Grid Contents after X 4 , X 5 and X 6 .

27

27

28

28

29

Instances

30 31 32

Conditional attributes

Decision class

INSTANCECOUNT

Decision class

29

35–40 40–45 25–35

Platinum Gold Silver

2 2 3

0 0 1

30

30000–40000 50000–60000 20000–30000

31 32

33

33

34

34

35

35

36

Fig. 11. Grid Contents after X 7 .

37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

36 37

The functions “FindNonUniqueDependency” and “FindUniqueDependency” are the most important. They insert a record in the grid each time a new set of attribute values is found from the dataset. The functions then search for the same set of values from the dataset and update their statuses in the grid depending on whether they lead to the same decision class or not. CLASSSTATUS is updated accordingly. Finally, the cardinality (INSTANCECOUNT) of unique or non-unique classes is calculated. The following example shows how the DDC method works. We will consider the decision system given in Table 2. The Size of the Grid will be 7x5. Initially, we read the first record, i.e., X 1 , and insert it in the grid. The contents of the grid are shown in Fig. 8: Note that it was the first record, so INSTANCECOUNT is set to “1” and CLASSSTATUS is set to “0”. We now read the second record, i.e., X 2 , as it already exists in the grid and leads to the same decision class, and INSTANCECOUNT will be incremented while CLASSSTATUS will remain the same. After X 2 , we read X 3 ; it does not exist in the grid, so we will insert it as shown in Fig. 9. In the same manner, reading X 4 , X 5 and X 6 will yield the grid contents given in Fig. 10. However, when we read X 7 , it already exists in the grid but leads to a different class in the dataset, so INSTANCECOUNT will be incremented but CLASSSTATUS will be set to 1. Thus, the grid contents will become as in Fig. 11. Now, if we consider non-unique classes, dependency will be calculated using equation (10) as follows:

k = γ (C , D ) = 1 −

3 7

=

39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54

4

55

7

56

However, if we consider the number of unique classes, dependency will simply be the ratio of the number of unique classes (which are four in number in our case) to the size of the universe, as follows:

k = γ (C , D ) =

38

57 58 59

4

60

7

61

JID:IJA AID:8132 /FLA

12

[m3G; v1.224; Prn:18/10/2017; 15:03] P.12 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

1 3

X1 X2 X3 X4

4 5 6 7 8 9

1

Table 5 A simple decision system.

2

Conditional attribute

2

a

b

z

3

0 1 1 0

0 0 0 0

1 1 0 1

4

DECISIONCLASS

INSTANCECOUNT

5 6 7

CLASSSTATUS

Instances

8 9

10

10

11

11

12

12

Fig. 12. Initial Grid Structure.

13 14 15 16

Instances

13

Conditional attribute

DECISIONCLASS

INSTANCECOUNT

CLASSSTATUS

0

1

1

0

17

19

20

20

Instances

Conditional attribute

DECISIONCLASS

INSTANCECOUNT

CLASSSTATUS

0 1

1 1

1 1

0 0

23

25

26

26

27

Instances

Conditional attribute

DECISIONCLASS

INSTANCECOUNT

CLASSSTATUS

0 1

1 1

1 2

0 1

30 31

36 37 38 39 40

43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

29 31 33 34

On the basis of information discussed in section 3, we now explain the step-by-step execution (dry run) of the nonunique portion of the DDC algorithm using the information system given in Table 5. In this decision system, there are two conditional attributes, i.e., {a, b}, and one decision class, i.e., { z}. There are four records in total. We will calculate dependency of “z” on “a”. As there are four records, one conditional attribute and one decision attribute, so the grid size will be calculated using equations (12) and (13) as follows:

41 42

28

32

4.4. Calculate dependency using DDC: an example

34 35

27

30

Fig. 15. Grid contents after inserting X 3 .

32 33

22 24

Fig. 14. Grid contents after inserting X 2 .

25

29

21 23

24

28

16 18

Fig. 13. Grid contents after inserting X 1 .

19

22

15 17

18

21

14

35 36 37 38 39 40 41

number of rows = number of records in the dataset = 4 number of columns = number of conditional attributes + number of decision attributes + 2 = 4 Fig. 12 shows the initial grid structure. Now we will calculate the dependency of the decision attribute “z” on “a”. We insert the record into the grid. Because the record is being inserted for the first time, INSTANCECOUNT will be “1” and CLASSSTATUS will be “0” (unique). Fig. 13 shows the grid contents after inserting “ X 1 ”. Now, a for-loop will be executed starting from “ X 2 ” to “ X 4 ”. “ X 2 ” will be read first. Statement 3 checks if this record already exists in the grid. In our case, the next value of “a” is “1”, which does not exist in the grid, so the else-part will be executed and “ X 2 ” will be inserted in the grid. Because the record now appears in the grid for the first time, the grid contents will be updated as in step-1. Fig. 14 shows the grid contents after inserting “ X 2 . The next iteration of the for-loop now reads “ X 3 ”. Statement 3 checks whether the value of “a” for “ X 3 ” already exists in the grid or not. Value “1” already exists in the grid. Statement 4 gets its index, which is “2” (the row number where this value exists in the grid). Statement 5 updates INSTANCECOUNT. Statement 6 checks if the decision class in the dataset of “ X 3 ” is the same as that mentioned in the grid. In this case, the decision class does not match, i.e., the condition in statement 6 becomes true, and so CLASSSTATUS will be updated in the grid by statement 7. The grid contents now become as shown in Fig. 15: The next iteration of the for-loop reads object “ X 4 ”. The value attribute “a” for “ X 4 ” is “0”. Statement 3 checks if any similar record exists in the grid. In our case, a record exists at row index “1”. Statement 4 will get this index, and INSTANCECONT will be updated by statement 5. Statement 6 will match the decision class in the dataset with that existing in the

42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

JID:IJA AID:8132 /FLA

[m3G; v1.224; Prn:18/10/2017; 15:03] P.13 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

1 2

Instances

3

13

Conditional attribute

DECISIONCLASS

INSTANCECOUNT

CLASSSTATUS

1

0 1

1 1

2 2

0 1

2

4

4

Fig. 16. Grid contents after inserting X 4 .

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

27 28 29 30 31 32 33 34 35 36

5. Applying DDC to feature selection algorithms We have re-implemented five state-of-the-art feature selection algorithms. Originally, these algorithms used a conventional positive region-based dependency measure. However, all these steps of conventional dependency were replaced with the proposed DDCs-based method. Following is the detail of each of these algorithms. 5.1. Supervised PSO-based QuickReduct using DDC

41 42 43 44

47 48 49 50

55

58 59 60 61

12 13 14 15 16 17 18 19 20 21 22 23 24 26 27 28 29 30 31 32 33 34 35 36 38 40 41 42 43 44 46 47 48 49 50 52 53

IFSA [22] performs feature selection in dynamic datasets. The algorithm, in its original form, is based on the conventional dependency measure using the positive region to perform feature selection in the three steps given below:

56 57

11

51

5.3. Incremental feature selection algorithm (IFSA) using DDC

53 54

10

45

1. The crossover order was redefined. Crossover in the GA (DDC) implementation is performed in decreasing order of fitness. This step is based on our observation that the crossover of chromosomes with higher fitness results in highquality offspring compared to the crossover of low-quality (low-fitness) chromosomes. The same process was used in the implementation of GA [40]. 2. The fitness score of each chromosome is calculated using the DDC method instead of the positive region-based approach.

51 52

9

39

Genetic Algorithm using DDC, hereafter abbreviated GA (DDC), resembles the conventional genetic algorithm [21] in many respects, e.g., the chromosomes here represent candidate solutions, where each gene represents the presence of an attribute in the chromosome. The fitness of a chromosome represents the dependency of a decision attribute on the conditional attribute set represented by that chromosome. Similarly, the mutation process is the same. Some changes, however, were made in GA (DDC) as follows:

45 46

8

37

5.2. Genetic algorithm using DDC

39 40

7

25

Originally, the Supervised PSO-based QuickReduct (PSO-QR) algorithm [20] uses a positive region-based dependency measure to perform selection. The algorithm is a hybrid approach that implements the particle swarm optimization method with RST. The algorithm initially constructs a random swarm of particles and calculates their fitness. Each particle represents a potential solution. For fitness purposes, it uses a rough set-based dependency measure. The particle with the highest fitness in the current swarm is considered to be the local best (pbest). The algorithm then compares pbest with the population’s overall highest fitness, and the particle with the highest fitness becomes the new global best (gbest). Finally, the algorithm updates the velocity and position of each particle. The algorithm terminates when a particle with ideal fitness is found. The particle is then considered to be the reduct set output by the algorithm. PSO-QR uses a positive region-based dependency measure in its original form; however, in its DDC-based version i.e., PSO-QR (DDC), we re-implemented it by using the DDC method in all the steps where conventional dependency was calculated.

37 38

5 6

grid at index “1”. The decision class matches, so ClASSSTATUS will not be updated. The grid contents thus become as shown in Fig. 16. The next part of the algorithm now iterates through the grid and sums up INSTANCECOUNT where CLASSSTATUS is “1” (non-unique). In our case, the sum is “2”. The formula implemented in statement 17 calculates the dependency as 0.5. Similarly, we can calculate the dependency using DDC by counting unique classes. Current research further generalizes IDCs [39] by reducing the heuristic rules from four to two and decreasing the execution time by 21%, as results have shown for six publically available datasets. Reducing the number of classes means a more generalized, more manageable implementation, easy to code but more effective compared to IDCs. Table 9 shows the experimental comparison of DDCs and IDCs.

25 26

3

54 55 56

1. Starting with the original feature subset P , it incrementally computes a new dependency function under P . 2. If the new dependency of P is equal to that of the entire dataset, then P is also a new feature subset; otherwise, a new feature subset is generated from P . 3. Finally, the algorithm removes the unnecessary features by eliminating features one by one. This step ensures that the algorithm outputs the optimal reduct set.

57 58 59 60 61

JID:IJA AID:8132 /FLA

1 2 3

[m3G; v1.224; Prn:18/10/2017; 15:03] P.14 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

14

The algorithm in its original form [22] also uses the conventional rough set-based dependency measure, which makes it suitable only for smaller datasets. In IFSA (DDC), we replaced all these steps with those of the DDC-based method, which substantially enhanced the efficiency of the algorithm while keeping its accuracy intact.

4 5 6 7 8 9 10 11 12 13 14 15 16

19

5.4. Fish swarm algorithm (FSA) using DDC FSA [23] uses swarm-based optimization to perform feature selection. A fish represents a potential solution. The algorithm randomly generates a swarm of fish and puts it in the sample space. Fish move around in the sample space to find a position with more food and less population concentration. Fish in a swarm may show three types of behaviours: 1. Searching behaviour, in which a fish moves from position X i to X j to search for food if F X j > F X i . Here, F X j = the fitness of position X j , and F X i = the fitness of position X i . 2. Swarming behaviour, in which a fish moves towards the swarm centre from position X i to X i +1 based on the fitness of the centre of the swarm. 3. Following behaviour, in which every fish tries to follow other fish that have found greater amounts of food.

24 25 26 27 28 29 30 31

34 35 36 37 38 39 40 41 42 43 44 45 46

RS-IHS-QR [24] proposes a hybrid approach for feature selection based on RST hybridized with a harmony search algorithm. The algorithm mimics the musical improvisation process. The algorithm uses five steps to perform feature selection; it starts with a random construction of harmony memory, which is updated with the passage of time. The algorithm in its original form uses the positive region-based dependency measure, but we have used RS-IHS-QR with the DDC-based method, hereafter abbreviated RS-IHS-QR (DDC), to overcome the complexity of the positive region-based dependency measure. 5.6. Parameter setting

49 50 51 52 53 54

59 60 61

9 10 11 12 13 14 15 16 18 19 21 23 24 25 26 27 28 29 30 31 33 34 35 36 37 38 39 40 41 42 43 44 45 46

HM = 10

48

HMCR = 0.95

50

PARmin = 0.4 PARmax = 0.9

49 51 52 53 54 55

6. Experimental analysis

57 58

8

47

55 56

7

32

In all algorithms, the intention was to compare the DDC-based approach with the conventional positive region-based method, so very slight changes were made in the original parameters. In both PSO-QR and PSO-QR (DDC), the range [0.9, 1.2] was used for the inertia weight “w”, as recommended by [41], because “w” in this range results in lesser chances of failure to find the global optimum. For GA and GA (DDC), the chromosome size was fixed, i.e., if five genes in a chromosome in GA represented the presence of an attribute, the same amount of genes was used in GA (DDC) to prevent biased analysis. Furthermore, the order of crossover, both in GA and GA (DDC), was set in descending order of dependency to ensure high-quality offspring in fewer generations. Similarly, in FSA and FSA (DDC), the swarm was initialized with the same fish position, and the algorithm was terminated as soon as the first fish found the ideal position to ensure that the algorithm terminates as soon as possible. For IFSA and IFSA (DDC), 5–10 attributes were initially considered as original features, and the rest were taken as additional features. This step was taken to ensure that the algorithm runs through the maximum number of attributes. In the cases of RS-IHS-QR and RS-IHS-QR (DDC), all of the parameter values, including HM, HMCR, RARmin and RARmax , were kept the same, as given in the original paper [25] as follows:

47 48

6

22

32 33

5

20

5.5. Rough set improved harmony search QuickReduct (RS-IHS-QR) using DDC

22 23

3

17

FSA originally used the conventional dependency measure to define all of the above-mentioned behaviours. In FSA (DDC), we re-implemented the algorithm with the DDC-based method.

20 21

2 4

17 18

1

56 57

To justify the efficiency and effectiveness of the proposed DDC method and algorithms using DDC, we performed detailed analysis using various datasets from the UCI [41] repository. The details of the datasets are given in Table 6. A comparison framework was designed for this purpose. Note that, for multivariate datasets, a discretization process was used to normalize the dataset to discrete values.

58 59 60 61

JID:IJA AID:8132 /FLA

[m3G; v1.224; Prn:18/10/2017; 15:03] P.15 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

1 2

15

1

Table 6 Sample datasets used.

2

3

Dataset

Instances

Attributes

Dataset characteristics/Attribute characteristics

Classes

3

4

Chess Handwriting Optidigits Phishing Landsat-satellite Vehicle

3196 1593 1797 11055 2000 846

37 266 65 31 37 19

Multivariate/Integer Multivariate/Real Multivariate/Integer Multivariate/Integer, Real Multivariate, Sequential, Time-Series/Integer, Real Multivariate/Categorical, Integer, Real

2 2 10 2 2 4

4

5 6 7 8 9

14 15

18 19 20 21

12

A comparison framework [39] was used for experimental analysis, and it comprised three variables, i.e., the percentage decrease in execution time, percentage decrease in memory used, and accuracy. The following section provides detailed descriptions of all of these variables. 6.1.1. Percentage decrease in execution time The percentage decrease in execution time signifies the runtime efficiency of the algorithm. It specifies how much of a decrease in execution time an algorithm yields compared to another algorithm. This is the basic measure to justify the execution efficiency of an algorithm. Mathematically:

percentage decrease in execution time = 100 −

24 25 26 27 28

E1

31 32 33 34 35

E2

This formula calculates the percentage decrease. The measure is taken as a percentage, e.g., we can say that an algorithm A reduces the execution time by 50% compared to algorithm B if A’s execution time is 0.5 s and B’s execution time is 1 s. The execution times of algorithms were calculated using a system stopwatch. The stopwatch was started after providing input and was stopped immediately after receiving output.

38 39

6.1.2. Accuracy Accuracy specifies the correctness percentage of an algorithm in terms of its output. An algorithm A is 100% correct with respect to algorithm B if it produces the same output as that of B for the same input. The measure was used to justify the correctness of the proposed DDC method and that of algorithms using DDC with respect to their counterparts using the conventional positive region-based approach.

42 43 44 45 46 47

percentage decrease in memory = 100 −

percentage decrease in memory = 100 −

50 51 52 53 54 55

M1 M2

∗ 100

5 15

6.1.4. Runtime context Runtime context specifies the overall context in which the execution of the algorithm was performed. It particularly specifies the hardware configuration including total available memory, processor, operating system platform, etc. During the analysis, the runtime context remained the same.

60 61

20 21 23 24 25 26 27 28 30 31 32 33 34 35 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 56

6.2. Experimental results

58 59

19

48

∗ 100 = 66.67%

56 57

18

40

where M 1 is the memory used by one algorithm and M 2 is the memory used by another algorithm. For calculation of this measure, only major data structures were considered, while intermediate temporary variables were ignored. If an algorithmA requires 5 MB of runtime memory and another algorithm-B requires 15 MB, the percentage decrease in runtime memory required by algorithm-A will be calculated as follows:

48 49

17

36

6.1.3. Percentage decrease in memory The percentage decrease in memory specifies the total percent of reduction in run-time memory that an algorithm yields compared to another algorithm. Mathematically:

40 41

15

29

36 37

14

22

∗ 100

29 30

13

16

22 23

8

11

16 17

7

10

6.1. Comparison framework

12 13

6

9

10 11

5

57 58

To verify the proposed approach, experiments were carried out in two phases. In the first phase the efficiency and accuracy of the proposed solution was verified, whereas in the second phase, feature selection algorithms using DDC methods were analysed. The comparison framework discussed above was used as a tool for analysis purposes.

59 60 61

JID:IJA AID:8132 /FLA

16

1 2 3 4 5 6 7

[m3G; v1.224; Prn:18/10/2017; 15:03] P.16 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–••• For i=1 to Number of records-1 For j=i+1 to Number of records If x(i)=x(j) Put x(j) in x(i) set. End-If End-For End-For Fig. 17. Step-1 of calculating Dep(P).

8

11 12 13 14 15

20 21 22 23 24 25 26 27 28

31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

5 6 7

10 11 12 13 14 15 17 18

For i=1 to total number of decision attribute based equivalence classes For j= 1 to total number of conditional attribute based equivalence classes // find cardinality of all sets in conditional attribute based equivalence classes that are subset of decision attribute based equivalence classes End-For End-For Fig. 19. Step-3 of calculating Dep(P).

29 30

4

16

Fig. 18. Step-2 of calculating Dep(P).

18 19

3

9

For i=1 to Number of records-1 For j=i+1 to Number of records For j= 1 to Number of attributes // match attributes one by one and put all records with similar attribute values in separate sets End-For End-For End-For

16 17

2

8

9 10

1

19 20 21 22 23 24 25 26 27 28 29

6.2.1. Efficiency and accuracy of the proposed DDC method To measure the efficiency and accuracy of the proposed DDC method, all three components of the comparison framework were measured and compared for both the DDC-based approach and that using the positive region. The results have shown that the DDC-based approach substantially reduces the execution time and required runtime memory without compromising accuracy. For experimental purposes, we have used various datasets from UCI, and three attribute sets were derived from each dataset, with each attribute set comprising different numbers of attributes. Dependency was measured against each attribute set using both DDC-based and positive region-based methods. The execution time, required runtime memory and accuracy were measured. Table 7 shows the results of the experiment. The first two columns specify the dataset name, instances and number of attributes in each dataset. The third and fourth columns specify the attribute set name and number of attributes in it. The fifth, sixth, seventh, eighth, ninth and tenth specify the dependency value, time taken and memory used for the DDC-based approach and positive region-based method, respectively. The eleventh and twelfth columns specify the percentage decrease in execution time and percentage decrease in memory used by the DDC-based method. Regarding execution time, experiments conducted using 18 attribute sets have shown that Dep(DDC) reduces the execution time by almost 62.4% compared to the positive region-based approach, i.e., Dep(P). The basic reason behind this is that Dep(DDC) directly calculates dependency, thus avoiding the time-consuming positive region calculation. Now, we explain the exact reason behind the performance of the DDC-based approach. Calculating Dep(P) requires three steps: Step-1: Calculate the equivalence class structure using the decision attribute (Qualification). Now, to programmatically complete this step, we need a nested for loop as shown in Fig. 17. Step-2: Calculate the equivalence class structure using the condition attribute set on which the dependency of the decision attribute is to be calculated. It will be necessary to match the attribute values of record i with i + 1, i + 2, . . . , n. We will need three nested loops to accomplish the task, as shown in Fig. 18. Step-3: Now, we need to calculate the positive region, which actually calculates the cardinality of equivalence classes (based on conditional attributes) that are subsets of equivalence classes (based on decision attributes). Programmatically, it will require two main nested loops and internal logic to calculate subsets, as shown in Fig. 19. Calculating Dep(IDCs) requires only two steps: Step-1: In the first step, we identify all the unique/non-unique classes as shown in Fig. 20. Step-2: In the second step, we count all the unique/non-unique classes as shown in Fig. 21. From the above comparison, it can be clearly seen that the programmatic logic required by the DDC method is far simpler than the positive region-based approach. Dep(DDCs) requires only one triple-nested loop block, followed by a single loop to count unique/non-unique classes, to calculate the dependency, whereas Dep(P) requires nested loops in all three steps. This is where we get the significant amount of reduction in execution time.

30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

JID:IJA AID:8132 /FLA

[m3G; v1.224; Prn:18/10/2017; 15:03] P.17 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

17

For i=1 to Number of records-1 For j=i+1 to Number of records For j=1 to Number of attributes //Match conditional attributes with those in grid, if (conditional attribute match) if(decision attribute match) mark current record as unique else mark current record as non-unique End-For End-For End-For

1 2 3 4 5 6 7 8 9 10 11

1 2 3 4 5 6 7 8 9 10 11

Fig. 20. Step-1 of calculating Dep(DDCs).

12 13

12 13

For i=1 to total rows-in-grid. Count all unique classes in grid. Next

14 15

14 15

16

16

Fig. 21. Step-2 of calculating Dep(DDCs).

17 18

Platinum Gold Silver

19 20

2 3 2

5 6 5

X1 X3 X4

X2 X5 X6

17 18 19

X7

20

21

21

Fig. 22. Runtime equivalence class structure [X] D .

22

22

23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54

23

Regarding the decrease in memory, it was noted that Dep(DDC) consumes less memory than Dep(P). The results have shown that Dep(DDC) reduces the required runtime memory by almost 96% for eighteen datasets with different numbers of attributes and records. The reason behind this is that Dep(DDC) does not require the calculation of the equivalence class structure, as required by the first two steps of Dep(P). To calculate these class structures, we require a substantial amount of memory. We will explain this with the help of the example discussed in section 2.5. To calculate the equivalence class structure for the decision attribute, we need memory of a size calculated as follows:





M [ X ] D = number of decision classes ∗ (max. number of records in any class + 3)



M [ X ] D = { 3, 6} If the memory taken by one matrix element is 4 bytes, then the total memory required to calculate [ X ] D will be:





M [ X ] D = 3 ∗ (6) ∗ 4 = 52 bytes. The matrix will have runtime contents as shown by Fig. 22. Similarly, to calculate [ X ]C , we require a two-dimensional matrix having rows equal to the number of records in the dataset and columns equal to the number of records plus three. Note that the extra three columns are for control purposes. Thus, the memory required by [ X ]C in our case will be equal to





M [ X ] D = 7 ∗ (10) ∗ 4 = 280 bytes. Thus, overall, we need “323” bytes of runtime memory to calculate Dep( P ). To calculate Dep(DDC), we need only a single grid, as discussed in section 3. The grid is again a two-dimensional matrix with the following dimensions:

55

number of rows = number of records in the dataset

56

number of columns = number of conditional attributes + no. of decision attributes + 2

57 58 59 60 61

In our example:

25 26 27 28 29 30

(14)

The memory will be used in the form of a two-dimensional matrix having a number of rows equal to the number of decision classes and a number of columns equal to the maximum number of records in any class plus three extra columns. Note that three extra columns are for control purposes: they will contain the decision attribute value, total number of objects in the current class and index of the last object in the current row. In Table 2, there are three decision classes (i.e., Platinum, Gold and Silver), and the “Gold” class has the maximum number (three) of records in it. Thus, the size of the matrix required to calculate [ X ] D will be:



24

31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58

number of rows = 7

59

number of columns = 2 + 1 + 2 = 5

61

60

JID:IJA AID:8132 /FLA

[m3G; v1.224; Prn:18/10/2017; 15:03] P.18 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

18

1 2 3

1

Table 7 DDCs vs. conventional dependency. Dataset

Inst./Att.

4

2

Chess

3196/37

7 8 9

Handwriting

1593/266

10 11 12

Optidigits

1797/65

13 14 15

Phishing

11055/31

16 17 18

Landsat-satellite

2000/37

19 20 21

Vehicle

22 23

846/19

0.529 2.340 3.354

38.9 38.9 38.9

74.8 73.3 53

98.8 98.8 98.8

6

1 1 1

1.217 1.280 1.435

9.6 9.6 9.6

38.5 41.6 46.7

91.5 91.5 91.5

9

0.459 0.459 0.459

1 1 1

1.622 1.669 1.654

12.3 12.3 12.3

54.7 51.4 54.7

96.3 96.3 96.3

12

0.500 2.808 6.84

1.476 1.476 1.476

0.393 0.833 0.967

9.407 13.713 19.344

466.2 466.2 466.2

94.7 79.5 64.6

99.7 99.7 99.7

15

0.957 1 1

0.531 0.593 0.593

0.297 0.297 0.297

0.957 1 1

1.451 1.529 1.622

15.2 15.2 15.2

63.4 61.2 63.4

98 98 98

18

1 1 1

0.109 0.125 0.125

0.067 0.067 0.067

1 1 1

0.390 0.421 0.375

2.73 2.73 2.73

72.1 70.3 66.7

97.5 97.5 97.5

21

Attr. set size

Dep(DDC) Dep.

Time (s)

Mem. (MB)

Dep.

Time (s)

CH_1 CH_2 CH_3

10 20 30

0.121 0.467 0.751

0.62 0.624 1.576

0.475 0.475 0.475

0.121 0.467 0.751

HND_1 HND_2 HND_3

80 160 240

1 1 1

0.748 0.748 0.765

0.814 0.814 0.814

OPT_1 OPT_2 OPT_3

20 40 60

1 1 1

0.734 0.811 0.749

PHI_1 PHI_2 PHI_3

10 20 30

0.393 0.833 0.967

LND_1 LND_2 LND_3

5 15 30

VEH_1 VEH_2 VEH_3

5 10 15

5 6

% Dec. in memory

3

Mem. (MB)

% Dec. in time

Attr. set ID

Dep(P)

24

27 29 30 31 32

% Ambiguity (class overlapping)

Dependency (P)

Dep(DDC)

Handwriting Musk Optidigits Sat-log Vehicle

59.8% 39.9% 74.9% 50% 24.8%

0.401 0.600 0.250 0.5 0.751

0.401 0.600 0.250 0.5 0.751

38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53

56 57 58 59 60 61

11 13 14 16 17 19 20 22 23

27 29 30 31 32 33 34

Thus,

35

required memory = 7 ∗ 5 ∗ 4 = 140 bytes. It is clear from the above example that Dep(DDC) takes almost 50% less memory compared to Dep(P) for the example given in Table 2. Regarding the accuracy of the proposed solution, it is clear from Table 7 that Dep(DDC) provides 100% accurate dependency compared to that provided by the conventional positive region-based approach. The reason behind this is that Dep(DDC) calculates the same unique/non-unique classes that represent the positive region. However, instead of using the equivalence class structure and calculating the positive region, it directly determines these unique/non-unique classes based on the decision class that the values of attributes lead to. As far as robustness of the proposed DDC based method is concerned, it gives the same results as obtained by the conventional positive region-based dependency calculation method. To prove this, we introduced ambiguity (class overlapping) in five datasets and calculated the dependency using both methods. Table 8 shows the results. Note that the proposed DDC-based dependency calculation algorithm calculated the same dependency values as the conventional method. It should be noted that, in the case of almost 60% ambiguity (Handwriting dataset), the dependency was 0.401 (i.e., almost 40%). Similarly, in the case of 39.9% ambiguity (Musk Dataset), the dependency was 0.600 (i.e., 60%). Thus, noise/ambiguity does not affect the results of the proposed algorithm. The following Table 9 shows a comparison of the DDC-based dependency calculation method and the IDC [30]-based method. DDCs resulted in a 21% decrease in execution time compared to the IDC-based approach with the same accuracy and required runtime memory.

54 55

10

28

33

37

8

26

Dataset

28

36

7

25

Table 8 DDCs vs. the conventional method for erroneous input.

26

35

5

24

25

34

4

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54

6.2.2. Efficiency and accuracy of algorithms using DDC Experimental analysis has shown that algorithms using the DDC-based approach have been more effective in terms of both execution time and memory while still maintaining 100% accuracy. Tables 10 through 14 show the results of the analysis. The first three columns specify the dataset name and the numbers of instances and attributes. Columns four, five, six and seven show the total number of attributes and memory consumed in both DDC and the conventional case. Finally, columns eight and nine show how much faster the proposed DDC-based approach is and what percentage decrease in memory it results in.

55 56 57 58 59 60 61

JID:IJA AID:8132 /FLA

[m3G; v1.224; Prn:18/10/2017; 15:03] P.19 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

1 2 3

1

Table 9 DDCs vs. IDCs.

2

Dataset

Inst./Att.

4

Chess

`

7 8 9

Handwriting

1593/266

10 11 12

Optidigits

1797/65

13 14 15

Phishing

11055/31

16 17 18

Landsat-satellite

2000/37

19 20 21

Vehicle

846/19

22 23

3

Mem. (MB)

% Dec. in time

0.71 0.71 2.01

0.475 0.475 0.475

25.5 12.1 21.6

6

1 1 1

0.91 0.93 0.97

0.814 0.814 0.814

17.8 19.6 21.1

9

0.459 0.459 0.459

1 1 1

0.87 0.95 0.98

0.459 0.459 0.459

15.6 14.6 23.6

12

0.500 2.808 6.84

1.476 1.476 1.476

0.393 0.833 0.967

0.71 3.5 7.77

1.476 1.476 1.476

29.6 19.8 12.0

15

0.957 1 1

0.531 0.593 0.593

0.297 0.297 0.297

0.957 1 1

0.68 0.69 0.68

0.297 0.297 0.297

21.9 14.1 12.8

18

1 1 1

0.109 0.125 0.125

0.067 0.067 0.067

1 1 1

0.17 0.17 0.19

0.067 0.067 0.067

35.9 26.5 34.2

21

Attr. set ID

Attr. set size

Dep(DDC) Dep.

Time (s)

Mem. (MB)

Dep.

Time (s)

CH_1 CH_2 CH_3

10 20 30

0.121 0.467 0.751

0.62 0.624 1.576

0.475 0.475 0.475

0.121 0.467 0.751

HND_1 HND_2 HND_3

80 160 240

1 1 1

0.748 0.748 0.765

0.814 0.814 0.814

OPT_1 OPT_2 OPT_3

20 40 60

1 1 1

0.734 0.811 0.749

PHI_1 PHI_2 PHI_3

10 20 30

0.393 0.833 0.967

LND_1 LND_2 LND_3

5 15 30

VEH_1 VEH_2 VEH_3

5 10 15

5 6

19

Dep(IDC)

4 5 7 8 10 11 13 14 16 17 19 20 22 23

24

24

25

25

26 27 28 29

26

Table 10 IFSA (DDC) vs. IFSA. Dataset

27 28

Instances

Attributes

30 31 32 33 34 35 36

Chess Handwriting Optidigits Phishing Sat Vehicle

3196 1593 1797 11055 2000 846

37 266 65 31 37 19

IFSA (DDC) total reducts

Memory used (MB)

IFSA total reducts

Memory used (MB)

X times faster

% Decrease in time

Accuracy

36 28 20 30 20 5

0.243 0.817 0.233 0.716 0.164 0.035

36 28 20 30 20 5

19.5 4.84 6.16 233.47 7.63 1.36

4.1 2.8 2.1 12.6 1.6 3.8

75.4 64.2 53.3 92 38.6 73.7

100% 100% 100% 100% 100% 100%

29 30 31 32 33 34 35 36

37

37

38

38

39 40 41

39

Table 11 GA (DDC) vs. GA. Dataset

40

Instances

Attributes

42 43 44 45 46 47 48

Chess Handwriting Optidigits Phishing Sat Vehicle

3196 1593 1797 11055 2000 846

37 266 65 31 37 19

GA (DDC) total reducts

Memory used (MB)

GA total reducts

Memory used (MB)

X times faster

% Decrease in time

Accuracy

36 28 20 30 20 5

0.243 0.817 0.233 0.716 0.164 0.035

36 28 20 30 20 5

19.5 4.84 6.16 233.47 7.63 1.36

22.6 2.4 2.8 24.5 3 2.2

95.6 58.7 64.4 95.9 66.8 54.3

100% 100% 100% 100% 100% 100%

41 42 43 44 45 46 47 48

49

49

50

50

51 52 53 54

51 52

Table 12 RS-IHS-QR (DDC) vs. RS-IHS-QR. Dataset

Instances

53

Attributes

55

RS-IHS-QR (DDC) total reducts

Memory used (MB)

RS-IHS-QR total reducts

Memory used (MB)

X times faster

% Decrease in time

Accuracy

36 28 20 30 20 5

0.243 0.817 0.233 0.716 0.164 0.035

36 28 20 30 20 5

19.5 4.84 6.16 233.47 7.63 1.36

1.7 2.1 1.6 3.3 2.4 2.1

40.3 53.4 39 69.4 58.5 51.8

100% 100% 100% 100% 100% 100%

56 57 58 59 60 61

Chess Handwriting Optidigits Phishing Sat Vehicle

3196 1593 1797 11055 2000 846

37 266 65 31 37 19

54 55 56 57 58 59 60 61

JID:IJA AID:8132 /FLA

1 2 3

[m3G; v1.224; Prn:18/10/2017; 15:03] P.20 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

20

1

Table 13 FSA (DDC) vs. FSA. Dataset

2

Instances

Attributes

4 5 6 7 8 9 10

Chess Handwriting Optidigits Phishing Sat Vehicle

3196 1593 1797 11055 2000 846

37 266 65 31 37 19

FSA (DDC) total reducts

Memory used (MB)

FSA total reducts

Memory used (MB)

X times faster

% Decrease in time

Accuracy

36 28 20 30 20 5

0.243 0.817 0.233 0.716 0.164 0.035

36 28 20 30 20 5

19.5 4.84 6.16 233.47 7.63 1.36

1.5 1.6 1.9 6 91.2 1.9

33.3 38.7 46.6 83.2 98.9 47.6

100% 100% 100% 100% 100% 100%

3 4 5 6 7 8 9 10

11

11

12

12

13 14 15

13

Table 14 PSO-QR (DDC) vs. PSO-QR. Dataset

Instances

14

Attributes

16

PSO-QR (DDC) total reducts

Memory used (MB)

PSO-QR total reducts

Memory used (MB)

X times faster

% Decrease in time

Accuracy

5 23 12 12 18 12

0.243 0.817 0.233 0.716 0.164 0.035

4 23 37 18 20 13

19.5 4.84 6.16 233.47 7.63 1.36

9.8 43.3 12.3 37.6 2.9 1.3

89.7 97.7 91.9 97.3 65.3 22.2

100% 100% 100% 100% 100% 100%

17 18 19 20 21 22

Chess Handwriting Optidigits Phishing Sat Vehicle

3196 1593 1797 11055 2000 846

37 266 65 31 37 19

15 16 17 18 19 20 21 22

23

23

24

24

25

25

26

26

27

27

28

28

29

29

30

30

31

31

32

32

33

33

34

34

35

35

36

36

37

37

38

38

39

39

40 41

Fig. 23. Comparison of execution time between IFSA (DDC) & IFSA.

42

45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

41 42

43 44

40

43

Regarding the percentage decrease in execution time, experiments have shown that algorithms using the DDC-based method show a significant decrease in execution time. We have observed a decrease of almost 65% for six datasets. This justifies our claim that algorithms using the DDC method are more efficient compared to those using the positive regionbased approach. As PSO, IHS, GA and FSA are random in nature, they may produce results in different iterations in different runs, and so to ensure unbiased analysis, we used the average time for a single iteration. The main reason behind this is the efficiency of the DDC method, which avoids the positive region. The positive regionbased approaches, alternately, suffer from the time-consuming task of positive region calculation, while DDC-based methods simply calculate the dependency using the number of unique/non-unique classes. Figs. 23 through 27 show the execution time comparison of both versions of each algorithm. It was also noted that algorithms using DDC require less memory compared to positive region-based approaches. The results have shown that DDC-based approaches have exhibited almost a 95% decrease in memory, on average, for six datasets. The reason behind this is that DDC-based approaches require only one matrix to calculate the number of unique and non-unique classes. Positive region-based approaches, alternately, require two matrices to calculate the equivalence class structure for the first two steps, as discussed in section 2.5. DDC-based approaches have shown 100% accurate results. The attributes in the reduct set calculated by genetic and other algorithms having random nature may be different. However, for accuracy to be 100%, the resulted subset should show same dependency as shown by entire set of conditional attributes which was verified and found to be the same. This justifies our claim that DDC-based approaches can successfully be used in any feature selection algorithm with absolute accuracy.

44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

JID:IJA AID:8132 /FLA

[m3G; v1.224; Prn:18/10/2017; 15:03] P.21 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

21

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17 18

17

Fig. 24. Comparison of execution time between GA (DDC) & GA.

18

19

19

20

20

21

21

22

22

23

23

24

24

25

25

26

26

27

27

28

28

29

29

30

30

31

31

32

32

33

33

34

34

35

35

36

36

37

37

38

38

39 40

39

Fig. 25. Comparison of execution time between RS-IHS-QR (DDC) & RS-IHS-QR.

40

41

41

42

42

43

43

44

44

45

45

46

46

47

47

48

48

49

49

50

50

51

51

52

52

53

53

54

54

55

55

56

56

57

57

58

58

59

59

60 61

60

Fig. 26. Comparison of execution time between FSA (DDC) & FSA.

61

JID:IJA AID:8132 /FLA

22

[m3G; v1.224; Prn:18/10/2017; 15:03] P.22 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

Fig. 27. Comparison of execution time between PSO-QR (DDC) & PSO-QR.

17 18 19

18

7. Conclusions and further work

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

51 52 53 54 55 56 57 58 59 60 61

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

References

49 50

19 20

In this paper, we have proposed a direct method for rough set-based dependency calculation. The conventional approach for calculating attributes dependency uses positive region calculation, which is a time-consuming job because calculating dependency using the positive region requires three complex steps, i.e., calculating the equivalence class structure using decision attribute(s), calculating the equivalence class structure using conditional attribute(s), and finally, calculating the positive region. Using Direct Dependency Calculation, alternately, directly calculates dependency by counting the number of unique/non-unique classes in the dataset. We then re-implemented five feature selection algorithms using the proposed DDC method. A complete comparison framework was used to perform experimental analysis. The results using various publicly available datasets have shown that the proposed DDC method is more effective than the positive region-based approach in terms of execution time, required runtime memory and accuracy. Experiments have shown that we can safely use the DDC-based approach in any of the feature selection algorithms. To date, DDC has been used only for supervised datasets to measure the significance of an attribute. This approach is employed because dependency classes require class labels to predict the degree of dependency as a new record is read. However, we may come across a situation where data are not labelled, i.e., in the case of unsupervised learning. Thus, the application of dependency classes may be challenging in this case, representing a scenario still to be tested. If dependency classes prove to be successful for unsupervised models, it will be helpful for performing unsupervised feature selection in large datasets by enhancing the efficiency and effectiveness of the underlying algorithms. Here, we have used DDC for feature selection algorithms only. However, a number of other algorithms, including prediction algorithms, decision-making algorithms, and rule extraction algorithms, also use the positive region-based rough set dependency measure. Thus, we can conclude that dependency classes can also be applied in these algorithms. However, their effectiveness still needs to be analysed, representing an avenue of future work to further prove the benefits of dependency classes. It should also be noted that similar heuristic rules can be derived for other extensions of rough sets, e.g., tolerance or fuzzy rough set approaches, and so we will also focus on deriving such rules for these extensions. The concept of DDC in this paper was intended for filter methods where the process of subset selection is independent of the learning algorithm. However, the effectiveness of DDC for wrapper techniques where feedback of the classification algorithm is used to measure the quality of selected features still needs to be tested. In the future, efforts will be made to integrate DDC with wrapper-based algorithms.

47 48

17

48 49

[1] N. Dessì, B. Pes, Similarity of feature selection methods: an empirical study across data intensive classification tasks, Expert Sys. Appl. 42 (10) (2015) 4632–4642. [2] T.P. Hong, C.H. Chen, F.S. Lin, Using group genetic algorithm to improve performance of attribute clustering, Appl. Soft Comput. 29 (2015) 371–378. [3] S. Paul, S. Das, Simultaneous feature selection and weighting – an evolutionary multi-objective optimization approach, Pattern Recognit. Lett. 65 (2015) 51–59. [4] K.O. Akande, T.O. Owolabi, S.O. Olatunji, Investigating the effect of correlation-based feature selection on the performance of support vector machines in reservoir characterization, J. Nat. Gas Sci. Eng. 22 (2015) 515–522. [5] I. Koprinska, M. Rana, V.G. Agelidis, Correlation and instance based feature selection for electricity load forecasting, Know.-Based Sys. 82 (2015) 29–40. [6] W. Qian, W. Shu, Mutual information criterion for feature selection from incomplete data, Neurocomputi. 168 (2015) 210–220. [7] M. Han, W. Ren, Global mutual information-based feature selection approach using single-objective and multi-objective optimization, Neurocomput. 168 (2015) 47–54. [8] M. Wei, T.W.S. Chow, R.H. Chan, Heterogeneous feature subset selection using mutual information-based feature transformation, Neurocomput. 168 (2015) 706–718. [9] M. Dash, H. Liu, Consistency-based search in feature selection, Artif. Intell. 151 (1) (2003) 155–176. [10] P. Moradi, M.A. Rostami, Graph theoretic approach for unsupervised feature selection, Eng. Appl. Artif Intell. 44 (2015) 33–45.

50 51 52 53 54 55 56 57 58 59 60 61

JID:IJA AID:8132 /FLA

[m3G; v1.224; Prn:18/10/2017; 15:03] P.23 (1-23)

M.S. Raza, U. Qamar / International Journal of Approximate Reasoning ••• (••••) •••–•••

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

23

[11] P. Moradi, M. Rostami, Integration of graph clustering with ant colony optimization for feature selection, Knowl.-Based Sys. 84 (2015) 144–161. [12] S.A. Bouhamed, I.K. Kallel, D.S. Masmoudi, B. Solaiman, Feature selection in possibilistic modeling, Pattern Recognit. 48 (11) (2015) 3627–3640. [13] M.L. Samb, F. Camara, S. Ndiaye, Y. Slimani, M.A. Esseghir, A novel RFE-SVM-based feature selection approach for classification, Int. J. Adv. Sci. Tech. 43 (2012) 27–36. [14] Z. Pawlak, A. Skowron, Rudiments of rough sets, Information sciences 177 (1) (2007) 3–27. [15] Z. Pawlak, Rough sets, Int. J. Comp. Info. Sci. 11 (1982) 341–356. [16] P.R.K. Varma, V.V. Kumari, S.S. Kumar, A novel rough set attribute reduction based on ant colony optimization, Int. J. Intell. Sys. Tech. Appl. 14 (3–4) (2015) 330–353. [17] C. Wang, M. Shao, B. Sun, Q. Hu, An improved attribute reduction scheme with covering based rough sets, Appl. Soft Comp. 26 (2015) 235–243. [18] X. Jia, L. Shang, B. Zhou, Y. Yao, Generalized attribute reduct in rough set theory, Knowl.-Based Sys. 91 (2016) 204–218. [19] Y. Kusunoki, M. Inuiguchi, Structure-based attribute reduction: a rough set approach, Feat. Sel. Dat. Pattern Recognit. (2015) 113–160. [20] H.H. Inbarani, A.T. Azar, G. Jothi, Supervised hybrid feature selection based on PSO and rough sets for medical diagnosis, Comput. Meth. Prog. Biomed. 113 (1) (2014) 175–185. [21] K. Zuhtuogullari, N. Allahvardi, N. Arikan, Genetic algorithm and rough sets based hybrid approach for reduction of the input attributes in medical systems, Int. J. Innov. Comput. Info. Cont. 9 (2013) 3015–3037. [22] W. Qian, W. Shu, B. Yang, C. Zhang, An incremental algorithm to feature selection in decision systems with the variation of feature set, Chinese J. Elect. 24 (2015) 128–133. [23] Y. Chen, Q. Zhu, H. Xu, Finding rough set reducts with fish swarm algorithm, Knowl.-Based Syst. 81 (2015) 22–29. [24] H.H. Inbarani, M. Bagyamathi, A.T. Azar, A novel hybrid feature selection method based on rough set and improved harmony search, Neural Comput. Appl. 26 (8) (2015) 1859–1880. ´ [25] M. Podsiadło, H. Rybinski, Rough sets in economy and finance, Trans. Rough Sets XVII (2014) 109–173. [26] V. Prasad, T.S. Rao, M.S.P. Babu, Thyroid disease diagnosis via hybrid architecture composing rough data sets theory and machine learning algorithms, Soft Comput. 20 (3) (2016) 1179–1189. [27] C.H. Xie, Y.J. Liu, J.Y. Chang, Medical image segmentation using rough set and local polynomial regression, Multimedia Tools Appl. 74 (6) (2015) 1885–1914. [28] G.A. Montazer, S. ArabYarmohammadi, Detection of phishing attacks in Iranian e-banking using a fuzzy–rough hybrid system, Appl. Soft Comput. 35 (2015) 482–492. [29] M.P. Francisco, J.V. Berna-Martinez, A.F. Oliva, M.A.A. Ortega, Algorithm for the detection of outliers based on the theory of rough sets, Decision Support Syst. 75 (2015) 63–75. [30] J. Komorowski, Z. Pawlak, L. Polkowski, A. Skowron, Rough Sets: A Tutorial, Rough Fuzzy Hybridization, A New Trend in Decision-Making, Springer, 1999, pp. 3–98. [31] Y. Jing, T. Li, J. Huang, Y. Zhang, An incremental attribute reduction approach based on knowledge granularity under the attribute generalization, Int. J. Approx. Reason. 76 (2016) 80–95. [32] H. Ge, L. Li, Y. Xu, C. Yang, Quick general reduction algorithms for inconsistent decision tables, Int. J. Approx. Reason. 82 (2017) 56–80. [33] M.S. Raza, U. Qamar, A hybrid feature selection approach based on heuristic and exhaustive algorithms using Rough set theory, in: Proceedings of the International Conference on Internet of Things and Cloud Computing, ACM, 2016. [34] Y. Qian, Q. Wang, H. Cheng, J. Liang, C. Dang, Fuzzy-rough feature selection accelerator, Fuzzy Sets Syst. 258 (2015) 61–78. [35] F. Pacheco, M. Cerrada, R.V. Sánchez, D. Cabrera, C. Li, J.V.D. Oliveira, Attribute clustering using rough set theory for feature selection in fault severity classification of rotating machinery, Expert Syst. Appl. 71 (2017) 69–86. [36] Y. Jiang, Y. Yu, Minimal attribute reduction with rough set based on compactness discernibility information tree, Soft Comput. 20 (6) (2016) 2233–2243. [37] A. Tan, J. Li, Y. Lin, G. Lin, Matrix-based set approximations and reductions in covering decision information systems, Int. J. Approx. Reason. 59 (2015) 68–80. [38] X. Zhang, C. Mei, D. Chen, J. Li, Feature selection in mixed data: a method using a novel fuzzy rough set-based information entropy, Pattern Recognit. 56 (2016) 1–15. [39] M.S. Raza, U. Qamar, An incremental dependency calculation technique for feature selection using rough sets, Inf. Sci. 343 (2016) 41–65. [40] Y. Shi, R. Eberhart, A modified particle swarm optimizer, in: IEEE International Conference on Evolutionary Computation, Anchorage, Alaska, 1998, pp. 69–73. [41] B. Kevin, M. Lichman, UCI Machine Learning Repository, Irvine, CA, http://archive.ics.uci.edu/ml (Accessed 11 April 2017).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

41

41

42

42

43

43

44

44

45

45

46

46

47

47

48

48

49

49

50

50

51

51

52

52

53

53

54

54

55

55

56

56

57

57

58

58

59

59

60

60

61

61