The effect of linguistic hedges on feature selection: Part 2

The effect of linguistic hedges on feature selection: Part 2

Expert Systems with Applications 37 (2010) 6102–6108 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

202KB Sizes 4 Downloads 26 Views

Expert Systems with Applications 37 (2010) 6102–6108

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

The effect of linguistic hedges on feature selection: Part 2 Bayram Cetisli * Department of Computer Engineering, University of Süleyman Demirel, Isparta 32260, Turkey

a r t i c l e

i n f o

Keywords: Linguistic Hedge (LH) Adaptive neuro-fuzzy classifier (ANFC) Feature selection (FS) Pattern recognition Multi-dimensional data analysis

a b s t r a c t The effects of linguistic hedges (LHs) on neuro-fuzzy classifier are shown in Part 1. This paper presents a fuzzy feature selection (FS) method based on the LH concept. The values of LHs can be used to show the importance degree of fuzzy sets. When this property is used for classification problems, and every class is defined by a fuzzy classification rule, the LHs of every fuzzy set denote the importance degree of input features. If the LHs values of features are close to concentration values, these features are more important or relevant, and can be selected. On the contrary, if the LH values of features are close to dilation values, these features are not important, and can be eliminated. According to the LHs value of features, the redundant, noisily features can be eliminated, and significant features can be selected. For this aim, a new LH-based FS algorithm is proposed by using adaptive neuro-fuzzy classifier (ANFC). In this study, the meanings of LHs are used to determine the relevant and irrelevant features of realworld databases. The experimental studies are shown the success of using the LHs in FS algorithm. Ó 2010 Elsevier Ltd. All rights reserved.

1. Introduction Nowadays, due to the rapid advancement of computer and database technologies, it is very important to obtain true or desirable knowledge. The importance of knowledge is the cause for new scientific branches such as data mining, machine intelligence, knowledge discovery, and statistics, to appear. Dimension reduction and feature selection (FS) are common preprocessing steps used for pattern recognition and classification applications (Theodoridis & Kotroumbas, 1999). In some problems, a lot of features can be used. If irrelevant features are used in combination with good features, the classifier will not perform well as it would with only good features. Therefore, the goal should be aimed at choosing a discriminative subset of features. There are many potential benefits of dimensionality reduction and feature selection: facilitating data visualization and data understanding, reducing the measurement and storage requirements, decreasing computational complexity, reducing training and utilization times. Dimensionality reduction of a feature set is a preprocessing technique commonly used on multi-dimensional data. There are two different approaches to achieve feature reduction, namely feature extraction and feature selection (Liu, 2005; Theodoridis & Kotroumbas, 1999). In the feature extraction approach, the popular methods used are principal component analysis (or Karhunen– Loeve transform), independent component analysis, singular value decomposition, manifold learning, factor analysis, and Fisher linear * Tel.: +90 246 211 1392; fax: +90 246 237 0859. E-mail address: [email protected] 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.02.115

discriminate analysis (Liu, 2005). However, they have a disadvantage that measurements from all of the original features are used in the projection to the lower dimensional space, due to which the meaning of original features can be lost. In some applications, it is desirable to pick a subset of the original features rather than finding a mapping that uses all of the original features. In these cases, the FS methods should be used instead of the feature extraction methods. In the FS approach, relevant features are selected from the original features without any projection. There are various wellknown measurements to obtain the relevant features, such as heuristic stepwise analysis (Liu, 2005), statistical hypothesis testing (Cord, Ambroise, & Cocquerez, 2006), genetic algorithm (Cord et al., 2006), neural networks (Kwak & Choi, 2002), support vector machines (Sindhwani & Rakshit, 2004), and fuzzy systems (Chakraborty & Pal, 2004; Jensen & Shen, 2007; Lee, Chen, Chen, & Jou, 2001; Li, Mukaidono, & Turksen, 2002; Sankar, Rajat, & Basak, 2000; Thawonmas & Abe, 1997; Tsang, Yeung, & Wang, 2003). The FS algorithms may also be categorized into two groups based on their evaluation procedure: filters and wrappers (Liu, 2005; Uncu & Türksßen, 2007). If the FS algorithm is run without any learning algorithm, then it is a filter approach. The filter-based approaches select the features using estimation criterion based on the statistics of learning data, and are independent of the induction classifier. Essentially, irrelevant features are filtered out before induction. Filters tend to be applicable to most domains, as they are not tied to any particular induction algorithm. If the evaluation procedure is tied to the task of the learning algorithm, the FS algorithm employs the wrapper approach. The wrappers may produce

B. Cetisli / Expert Systems with Applications 37 (2010) 6102–6108

better results, though they are expensive to run, and can break down with very large numbers of features. This is due to the use of learning algorithms in the evaluation of subsets, some of which can encounter problems when dealing with large datasets. Fuzzy systems such as entropy based (Lee et al., 2001), fuzzyrough sets (Jensen & Shen, 2007), optimal fuzzy-valued feature subset selection (OFSS) (Tsang et al., 2003), and fuzzy weights (Li et al., 2002) methods have been used for FS in scientific literature. In this paper, a novel FS method based on adaptive neuro-fuzzy classifier (ANFC) is proposed by using LHs. According to explanations above, the proposed method is a wrapper approach. The classifier with LHs is given in Part 1. This paper is organized as follows: The capability of LHs to select the relevant features is given in Section 2. Some constraints are applied on LHs, here. For that reason, the proposed FS algorithm is presented in Section 3. Some experimental results are presented in Section 4. The Discussion and the Conclusion are given in Section 5. 2. The effects of linguistic hedges on FS Shannon (1938) described binary functions using two-valued variables. The two of them are the selection of A1 or A2 inputs for every condition. These functions are given in Table 1. When the function F1 is investigated, it can be seen that it follows the variable A1{F1 = f1(A1,A2) = A1}. It means that the function F1 depends only on the variable A1, irrespective of the value of A2. A similar case is defined for the function F2, which also follows the variable A2{F2 = f2(A1,A2) = A2}. These two functions are examples of FS and can be defined by the product and power operators

F 1 ¼ A11  A02 ¼ A1 and F 2 ¼ A01  A12 ¼ A2 :

ð1Þ

If the power value of any variable is zero, then the value of the variable always is one. If the power value of any variable is one, then the variable is used with its original value. These conditions are given in Table 2. These Boolean functions can also be defined for fuzzy algebra. Let A1 and A2 be fuzzy sets on x1 and x2 feature, and y be the output, respectively. The p1 and p2 represent the LH values of those fuzzy sets. In this case, a new general fuzzy classification rule instead of the Boolean function can be defined as

IF x1 is A1 with p1 hedge AND x2 is A2 with p2 hedge THEN y is C 1 ; where C1 represents the class label of output. Table 1 The Shannon’s binary selection functions. Inputs

Outputs

A1

A2

F1

F2

0 0 1 1

0 1 0 1

0 0 1 1

0 1 0 1

6103

According to this fuzzy rule, the functions F1 and F2 are redefined in fuzzy logic with a similar meaning: R1: IF x1 is A1 with p1 = 1 hedge AND x2 is A2 with p2 = 0 hedge THEN y is F1. R2: IF x1 is A1 with p1 = 0 hedge AND x2 is A2 with p2 = 1 hedge THEN y is F2. These rules can be reduced to the following rules: R1: IF x1 is A1 with p1 = 1 hedge THEN y is F1. R2: IF x2 is A2 with p2 = 1 hedge THEN y is F2. The reduced rules contain only the selected features. For selection, the LH values have active roles. It can easily be said that, if the LH value of fuzzy set of any feature for any class equals to one, this feature is important for that class. Otherwise, it is not. This selection criterion can easily be seen with binary values. But in real applications, binary values of LHs cannot always be obtained. When the LHs have been tuned, they are taken as a real number in a wide range. For that reason, there should be a crucial point to give a decision. In this study, this point is taken as 0.5. If the LH value of fuzzy set of the jth feature is bigger and equal to the crucial point pj > 0.5 then the jth feature is important. If the LH value is smaller than the crucial point pj < 0.5 then the jth feature is not important. In fuzzy literature, LHs are generally employed with constant values that are also described with linguistic words. However, in this study, the LHs for FS and classification are used as variable, and can be changed in a determined range. For that reason, it is not possible to use a word for every LH value. But, ‘‘more recessive”, ‘‘recessive”, ‘‘neutral,” ‘‘dominant,” and ‘‘more dominant” words can be used for specific ranges pj = 0, 0 < pj < 0.5, pj = 0.5, 0.5 < pj < 1, and pj = 1, respectively. In classification problems, when a fuzzy rule for every class with LHs is defined, the LHs of any feature give different results for classes. It means that some features are relevant for any class, but these features cannot be irrelevant for the other classes. Therefore, an FS algorithm should be proposed using LHs. 3. FS Algorithm using LHs First of all, there is only one fuzzy classification rule for every class. When the number of rules is bigger than the number of classes, the LH values cannot be sufficient for the selection of features. Furthermore, it is easy to force the LH values to binary values. There are two different cases: feature selection and feature elimination. For feature elimination, the LH values of any feature should be smaller than 0.5 for every class, and should also be close to zero. For FS, there are two criteria. One is the selection of features that have the biggest hedge value for any class and the other is the selection of features that have a bigger hedge value for every class, because any feature cannot be selective for every class. For that reason, a selective function should be described from the hedge values of any feature,

Pj ¼

K Y

pij ;

ð2Þ

i¼1

Table 2 The using of powers to describe the F1 and F2 functions. A1

A2

p1

p2

p2 F 1 ¼ Ap1 1 ^ A2

p1

p2

p2 F 2 ¼ Ap1 1 ^ A2

0 0 1 1

0 1 0 1

1 1 1 1

0 0 0 0

0 0 1 1

0 0 0 0

1 1 1 1

0 1 0 1

where Pj denotes the selection value of the jth feature, and K is the number of classes. For forcing the hedge values to binary values, the initial values of hedges are taken as 0.5. After the tuning hedges, if the hedge value of any feature increases to one, the feature is selective for belonging the class. If the hedge value of any feature decreases to zero, the feature is irrelevant for belonging the class. According

6104

B. Cetisli / Expert Systems with Applications 37 (2010) 6102–6108

to the above cases, the FS and classification algorithms are given as follows. 3.1. Feature selection 1. Describe only one fuzzy classification rule for every class using Gaussian distribution. 2. Set pij = 0.5, for i = 1,2,. . .,K and j = 1,2,. . .,D, where K is the number of classes and D is the number of features. 3. Set the number of selected features (L). 4. Train the neuro-fuzzy classifier with LHs. In training, 0 6 pij 6 1. 5. For i = 1 to K. Find the jth feature that satisfies the maximum p value for the ith class. Take the jth feature into the individual discriminative features set. 6. The (L–K) features, which have the biggest P value, are selected as common discriminative features. 7. There are L discriminative features. The new training Xnew and testing data are created by the selected features from the original data. Both in FS and in classification section, the ANFC with LHs is used. For that reason, there are some different situations such as, constraints of LHs, number of rules, and initial values for the construction of ANFC–LH. The classification algorithm is given as follows. 3.2. Classification 1. Set the number of fuzzy rules (V) for every class. Then the total fuzzy rules are U = V  K, where U is the number of fuzzy rules. 2. Set pij = 1, for i = 1, 2, . . ., U and j = 1, 2, . . . , D. 3. Determine the initial value of nonlinear parameters of ANFC–LH by using K-means clustering. 4. Train the ANFC–LH with Xnew training set. In training, pij value should be equal to or bigger than zero for every feature and fuzzy rules (pij P 0). 5. Obtain the training and testing classification results.

4. Experimental results To demonstrate the applicability of the proposed FS and classification methods, six real-world datasets were analyzed. All datasets are in public domain. The datasets are Iris, Wisconsin breast cancer, hepatitis disease, wine, Ceveland heart disease, and Escherichia coli (E. coli). These datasets are obtained from the famous UCI machine learning repository (UCI, 2009). In this study, ANFC–LH runs 10 times for the FS and the classification parts. In the FS parts of experiments, ANFC–LH is trained using all data instances without testing set. In the classification parts of experiments, and 5-fold cross validation is used to compute the recognition rates. The number of fuzzy rules is determined according to the number of classes. In some problems, when the number of fuzzy rules of each class is increased, the recognitions rates can be decreased due to the some logical features. 4.1. Iris dataset The Iris dataset is a common benchmark in classification and pattern recognition studies. It contains 150 measurements of four features such as, sepal length (SL), sepal width (SW), petal length (PL), and petal width (PW). The dataset is equally divided into three classes: iris setosa, iris versicolor, and iris virginica (UCI, 2009). Table 3 gives the obtained LH values of Iris dataset after the training.

Table 3 The LH values of Iris dataset for every class and every feature. Class/feature

SL

SW

PL

PW

Setosa Versi color Virginica

0.5 0.0 0.0

0.5 0.0 0.0

0.5 1.0 1.0

0.5 1.0 1.0

P value

0.0

0.0

0.5

0.5

From Table 3, it can be seen that there are two important features: PL and PW according to FS algorithm. Because, their P values are the biggest, and also their LH (p) values are the maximum for every class. The SL and SW features are irrelevant for versi color and virginica classes. Setosa class is easily distinguished from the other classes. If the classification rules are expressed for each class, then the rules are: R1: IF SL is A11 with p11 = 0.5 AND SW is A12 with p12 = 0.5 AND PL is A13 with p13 = 0.5 AND PW is A14 with p14 = 0.5 THEN class is Setosa. R2: IF SL is A21 with p21 = 0 AND SW is A22 with p22 = 0 AND PL is A23 with p23 = 1 AND PW is A24 with p24 = 1 THEN class is Versicolor. R3: IF SL is A31 with p31 = 0 AND SW is A32 with p32 = 0 AND PL is A33 with p33 = 1 AND PW is A34 with p34 = 1 THEN class is Virginica. The FS rules can also be expressed with adjectives as R1: IF SL is Neutral A11 AND SW is Neutral A12 AND PL is Neutral A13 AND PW is Neutral A14 THEN class is Setosa. R2: IF SL is more recessive A21 AND SW is more recessive A22 AND PL is more dominant A23 AND PW is more dominant A24 THEN class is Versicolor. R3: IF SL is more recessive A31 AND SW is more recessive A32 AND PL is more dominant A33 AND PW is more dominant A34 THEN class is Virginica. After the FS and the classification steps, these rules can be reduced to following rules: R1: IF PL is A13 with p13 = 0.5 AND PW is A14 with p14 = 0.5 THEN class is Setosa. R2: IF PL is A23 with p23 = 1.1 AND PW is A24 with p24 = 1.1 THEN class is Versicolor. R3: IF PL is A33 with p33 = 1.0 AND PW is A34 with p34 = 1.2 THEN class is Virginica. After the classification step, it can be seen that some of the hedge values are bigger than 1, because the hedge values are not constrained in the classification step. These classification rules can also be expressed with adjectives, which are described in Part I, as shown in the following: R1: IF PL is minus A13 AND PW is minus A14 THEN class is Setosa. R2: IF PL is plus A23 AND PW is plus A24 THEN class is Versicolor. R3: IF PL is plus A33 AND PW is plus A34 THEN class is Virginica. As a result, these fuzzy classification rules have more meaning, and have a distinctive mark. In addition, one of the aims of the fuzzy theory that is computing with words concept is verified by using adaptive LHs in this study. The classification results of Iris dataset obtained by ANFC–LH method are given in Table 4. Here, each class for ANFC–LH is intuitively defined with 1 and 2 fuzzy rules.

6105

B. Cetisli / Expert Systems with Applications 37 (2010) 6102–6108 Table 4 The ANFC–LH classification results of Iris dataset.

Table 7 The ANFC–LH classification results of WBC dataset.

Features

Cluster size for each class

Testing recognition rate (%)

Features

Cluster size for each class

Testing recognition rate (%)

All PL, PW All PL, PW

1 1 2 2

94.66 94.66 94.66 97.33

All 1–2–6–9 All 1–2–6–9 1–6 1–6

1 1 2 2 1 2

99.02 98.53 98.04 98.53 95.60 95.60

Table 4 shows that the discriminative powers of the selected features are better than all features. The selected features increase the recognition rate for test set. It means that some overlapping classes can be easily distinguished by selected features. The proposed method is compared with the other methods. One of them is an efficient fuzzy classifier with the ability of FS based on a fuzzy entropy measure (FEBFC) (Lee et al., 2001). The other method is exception ratio model that represents the degree of overlaps in the class regions (Thawonmas & Abe, 1997). The last method, that is optimal fuzzy-valued feature subset selection (OFFSS), measures the quality of a subset of features (Tsang et al., 2003). The comparison results of the methods are given in Table 5. The Iris classification is generally used to denote the selective power of methods. A lot of papers represent that PL and PW are relevant features for Iris classification (Chakraborty & Pal, 2004; Lee et al., 2001; Li et al., 2002; Thawonmas & Abe, 1997; Tsang et al., 2003).

4.2. Wisconsin breast cancer (WBC) dataset This dataset consist of 699 patterns, and each pattern is described by nine features (clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses) (UCI, 2009). After the 16 instances are removed from the dataset due to missing values, there are 683 instances. The dataset is divided into two classes: belign and malignant. Table 6 gives the LH values of WBC dataset. According to the FS algorithm, the clump thickness (F1) for malignant, the bare nuclei (F6) for belign class are individually relevant features. The bare nucleus is also a common relevant feature for each class. The other common relevant features are mitoses (F9) and uniformity of cell size (F2), respectively. The ANFC–LH classification results of WBC dataset are given in Table 7. Here, each class for ANFC–LH is intuitively defined with 1 and 2 fuzzy rules. In the WBC dataset classification results, the selected features still satisfy the high recognition rates. The clump thickness and bare nuclei features

Table 5 The comparison results of different methods about Iris classification. Number of selected features

Methods

Testing recognition rate (%)

2 2

FEBFC (Lee et al., 2001) Exception R (Thawonmas & Abe, 1997) OFSS (Tsang et al., 2003) ANFC-LH

97.12 97.33

2 2

95.33 97.33

Table 8 The comparison results of different methods about WBC classification. Number of selected features

Methods

Testing recognition rate (%)

6 6 6

FEBFC (Lee et al., 2001) OFSS (Tsang et al., 2003) Fuzzy-rough (Hu et al., 2006) FS_SFS (Liu & Zheng, 2006) ANFC-LH

95.14 93.58 95.26

15 4

92.90 98.53

are very important features for the WBC dataset classification. The comparison results of WBC classification between the different methods are given in Table 8. The one method, which is proposed to select the features, is an information measure based on fuzzyrough set model (Hu, Yu, & Xie, 2006). The other FS method is support vector recognition based filtered and supported sequential forward search method (FS_SFS) (Liu & Zheng, 2006). Although the ANFC–LH method uses least features, it gives the best testing recognition rates.

4.3. Hepatitis disease dataset This dataset requires determination of whether patients with hepatitis will either live or die. This dataset contains 19 features and 155 samples (UCI, 2009). The hepatitis dataset has a high percentage of instances containing missing values. For that reason, two different methods are used in this study instead of deleting the instances. One is to replace the missing values with the mean of the belonging to class and the other is to set to zero. The classification results in Tables 9–11 are obtained by using the first method. Also, the classification result by using the second method is given in Table 11. The LH values of selected features for Hepatitis dataset are given in Table 9. According to the results of Table 9, the F1 (age), F5 (fatigue), F11 (spiders), F12 (ascites), F17 (albumin), and F19 (histology) features are very important for the hepatitis disease classification problem. The ANFC–LH classification results of the hepatitis disease dataset are given in Table 10. Here, each class for ANFC–LH is intuitively defined with 2 and 3 fuzzy rules. The classification accuracies of the hepatitis disease dataset with selected six features still keep the high recognition rates. For this problem, there are two similar methods that are based

Table 6 The LH values of WBC dataset for every class and every feature. Class/feature

Fl

F2

F3

F4

F5

F6

F7

F8

F9

Belign Malignant

0.20 1.00

0.51 0.66

0.00 0.71

0.50 0.27

0.15 0.27

0.94 0.80

0.21 0.82

0.42 0.77

0.66 0.71

P value

0.20

0.342

0.00

0.136

0.042

0.756

0.175

0.331

0.474

6106

B. Cetisli / Expert Systems with Applications 37 (2010) 6102–6108

Table 9 The LH values of selected features from Hepatitis disease dataset.

Table 13 The ANFC–LH classification results of wine dataset.

Class/feature

Fl

F5

F11

F12

F17

F19

Features

Cluster size for each class

Testing recognition rate (%)

Die Live

0.74 0.66

0.88 0.47

0.86 0.56

0.67 0.81

1.0 1.0

0.55 0.80

P value

0.49

0.41

0.49

0.55

1.0

0.44

All 7–10–13 AD 7–10–13

1 1 2 2

94.38 94.38 100 96.62

Table 10 The ANFC–LH classification results of hepatitis disease dataset.

Table 14 The comparison results of different methods about wine classification.

Features

Cluster size for each class

Testing recognition rate (%)

All 1–5–11–17–19 All 1–5–11–17–19

2 2 3 3

89.61 91.65 89.74 92.44

Number of selected features

Methods

Testing recognition rate (%)

6

Fuzzy-rough (Hu et al., 2006) Robust clust. (Cord et al., 2006) Pairwise C. (Zhanga et al., 2008) ANFC–LH

94.87

4 6 3

94.20 80.8 96.62

Table 11 The comparison results of different methods about hepatitis disease classification. Number of selected features

Methods

Testing recognition rate

9

FS with fuzzy-AIRS (Polat & Günesß, 2007a) FS with fuzzy-AIRS (Polat & Günesß, 2007b) PCA with AIRS (Polat & Günesß, 2007) ANFC–LH (replace mean) ANFC–LH (replace mean) ANFC–LH (set zero)

81.82

10 5 10 5 5

94.12 94.12 94.87 92.44 85.89

on the C4.5 decision tree based FS methods with fuzzy-artificial immune recognition system (AIRS) classifier (Polat & Günesß, 2007a, 2007b) are used for comparison. Polat and Günesß (Polat & Günesß, 2007a, 2007b) obtained two different results with the same methods. Also, in the other method, Polat and Günesß (Polat & Günesß, 2007) used principles component analysis (PCA) and AIRS classifier for the same problem. The comparisons are shown in Table 11. Table 11 shows that the proposed FS method with ANFC–LH gives similar results as the FS methods with fuzzy-AIRS.

4.4. Wine dataset The wine dataset contain the chemical analysis of 178 wines grown in the same region of Italy, but derived from three different vineyards (UCI, 2009). There are three types of wines. The 13 continuous features available for classification are alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/ OD315 of diluted wines, and proline. The LH values of the wine dataset and the ANFC–LH classification results are given in Tables

12 and 13, respectively. Here, each class for ANFC–LH is intuitively defined with 1 and 2 fuzzy rules. From Table 12, it is obvious that the proline (F13) is relevant for Class 1, the color intensity (F10) is relevant to both Classes 2 and 3. Also, the color intensity is the best common feature for all classes. According to Table 12, the F7, F10, and F13 features of the wine dataset can be selected as relevant features. Although the number of selected features is reduced to 3 from 13, the classification is still successful as seen in Table 13. The comparison results of the wine dataset classification among the different methods are given in Table 14. One method is a wrapped FS process that is proposed in the context of robust clustering based on Laplace mixture models (Cord et al., 2006). The other method is FS with pairwise constraints (Zhanga, Chena, & Zhoub, 2008). Table 14 shows that although the ANFC–LH method uses the least features, it gives the best classification result for wine data. 4.5. Cleveland heart disease dataset The Cleveland heart disease dataset consists of a dataset of heart disease diagnoses (UCI, 2009). This dataset contains 76 features, but all published experiments have used a subset of just 14 of them. The purpose of the dataset is to predict the presence or absence of heart disease, given the results of various medical tests carried out on a patient. The dataset originally contained 303 samples but 6 of these contained missing class values and therefore were discarded leaving 297. The LH values of the heart disease dataset and the ANFC–LH classification results are given in Tables 15 and 16, respectively. Here, each class for ANFC–LH is intuitively defined with 2 and 4 fuzzy rules. From Table 15 it is obvious that chest pain type (F3), resting blood pressure (F4), serum cholesterol (F5), number of major vessels (F12) and that (F13) are relevant and the best common features for classes. According to Table 15, the F3, F4, F5, F12, and

Table 12 The LH values of wine dataset. Class/feature

Fl

Fl

F3

F4

F5

F6

F7

F8

F9

F10

P11

F12

F13

Class 1 Class 2 Class 3

0.6 0.6 0.5

0.7 0.7 0.4

0.5 0.5 0.5

0.5 0.4 0.5

0.8 0.4 0.4

0.7 0.4 0.5

0.9 0.6 0.7

0.5 0.5 0.5

0.6 0.4 0.5

0.8 1.0 0.9

0.6 0.5 0.6

0.6 0.7 0.6

0.1 0.4 0.6

P value

0.2

0.2

0.1

0.0

0.1

0.1

0.4

0.1

0.1

0.7

0.2

0.2

0.2

6107

B. Cetisli / Expert Systems with Applications 37 (2010) 6102–6108 Table 15 The LH values of heart disease dataset. Class/feature

Fl

F2

F3

F4

F5

F6

F7

F8

F9

F10

F11

F12

F13

Absence Presence

0.0 0.0

0.5 0.5

0.7 0.5

0.7 0.9

0.6 0.6

0.5 0.5

0.5 0.5

0.3 0.2

0.5 0.5

0.5 0.3

0.4 0.6

0.4 0.8

0.7 0.2

P value

0

0.2

0.3

0.6

0.3

0.2

0.3

0.1

0.2

0.2

0.3

0.3

0.1

Table 16 The ANFC–LH classification results of heart disease dataset.

Table 19 The ANFC–LH classification results of E. coli promoter dataset.

Features

Cluster size for each class

Testing recognition rate (%)

Features

Cluster size for each class

Testing recognition rate (%)

All 3–4–5–12–13 All 3–4–5–12–13

2 2 4 4

86.51 85.48 87.54 86.72

All 15–16–17–18–39 15–39 All 15–16–17–18–39 15–39

1 1 1 2 2 2

90.56 96.22 84.90 92.45 98.11 86.79

F13 features of the heart disease dataset can be selected for classification. Although the number of selected features is reduced to 5 from 13, the classification is still successful as seen in Table 16. The comparison results of the heart disease dataset classification are given in Table 17. The one FS method that is used for comparison is robust clustering based on Laplace mixture (Cord et al., 2006). Although the proposed method selects the features less than FS with fuzzy-AIRS, the classification accuracy of the proposed method is less than FS with Fuzzy-AIRS. However, 270 instances of dataset are used for evaluation (Polat & Günesß, 2007a).

4.6. Escherichia coli promoter dataset

Table 20 The comparison results of different methods about E. coli promoter classification. Number of selected features

Methods

Testing recognition rate

4

FS_LSSVM (Polat & Günesß, 2007) FS with fuzzy-AIRS (Polat & Günesß, 2009) ANFC–LH

84.62

4 5

86.54 98.22

Number of selected features

Methods

Testing recognition rare (%)

8.6 9

FS_SFS (Liu & Zheng, 2006) FS with fuzzy-AIRS (Polat & Günesß, 2007a) Laplace M. (UCI, xxxx) ANFC–LH

84.80 92.59

In Table 18, only the LHs values of selected features are given due to the size of features of the dataset. According to the FS algorithm, the F15 (36 nucleotide positions) and F17 (34 nucleotide positions) for promoter class, the F39 (12 nucleotide positions) for non-promoter class are individually selected. The F16 and F18 are also common relevant features for each class. As a result, the F15 and F39 are very important features for E. coli promoter dataset. Also, it should be emphasized that the 35 and 10 locations of the sequences have a mean for promoter structure (Polat & Günesß, 2007). The selected nucleotide positions by using LHs are also close to these locations. Although the size of the input space is decreased to 5 from 57, the recognition rates are increased for each of the test. It means that most of the features of E. coli dataset cause of overlapping classes. When they are eliminated, the classification success improves. For this problem, C4.5 decision tree based FS method with least square support vector machine (FS_LSSVM) (Polat & Günesß, 2007) classifier and the fuzzy-based FS method with Fuzzy-AIRS classifier (FS with Fuzzy-AIRS) (Polat & Günesß, 2009) are also used for comparison (Table 20). As can be seen from Table 20, the proposed FS method with the ANFC–LH is superior to other methods.

69.50 86.72

5. Discussion and conclusion

The prediction of Escherichia coli (E. coli) promoter gene sequences is an important issue in the field of molecular biology (UCI, 2009). The promoters are DNA sequences, which affect the frequency and location of transcription initiation though interaction with RNA polymerase. The dataset contains 106 DNA sequences, including 53 samples promoter sequences and 53 non-promoter sequences, whose lengths are all 57. The DNA sequences consist of four types of nucleotides: adenine (A), guanine (G), cytosine (C), and thymine (T). The sequence starts at position 50 and ends at position +7. The LH values of the E. coli dataset and the ANFC–LH classification results are given in Tables 18 and 19, respectively. Here, each class for ANFC–LH is intuitively defined with 1 and 2 fuzzy rules.

Table 17 The comparison results of different methods about heart disease classification.

2 5

Table 18 The LH values of selected features of E. coli promoter dataset. Class/feature

F15

F16

F17

F18

F39

Promoter Non-promoter

1.00 1.00

0.88 0.70

1.00 0.10

0.49 0.82

1.00 1.00

P value

1.00

0.56

0.10

0.40

1.00

In this study, the positive effect of linguistic hedges on adaptive neuro-fuzzy classifier is presented. According to the proposed method, linguistic hedges are used in fuzzy classification rules, and adapted during the training of the system. Besides the contribution of linguistic meaning into the fuzzy classification, linguistic hedge values can be used for feature selection. Experimental results show that when the linguistic hedge value of the fuzzy classification set in any feature is close to 1, this feature is relevant for that class, otherwise it may be irrelevant. The proposed feature

6108

B. Cetisli / Expert Systems with Applications 37 (2010) 6102–6108

selection algorithm considerably decreases the number of features for classification problems. This characteristic of the method is satisfied to simplify the complex problem. It is concluded that while a feature is relevant for a particular class, it might be irrelevant to the other classes. In this case, the fuzzy sets of this feature should be taken differently in the fuzzy rules, and the difference is satisfied by linguistic hedges in this study. In some cases, the adaptive linguistic hedges can increase the classification accuracy rates. It was shown by the experimental studies that the proposed algorithm is successful to select the relevant features, and it can also eliminate the irrelevant features. When the numerical values of linguistic hedges are changed with the words, the effects of features on classes are clearly represented as in the classification of the Iris dataset. In this study, a computing with words is applied on classification problems by progressive stages. In future, computing with words is likely to emerge as a major field in its own right. The adaptive linguistic hedges have an important contribution to this concept. In future, this method can also be successfully applied to micro array gene selection, internet, and other data mining problems. Therefore, the impressive results may be obtained with the proposed method.

References Chakraborty, D., & Pal, N. R. (2004). A neuro-fuzzy scheme for simultaneous feature selection and fuzzy rule-based classification. IEEE Transactions on Neural Networks, 15(1), 110–123. Cord, A., Ambroise, C., & Cocquerez, J. P. (2006). Feature selection in robust clustering based on Laplace mixture. Pattern Recognition Letters, 27, 627–635. Hu, Q., Yu, D., & Xie, Z. (2006). Information-preserving hybrid data reduction based on fuzzy-rough techniques. Pattern Recognition Letters, 27, 414–423. Jensen, R., & Shen, Q. (2007). Fuzzy-rough sets assisted attribute selection. IEEE Transactions on Fuzzy Systems, 15(1), 73–89. Kwak, N., & Choi, C. C. H. (2002). Input feature selection for classification problems. IEEE Transactions on Neural Networks, 13(1), 143–159.

Lee, H. M., Chen, C. M., Chen, J. M., & Jou, Y. L. (2001). An efficient fuzzy classifier with feature selection based on fuzzy entropy. IEEE Transactions on Systems, Man and Cybernetics, Part B, 31(3), 426–432. Li, R. P., Mukaidono, M., & Turksen, I. B. (2002). A fuzzy-neural network for pattern classification and feature selection. Fuzzy Sets and Systems, 130, 101–108. Liu, H. et al. (2005). Evolving feature selection. IEEE Intelligent Systems, 20, 64–76. Liu, Y., & Zheng, Y. F. (2006). FS_SFS: A novel feature selection method for support vector machines. Pattern Recognition, 39, 1333–1345. Polat, K., & Günesß, S. (2007b). Medical decision support system based on artificial immune recognition immune system (AIRS), fuzzy weighted pre-processing and feature selection. Expert Systems with Applications, 33(2), 484–490. Polat, K., & Günesß, S. (2007). Prediction of hepatitis disease based on principal component analysis and artificial immune recognition system. Applied Mathematics and Computation, 189(2), 1282–1291. Polat, K., & Günesß, S. (2007). A novel approach to estimation of E.coli promoter gene sequences: Combining feature selection and least square support vector machine (FS_LSSVM). Applied Mathematics and Computation, 190(2), 1574–1582. Polat, K., & Günesß, S. (2007a). A hybrid approach to medical decision support systems: Combining feature selection, fuzzy weighted pre-processing and AIRS. Computer methods and Programs in Biomedicine, 88, 164–174. Polat, K., & Günesß, S. (2009). A new method to forecast of E. coli promoter gene sequences: Integrating feature selection and fuzzy-AIRS classifier system. Expert Systems with Applications, 36, 57–64. Sankar, K. P., Rajat, K. D., & Basak, J. (2000). Unsupervised feature evaluation: A neuro-fuzzy approach. IEEE Transactions on Neural Networks, 11(2), 366–376. Shannon, C. E. (1938). A symbolic analysis of relay and switching circuits. Transactions of the American Institute of Electrical Engineers, 57, 713–723. Sindhwani, V., Rakshit, S., et al. (2004). Feature selection in MLPs and SVMs based on maximum output information. IEEE Transactions on Neural Networks, 15(4), 937–948. Thawonmas, R., & Abe, S. (1997). A novel approach to feature selection based on analysis of class regions. IEEE Transactions on Systems, Man and Cybernetics, Part B, 27(2), 196–207. Theodoridis, S., & Kotroumbas, K. (1999). Pattern recognition. San Diego: Academic Press. Tsang, E. C. C., Yeung, D. S., & Wang, X. Z. (2003). OFFSS: Optimal fuzzy-valued feature subset selection. IEEE Transactions on Fuzzy Systems, 11(2), 202–213. UCI Repository of Machine Learning Databases and Domain Theories [Online]. Available from: (Last accessed 02 January 2009). Uncu, Ö., & Türksßen, I. B. (2007). A novel feature selection approach: Combining feature wrappers and filters. Information Sciences, 177, 449–466. Zhanga, D., Chena, S., & Zhoub, Z. H. (2008). Constraint score: A new filter method for feature selection with pairwise constraints. Pattern Recognition, 41, 1440–1451.