Two cooperative ant colonies for feature selection using fuzzy models

Two cooperative ant colonies for feature selection using fuzzy models

Expert Systems with Applications 37 (2010) 2714–2723 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

698KB Sizes 0 Downloads 108 Views

Expert Systems with Applications 37 (2010) 2714–2723

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Two cooperative ant colonies for feature selection using fuzzy models Susana M. Vieira a,*, João M.C. Sousa a, Thomas A. Runkler b a b

Center of Intelligent Systems, IDMEC, LAETA, Instituto Superior Técnico, Technical University of Lisbon, Av. Rovisco Pais 1, 1049-001 Lisbon, Portugal Siemens AG, Corporate Technology, Learning Systems CT IC 4, 81739 Munich, Germany

a r t i c l e

i n f o

Keywords: Feature selection Ant colony optimization Fuzzy modeling

a b s t r a c t The available set of potential features in real-world databases is sometimes very large, and it can be necessary to find a small subset for classification purposes. One of the most important techniques in data pre-processing for classification is feature selection. Less relevant or highly correlated features decrease, in general, the classification accuracy and enlarge the complexity of the classifier. The goal is to find a reduced set of features that reveals the best classification accuracy for a classifier. Rule-based fuzzy models can be acquired from numerical data, and be used as classifiers. As rule based structures revealed to be a useful qualitative description for classification systems, this work uses fuzzy models as classifiers. This paper proposes an algorithm for feature selection based on two cooperative ant colonies, which minimizes two objectives: the number of features and the classification error. Two pheromone matrices and two different heuristics are used for these objectives. The performance of the method is compared with other features selection methods, achieving equal or better performance. Ó 2009 Elsevier Ltd. All rights reserved.

1. Introduction Feature selection has been an active research area in data mining, pattern recognition and statistics communities. The main idea of feature selection is to choose a subset of available features, by eliminating features with little or no predictive information and also redundant features that are strongly correlated. Many practical pattern classification tasks (e.g., medical diagnosis) require learning of an appropriate classification function that assigns a given input pattern (typically represented by using a vector of feature values) to one of a set of classes. The choice of features used for classification has an impact on the accuracy of the classifier and on the time required for classification. The challenge is selecting the minimum subset of features with little or no loss of classification accuracy. The feature subset selection problem consists of identifying and selecting a useful subset of features from a larger set of often mutually redundant, possibly irrelevant, features with different associated importance (Motoda & Liu, 1998). The methods found in the literature can generally be divided into two main groups: model-free methods and model-based methods (Guyon & Elisseeff, 2003). Model-free methods use the available data only and are based on statistical tests, properties of functions, etc. These methods do not need to develop models to find significant inputs. The methods discussed in this paper belong to the group of model-based methods. Models with different * Corresponding author. E-mail addresses: [email protected] (S.M. Vieira), [email protected] (J.M.C. Sousa), [email protected] (T.A. Runkler). 0957-4174/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2009.08.026

sets of features are compared, and the model that minimizes the model output error is selected. These methods include exhaustive search (Guyon & Elisseeff, 2003) in which all combinations of subsets were evaluated. This method guarantees an optimal solution, but finding the optimal subset of features is NP-hard. Decision tree search methods, with the proper branch conditions, limit the search space to the best performed branches, but do not guarantee to find the global best solution (Mendonça, Vieira, & Sousa, 2007). For a large number of features, evaluating all states is computationally non-feasible requiring heuristic search methods (Boz, 2002). Feature selection algorithms have been reviewed in Liu and Yu (2005). More recently, nature inspired algorithms have been used to select features, namely: particle swarm optimization (Wang, Yang, Teng, Xia, & Jensen, 2007), GA-based attribute reduction (Bazan, Nguyen, Nguyen, Synak, & Wroblewski, 2000; Pulkkinen & Koivisto, 2008), and ant colonies (Al-Ani, 2005; Jensen & Shen, 2005; Sivagaminathan & Ramakrishnan, 2007). Nature inspired algorithms like ant colony optimization have been successfully applied to a large number of difficult combinatorial problems like quadratic assignment, traveling salesman problems, routing in telecommunication networks, scheduling, machine learning and feature selection (Dorigo, Birattari, & Stützle, 2006). Ant colony optimization is particularly attractive for feature selection since no reliable heuristic is available for finding the optimal feature subset, so it is expected that the ants discover good feature combinations as they proceed through the search space. An ACO approach for feature selection problems was presented in Al-Ani (2005), where a term called updated selection measure (USM) is used for selecting features. A major application of the

2715

S.M. Vieira et al. / Expert Systems with Applications 37 (2010) 2714–2723

algorithm in Al-Ani (2005) is in the field of texture classification and classification of speech segments. Another application of ACO to feature selection can be found in Jensen and Shen (2004), where an entropy-based modification of the original rough set-based approach for feature selection problems was presented. In Schreyer and Raidl (2002), an ACO approach is used for labeling point features. Further, in Sivagaminathan and Ramakrishnan (2007) a relatively simple model of ACO is presented, where the major difference from previous works is in the calculation of the heuristic value, which is treated as a simple function of the cost (classification accuracy). This paper proposes a new algorithm for feature selection using two cooperative ant colonies, which are used to cope with two different objectives. The two objectives considered are: minimizing the number of features and minimizing the error classification. Two pheromone matrices and two different heuristics are used for each objective. A novel approach for computing a heuristic value is proposed to determine the size of the subset of features. This new approach is an extension from previous work, where a single objective function was used and the subset size was not determined by the algorithm (Vieira et al., 2007). The approach described in this paper is an extended version of Vieira, Sousa, and Runkler (2008). The paper is organized as follows. Next section presents a brief description of fuzzy classification. An introduction of ACO applications in feature selection problems is discussed in Section 3, where the proposed methodology is presented. In Section 4, the results are presented. Some conclusions are drawn in Section 5 and the possible future work is discussed. 2. Fuzzy models for classification Rule-based expert systems are often applied to classification problems in fault detection, biology, medicine, etc. Fuzzy logic improves classification and decision support systems by allowing the use of overlapping class definitions and improves the interpretability of the results by providing more insight into the classifier structure and decision making process (Roubos, Setnes, & Abonyi, 2003; Sousa & Kaymak, 2002). The automatic determination of fuzzy classification rules from data has been approached by several different techniques: neuro-fuzzy methods, geneticalgorithm based rule selection and fuzzy clustering in combination with GA-optimization (Setnes & Roubos, 2000). In this paper, an approach that addresses simplicity and accuracy issues is described. Interpretable fuzzy rule-based classifiers are obtained from observation data. 2.1. Model structure Takagi–Sugeno (TS) fuzzy models (Takagi & Sugeno, 1985), where the consequents are crisp functions of the antecedent variables are applied. Rule antecedents are fuzzy descriptions in the n-dimensional feature space and rule consequents are values representing the degree of approximation to a given class label from the set 1; 2; . . . ; N c , where N c is the number of classes. There is a classification (output) for each class in the range ½0; 1: zero (0) does not belong to the class and one (1) belongs completely to the class. The chosen class is the output with higher membership degree: Ri : If x1 is Ai1 and . . . and xn is Ain then

yi‘ ¼ bi‘ ;

i ¼ 1; . . . ; K; ‘ ¼ 1; . . . ; N c

ð1Þ

Here, x ¼ ½x1 ; x2 ; . . . ; xn T is the feature vector, yi‘ is the output of the ith rule, and Ai1 ; . . . ; Ain are the antecedent fuzzy sets. The and connective is modeled by the product operator. The degree of activation

Qn

of the ith rule is calculated as bi ðxÞ ¼ j¼1 lAij ðxj Þ where lAij 2 ½0; 1 is the membership function of the fuzzy set Aij in the antecedent of Ri . The model output, y‘ , is computed by aggregating the individual rules contribution:

PK i¼1 bi yi‘ y‘ ¼ P ; K i¼1 bi

ð2Þ

Each class is considered an output of the model. The output of the classifier is given by the following classification decision:

   ^ ¼ ‘ : y‘ ¼ max y1 ; y2 ; . . . ; yNc y

ð3Þ

2.2. Parameter estimation Given N available input–output data pairs ðxk ; uk Þ, the n-dimensional pattern matrix X ¼ ½x1 ; . . . ; xN T , and the corresponding class vector u ¼ ½u1‘ ; . . . ; uN‘ T are constructed. The number of rules K, the antecedent fuzzy sets Aij , and the consequent parameters bi‘ are determined by means of fuzzy clustering in the product space of the input and output variables (Sousa & Kaymak, 2002). Hence, the data set Z to be clustered is composed from X and u:

Z ¼ ½X; uT :

ð4Þ

Vector u contains the binary assignment for each existing class. Each pattern uk is ð1; 0; 0; 0; . . .Þ for Class 1, ð0; 1; 0; 0; . . .Þ for Class 2, ð0; 0; 1; 0; . . .Þ for Class 3, and so on. Given the data Z and the number of clusters K, several fuzzy clustering algorithms can be used. In this paper, the fuzzy c-means (FCM) (Bezdek, 1981) clustering algorithm is used to compute the fuzzy partition matrix U. The matrix Z provides a description of the system in terms of its local characteristic behavior in regions of the data identified by the clustering algorithm, and each cluster defines a rule. The fuzzy sets in the antecedent of the rules are obtained from the partition matrix U, whose ikth element lik 2 ½0; 1 is the membership degree of the data object zk in cluster i. One-dimensional fuzzy sets Aij are obtained from the multidimensional fuzzy sets defined point-wise in the ith row of the partition matrix by projections onto the space of the input variables xj :

lAij ðxjk Þ ¼ projNj nþ1 ðlik Þ;

ð5Þ

where proj is the point-wise projection operator (Klir & Yuan, 1995). The point-wise defined fuzzy sets Aij are approximated by suitable parametric functions in order to compute lAij ðxj Þ for any value of xj . The consequent parameters for each rule are obtained as a weighted ordinary least-square estimate. Let hTi ¼ ½bi , let Xe denote the matrix ½X; 1 and let Wi denote a diagonal matrix in having the degree of activation, bi ðxk Þ, as its kth diagonal element. Assuming that the columns of Xe are linearly independent and bi ðxk Þ > 0 for 1 6 k 6 N, the weighted least-squares solution of y ¼ Xe h þ e becomes

h i1 hi ¼ XTe Wi Xe XTe Wi y:

ð6Þ

Rule bases constructed from clusters can be redundant due to the fact that the rules defined in the multidimensional premise are overlapping in one or more dimensions. A possible approach to solve this problem is to reduce the number of features n of the model, as addressed in this paper. The number of fuzzy rules (or clusters) that best suits the data must be determined prior to the construction of the fuzzy classifiers. For this purpose the following criterion, as proposed in Sugeno and Yasukawa (1993), is used to determine the number of clusters:

2716

S.M. Vieira et al. / Expert Systems with Applications 37 (2010) 2714–2723

SðKÞ ¼

N X K X

k2 Þ; ðlik Þm ðkxk  v i k2  kv i  x

nant criterion for feature selection (Duda, Hart, & Stork, 2001), which ranks the features based on their relative importance, and it is described in more detail in Section 3.1.3. The best number of features is called features cardinality N f . The determination of the features cardinality is addressed in the first colony sharing the same minimization cost function with the second colony, which in this case aggregates both the maximization of the classification accuracy and the minimization of the features cardinality. Hence, the first colony determines the size of the subsets of the ants in the second colony, and the second colony selects the features that will be part of the subsets.

ð7Þ

k¼1 i¼1

where N is the number of data points to be clustered, K is the num is ber of clusters (K P 2), xk is the kth data point (usually vector), x the mean value for the inputs, v i the center of the ith cluster, lik is the grade of the kth data point belonging to the ith cluster and m is an adjustable weight. The parameter m has a great importance in this criterion. The bigger the m the bigger the optimum number of clusters. Usually, this value is around 2. The number of clusters K is increased from two up to the number that gives the minimum value for SðKÞ. Note that this minimum can be local. However, this procedure diminishes the number of rules and consequently the complexity of the fuzzy model. At each iteration, the number of clusters are determined using the fuzzy cmeans (Bezdek, 1981) algorithm and the process stops when SðKÞ increases from one iteration to the next one. The first term of the right-hand side of (7) is the variance of the data in a cluster and the second term is the variance of the clusters themselves. The optimal clustering achieved is the one that minimizes the variance in each cluster and maximizes the variance between clusters. The performance criterion used to evaluate the fuzzy model is the classification accuracy C a , given by the percentage of correct classifications:

Ca ¼

ðNn  Ne Þ  100%; Nn

3.1. Proposed algorithm The algorithm proposed in this paper deals with the feature selection problem as a multi-criteria problem with a single objective function. Therefore, a pheromone matrix is computed for each criterion, and different heuristics are used. The objective function of this optimization algorithm aggregate both criteria, the minimization of the classification error rate and the minimization of the features cardinality:

Jk ¼

Nke N f þ Nn n

ð9Þ

where N n is the number of used data samples, n is the total number of features and k is a given ant. The approach is schematically represented in Fig. 1. To evaluate the classification error, a fuzzy classifier is built for each solution following the procedure described in Section 2. Let x ¼ ½x1 ; x2 ; . . . ; xn T be the set of given n features, and w ¼ ½w1 ; w2 ; . . . ; wNf T , be a subset of features where ðw # xÞ. It is desirable that N f  n. Table 1 describes the variables used in the algorithm.

ð8Þ

where N n is the number of used samples and N e is the number of classification errors in test samples (misclassifications). 3. Proposed ant feature selection Ant algorithms were first proposed by Dorigo (1992) as a multiagent approach to difficult combinatorial optimization problems, such as traveling salesman problem, quadratic assignment problem or supply chain management (Silva, Sousa, & Runkler, 2007; Silva, Sousa, Runkler, & Sá da Costa, 2006, 2009). The ACO methodology is an optimization method suited to find minimum cost paths in optimization problems described by graphs (Dorigo et al., 2006). This paper presents a new implementation of ACO applied to feature selection, where the best number of features is determined automatically. In this approach, two objectives are considered: minimizing the number of features and minimizing the error classification. Two cooperative ant colonies optimize each objective. The first colony determines the number (cardinality) of features and the second selects the features based on the cardinality given by the first colony. Thus, two pheromone matrices and two different heuristics are used. A novel approach for computing a heuristic value is proposed to determine the cardinality of features. The heuristic value is computed using the Fisher discrimi-

3.1.1. Probabilistic rule Consider a problem with N f nodes and two colonies of g ants. First, g ants of the first colony randomly select the number of nodes N f to be used by the g ants of the second colony. The probability that an ant k chooses the features cardinality N f ðkÞ is given by

pki ðtÞ ¼ P

½sni an  ½gni bn

s

an

l2Ckn ½ nl 

where sni is the pheromone concentration matrix and gni is the heuristic function matrix, for path ðiÞ. The values of the pheromone matrix are limited to ½snmin ; snmax , with snmin ¼ 0 and snmax ¼ 1. Ckn is the feasible neighborhood of ant k (available number of features to be selected), which acts as the memory of the ants, and contains all the trails that the ants have not passed and can be chosen. The parameters an and bn measure the relative importance of trail pheromone and heuristic knowledge, respectively.

Test

Update Pheromone

Update Pheromone

X test

Rank Features

for cardinality

for selection

Ytest

Modeling

I Minimize

ð10Þ

 ½gnl bn

Cost

In Minimize Fig. 1. Proposed cooperative ant algorithm.

S.M. Vieira et al. / Expert Systems with Applications 37 (2010) 2714–2723

2717

Table 1 Variables definition. Variable

Description

General n N Nn I K Nc g x w

Number of features Number of samples Number of samples used for validation Number of iterations Number of rules/clusters of the fuzzy model Number of existing classes in database Number of ants Set with all the features Subset of features selected to build classifiers Cost of the solution for each ant k

Jk Jq

Cost of the winner ant q

Ant colony for cardinality of features Nf N f ðkÞ In

Ckn

Features cardinality (number of selected features) Features cardinality of ant k Number of iterations with same feature cardinality Pheromone weight of features cardinality Heuristic weight of features cardinality Pheromone trails for features cardinality Heuristic of features cardinality Evaporation of features cardinality Feasible neighborhood of ant k (features cardinality availability)

Qi

Amount of pheromone laid in the features cardinality of the best solution

an bn

sn gn qn

Ant colony for selecting subset of features Lkf ðtÞ

Feature subset for ant k at tour t

af

Pheromone weight of features Heuristic weight of features Pheromone trails for feature selection Heuristic of features

bf

sf gf qf

Evaporation of features

Ckf

Feasible neighborhood of ant k (features availability)

Qj

Amount of pheromone laid in the features of the best solution

After all the g ants from the first colony have chosen the features cardinality N f ðkÞ, each ant k from the second colony select N f ðkÞ features (nodes). The probability that an ant k chooses feature j as the next feature to visit is given by

pkj ðtÞ ¼ P

½sfj ðtÞaf  ½gfj bf

s

l2Ckf ½ fl ðtÞ

af

 ½gfl bf

ð11Þ

where sfj is the pheromone concentration matrix and gfj is the heuristic function matrix for the path ðjÞ. Again, the pheromone matrix values are limited to ½sfmin ; sfmax , with sfmin ¼ 0 and sfmax ¼ 1. Cf is the feasible neighborhood of ant k (available features), which contains all the features that the ants have not selected and can be chosen. Again, the parameters af and bf measure the relative importance of trail pheromone and heuristic knowledge, respectively.

3.1.2. Updating rule After a complete tour, when all the g ants have visited all the N f ðkÞ nodes, both pheromone concentration in the trails are updated by

sni ðt þ 1Þ ¼ sni ðtÞ  ð1  qn Þ þ Dsni ðtÞ sfj ðt þ 1Þ ¼ sfj ðtÞ  ð1  qf Þ þ Dsfj ðtÞ

ð12Þ ð13Þ

where qn 2 ½0; 1 is the pheromone evaporation of the features cardinality, qf 2 ½0; 1 is the pheromone evaporation of the features and Dsni and Dsfj are the pheromone deposited on the trails ðiÞ and ðjÞ, respectively, by the ant q that found the best solution J q for this tour:

Dsqni ¼ Dsqfj ¼



Qi

if node ðiÞ is used by the ant q

0  Qj

otherwise if node ðjÞ is used by the ant q otherwise

ð14Þ ð15Þ

The number of nodes N f ðkÞ that each ant k has to visit on each tour t is only updated every In tours (iterations), in order to allow the search for the best features for each features cardinality N f . The algorithm runs I times. Both colonies share the same cost function given in (9). 3.1.3. Heuristics The heuristic value used for each feature (ants visibility) for the second colony, is computed as

gfj ¼ 1=Nej

ð16Þ

for j ¼ 1; . . . ; n. For the features cardinality (first colony), the heuristic value is computed using the Fisher discriminant criterion for feature selection (Duda et al., 2001). Considering a classification problem with two possible classes, class 1 and class 2, the Fisher discriminant criterion is described as

  l ðiÞ  l ðiÞ2 1 2 FðiÞ ¼ r21 þ r22

ð17Þ

where l1 ðiÞ and l2 ðiÞ are the mean values of feature i for the samples in class 1 and class 2, and r21 and r22 are the variances of feature i for the samples in class 1 and class 2. The score aims to maximize the between-class difference and minimize the within-class spread. Other currently proposed rank-based criteria generally come from similar considerations and show similar performance (Duda et al., 2001). Since our goal is to work with several classification problems,

2718

S.M. Vieira et al. / Expert Systems with Applications 37 (2010) 2714–2723

which can contain two or more possible classes, a one versus-all strategy is used to rank features. Thus, for a C-class prediction problem, a particular class is compared with the other C  1 classes that are considered together. The features are weighted according to the total score summed over all C comparisons: C X

F j ðiÞ;

ð18Þ

j¼1

where F j ðiÞ denotes the Fisher discriminant score for the ith feature at the jth comparison. Algorithm 1 presents the description of the ant feature selection algorithm. Algorithm 1. Ant feature selection /*Initialization*/ set the parameters qf ; qn ; af ; an ; bf ; bn ; I; In ; g. for t ¼ 1 to I do for k ¼ 1 to g do Choose the subset size Nf ðkÞ of each ant k using (10) end for for l ¼ 1 to In do for k ¼ 1 to g do

4.2. Cross validation

Build feature set Lkf ðtÞ by choosing N f ðkÞ features using (11) Compute the fuzzy model using the Lkf ðtÞ path selected by ant k Compute the cost function Jk ðtÞ Update J q end for Update pheromone trails sni ðt þ 1Þ and

tree search feature selection (Mendonça et al., 2007); and a fuzzy GA-based rule selection method (Ishibuchi & Nojima, 2007). More details can be found in the cited references. For all the data sets, the presented results are obtained using ten runs of the proposed algorithm. The parameters used in the experiments for the databases are presented in Table 2. The classification error rates of the used classifiers are obtained by performing cross validation (10-fold cross validation (CV10)). For the sake of comparison with other published algorithms, the results presented for the wine data set are not obtained using cross validation. In this case a test data set and a training set are used. The experimental results are presented as the best, the worst and the mean value of correct classifications C a out of 10 independent runs. These 10 runs were obtained using the same test data set.

For the cross validation procedure, the data set with N n samples is divided into 10 mutually exclusive sets of approximately equal size, with each subset consisting of approximately the same proportions of labels as the original data set, known as stratified cross validation (Kohavi, 1995). The classifier is trained 10 times, with a different subset left out as the test set and the other samples used to train the classifier at each time. During the training phase, the classifier is trained on nine out of 10-folds in which classification accuracy is used, as defined in (8). The prediction performance of the classifier is estimated by considering the average classification accuracy of the 10 cross-validation experiments, described as

ECV ¼

sfj ðt þ 1Þ, as defined

in (12) and (13). end for end for

10 1 X Ci Nn i¼1

!  100%

ð19Þ

where C i is the number of correctly classified samples:

C i ¼ Nn  Ne

ð20Þ

4.3. Results for different data sets 4. Experimental results 4.1. Data sets and parameters The effectiveness of the proposed method is applied to data sets taken from some well known benchmarks in the UCI repository (Asuncion & Newman, 2007). Four real data sets (Wine, Wisconsin Breast Cancer, Vote and Sonar), and one artificial data set (M-of-N) were used to test the ant feature selection algorithm. The obtained results are compared with other algorithms, namely: particle swarm optimization for rough set-based feature selection algorithm (PSORSFS) (Wang et al., 2007); positive region-based attribute reduction algorithm (POSAR) (Jensen & Shen, 2003); conditional entropy-based attribute reduction (CEAR) (Wang & Zhao, 2004); discernibility matrix-based attribute reduction (DISMAR) (Hu, Lu, & Shi, 2003); GA-based attribute reduction (GAAR) (Bazan et al., 2000); top-down (TD) and bottom-up (BU) decision Table 2 Values of parameters used in the experiments. Data set

an

bn

qn

sn0

af

bf

qf

sf0

Wine Breast cancer Vote M-of-N Sonar

0.3 0.3

0.9 0.9

0.05 0.05

0.5 0.5

0.3 0.3

0.9 0.9

0.1 0.1

0.5 0.5

0.2 0.4 0.4

0 0.9 0.9

0.06 0.05 0.05

0.5 0.5 0.5

0.3 0.3 0.3

0.9 0.9 0.9

0.09 0.1 0.1

0.5 0.5 0.5

I

In

g

50 50

5 5

20 20

100 50 200

5 5 5

30 20 20

4.3.1. Wine data set The wine data set is a widely used classification data available online in the repository of the University of California (Asuncion & Newman, 2007), and contains the chemical analysis of 178 wines grown in the same region in Italy, derived from three different cultivars. Thirteen continuous attributes are available for classification: alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavanoids, nonflavanoids phenols, proanthocyanism, color intensity, hue, OD280/OD315 of dilluted wines and proline. The AFS algorithm is applied to select the relevant features within the wine classification data set and is compared to other approaches available in the literature. The data set characteristics are presented in Table 3. As can be seen in Table 4, the obtained results are better than those in feature selection based on interclass separability (Roubos et al., 2003), and classification without feature selection, using a fuzzy classifier (Ishibuchi, Nakashima, & Murata, 1999). Further, the obtained classifier uses far less rules than the 2 first approaches (3 compared to 60) and less features. Ant feature selection uses a Table 3 Description of the used data sets. Data sets used

# Features

# Classes

# Samples

Wine Breast cancer Vote M-of-N Sonar

13 9 16 13 60

3 2 2 2 2

178 699 300 1000 208

2719

S.M. Vieira et al. / Expert Systems with Applications 37 (2010) 2714–2723 Table 4 Classification rates on the Wine data for 10 independent runs. Method

Number of features

Number of rules

Classification accuracy (%) Max.

Mean

Min.

AFS approach Corcoran and Sen (1994) Ishibuchi et al. (1999) Roubos et al. (2001) Mendonça et al. (2007) (Top–down) Mendonça et al. (2007) (Bottom–up)

4–8 13 13 4–7 11 4

3 60 6 3 2 2

100 100 99.4 99.4 100 100

99.8 99.5 98.5 – 99.9 98.5

98.8 98.3 97.8 98.3 99.4 92.7

4.3.2. Breast cancer data set The breast cancer data set is also available in the repository of the University of California (Asuncion & Newman, 2007) and it was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. The Wisconsin breast cancer data is widely used to test the effectiveness of classification algorithms. The aim of the classification is to distinguish between benign and malignant cancers based on the available nine measurements (attributes): clump thickness, uniformity of cell size, uniformity of cell shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli and Mitoses. The attributes have integer value in the range [1, 10]. The original database contains 699 instances, however 16 of these are omitted because these are incomplete, which is common in data mining. The class distribution is 65.5% benign and 34.5% malignant. In Figs. 5 and 6, an example of the evolution of the algorithm tours for the breast cancer data set is presented. In Fig. 5 the average of the number of incorrect classifications for all ants at each iteration t is shown. Fig. 6 presents the minimum number of incorrect classifications

Minimum classification accuracy

2

1.5

1

0.5

0

0

10

20 30 # iteration

40

50

Fig. 3. Minimum number of errors for each iteration in the wine data set.

13 12 Best Number of features

similar number of features as in Roubos et al. (2001), obtaining better classification accuracy. The approach in Mendonça et al. (2007) has better classification accuracy using a tree search method with the top-down approach using 11 features. The bottom-up approach in Mendonça et al. (2007) uses slightly less features and the classification accuracy is worse. Finally, the number of rules is very low and comparable to the other approaches. An example of the process of the ant feature selection searching for optimal solutions for wine data set is given in Figs. 2 and 3, where it can be observed that the average classification error decreases, indicating the convergence of the algorithm. In Fig. 2 the average number of incorrect classifications for all the k ants at each iteration t is presented. Fig. 3 shows the minimum number of incorrect classifications of all ants at each iteration t. Finally, Fig. 4 presents the evolution of the search for the best number of features.

11 10 9 8 7 6

0

10

20 30 # iteration

40

50

Fig. 4. Best number of features for each iteration in the wine data set.

3.5 Average classification accuracy

Average classification accuracy

8 7 6 5 4 3

3

2.5

2

1.5 2

0

10

20 30 # iteration

40

50

Fig. 2. Average classification accuracy for each iteration in the wine data set.

0

10

20 30 # iteration (I)

40

50

Fig. 5. Average classification accuracy for each iteration in the breast cancer data set.

2720

S.M. Vieira et al. / Expert Systems with Applications 37 (2010) 2714–2723

Minimum classification accuracy

1 0.8 0.6 0.4 0.2 0

0

10

20 30 # iteration (I)

40

50

Fig. 6. Minimum number of errors for each iteration in the breast cancer data set.

Best Number of features

6

5

4

3

0

10

20 30 # iteration (I)

40

50

Fig. 7. Best number of features for each iteration in the breast cancer data set.

of all the k ants at each iteration t. In Fig. 7, the evolution of the search for the best feature cardinality is presented. As can be seen in Table 5, the obtained results are always better than those in Wang et al. (2007) using PSO and rought set-based feature selection, and in Abonyi and Szeifert (2003) using GathGeva clustering based classifier, supervised clustering, and feature selection based on interclass separability. This result is remarkable, as the number of features is smaller and the number of rules is smaller or equal, when compared to the other classification techniques. 4.3.3. Vote data set The vote data set is a widely used classification data available in the repository of the University of California (Asuncion & Newman, 2007), and is a binary class problem (democrat vs. republican), with sixteen binary features and 435 instances. In the votes data

set, there is a high proportion between each class. There are some missing values in this data set. The samples with missing values were discarded, and the total number of instances used is 300. Table 6 presents the results obtained with AFS and compares them to other algorithms presented in the literature. In Figs. 8 and 9, an example of the evolution of the algorithm tours for the vote data set is presented, where the average of the number of incorrect classifications for all the k ants at each iteration t, and the minimum number of incorrect classifications of all the k ants at each iteration t are shown. In Fig. 10, the evolution of the search for the best feature cardinality is presented. In this example, the algorithm started to converge to the best solution after 30 iterations (see Figs. 8–10). For this particular data set, the number of iterations to guarantee the convergence was set to 100. The obtained results are always better than those in Wang et al. (2007). Recall that these algorithms were described in Section 4.1. Note that the number of features used by the AFS approach is lower than in the other algorithms, the number of rules is lower and that the obtained accuracy is better.

4.3.4. M-of-N data set An artificial data set that is M-of-N concept (Murphy & Pazzani, 1991) was also tested, which is an at least M-of-N among 13 features, and 1000 instances are used. Table 7 shows the results of the feature selection algorithm. Ten runs with one-half training instance and one-half testing set was used to validate the results. In Fig. 11, an example of the evolution of the algorithm tours for the M-of-N data set is presented. Note that for this data set the minimum number of errors for each iteration is always zero. In Fig. 12, the evolution of the search for the best feature cardinality is presented for one of the runs of the proposed ACO algorithm. As can be seen in Table 7, the obtained results are similar to those in Wang et al. (2007), using however much less fuzzy rules. In this data set the number of features used by AFS is higher than in the other algorithms, but the accuracy is maintained.

4.3.5. Sonar data set The sonar data set is also available in the repository of the University of California (Asuncion & Newman, 2007) and it contains information of 208 objects and 60 attributes. The objects are classified in two classes: ‘‘rock” and ‘‘mine”. A data frame with 208 observations and 61 variables is used. The first 60 represent the energy within a particular frequency band, integrated over a certain period of time. The last column contains the class labels. There are two classes 0 if the object is a rock, and 1 if the object is a mine (metal cylinder). The range value of each attribute varies from 0.0 to 1.0. This data set is an interesting challenge for the proposed algorithm. The number of features is bigger than the usual benchmark examples giving the opportunity to test the ant algorithm.

Table 5 Classification rates on the Breast Cancer data for 10-fold cross validation. Method

AFS approach Wang et al. (2007) (POSAR) Wang et al. (2007) (CEAR) Wang et al. (2007) (DISMAR) Wang et al. (2007) (GAAR) Wang et al. (2007) (PSORSFS) Abonyi and Szeifert (2003) (GG: R ¼ 2) Abonyi and Szeifert (2003) (Sup: R ¼ 2) Abonyi and Szeifert (2003) (GG: R ¼ 4) Abonyi and Szeifert (2003) (Sup: R ¼ 4)

Number of features

3 4 4 5 4 4 8–9 7–9 9 8–9

Number of rules

2 67 75 67 64 64 2 2 2 2

Classification accuracy (%) Max.

Mean

Min.

100.0 95.94 94.20 95.94 95.65 95.80 95.71 98.57 98.57 98.57

96.4 – – – – – 90.99 92.56 95.14 95.57

91.3 – – – – – 84.28 84.28 88.57 90.00

2721

S.M. Vieira et al. / Expert Systems with Applications 37 (2010) 2714–2723 Table 6 Classification rates on the Vote data for 10-fold cross validation. Method

Number of features

Number of rules

Classification accuracy (%) Max.

Mean

Min.

AFS approach Wang et al. (2007) Wang et al. (2007) Wang et al. (2007) Wang et al. (2007) Wang et al. (2007)

4 9 11 8 9 8

3 25 25 28 25 25

100 94.3 92.3 93.7 94.0 95.3

94.3 – – – – –

87.1 – – – – –

(POSAR) (CEAR) (DISMAR) (GAAR) (PSORSFS)

Table 7 Classification rates on the M-of-N data for 10-fold cross validation.

Average classification accuracy

11

Method

Number of features

Number of rules

Max.

Mean

Min.

AFS approach Wang et al. (2007) (POSAR) Wang et al. (2007) (CEAR) Wang et al. (2007) (DISMAR) Wang et al. (2007) (GAAR) Wang et al. (2007) (PSORSFS)

9 7

3 35

100 100

100 –

100 –

7

35

100





6

35

100





6

35

100





6

35

100





10

9

8

7 0

20

40 60 # iteration (I)

80

100

Classification accuracy (%)

Fig. 8. Average classification accuracy for each iteration in the vote data set.

22 Average classification accuracy

Minimum classification accuracy

7

6.8

6.6

6.4

6.2

20 18 16 14 12 10 8 6 4

6

0

20

40 60 # iteration (I)

80

0

10

100

20 30 # iteration (I)

40

50

Fig. 11. Average classification accuracy for each iteration in M-of-N data set. Fig. 9. Minimum number of errors for each iteration in the vote data set.

13

Best Number of features

Best Number of features

15

10

5

0

0

20

40 60 # iteration (I)

80

100

Fig. 10. Best number of features for each iteration in the vote data set.

12

11

10

9

0

10

20 30 # iteration

40

50

Fig. 12. Best number of features for each iteration in M-of-N data set.

2722

S.M. Vieira et al. / Expert Systems with Applications 37 (2010) 2714–2723

Table 8 Classification rates on the Sonar data for 10-fold cross validation. Number of features

Number of rules (average)

Classification accuracy (%) Average best error rate (%)

AFS approach Ishibuchi and Nojima (2007)

15–31 all

3 10.01

83.1 82.7

Best Number of Features

Method

40

35

30

25

20

In Table 8, the average best error rates is presented for the sake of comparison. The accuracy results achieved in this work are comparable with the ones obtained in Ishibuchi and Nojima (2007). Note that the number of features used by AFS approach to build the fuzzy model is relatively smaller when compared to the number of available features, and yet the accuracy of the classification model is maintained. Further, the average number of rules is substantially smaller than in Ishibuchi and Nojima (2007). Figs. 13 and 14 present an example of the evolution of the algorithm tours for the sonar data set. The evolution of the search for the best feature cardinality is presented in Fig. 15. 4.4. Discussion The feature selection algorithm proposed in this paper was applied to four real data sets (Wine, Wisconsin Breast Cancer, Vote

Average classification accuracy

20

19

18

17

16

15

0

30

60

90 120 # iteration (I)

150

180 200

Fig. 13. Average classification accuracy for each iteration in sonar data set.

15 0

50

100 # Iteration (I)

150

200

Fig. 15. Best number of features for each iteration in sonar data set.

and Sonar), and one artificial data set (M-of-N). It can be observed that real data sets can have many weakly relevant features rather than strongly relevant or totally irrelevant features. The ant feature selection algorithm discards a bigger percentage of features for the case of real data sets than for the artificial data set. However, the selected features are not always the same, once there are features that are weakly relevant and have a similar influence in the classifier. The time of convergence of the presented algorithm can be reduced using a lower number of ants. This number is related to the number of features in the data set. The number of fuzzy rules in the classifiers is generally quite small when compared to the other approaches. The ant feature selection algorithm was running on a Intel Core 2, 2.4 GHz CPU, with 2 GB of RAM. The computational time for the presented algorithm is 830 s for the wine data set, 567 s for breast cancer, 434 s for vote, 125 s for M-of-N and 1343 s for sonar. The data sets with a bigger number of classes tend to have a larger computational effort. This can be explained with the fact that about 90% of the computational time is spent in obtaining the fuzzy models. In general, it can be concluded that the computational effort of the proposed algorithm is not too high. The performance of the fuzzy models with ant feature selection is equal or better than the performance of the fuzzy models using all features, as can be seen in Tables 4–7. In the M-of-N data set the performance is the same, but AFS has the advantage of selecting the more relevant features and discarding the non-relevant ones. For all data sets, the proposed algorithm performs equal or better than other algorithms cited in the literature, as can be observed in Tables 4–7.

6 Minimum classification accuracy

5. Conclusions 5.5

5

4.5

4 0

50

100 # Iteration (I)

150

200

Fig. 14. Minimum number of errors for each iteration in the sonar data set.

A feature selection algorithm based on two cooperative ant colonies is proposed in this paper. The problem is divided into two objectives: choosing the features cardinality and selecting the most relevant features. The feature selection algorithm uses fuzzy classifiers. The proposed algorithm was applied to five classification databases that are considered benchmarks. The performance of the proposed algorithm was compared to previous works. The ant based feature selection algorithm yielded similar or better classification rates. In the near future, the proposed feature selection algorithm will be applied to classification problems with a larger number of features. Moreover, as the proposed ant feature selection algorithm is generic, it can be combined with other intelligent classifiers, such as neural networks classifiers.

S.M. Vieira et al. / Expert Systems with Applications 37 (2010) 2714–2723

Acknowledgements This work is supported by the Portuguese Government and FEDER under the programs: Programa de financiamento Plurianual das Unidades de I&D da FCT (POCTI-SFA-10-46-IDMEC) and by the FCT grant SFRH/25381/2005, Fundação para a Ciência e a Tecnologia, Ministério do Ensino Superior, da Ciência e da Tecnologia, Portugal.

References Abonyi, J., & Szeifert, F. (2003). Supervised fuzzy clustering for the identification of fuzzy classifiers. Pattern Recognition Letters, 24(14), 2195–2207. Al-Ani, A. (2005). Feature subset selection using ant colony optimization. International Journal of Computational Intelligence, 2(1), 53–58. Asuncion, A., & Newman, D. J. (2007). UCI machine learning repository. Bazan, J., Nguyen, H. S., Nguyen, S. H., Synak, P., & Wroblewski, J. (2000). Rough set algorithms in classification problem. In L. Polkowski, S. Tsumoto, & T. Y. Lin (Eds.), Rough set methods and applications (pp. 49–88). New York: PhysicaVerlag. Bezdek, J. C. (1981). Pattern recognition with fuzzy objective functions. New York: Plenum Press. Boz, O. (2002). Feature subset selection using sorted feature relevance. In Proceedings of ICMLA2002, international conference of machine learning and applications, Los Angeles, USA (pp. 147–153). Corcoran, A. L., & Sen, S. (1994). Using real-valued genetic algorithms to evolve rule sets for classification. In Proceedings of 1st IEEE conferecne on evolut. comput. (pp. 120–124). Dorigo, M. (1992). Optimization, learning and natural algorithms (in Italian). Ph.D. Thesis. Dorigo, M., Birattari, M., & Stützle, T. (2006). Ant colony optimization. IEEE Computational Intelligence Magazine, 1(4), 28–39. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). Wiley– Interscience Publication. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. Hu, K., Lu, Y. C., & Shi, C. Y. (2003). Feature ranking in rough sets. AI Communication, 16(1), 41–50. Ishibuchi, H., Nakashima, T., & Murata, T. (1999). Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems. IEEE Transactions on Systems Man and Cybernetics Part B, 29, 601–618. Ishibuchi, H., & Nojima, Y. (2007). Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. International Journal of Approximate Reasoning, 44, 4–31. Jensen, R., & Shen, Q. (2003). Finding rough set reducts with ant colony optimization. In Proceedings of the 2003 UK workshop on computational intelligence (pp. 15–22). Jensen, R., & Shen, Q. (2004). Fuzzy-rough attribute reduction with application to web categorization. Fuzzy Sets and Systems, 141(3), 469–485. Jensen, R., & Shen, Q. (2005). Fuzzy-rough data reduction with ant colony optimization. Fuzzy Sets and Systems, 149, 5–20. Klir, G. J., & Yuan, B. (1995). Fuzzy sets and fuzzy logic: Theory and applications. Upper Saddle River: Prentice-Hall. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the international joint conference on artificial intelligence.

2723

Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 17(4), 491–502. Mendonça, L. F., Vieira, S. M., & Sousa, J. M. C. (2007). Decision tree search methods in fuzzy modeling and classification. International Journal of Approximate Reasoning, 44(2), 106–123. Motoda, H., & Liu, H. (1998). Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers. Murphy, P. M., & Pazzani, M. J. (1991). Id2-of-3: Constructive induction of M-of-N concepts for discriminators in decision trees. In Proceedings of the eighth international workshop on machine learning, Evanston, IL, June 1991 (pp. 183– 187). Pulkkinen, P., & Koivisto, H. (2008). Fuzzy classifier identification using decision tree and multiobjective evolutionary algorithms. International Journal of Approximate Reasoning, 48(2), 526–543. Roubos, J. A., Setnes, M., & Abonyi, J. Learning fuzzy classification rules from data. Volume 2001 of Developments in soft computing. Proceedings of the international conference on recent advances in soft computing 2000 (Chapter 13, pp. 108–115). Berlin/Heidelberg: Springer-Verlag. Roubos, J. A., Setnes, M., & Abonyi, J. (2003). Learning fuzzy classification rules from labeled data. International Journal of Information Sciences, 150(1), 77–93. Schreyer, G. R., & Raidl, M. (2002). Letting ants labeling point features. In Proceedings of the 2002 congress on evolutionary computation, 2002 (CEC’02) (Vol. 2, pp. 1564–1569). Setnes, M., & Roubos, J. A. (2000). GA-fuzzy modeling and classification: Complexity and performance. IEEE Transactions on Fuzzy Systems, 8(5), 509–522. Silva, C. A., Sousa, J. M. C., & Runkler, T. A. (2007). Rescheduling and optimization of logistic processes using GA and ACO. Engineering Applications of Artificial Intelligence, 21(3), 343–352. Silva, C. A., Sousa, J. M. C., Runkler, T. A., & Sá da Costa, J. M. G. (2006). Distributed optimization of a logistic system and its suppliers using ant colony optimization. International Journal of Systems Science, 37(8), 503–512. Silva, C. A., Sousa, J. M. C., Runkler, T. A., & Sá da Costa, J. M. G. (2009). Distributed supply-chain management using ant colony optimization. European Journal of Operations Research, 199, 349–358. Sivagaminathan, R. K., & Ramakrishnan, S. (2007). A hybrid approach for feature subset selection using neural networks and ant colony optimization. Expert Systems with Applications, 33, 49–60. Sousa, J. M. C., & Kaymak, U. (2002). Fuzzy decision making in modeling and control. Singapore and UK: World Scientific and Imperial College. Sugeno, M., & Yasukawa, T. (1993). A fuzzy-logic-based approach to qualitative modeling. IEEE Transactions on Fuzzy Systems, 1(1), 7–31. Takagi, T., & Sugeno, M. (1985). Fuzzy identification of systems and its applications to modelling and control. IEEE Transactions on Systems Man and Cybernetics, 15(1), 116–132. Vieira, S. M., Sousa, J. M. C., & Runkler, T. A. (2008). Fuzzy classification in ant feature selection. In Proceedings of 2008 IEEE world congress on computational intelligence, WCCI 2008, Hong Kong, China, June 2008 (pp. 1763–1769). Vieira, S. M., Sousa, J. M. C., & Runkler, T. A. (2007). Ant colony optimization applied to feature selection in fuzzy classifiers. In Melin et al. (Eds.), Lecture Notes in artificial intelligence 4529. Foundations of fuzzy logic and soft computing. Proceedings of the 12th international fuzzy systems association world congress, IFSA 2007 (Vol. 4529, pp. 778–788). Cancun, México: Springer. Wang, G. Y., & Zhao, J. (2004). Theoretical study on attribute reduction of rough set theory: Comparison of algebra and information views. In Proceedings of the third ieee international conference on cognitive informatics. Wang, X., Yang, J., Teng, X., Xia, W., & Jensen, R. (2007). Feature selection based on rough sets and particle swarm optimization. Pattern Recognition Letters, 28, 459–471.