Applied Soft Computing 11 (2011) 239–248
Contents lists available at ScienceDirect
Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc
Chaotic maps based on binary particle swarm optimization for feature selection Li-Yeh Chuang a , Cheng-Hong Yang b,c,∗,1 , Jung-Chike Li c a
Department of Chemical Engineering, I-Shou University, Kaohsiung 80041, Taiwan Department of Network Systems, Toko University, Chiayi 61363, Taiwan c Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung 80778, Taiwan b
a r t i c l e
i n f o
Article history: Received 13 September 2008 Received in revised form 8 November 2009 Accepted 16 November 2009 Available online 4 December 2009 Keywords: Feature selection Binary particle swarm optimization Chaotic maps K-nearest neighbor Leave-one-out cross-validation
a b s t r a c t Feature selection is a useful pre-processing technique for solving classification problems. The challenge of solving the feature selection problem lies in applying evolutionary algorithms capable of handling the huge number of features typically involved. Generally, given classification data may contain useless, redundant or misleading features. To increase classification accuracy, the primary objective is to remove irrelevant features in the feature space and to correctly identify relevant features. Binary particle swarm optimization (BPSO) has been applied successfully to solving feature selection problems. In this paper, two kinds of chaotic maps—so-called logistic maps and tent maps—are embedded in BPSO. The purpose of chaotic maps is to determine the inertia weight of the BPSO. We propose chaotic binary particle swarm optimization (CBPSO) to implement the feature selection, in which the K-nearest neighbor (K-NN) method with leave-one-out cross-validation (LOOCV) serves as a classifier for evaluating classification accuracies. The proposed feature selection method shows promising results with respect to the number of feature subsets. The classification accuracy is superior to other methods from the literature. © 2009 Elsevier B.V. All rights reserved.
1. Introduction Feature selection is the process of choosing a subset of features from an original feature set; it can be viewed as a principal preprocessing tool for solving classification problems [1]. Determining an optimal feature subset is a very complex task, which proves decisive for the outcome of classification error rates. The final feature subset is expected to retain high classification accuracy. The goal is to select a relevant subset of d features from a set of D features (d < D) in a given data set [2]. D is comprised of all features in a given data set; it may include noisy, redundant, and misleading features. Therefore, an exhaustive search performed over the entire solution space, which usually takes a long time, often cannot be applied in practice [3]. To resolve these feature selection problems, we aimed at retaining only a subset of d relevant features. Irrelevant features are not only useless for classification, but could also potentially reduce the classification performance. By deleting irrelevant features, computational efficiency can be improved and classification accuracy increased.
∗ Corresponding author at: Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung 80778, Taiwan. Tel.: +886 7 381 4526x5639; fax: +886 7 383 4418. E-mail addresses:
[email protected] (L.-Y. Chuang),
[email protected] (C.-H. Yang),
[email protected] (J.-C. Li). 1 Department of Network Systems, Toko University, Chiayi 61363, Taiwan. Taiwan. Tel.: +886 5 362 2889x110; fax: +886 5 362 0707. 1568-4946/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2009.11.014
For the identification of relevant features and the removal of irrelevant features two different methods can be employed, namely the filter and wrapper models. The filter model relies on general characteristics of the data to evaluate and select feature subsets without involving any learning algorithm. The wrapper model first implements an optimizing algorithm that adds or deletes features to produce various subset features, and then employs a classification algorithm to evaluate this subset of features. In this study, we adopted a wrapper model by using an evolutionary algorithm for the feature selection problem. Different evolutionary algorithms have been proposed recently to obtain near-optimal subsets of solutions. These approaches include genetic algorithms (GAs) designed by imitating natural evolution [2], ant colony optimization (ACO) [4], particle swarm optimization (PSO) with swarm intelligence [1], and tabu search (TS) with intermediate memory [5]. Several common feature selection methods are introduced briefly here. The sequential forward selection (SFS) method begins with a feature subset and sequentially adds or removes features until some termination criterion is met. SFS suffers from a nesting affect. Since SFS algorithms do not examine all possible feature subsets, they are not guaranteed to produce an optimal result [6]. In SFS, features discarded cannot be re-selected, and selected features cannot be removed later. The sequential backward selection (SBS) method is the backward version of SFS. The plus-l-take-away-r process (PTA(l, r)) goes forward l stages (by adding l features via SFS) and then goes backward r stages (by deleting r features via SBS);
240
L.-Y. Chuang et al. / Applied Soft Computing 11 (2011) 239–248
this process is repeated several times. PTA was proposed to resolve problems associated with the nesting effect. However, there is no theoretical guidance to determine the appropriate values of l and r [6]. Sequential forward floating selection (SFFS) is the floating version of PTA(l, r). Unlike PTA(l, r), SFFS can be reversed an unlimited number of times as long as the reversal finds a better feature subset than the feature subset obtained so far at the same size [6]. SFFS suffers from entrapment in a local optimum when applied to a large-number feature problem [6]. Simple genetic algorithms (SGA) generally have poor search ability in the near local optimum region, and premature convergence in SGA seems to be difficult to avoid. Therefore test parameters in SGA need to be carefully selected [6]. Particle swarm optimization (PSO) is a search process based on the idea of swarm intelligence in biological populations. In PSO, an information sharing algorithm randomly generates an initial population for the search process. The position and velocity of each particle is adjusted based on its individual experience and the experience of its neighbors. The information is updated based on social interactions between particles. PSO has been successfully applied in many areas, including data clustering [7], multimodal functions [8], flow-shop scheduling [9], and bioinformatics [10]. Other recently published works on PSO are listed in the reference section [11–15] for the interested reader. Chaos is a non-linear system with deterministic dynamic behavior. It has ergodicity, stochastic and regularity properties, and is very sensitive to its initial conditions and parameters. Small differences in values in the initial solutions will result in great differences after many iterations. These characteristics of a chaotic system can enhance the search ability of PSO. Different chaotic maps each have their own special property. Using distinct chaotic maps in BPSO yields different results. Chaos theory has successfully been applied to problems in a variety of areas, amongst them communications and numerical simulation [16], engineering design [17], sound and vibration [18], and optimization algorithms [19]. In this paper, a binary version of PSO with chaotic maps (CBPSO) for determining the inertia weight values is used. Two kinds of chaotic maps used in binary particle swarm optimization are logistic maps and tent maps, respectively. The K-nearest neighbor (K-NN) method based on Euclidean distance calculations serves as a classifier for ten data sets taken from the literature [2]. Experimental results show that CBPSO can not only reduce the number of features, but also achieves higher classification accuracies. This paper is organized as follows. Section 2 introduces the methods used in this study, namely binary particle swarm optimization, chaotic sequences for inertia weight, K-nearest neighbor classifier method and the CBPSO-KNN procedure. Section 3 details the experimental results and contains a discussion thereof. Results obtained by the proposed method are compared with results of other feature selection methods. Finally, a brief conclusion is offered in Section 4.
2. Methods
All of the particles have fitness values based on the calculation of a fitness function. Particles are updated by following two parameters called pbest and gbest at each iteration. Each particle is associated with the best solution (fitness) the particle has achieved so far in the search space. This fitness value is stored, and represents the position called pbest. The value gbest is a global optimum value for the entire population. PSO was originally developed to solve real-value optimization problems. A swarm consists of N particles moving around a ddimensional search space. The position of the pth particle can be represented by Xp = (xp1 , xp2 , . . ., xpd ). The velocity for the pth particle can be written as Vp = (vp1 , vp2 , . . . , vpd ). The positions and velocities of the particles are confined within [Xmin , Xmax ]d and [Vmin , Vmax ]d , respectively. Particles coexist and evolve simultaneously based on knowledge shared with neighboring particles; they make use of their own memory and knowledge gained by the swarm as a whole to find the best solution. The best previously visited position of the pth particle is noted as its individual best position pbestp = (pbestp1 , pbestp2 , . . ., pbestpd ), a value called pbest. The best value of all individual pbestp values is denoted the global best position gbest = (gbest1 , gbest2 , . . ., gbestd ) and called gbest. The PSO process is initialized with a population of random particles, and the algorithm then executes a search for optimal solutions by continuously updating generations. At each generation, the position and velocity of the pth particle are updated by pbestp and gbest in the swarm. The update equations can be formulated as old vnew = w × vold + c1 × rand1 × (pbestpd − xpd ) pd pd old ) + c2 × rand2 × (gbestd − xpd
new old = xpd xpd + vnew pd
(2)
Many optimization problems occur in a space featuring discrete, qualitative distinctions between variables and levels of variables. To extend the real-value version of PSO to a binary/discrete space, Kennedy and Eberhart proposed a binary version of the PSO method (BPSO). In a binary search space, a particle may move to near corners of a hypercube by flipping various number of bits; thus, the overall particle velocity may be described by the number of bits changed per iteration [21]. The position of each particle is represented in binary string form by Xp = {xp1 , xp2 , . . ., xpd } and is randomly generated. The bit values {0} and {1} represent a non-selected and selected feature, respectively. The velocity of each particle is represented by Vp = {vp1 , vp2 , . . ., vpd } (p is the number of particles, and d is the number of dimensions (features) of a given data set). The initial velocities in particles are probabilities limited to a range of {0.0–1.0}. In BPSO, once the adaptive values pbest and gbest are obtained, the features of the pbest and gbest particles can be tracked with regard to their position and velocity. Each particle is updated according to the following equations [9]:
old vnew = w × vold + c1 × rand1 × pbestpd − xpd pd pd
2.1. Binary particle swarm optimization Particle swarm optimization (PSO) is a population-based evolutionary computation technique developed by Kennedy and Eberhart in 1995 [20]. PSO simulates the social behavior of organisms, i.e., birds in a flock or fish in a school. This behavior can be described by a swarm intelligence system. In PSO, each solution can be considered an individual particle in a given search space, which has its own position and velocity. During movement, each particle adjusts its position by changing its velocity based on its own experience, as well as the experience of its companions, until an optimum position is reached by itself and its companions [21].
(1)
old + c2 × rand2 × gbestd − xpd
if
vnew pd
∈ / (Vmin , Vmax ) then
vnew = max(min(Vmax , vnew ), Vmin ) pd pd S(vnew )= pd
(3)
1 1+e
(4) (5)
−vnew pd
If rand < S vnew pd
new new then xpd = 1; else xpd =0
(6)
In Eq. (3), w is the inertia weight, c1 and c2 are acceleration parameters, and rand, rand1 and rand2 are three independent random
L.-Y. Chuang et al. / Applied Soft Computing 11 (2011) 239–248
241
Fig. 1. The BPSO flow chart. p: number of particles, d: number of dimensions, D: total number of dimensions, g: number of generations, G: maximum number of generations, and N: population size.
numbers between [0, 1]. Velocities vnew and vold are those of the pd pd updated particle and the particle before being updated, respecold is the original particle position (solution), and xnew is tively, xpd pd the updated particle position (solution). In Eq. (4), particle velocities of each dimension are tried to a maximum velocity Vmax . If the sum of accelerations causes the velocity of that dimension to exceed Vmax , then the velocity of that dimension is limited to Vmax . Both Vmax and Vmin are user-specified parameters (in our case Vmax = 6, Vmin = −6). In Eqs. (5) and (6), the updated features are calculated by the ), in which vnew is the updated velocity value. If function S(vnew pd pd new S(vpd ) is larger than a randomly produced disorder number that new is represented by is within {0.0–1.0}, then its position value xpd {1}, meaning this feature is selected as a required feature for the next update. If S(vnew ) is smaller than a randomly produced disorder pd new is repnumber that is within {0.0–1.0}, then its position value xpd resented by {0}, meaning this feature is not selected as a required feature for the next update (Fig. 1).
2.2. Chaotic sequences for inertia weight The inertia weight controls the balance between the global exploration and local search ability. A large inertia weight facilitates the global search, while a small inertia weight facilitates the local search. Proper adjustment of the inertia weight value is important. The inertia weight w is the key factor influencing the convergence and thus will greatly affect the BPSO search process, and through it the resulting classification accuracy. The BPSO process often suffers from entrapment of particles in a local optimum, which causes the premature convergence mentioned above. We employed chaotic binary particle swarm optimization (CBPSO) to prevent this early convergence, and thus achieve superior classification results. Chaos is a deterministic dynamic system very sensitive to its initial conditions and parameters. The nature of chaos is apparently random and unpredictable; however it also possesses an element of regularity [22]. A chaotic map is used to determine the inertia weight value at each iteration. Logistic maps and tent maps are the most frequently used chaotic behavior maps. They are similar
242
L.-Y. Chuang et al. / Applied Soft Computing 11 (2011) 239–248
When the inertia weight value is close to 1, CBPSO strengthens the global search ability. For inertia weight values close to 0, CBPSO enhances the local search ability. 2.3. K-nearest neighbor
Fig. 2. Chaotic inertia weight using logistic map.
Fig. 3. Chaotic inertia weight using tent map.
to each other and both show unstable dynamic behavior. Chaotic sequences have been proven easy and fast to generate and store, as there is no need for storage of long sequences [22]. In this paper, chaotic sequences are embedded in the BPSO. Logistic maps and tent maps determine the inertia weight values, respectively. The number of iterations in the CBPSO is given by t. The inertia weight value is modified by a logistic map according to w(t + 1) = 4.0 × w(t) × (1 − w(t)),
w(t) ∈ (0, 1)
(7)
Fig. 2 shows the chaotic inertia weight value using a logistic map for the total number of iterations; w(0) is set to 0.48. The inertia weight value modified by a tent map is given by If (w(t) < 0.7)w(t + 1) = w(t)/0.7 else w(t + 1) = 10/3w(t)(1 − w(t)),
w(t) ∈ (0, 1)
(8)
Fig. 3 shows the chaotic inertia weight value using a tent map for the total number of iterations. w(0) is the same as for the logistic map above; it is set to 0.48.
The K-nearest neighbor (K-NN) method is a supervised learning algorithm introduced by Fix and Hodges in 1951, and is still one of the most popular nonparametric methods [23,24]. The K-NN method is easy to implement since only the parameter K (number of nearest neighbors) needs to be determined. The parameter K is an important factor affecting the performance of the classification process. In a multi-dimensional feature space, the data is divided into testing and training samples. K-NN classifies a new object based on the minimum distance from the test samples to the training samples. The Euclidean distance was used for calculations in this paper. If an object is near to the K nearest neighbors, the object is classified into the K-object category. In order to increase the classification accuracies, the parameter K has to be adjusted based on the different data set characteristics. In K-NN, a large category tends to have a small classification error, while the classification error for minority classes is usually rather large, a fact that lowers the performance of K-NN under such circumstances [25]. Cross-validation is a useful technique for choosing the parameter K. In this paper, the leave-one-out crossvalidation (LOOCV) method was used. When there are n data to be classified, the data are divided into one testing sample and n − 1 training samples at each iteration of the evaluation process, and finally a classifier is constructed by training the n − 1 training samples. The testing sample category can be judged by the classifier. 2.4. CBPSO–K-NN procedure Initially, the position of each particle is represented by a binary (0/1) string S = F1 , F2 , . . ., Fd . d is the number of features of the test data, where 1 represents a selected feature, while 0 represents a non-selected feature. The classification accuracy of a 1-nearest neighbor (1-NN) determined by the leave-one-out cross-validation method is used to measure the fitness of each particles. For example, let a data sets contain seven records A, B, C, D, E, F and G which each having four features. A pth particle then has a binary string d = 4 in which Sp is randomly generated as 1010 (shown in Fig. 4). A, B, C, D, E, F and G are in turn individually used as testing data and with other six data sets as training data to evaluate if the selected futures can be classified correctly. In Fig. 4-1, G is the testing data and the others six data sets are the training data. Since Sp = 1010, F1 and F3 are selected. When 1-NN (Fig. 4-2) is used, G is classified as belonging to the shape square, because D
Fig. 4. Example of the CBPSO–K-NN procedure.
L.-Y. Chuang et al. / Applied Soft Computing 11 (2011) 239–248
243
Fig. 5. Chaotic BPSO flowchart.
Fig. 6. Three stages of feature selection (simplistic).
has the shortest distance to G. However, since G is of the triangle shape in data set, it is a wrongly classified sample. The fitness value of the pth particle (accuracy) = (correctly classified samples/total samples) × 100%. Fig. 5 shows the flowchart of the entire CBPSO process. Fig. 6 presents a simplistic diagram of the three stages of feature selection employed in this study. The CBPSO procedure can be summarized as follows:
Step 1. Randomly generate an initial population for CBPSO. Step 2. Evaluate fitness values of all particles. Step 3. Calculate inertia weight value by chaotic maps according to Eqs. (5) or (6). Step 4. Update the pbest and gbest values. Each particle updates its position and velocity by CBPSO through Eqs. (1)–(4). Step 5. Check the termination criterion. If satisfied, output final solution. Otherwise go to Step 2.
3. Results and discussion 3.1. Data sets The data sets in this study were obtained from the UCI Repository [26]. Table 1 illustrates the format of the ten classification problems. Three different feature selection problems were classified [27]. If the number of features is between 10 and 19, the sample groups can be considered small; these data sets include the Glass, Vowel, Wine, Letter, Vehicle and Segmentation data sets. If the number of features is between 20 and 49, the sample test groups are medium scale problems; these include the WDBC, Ionosphere and Satellite problems. If the number of features is greater than 50, the test problems are large scale problems; this group includes the Sonar problem. Selected feature subsets were classified by the 1-NN method with leave-one-out cross-validation for all ten data sets. Except for the glass data set, the expression levels for each feature of all data sets were normalized to [0, 1]. The normalization
Table 1 Format of classification test problems. Data sets
Number of samples
Number of classes
Number of features
Classifier method
Glass Vowel Wine Letter Vehicle Segmentation WDBC Ionosphere Satellite Sonar
214 990 178 20000 846 210/2100 569 351 4435/2000 208
7 11 3 26 4 7 2 2 6 2
10 10 13 16 18 19 30 34 36 60
1-NN 1-NN 1-NN 1-NN 1-NN 1-NN 1-NN 1-NN 1-NN 1-NN
x/y indicates the number of testing and training samples, respectively.
244
L.-Y. Chuang et al. / Applied Soft Computing 11 (2011) 239–248
Table 2 Classification accuracies for the test data sets. Datasets
d*
SFS
PTA
SFFS
SGA
HGA (1)
HGA (2)
HGA (3)
HGA (4)
NA 100 100 100
NA 100 100 NA
BPSO
CBPSO(1)
CBPSO(2)
d**
A (%)
d**
A (%)
d**
A (%)
3
100
3
100
3
100
Glass (D = 10)
2 4 6 8
99.07 100 100 100
99.07 100 100 100
99.07 100 100 100
99.07 100 100 100
99.07 100 100 100
99.07 100 100 100
Vowel (D = 10)
2 4 6 8
62.02 92.63 98.28 99.70
62.02 92.83 98.79 99.70
62.02 92.83 98.79 99.70
62.02 92.83 98.79 99.70
62.02 92.83 98.79 99.70
62.02 92.83 98.79 99.70
NA 92.83 98.79 99.70
NA 92.83 98.79 NA
9
99.49
9
99.49
9
99.49
Wine (D = 13)
3 5 8 10
93.82 94.38 95.51 92.13
93.82 94.38 95.51 92.13
93.82 94.94 95.51 92.70
93.82 95.51 95.51 92.70
93.82 95.51 95.51 92.70
93.82 95.51 95.51 92.70
93.82 95.51 95.51 92.70
NA 95.51 95.51 92.70
8
98.88
8
99.44
8
99.44
Letter (D = 16)
3 6 10 13
47.09 86.20 96.12 96.42
47.09 87.60 96.35 96.42
47.09 87.60 96.35 96.42
47.09 87.60 96.35 96.42
47.09 87.60 96.35 96.42
47.09 87.60 96.35 96.42
47.09 87.60 96.35 96.42
NA 87.60 96.35 96.42
16
95.38
15
96.45
13
96.58
Vehicle (D = 18)
4 7 11 14
62.77 69.15 69.50 68.20
64.78 70.09 71.75 70.80
69.15 73.52 71.87 70.80
69.50 72.97 71.84 70.80
69.74 73.52 72.46 70.80
69.74 73.52 72.46 70.80
69.74 73.52 72.46 70.80
69.74 73.52 72.46 70.80
11
74.70
10
74.35
12
75.06
Segmentation (D = 19)
4 8 11 15
92.81 92.95 92.95 92.57
92.81 92.95 92.95 92.57
92.81 92.95 92.95 92.57
92.81 92.95 92.95 92.57
92.81 92.95 92.95 92.57
92.81 92.95 92.95 92.57
92.81 92.95 92.95 92.57
92.81 92.95 92.95 92.57
11
97.88
10
97.92
10
97.92
WDBC (D = 30)
6 12 18 24
93.15 92.62 94.02 92.44
93.15 92.97 94.20 93.50
94.20 94.20 94.20 93.85
93.67 93.95 93.85 93.85
93.92 94.06 93.92 93.85
94.38 94.06 93.99 93.85
93.99 94.06 94.13 93.85
93.99 94.27 93.99 93.85
13
97.72
12
97.54
15
97.54
Ionosphere (D = 34)
7 14 20 27
93.45 90.88 90.03 89.17
93.45 92.59 92.02 91.17
93.45 93.79 92.88 90.88
94.70 94.30 93.79 91.17
95.38 94.93 93.90 91.45
95.50 95.56 94.19 91.45
95.56 95.21 93.73 91.45
95.50 95.21 94.13 91.45
10
93.73
12
93.45
15
96.02
Satellite (D = 36)
7 14 22 29
86.85 89.45 90.45 90.40
88.20 89.85 91.10 90.70
88.55 90.10 91.45 91.25
87.89 90.61 91.36 91.10
88.25 90.88 91.37 91.21
88.44 90.87 91.44 91.24
88.55 90.80 91.45 91.24
88.46 90.82 91.41 91.18
14
90.61
22
91.24
21
91.45
Sonar (D = 60)
12 24 36 48
87.02 89.90 88.46 91.82
89.42 90.87 91.83 92.31
92.31 93.75 93.27 91.35
92.40 95.49 95.09 92.02
93.65 95.86 95.67 92.60
94.71 95.96 95.82 93.17
94.61 96.34 95.67 93.17
94.81 96.15 95.67 93.08
32
92.79
27
93.27
26
95.75
D: total number of features; d*: selected number of features; SFS: sequential forward search; PTA: plus and take away; SFFS: sequential forward floating search; SGA: sequential genetic algorithm; HGA: hybrid genetic algorithm; d**: optimal selected number of features; BPSO: binary particle swarm optimization; CBPSO(1): Chaotic BPSO with logistic map; CBPSO(2): Chaotic BPSO with tent map; A (%): classification accuracy in %. Highest values are in bold-type.
procedure is given in Eq. (9). valuemax is the maximum original value. valuemin is the minimum original value. valuemax is set to 1 and valuemin is set to 0
x = lower + (upper − lower) ∗
value − value min valuemax − valuemin
would “overfly” the target about half of the time [21]. The above values were taken from Shi and Eberhart [29]. The inertia weight w for BPSO was fixed at 0.48, and the inertia weight w(0) for CBPSO(1) and CBPSO (2) were also set to 0.48.
(9) 3.3. Discussion
3.2. Parameters settings The parameters of BPSO, CBPSO(1) (logistic maps) and CBPSO(2) (tent maps) were almost the same. No swarm size between 20 and 100 particles produced results that were clearly superior or inferior to any other value for a majority of the tested problems [28]. In order to cut down cost by reducing the computing time and compare our results to the results of HGAs [2] under identical condition, we set the population size of BPSO, CBPSO(1) and CBPSO(2) to 20. The termination criterion is t iterations, with t set to 100. The acceleration factors c1 and c2 were both set to 2; this value is almost ubiquitously adopted in PSO research. The acceleration parameters were multiplied by 2.0 to give it a mean of 1, so that agents
As shown in Table 2, the number of necessarily selected features can be decreased in six UCI data sets. When feature selection is employed, only a relatively small number of features need to be selected. The selected features play an important role in determining the classification performance. Our goal is to identify the features that prove most beneficial for classification, since the selected features greatly influence the performance of BPSO, CBPSO(1) and CBPSO(2). The search ability of the particles depends on the extended search ability of BPSO and CBPSO. Table 2 compares experimental results obtained by other methods from the literature [2] to the BPSO, CBPSO(1) and CBPSO(2) methods used here. The classification accuracies of the Glass,
L.-Y. Chuang et al. / Applied Soft Computing 11 (2011) 239–248
Fig. 7.1. Number of iterations vs. classification accuracy in glass data set.
Fig. 7.2. Number of iterations vs. classification accuracy in vowel data set.
Vowel, Wine, Letter, Vehicle, Segmentation, WDBC, Ionosphere, Satellite and Sonar classification problems obtained by BPSO are 100%, 99.49%, 98.88%, 95.38%, 74.70%, 97.88%, 97.72%, 93.73%, 90.61% and 92.79%, respectively. The classification accuracies of the Glass, Vowel, Wine, Letter, Vehicle, Segmentation, WDBC, Ionosphere, Satellite and Sonar classification problems obtained by CBPSO(1) are 100%, 99.49%, 99.44%, 96.45%, 74.35%, 97.92%, 97.54%, 93.45%, 91.24% and 93.27%, respectively. CBPSO(1) obtained a higher classification accuracy for the Wine, Letter, Vehicle Segmentation and WDBC classification problems than any of the methods from the literature. The classification accuracies of the Glass, Vowel, Wine, Letter, Vehicle, Segmentation, WDBC, Ionosphere, Satellite and Sonar classification problems obtained by CBPSO(2) are 100%, 99.49%, 99.44%, 96.58%, 75.06%, 97.92%, 97.54%, 96.02%, 91.45% and 95.75%, respectively. CBPSO(2) obtained the highest classification accuracy of any method for the Letter, Wine, Vehicle, Segmentation, Ionosphere and Satellite classification problems. Figs. 7.1–7.10 show the number of iterations vs. classification accuracy of BPSO, CBPSO(1) and CBPSO(2) for ten test data sets. These figures also clearly show the difference in classification accuracies between the inertia weight value of CBPSO with logistic map and tent map. At the same iteration, the classification accuracy for CBPSO(1) and
Fig. 7.3. Number of iterations vs. classification accuracy in wine data set.
245
Fig. 7.4. Number of iterations vs. classification accuracy in letter data set.
Fig. 7.5. Number of iterations vs. classification accuracy in vehicle data set.
Fig. 7.6. Number of iterations vs. classification accuracy in segmentation data set.
CBPSO(2) may be different because of the different inertia weight values. For the majority of the ten test data sets, CBPSO(2) proved to be superior to CBPSO(1) in classification accuracy. This seems to indicate that the application of a tent map for the inertia weight value is more suitable for the CBPSO method. Classification accuracies of the methods from the literature [2] were measured by four values (D/5, 2D/5, 3D/5, and 4D/5) of the
Fig. 7.7. Number of iterations vs. classification accuracy in WDBC data set.
246
L.-Y. Chuang et al. / Applied Soft Computing 11 (2011) 239–248 Table 3 Inertia weight with chaotic maps for selected iterations. Inertia weight
Fig. 7.8. Number of iterations vs. classification accuracy in ionosphere data set.
Fig. 7.9. Number of iterations vs. classification accuracy in satellite data set.
total number of features (D). The number of feature subsets is at least 2A (A: D/5, 2D/5, 3D/5, and 4D/5 features). The methods from the literature all calculate the number of 2A feature subsets, and then obtain classification accuracies. Optimal classification accuracy was obtained by an exhaustive search for various numbers of features. In fact, the optimal number of features for each test problem is unknown. If the optimal numbers of features do not belong to D/5, 2D/5, 3D/5, and 4D/5 features, the methods from the literature cannot obtain any applicable classification accuracy. Furthermore, calculating the four values D/5, 2D/5, 3D/5, and 4D/5 is an extremely cumbersome and time-consuming process compared to the proposed method. For most of the test data sets, the methods from the literature obtained lower classification accuracies, a fact representatively illustrated by the Wine test problem, in which the classification accuracy obtained by CBPSO(1) and CBPSO(2) was higher than the results from the literature methods, while the number of features selected was the same. In the case of the Satellite test problem, CBPSO(2) obtained 91.45% classification accuracy, the same as the best method from the literature. However, CBPSO(2) only required the selection of 21 features, fewer than the 22 (3D/5) features
Fig. 7.10. Number of iterations vs. classification accuracy in sonar data set.
w(0) w(1) w = (t/10) = w(10th) w = (2t/10) = w(20th) w = (3t/10) = w(30th) w = (4t/10) = w(40th) w = (5t/10) = w(50th) w = (6t/10) = w(60th) w = (7t/10) = w(70th) w = (8t/10) = w(80th) w = (9t/10) = w(90th) w = (t) = w(100th)
Chaotic maps Logistic map
Tent map
0.48 0.9984 0.99577 0.41727 0.90041 0.78968 0.90091 0.86353 0.75244 0.50798 0.90041 0.78968
0.48 0.68571 0.72826 0.33007 0.92483 0.00503 0.17793 0.28359 0.94465 0.54505 0.46981 0.89087
selected by the literature methods. Table 2 shows furthermore that GAs lead to a better performance than SFS (sequential forward search), PTA (plus and take away) and SFFS (sequential forward floating search) [2]. Results shown in Table 2 indicate that CBPSO works well for small and medium size problems, although for high-dimensional problems (Sonar test problem), it only slightly improves the classification accuracy. A chaos system has certain ergodic and stochastic properties. Using chaotic behavior in the inertia weight value with chaotic maps causes the inertia weight values to fluctuate between [0, 1]. The changed inertia weight values affect the velocities and positions of particles at each iteration. The introduction of a fluctuating inertia weight value enables the particles to move on to new search regions. Extending the search space region is an important task. When inertia weight values are close to either 1 or 0, the CBPSO process boosts either the search in global or local regions. Chaotic behavior increases the local and global search ability. A group of particles may search new regions of the solution space and congregate toward a global or near-global optimum. This extended search in the solution space is equivalent to different subset combinations of features, which lead to superior classification. In order to compare the performance of logistic maps and tent maps, w(0) was sets to the same value of 0.48 for both. Table 3 shows the inertia weight value with chaotic behavior for the logistic map and the tent map. The inertia weight value at each iteration of the logistic map and tent map can be clearly seen. For example, the inertia weight value for CBPSO(1) starts at 0.9984, the inertia weight value of CBPSO(2) adopts a value of 0.72826 at the 2t/10 (20th) iteration, etc. The experimental results in Table 2 indicate that CBPSO(2) is superior to CBPSO(1). According to Figs. 2 and 3, the rate of oscillation for the logistic map is higher than the one for the tent map. The more stable state of the tent maps is preferable to logistic maps, as indicated by the better performance of CBPSO(2). The greater stability ensures the steadier search capability of CBPSO(2). To explain the effect of the inertia weight with tent maps, let us consider the Vehicle data set in Table 4. When CBPSO(2) executes the 32nd iteration, the search ability of CBPSO(2) is affected by the 31st inertia weight. From the 31st iteration to the 34th iteration, the inertia weight is changed from 0.924834 to 0.208548, 0.297926 and 0.425609, respectively. As the inertia weight is reduced from about 0.9 to about 0.2, CBPSO(2) increased the local search ability, and thus by the 34th iteration the classification accuracy of CBPSO(2) is boosted from 74.47 (31st iteration) to 74.7 (34th iteration). The higher classification accuracy indicates that CBPSO(2) finds more classification-relevant features than in the preceding several iterations. In the 67th to 69th iterations, the inertia weight is changed from 0.876822 to 0.324017 and 0.462881, respectively. This case is similar to the one above, as CBPSO(2) focuses on the local search
L.-Y. Chuang et al. / Applied Soft Computing 11 (2011) 239–248 Table 4 Effect of the inertia weight with tent map on the number of features selected and classification accuracy (shown for the vehicle data set). Iteration
w
Accuracy
Features
31st 32nd 33rd 34th 35th ... 67th 68th 69th 70th 71st 72nd 73rd
0.924834 0.208548 0.297926 0.425609 0.608013 ... 0.876822 0.324017 0.462881 0.661258 0.944655 0.156847 0.224067
74.47 74.47 74.47 74.7 74.7 ... 74.7 74.7 74.7 74.7 74.7 75.06 75.06
13 13 13 13 13 ... 12 12 11 11 11 12 12
Table 5 Time complexity and search types of existing methods.
247
of chaotic maps, a logistic map and a tent map, are used on binary particle swarm optimization. Since inertia weight values applied to the feature selection process are differently changed at every iteration, the two chaotic maps show different dynamic behavior between [0, 1]. The generated sequence of the chaotic map consists of pseudo-random numbers; however, there are no fixed points, periodic orbits, or quasi-periodic orbits in the behavior of the chaos system. As a result, the system can avoid entrapment in local optima. This behavior affects the search ability of CBPSO. The classification accuracy is calculated by a 1-NN classifier with LOOCV. Experimental results show that CBPSO with a tent map obtained higher classification accuracies than CBPSO with a logistic map, and that CBPSO is generally very competitive when compared to other methods. It could serve as an ideal pre-processing tool to help optimize the feature selection process.
Acknowledgements
Existing methods
Time complexity
Search types
SFS PTA SFFS
(D2 ) (D2 ) O(2D )
Sequential Sequential Sequential
SGA HGAs BPSO CBPSO(1) CBPSO(2)
(D) (D)(D) (D) (D) (D)
Parallel Parallel Parallel Parallel Parallel
in order to obtain higher classification accuracy. The difference is that although the classification accuracies for these iterations are the same (74.7), CBPSO(2) removes one irrelevant feature, and thus the number of features is reduced from 12 to 11. Table 5 shows the time complexity of the proposed method compared to the time complexities of the literature methods [27]. The number of initial feature sets is denoted D. In Table 5, ( ) denotes a tight estimate of complexity (exact, except for a multiplicative constant) and O( ) denotes an estimate of complexity for which only an upper bound is known. Time complexities under a typical setting of parameters are shown in parentheses. These time complexities are only a clue when we use these algorithms [27]. Search types can be divided into sequential and parallel types. The existing parallel type methods obtain more solutions in the feature space than the sequential type methods, meaning that parallel type methods have a stronger search ability than the sequential type methods. In BPSO, two independent numbers, rand1 and rand2 in Eq. (1), affect the velocity of each particle. The proper adjustment of the BPSO parameters w (inertia weight) and the acceleration factors c1 and c2 is an important task. The inertia weigh w controls the balance between the global exploration and local search ability. A large inertia weight facilitates the global search, whereas a small inertia weight facilitates the local search. c1 and c2 control the movement of the particles. To avoid premature BPSO convergence, the adjustment should not be too excessive, since this might cause extreme particle movements. An excessive adjustment would make it impossible to obtain optimized features. Hence, suitable parameter adjustment is paramount. The parameters for the BPSO were taken from Shi and Eberhart [29], and were optimized by them. 4. Conclusion Feature selection is a fundamental technique in many application areas. Different evolutionary algorithms have been developed for different feature selection problems. In this paper, two kinds
This work is partly supported by the National Science Council in Taiwan under grants NSC96-2622-E-151-019-CC3, NSC96-2622E214-004-CC3, NSC95-2221-E-151-004-MY3, NSC95-2221-E-214087, and NSC95-2622-E-214-004.
References [1] X. Wang, J. Yang, X. Teng, W. Xia, R. Jensen, Feature selection based on rough sets and particle swarm optimization, Pattern Recognition Letters 28 (2007) 459–471. [2] I.-S. Oh, J.-S. Lee, B.-R. Moon, Hybrid genetic algorithms for feature selection, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2004) 1424–1437. [3] T.M. Cover, J.M. Van Campenhout, On the possible orderings in the measurement selection problem, IEEE Transactions on Systems, Man and Cybernetics 7 (1977) 657–661. [4] R.K. Sivagaminathan, S. Ramakrishnan, A hybrid approach for feature subset selection using neural networks and ant colony optimization, Expert Systems with Applications 33 (2007) 49–60. [5] M.A. Tahir, A. Bouridane, F. Kurugollu, Simultaneous feature selection and feature weighting using hybrid tabu search/K-nearest neighbor classifier, Pattern Recognition Letters 28 (2007) 438–446. [6] H. Zhang, G. Sun, Feature selection using tabu search method, Pattern Recognition 35 (2002) 701–711. [7] Y.-T. Kao, E. Zahara, I.W. Kao, A hybridized approach to data clustering, Expert Systems with Applications 34 (2008) 1754–1762. [8] Y.-T. Kao, E. Zahara, A hybrid genetic algorithm and particle swarm optimization for multimodal functions, Applied Soft Computing 8 (2008) 849–857. [9] Z. Lian, X. Gu, B. Jiao, A novel particle swarm optimization algorithm for permutation flow-shop scheduling to minimize makespan, Chaos, Solitons & Fractals 35 (2008) 851–861. [10] W. Qian, Y. Yang, N. Yang, C. Li, Particle swarm optimization for SNP haplotype reconstruction problem, Applied Mathematics and Computation 196 (2008) 266–272. [11] A. Cervantes, I.M. Galvan, P. Isasi, AMPSO: a new particle swarm method for nearest neighborhood classification, IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics 39 (2009) 1082–1091. [12] H. Shinn-Ying, L. Hung-Sui, L. Weei-Hurng, H. Shinn-Jang, OPSO: orthogonal particle swarm optimization and its application to task assignment problems, IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans 38 (2008) 288–298. [13] M.A. Montes de Oca, T. Stutzle, M. Birattari, M. Dorigo, Frankenstein’s PSO: a composite particle swarm optimization algorithm, IEEE Transactions on Evolutionary Computation 13 (2009) 1120–1132. [14] Z.H. Zhan, J. Zhang, Y. Li, H.S. Chung, Adaptive particle swarm optimization, IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics 39 (2009) 1362–1381. [15] S. Kiranyaz, T. Ince, A. Yildirim, M. Gabbouj, Fractional particle swarm optimization in multidimensional search space, IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics (2009) 1–22. [16] K. Fallahi, R. Raoufi, H. Khoshbin, An application of Chen system for secure chaotic communication based on extended Kalman filter and multi-shift cipher algorithm, Communications in Nonlinear Science and Numerical Simulation 13 (2008) 763–781. [17] L.d.S. Coelho, V.C. Mariani, Use of chaotic sequences in a biologically inspired algorithm for engineering design optimization, Expert Systems with Applications 34 (2008) 1905–1913. [18] S.-Y. Liu, X. Yu, S.-J. Zhu, Study on the chaos anti-control technology in nonlinear vibration isolation system, Journal of Sound and Vibration 310 (2008) 855–864.
248
L.-Y. Chuang et al. / Applied Soft Computing 11 (2011) 239–248
[19] D. Yang, G. Li, G. Cheng, On the efficiency of chaos optimization algorithms for global optimization, Chaos, Solitons & Fractals 34 (2007) 1366–1375. [20] J. Kennedy, R.C. Eberhart, Particle swarm optimization, Proceedings of the IEEE International Conference on Neural Networks 4 (1995), 1942–1948. [21] J. Kennedy, R.C. Eberhart, Y. Shi, Swarm Intelligence, Morgan Kaufmann, 1 ed., D.B. Fogel, San Francisco, 2001. [22] B. Alatas, E. Akin, A.B. Ozer, Chaos embedded particle swarm optimization algorithms, Chaos, Solitons & Fractals 40 (2009) 1715–1734. [23] T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE Transactions on Information Theory 13 (1967) 21–27. [24] E. Fix, J. Hodges Jr., Discriminatory analysis. Nonparametric discrimination: consistency properties, International Statistical Review/Revue Internationale de Statistique 57 (1989) 238–247.
[25] S. Tan, An effective refinement strategy for KNN text classifier, Expert Systems with Applications 30 (2006) 290–298. [26] P. Murphy, D. Aha, UCI repository of machine learning databases, 1995, URL http://www.sgi.com/Technology/mlc/db. [27] M. Kudo, J. Sklansky, Comparison of algorithms that select features for pattern classifiers, Pattern Recognition 33 (2000) 25–41. [28] D. Bratton, J. Kennedy, Defining a standard for particle swarm optimization, in: Swarm Intelligence Symposium, 2007, IEEE, 2007, pp. 120–127. [29] Y. Shi, R.C. Eberhart, A modified particle swarm optimizer, in: The 1998 IEEE International Conference on Evolutionary Computation Proceedings, 1998, pp. 69–73.