A two-stage feature selection method with its application

Computers and Electrical Engineering 47 (2015) 114–125 Contents lists available at ScienceDirect Computers and Electrical Engineering journal homepa...

Download PDF

749KB Sizes 1 Downloads 149 Views

Report

PDF Reader
Full Text

Computers and Electrical Engineering 47 (2015) 114–125

Contents lists available at ScienceDirect

Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng

A two-stage feature selection method with its application Xuehua Zhao a, Daoliang Li b,∗, Bo Yang c, Huiling Chen d, Xinbin Yang a, Chenglong Yu a, Shuangyin Liu e a

School of Digital Media, Shenzhen Institute of Information Technology, Shenzhen 518172, China College of Information and Electrical Engineering, China Agricultural University, P. O. Box 121, 17 Tsinghua East Road, Beijing 100083, China College of Computer Science and Technology, Jilin University, Changchun 130012, China d College of Physics and Electronic Information, Wenzhou University, Wenzhou 325035, China e College of Information, Guangdong Ocean University, Zhanjiang 524025, China b c

a r t i c l e

i n f o

Article history: Received 22 April 2014 Revised 17 August 2015 Accepted 24 August 2015 Available online 24 September 2015 Keywords: Foreign ﬁbers Feature selection Information gain Binary particle swarm optimization

a b s t r a c t Foreign ﬁbers in cotton seriously affect the quality of cotton products. Online detection systems of foreign ﬁbers based on machine vision are the eﬃcient tools to minimize the harmful effects of foreign ﬁbers. The optimum feature set with small size and high accuracy can eﬃciently improve the performance of online detection systems. To ﬁnd the optimal feature sets, a two-stage feature selection algorithm combining IG (Information Gain) approach and BPSO (Binary Particle Swarm Optimization) is proposed for foreign ﬁber data. In the ﬁrst stage, IG approach is used to ﬁlter noisy features, and the BPSO uses the classiﬁer accuracy as a ﬁtness function to select the highly discriminating features in the second stage. The proposed algorithm is tested on foreign ﬁber dataset. The experimental results show that the proposed algorithm can eﬃciently ﬁnd the feature subsets with smaller size and higher accuracy than other algorithms. © 2015 Elsevier Ltd. All rights reserved.

1. Introduction Foreign ﬁbers in cotton refer to non-cotton ﬁbers and dyed ﬁbers, such as hairs, binding ropes, plastic ﬁlms, candy wrappers, and polypropylene twines, etc. The foreign ﬁbers in cotton, especially in lint, will seriously affect the quality of the ﬁnal cotton textile products, even low content of foreign ﬁbers in cotton [1,2]. To reduce the harm of foreign ﬁber in cotton textile products, online detection systems based on machine vision have been studied for evaluating the quality of the cotton in recent years [2–5]. In such systems, the classiﬁcation of foreign ﬁbers in cotton is the basic and key technology, which is closely related to system’s performance. To improve the accuracy of classiﬁcation and eﬃciency of systems, ﬁnding the optimum feature sets with the small size and high accuracy is essential because they not only simplify the design of classiﬁer, but also improve the eﬃciency of online detection. It is a feature selection (FS) problem in nature. FS is the technique of selecting a subset of relevant features for building robust learning models, which is commonly used in machine learning. FS aims at simplifying a feature set by reducing its dimensionality and identifying relevant underlying features without sacriﬁcing predictive accuracy [6]. Unfortunately, ﬁnding the optimal feature subsets has been proved to be NP-hard [7] so that a number of FS algorithms are proposed to ﬁnd the near optimal solutions in smaller amount of time. These methods are generally divided into three categories: the ﬁlter approach, the wrapper approach and the embedded method. In the ﬁrst

∗

Corresponding author. Tel.:+86 10 62736764; fax: +86 10 62737741. E-mail address: [email protected], [email protected] (D. Li).

http://dx.doi.org/10.1016/j.compeleceng.2015.08.011 0045-7906/© 2015 Elsevier Ltd. All rights reserved.

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

115

category, the ﬁlter approaches are ﬁrst utilized to select the subsets of features before the learning algorithms are applied. On the other hand, the wrapper approaches [8] utilize the learning algorithms as a ﬁtness function and search for the best subsets of features in the space of all feature subsets. Besides the ﬁlters and wrappers, the embedded methods incorporate variable selection as a part of the training process, and feature relevance is obtained analytically from the objective of the learning model [9]. 1.1. Related work Currently, several FS approaches have been applied to select feature sets of foreign ﬁbers in cotton for online detection. Yang et al. [10] proposed an improved genetic algorithm by which the optimal feature subset can be selected effectively and eﬃciently from a multi-character feature set. The algorithm adopted the segmented chromosome management scheme to implement local management of chromosome, and can obtain strong searching ability at the beginning of the evolution and achieved accelerated convergence along the evolution. Zhao et al. [11] proposed an FS algorithm that used ant colony optimization to select the feature subsets of foreign ﬁbers in cotton. To improve the eﬃciency of ant colony optimization, Zhao et al. [12] proposed an improved ant colony optimization for feature selection, whose objective is to ﬁnd the (near) optimal subsets in multi-character feature sets. In the algorithm, group constraint is adopted to limit subset constructing process and probability transition for reducing the effect of invalid subsets and improve the convergence eﬃciency. Li et al [13] used PSO (particle swarm optimization) to select the optimal feature sets of foreign ﬁbers in cotton. However, these algorithms belong to wrapper approaches, their advantages include the interaction between feature subset search and model selection, and the ability to take into account feature dependencies. A common drawback of these algorithms is that they have a higher risk of overﬁtting than ﬁlter techniques and are very computationally intensive. 1.2. Motivation and contribution Filter approaches assess the relevance of features by looking only at the intrinsic properties of the data. In most cases a feature relevance score is calculated, and low-scoring features are removed. Afterwards, this subset of features is presented as input to the classiﬁcation algorithm. Advantages of ﬁlter approaches are that they easily scale to very high-dimensional datasets due to their simplicity and fastness, and are independent of the classiﬁcation algorithm. As a result, feature selection only need to be performed only once, and then different classiﬁers can be evaluated. A common disadvantage of ﬁlter approaches is that they ignore the interaction with the classiﬁer, and that the most approaches are univariate. This means that each feature is considered separately, thereby ignoring feature dependencies, which may lead to worse classiﬁcation performance when compared to other types of feature selection approaches. Whereas ﬁlter approaches are independent of the model hypothesis, wrapper methods embed the model hypothesis search within the feature subset search. In this setup, a search procedure in the space of possible feature subsets is deﬁned, and various subsets of features are generated and evaluated. The evaluation of a speciﬁc subset of features is obtained by training and testing a speciﬁc classiﬁcation model, rendering this approach tailored to a speciﬁc classiﬁcation algorithm. To search the space of all feature subsets, a search algorithm is then ’wrapped’ around the classiﬁcation model. However, as the space of feature subsets grows exponentially with the number of features, heuristic search methods are used to guide the search for an optimal subset. Advantages of wrapper approaches include the interaction between feature subset search and model selection, and the ability to take into account feature dependencies. A common drawback of these techniques is that they have a higher risk of overﬁtting than ﬁlter techniques and are very computationally intensive. In this paper, we proposed a two-stage feature selection algorithm by combining the IG (Information Gain) ﬁlter approach and BPSO (Binary Particle Swarm Optimization) wrapper approach for foreign ﬁber data. IG method selects the features with higher scores. BPSO, when combined with a learning algorithm, is a successful wrapper but computationally expensive. The integration of IG and BPSO thus leads to an effective feature selection scheme. In the ﬁrst stage of our algorithm, the IG is applied to ﬁnd a candidate feature set of foreign ﬁbers. This ﬁlters out many unimportant features and reduces the computational load for BPSO. In the second stage, BPSO wrapper is applied to directly select the highest discriminative feature subset from the candidate set. We perform the comprehensive experiments to validate the eﬃciency of the algorithms on foreign ﬁber datasets. The experimental results show that the proposed algorithm is very effective for the data of foreign ﬁbers in cotton. The rest of the paper is organized as follows: Section 2 introduces some basic concepts of classiﬁcation and FS. Section 3 describes the proposed algorithm. Experimental results are presented in Section 4. The last section draws a general conclusion. 2. Preliminaries In this section, we introduce the basic notions of classiﬁcation and FS. 2.1. Classiﬁcation Machine learning is usually divided into two main types: supervised learning and unsupervised learning. In the supervised learning approach, the goal is to learn a mapping from inputs x to outputs y, given a labeled set of input–output pairs D = {xi , yi }ni=1 . Here D is called the training set, and N is the number of training examples.

116

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

In the simplest setting, each training input xi is a D-dimensional vector of numbers, representing, say, the height and weight of a person. These are called features. In general, however, xi could be a complex structured object, such as an image, a sentence, an email message, a graph, etc. Similarly the form of the output or response variable can in principle be anything, but most methods assume that yi is a categorical or nominal variable from some ﬁnite set, yi ∈ {1, . . . , C } (such as male or female), or that yi is a real-valued scalar (such as income level). When yi is categorical, the problem is known as classiﬁcation or pattern recognition, and when yi is real-valued, the problem is known as regression. For classiﬁcation, the goal is to learn a mapping from inputs x to outputs y, where yi ∈ {1, . . . , C } with C being the number of classes. If C = 2, this is called binary classiﬁcation (in case we often assume yi ∈ {1, 0}); if C > 2, this is called multiclass classiﬁcation. If the class labels are not mutually exclusive (e.g., somebody may be classiﬁed as tall and strong), we call it multilabel classiﬁcation, but this is best viewed as predicting multiple related binary class labels (a so-called multiple output model). One way to formalize the problem is as function approximation. We assume y = f(x) for some unknown function f, and the goal of learning is to estimate the function f given a labeled training set, and then to make predictions using y = f (x). (We use the short line symbol to denote an estimate.) Our main goal is to make predictions on novel inputs, meaning ones that we have not seen before (this is called generalization), since predicting the response on the training set is easy. 2.2. Feature selection The high dimensionality of data poses challenges to learning tasks because of the curse of dimensionality. In the presence of many irrelevant and redundant features, learning models tend to overﬁt and become less comprehensive. FS is one effective way to identify relevant and redundant features for dimensionality reduction. Various studies show that features can be removed without performance deterioration. FS is closely related to two conceptions that are relevance and redundancy. A popular deﬁnition of relevance is given as the following. Let F be the full set of features, fi be a feature, Si = F – {fi }. Let C denote the class label and P denote the conditional probability of the class label C given a feature set. The statistical relevance of a feature can be formalized as: Deﬁnition 1. (Relevance) A feature Fi is relevant iff

∃Si∗ ⊂ Si , such that P (C |Fi , Si∗ ) = P(C |Si∗ ) Otherwise, the feature Fi is said to be irrelevant. Deﬁnition 2. (Redundancy) A feature Fi is redundant iff

P (C |Fi , Si ) = P (C |Si ), but ∃Si∗ ⊆ Si ,

such that P (C |Ci , Si∗ ) = P (C |Si∗ )

Deﬁnition 1 suggests that a feature, Fi , is statistically relevant if its removal from the feature set will decrease the prediction power. Deﬁnition 2 suggests that a feature, Fi , can become redundant due to the existence of other relevant features, which provide similar prediction power as Fi . Then, FS is to ﬁnd the minimum subset S from set F, which has maximal relevance and minimum redundancy as possible. Meanwhile the subset S is optimized for improving the performance of the machine learning algorithm. Obviously, the rigorous way to ﬁnd the optimal subset S is exhaustive search and evaluation of all the possible subsets of F, which has been demonstrated as NP-Completeness. To circumvent this problem, many heuristic subset search or selection strategies have been proposed [14]. 3. Methodology In this paper, we propose a two-stage FS algorithm combining IG and BPSO for feature selection of foreign ﬁber in cotton (IG-BPSO). In the IG-BPSO, IG, a class of ﬁlter approach, is used to eliminate the noisy and irrelevant features in order to induce the dimensions of feature space. BPSO, a class of wrapper approach, is used to ﬁnd the best subset related to the certain speciﬁc classiﬁer. As a result, IG-BPSO can overcome the shortcoming of ﬁlter approaches that ignore feature dependencies, and improve the eﬃciency of wrapper approaches by eliminating the noisy and irrelevant features. The proposed algorithm can be characterized by two stages: in the ﬁrst stage, IG is used to ﬁlter noisy and irrelevant features of foreign ﬁber in cotton. IG looks at each feature in isolation and measures how important it is for the prediction of the correct class label. In this stage, IG cannot eliminate the redundant features and consider the dependency between the features. The redundancy and dependency are considered in second stage by BPSO. In the second stage, BPSO and classiﬁers work together for selecting the highly discriminating features. As a result, IG-BPSO can combine the advantages of ﬁlter and wrapper approaches and select the high quality of subsets related to the certain speciﬁc classiﬁer. The scheme for the proposed algorithm is shown in Fig. 1. The detailed steps are described as follows. In the ﬁrst stage, the original data are preprocessed by the IG ﬁlter. Each feature is evaluated and sorted according to IG criterion, and the ﬁrst N features are selected to form a new subset. The dataset in the second stage is generated according to the feature set obtained by IG, i.e., each sample used in the second stage only contains these items corresponding to these features in the feature set obtained by IG.

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

Datasets of foreign fiber in cotton

Filter stage

Wrapper stage

IG feature selection

BPSO feature selection

Classifiers

Optimal subsets

117

Fig. 1. The general scheme of the IG-BPSO algorithm.

In the second stage, a wrapper approach combining BPSO and classiﬁers is used to accomplish the feature selection in the reduced feature set. The goodness of the subsets is evaluated by the classiﬁcation accuracy of a given classiﬁer. 3.1. Information gain (IG) ﬁlter IG [15] is a measure of dependence between the feature and the class label. It is one of the most popular feature selection techniques as it is easy to compute and simple to interpret. IG of a feature X and the class labels Y is calculated as

IG(X, Y ) = H (X ) − H (X |Y )

(1)

Entropy (H) is a measure of the uncertainty associated with a random variable. H(X) and H(X|Y) is the entropy of X and the entropy of X after observing Y, respectively.

H (X ) = −

P (xi )log2 (P (xi ))

(2)

i

H (X |Y ) = −

j

P (y j )

P (xi |y j )log2 (P (xi |y j ))

(3)

i

The maximum value of information gain is 1. A feature with a high information gain is relevant. IG is evaluated independently for each feature and the features with the top-k values are selected as the relevant features. IG approach is an effectively supervised feature selection algorithm, which has been widely applied in many real applications. 3.2. BPSO wrapper PSO is a population-based search technique ﬁrst proposed by Kennedy and Eberhart [16] and is motivated by the social behavior of organisms such as bird ﬂocking and ﬁsh schooling. It is based on swarm intelligence and well suited for combinatorial optimization problems in which the optimization surface possesses many local optimal solutions. The underlying phenomenon of PSO is that knowledge is optimized by social interaction where the thinking is not only personal but also social. The particles in PSO are similar to the chromosomes in GA. However, PSO is usually easier to implement than the GA as there are neither crossover nor mutation operators in the PSO and the movement from one solution set to another is achieved through the velocity functions. PSO was ﬁrstly introduced for optimization of continuous space. To solve discrete spaces problem, a discrete binary particle swarm optimization is proposed by Kennedy and Eberhart [16]. In this study, the discrete binary particle swarm optimization is adopted for feature selection. PSO is based on the principle that each solution can be represented as a particle in a swarm. Each particle has a position and a corresponding ﬁtness value evaluated by the ﬁtness function to be optimized. The particles iterate from one position to another according to their most recent velocity vector. This velocity vector is determined according to the particle’s own experience as well as the experience of other particles by using the best positions encountered by the particle and the swarm. Speciﬁcally, the velocity vector of each particle is calculated by updating the previous velocity by following two best values. The ﬁrst best value is the particle’s personal best value (pbest), i.e., the best position it has visited thus far, and is tracked by each particle. The other best value is tracked by the swarm and corresponds to the best position visited by any particle in the population. This best value is called the global best (gbest). The effect of personal best and global best on the velocity update is controlled by weights called learning factors. Through the joint self and swarm-based updating, the PSO achieves local and global search capabilities where the intensiﬁcation and diversiﬁcation are achieved via relative weighting. Considering a d-dimensional search space, the ith particle is represented as Xi = (xi,1 , xi,2 , … ,xi,d ), with velocity Vi = (vi,1 , vi,2 , … ,vi,d ). The best previous position of the ith particle is represented as Pi = (pi,1 , pi,2 , … ,pi,d ). The best particle in the population is represented as Pg = (pg, 1 , pg, 2 , … ,pg,d ). Each particle updates its position and velocity according to the two best values at each iteration.

118

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125 Table 1 The algorithm of BPSO for FS. Algorithm: BPSO for FS Input: m: the number of particles; c1 , c2 : acceleration coeﬃcients; wmax : maximum of inertia weight; wmin : minimum of inertia weight; vmax : maximum velocity of particles; Itermax : maximum of iterations; Output: Pg best : global best solution; Begin 1. pg ←0; 2. Generate random population of m particles; 3. for each particle i 4. Calculate ﬁtness(i); 5. p(i) ← ﬁtness(i); 6. if ﬁtness(i) is better than pg then 7. pg ← ﬁtness(i); 8. Pg best ← i; 9. end if 10. end for 11. do until Itermax has been reached 12. for each particle i; 13. Calculate particle velocity according to Eq. (3); 14. Calculate particle position according to Eq. (5); 15. Calculate the ﬁtness (i); 16. if ﬁtness(i) is better than p (i) then 17. p (i) ← ﬁtness(i); 18. end if 19. if ﬁtness(i) is better than pg then 20. pg ← ﬁtness(i); 21. Pg best ← i; 22. end if 23. end for 24. Update inertia weight, w according to Eq. (4); 25. end do; 26. Return Pg best; End

In binary PSO, each bit of a particle moves in a state space restricted to zero and one, which will be in one state or zero state in terms of the changes in probabilities. If the velocity is high it is more likely to choose 1, and lower values favor choosing 0. A sigmoid function is applied to transform the velocity from continuous space to probability space:

sig(vi, j ) =

1 , j = 1, 2, . . . , d 1 + exp ( − vi, j )

(4)

After each iteration t, the velocity is updated as follows:

vt+1 = w × vti, j + c1 × r1 ( pti, j − xti, j ) + c2 × r2 ( ptg, j − xti, j ) i, j

(5)

where w is inertia weight, which is used to balance the global exploration and local exploitation, a large inertia weight facilitates the global search, while a small inertia weight facilitates the local search. The parameters c1 and c2 are acceleration coeﬃcients, which represents the magnitude of the inﬂuences on the particles velocity in the directions of the personal and the global optima, respectively. The parameters r1 and r2 are random numbers in the range [0, 1]. xi,j , pi,j and pg,j belong to {0, 1}. Particles’ velocities are limited to a maximum velocity, vmax , which determines how large steps through the solution space each particle is allowed to take. If vmax is too small, particles may not explore suﬃciently beyond locally good regions. They could become trapped in local optima. On the other hand, if vmax is too high particles might ﬂy past good solutions. The new particle position is updated using the following rule:

xt+1 = i, j

1,

if rnd < sig(vi, j )

0,

if rnd ≥ sig(vi, j )

, j = 1, 2, . . . , d

(6)

where sig(vi,j ) is calculated according to Eq. (4), and rnd is a uniform random number in [0, 1]. Designing the ﬁtness function is important for BPSO-based wrapper approach, by which the quality of each subset is evaluated. The ﬁtness evaluation contains two terms: (1) accuracy from the validation data and (2) number of features used. In our context, we use the classiﬁer accuracy as the ﬁtness function. The algorithm of BPSO for FS is presented in Table 1.

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

119

3.3. Classiﬁer To evaluate the feature subsets, two classiﬁers are adopted, which are support vector machines (SVM) and k-nearest neighbor classiﬁer, respectively. 3.3.1. SVM classiﬁer SVM is a powerful classiﬁcation algorithm that has shown state-of-the-art performance in a variety of classiﬁcation tasks. The ﬁrst generation of SVMs was only designed for binary classiﬁcation. However, most real-life diagnostic tasks are not binary, and solving multiclass classiﬁcation problems is much harder than solving binary ones. Fortunately, several algorithms have been proposed for extending binary SVM to multiclass classiﬁcation [3]. When the data are linearly separable, SVM computes the hyperplane that maximizes the margin between the training examples and the class boundary. When the data are not linearly separable, the examples are mapped to a high-dimensional space where such a separating hyperplane can be found. The mechanism that deﬁnes this mapping process is called the kernel function. The reason why SVM insists on ﬁnding the maximum margin hyperplane is that it offers the best generalization ability. It allows not only the best classiﬁcation performance (e.g., accuracy) on the training data, but also leaves much room for the correct classiﬁcation of the future data. To ensure that the maximum margin hyperplanes are actually found, an SVM classiﬁer − → attempts to maximize the following function with respect to w and b:

LP =

1 → → − → − w − w xi + b) + αi yi (− αi 2 t

t

i=1

i=1

(7)

where t is the number of training examples, and ai , i = 1, … , t, are non-negative numbers such that the derivatives of LP with − → respect to ai are zero. ai are the Lagrange multipliers and LP is called the Lagrangian. In this equation, the vectors w and constant b deﬁne the hyperplane. Currently, SVM has been utilized to deal with detection of foreign ﬁbers in cotton and shows its excellent performance of classiﬁcation [3, 4]. 3.3.2. kNN classiﬁer To validate the proposed algorithm, we also use kNN classiﬁer to evaluate the selected feature subsets. One of the simplest, and rather trivial classiﬁers is the Rote classiﬁer, which memorizes the entire training data and performs classiﬁcation only if the attributes of the test object match one of the training examples exactly. An obvious drawback of this approach is that many test records will not be classiﬁed because they do not exactly match any of the training records. A more sophisticated approach, k-nearest neighbor (kNN) classiﬁcation [17], ﬁnds a group of k objects in the training set that are closest to the test object, and bases the assignment of a label on the predominance of a particular class in this neighborhood. There are three key elements of this approach: a set of labeled objects, e.g., a set of stored records, a distance or similarity metric to compute distance between objects, and the value of k, the number of nearest neighbors. To classify an unlabeled object, the distance of this object to the labeled objects is computed, its k-nearest neighbors are identiﬁed, and the class labels of these nearest neighbors are then used to determine the class label of the object. Given a training set D and a test object x = (x’, y’), the algorithm computes the distance (or similarity) between z and all the training objects (x, y) ∈ D to determine its nearest-neighbor list, Dz . (x is the data of a training object, while y is its class. Likewise, x’ is the data of the test object and y’ is its class.) Once the nearest-neighbor list is obtained, the test object is classiﬁed based on the majority class of its nearest neighbors:

Majority voting : y = argmax v

I(v = yi )

(xi ,yi )∈Dz

where v is a class label, yi is the class label for the ith nearest neighbors, and I(•) is an indicator function that returns the value 1 if its argument is true and 0 otherwise. The kNN classiﬁers are lazy learners, that is, models are not built explicitly unlike eager learners (e.g., decision trees, SVM, etc.). Thus, building the model is cheap, but classifying unknown objects is relatively expensive since it requires the computation of the k-nearest neighbors of the object to be labeled. This, in general, requires computing the distance of the unlabeled object to all the objects in the labeled set, which can be expensive particularly for large training sets. A number of techniques have been developed for eﬃcient computation. 4. Experimental results and discussions In this section, we perform comprehensive experiments to validate IG-BPSO, and made comparisons with other FS algorithms on foreign ﬁber datasets. 4.1. Data preparation A large number of foreign ﬁber images were acquired using our experiment platform, and 254 representative images, which are 4000 pixels wide and 500 pixels high, were selected for the experiments. Fig. 2 shows several examples. The 254

120

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

Fig. 2. Example images of foreign ﬁbers in cotton: (a) hair, (b) black plastic ﬁlm, (c) red cloth, (d) hemp rope, (e) red polypropylene, (f) black feather. Table 2 BPSO parameters. Parameters

Value

The number of particles The number of generations The acceleration coeﬃcient c1 The acceleration coeﬃcient c2 Maximum velocity vmax

50 30 2 1.5 1

representative images consist of 48 images of hair, 42 images of black plastic ﬁlm, 42 images of red cloth, 42 images of hemp rope, 40 images of polypropylene, and 40 images of black feather. Then, image segmentation is performed. Consequently, totally 702 foreign ﬁber objects are generated, which include 51 hair objects, 102 black plastic ﬁlm objects, 108 cloth objects, 123 rope objects, 132 polypropylene twines objects, 186 feather objects. Afterwards, we extract the features from these objects. In the experiments, the total 75 features are extracted to form a 75-dimensional feature vector including 27 color features, 41 texture features and 7 shape features. Finally, all the samples are generated, and normalized for reducing the impact of different dimensions. 4.2. Experiment setting All experiments are conducted on Intel(R) Core(TM)2 Quad Processor with a CPU clock rate of 2.66 GHz and 4.0 GB main memory, running on the Windows 7 operating system. All the algorithms are implemented in Matlab 2010b. 10-fold crossvalidation is used to estimate the performance of classiﬁcation. In 10-fold cross-validation, the data is ﬁrst partitioned into 10 nearly equally sized folds. Subsequently ten iterations of training and validation are performed such that within each iteration a different fold of the data is held-out for validation while the remaining 9-folds are used for learning. The parameters of BPSO are set in terms of Table 2. These parameters were selected after several test evaluations of each algorithm until reaching the best conﬁguration in terms of the quality of solutions and the computational effort. 4.3. Results and discussion First, we performed 10-fold cross-validation respectively using SVM and kNN classiﬁers on the foreign ﬁber datasets without feature selection. The results are shown in Table 3. As we can see, the number of features is 75, and the classiﬁcation accuracy of SVM and kNN is 86.3% and 85.4%, respectively. Then we use IG to evaluate each feature and compute the classiﬁcation accuracy of the top-k features in terms of SVM and kNN, respectively. Fig. 3 shows the classiﬁcation accuracy of subsets corresponding to different k values. As we can see, for SVM, the subset consisting of the top-62 features has the highest classiﬁcation accuracy, 91.2%; for kNN, the subset consisting of the top-66 features has the highest classiﬁcation accuracy, 89.4%.

1

1

0.9

0.9

Classification accuracy of kNN

Classification accuracy of SVM

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

0.8 0.7 0.6 0.5

0.8 0.7 0.6 0.5 0.4

0.4 0.3 0

121

0.3

10

20

30 40 50 Number of features

60

70

0

10

20

30

40

50

60

70

Number of features

(a)

(b) Fig. 3. Curves of average accuracy of top-k feature set by IG.

Table 3 Classiﬁcation accuracy of original feature set. Set

Average accuracy (%)

Number of features

kNN SVM

85.4 86.3

75 75

Table 4 Comparison of BPSO and IG-BPSO. Methods

Average accuracy (%)

Average number of features

BPSOk NN IG-BPSOk NN BPSOSVM IG-BPSOSVM

86.37 90.01 90.62 91.48

39 35 42 34

For IG-BPSO, the dataset used in the second stage is constructed according to the following method: (1) each feature is evaluated by IG; (2) all the features are sorted by their scores in descending order; (3) the feature subset consisting of the top-k feature is evaluated by classiﬁcation accuracy; (4) the feature subset with highest accuracy is used to generate the dataset in the second stage. In the experiments, the datasets in the second stage are generated in terms of the top-62 features with SVM and the top-66 features with kNN, respectively. Based on these new datasets, the ﬁnal results of IG-BPSO are showed in Table 4. For comparison, we also run BPSO in the original dataset and the results also are listed in Table 4. As we see, for SVM and kNN, IG-BPSO can ﬁnd the optimal subset with smaller size and higher accuracy than BPSO. The FS processes of two algorithms with SVM are showed in Fig. 4. We can see that IG-BPSO can ﬁnd optimal subset by less iterations than BPSO, and the optimal subset by IG-BPSO has higher classiﬁcation accuracy than subset by BPSO. The reason is that some noisy and irrelevant features are eliminated in the ﬁrst stage. Fig. 5 shows the comparison of performance of the original feature set and the subsets obtained by three algorithms. In the Fig. 5(a), we can see the following results: For SVM, IG-BPSO has the best performance, which selects subset with highest classiﬁcation accuracy, 91.48%. The subset obtained by IG has higher classiﬁcation accuracy than that by BPSO, and the performance of original set is the worst, only 86.3%. In the Fig. 5(b), we can see the following results: For kNN, IG-BPSO also has the best performance, which selects subset with highest classiﬁcation accuracy, 90.01%. The subset obtained by IG has higher classiﬁcation accuracy than that by BPSO, and the performance of original set is the worst, only 85.4%. Fig. 6 shows the comparison of the number of original feature set and subsets obtained by the three algorithms. In the Fig. 6(a), we can see the following results: for SVM, the number of features of subset obtained by IG-BPSO, BPSO and IG, is 34, 42, 62, respectively, and they are smaller than the number of original set, 75. IG-BPSO can ﬁnd the subset with smallest size. We can see in Fig. 6(b), for kNN, the number of features of subset obtained by IG-BPSO, BPSO and IG, is 35, 39, 66, respectively. The subsets obtained by IG-BPSO have the smallest size. Finally, we can conclude that IG-BPSO can ﬁnd the better subsets with small size and high accuracy than IG and BPSO.

122

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

0.93 IG-BPSO BPSO

0.92 0.91 Fitness value

0.9 0.89 0.88 0.87 0.86 0.85 0.84 0

5

10

15 Iterations

20

25

30

Fig. 4. Curves of ﬁtness of BPSO and IG-BPSO with SVM.

0.92

0.9062

0.9 0.89 0.88

0.863

0.86 0.85

0.894

0.89 0.88 0.87 0.86

0.8637 0.854

0.85 0.84

0.84 0.83

0.9001

0.9

Classification accuracy of kNN

Classification accuracy of SVM

0.91

0.87

0.91

0.9148 0.9102

original set

subset of IG

subset of BPSO subset of IG-BPSO

0.83

Original set and subsets obtained by three algorithms

original set

subset of IG

subset of BPSO subset of IG-BPSO

Original set and subsets obtained by three algorithms

(a)

(b) Fig. 5. Comparisons of classiﬁcation accuracy with SVM.

80

80

75

70

62 60 50

42 40

34 30 20 10

The number of features

The number of features

70

75 66

60 50

39

40

35

30 20

original set

subset of IG

subset of BPSO subset of IG-BPSO

Original set and subset obtained by three algorithms with SVM

10

original set

subset of IG

subset of BPSO subset of IG-BPSO

Original set and subset obtained by three algorithms with kNN

(a)

(b) Fig. 6. Comparisons of number of features of subsets.

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

123

Parameters

Value

Number of population Number of generations Crossover rate Mutation rate

50 30 0.9 0.05

1

1

0.9

0.9

Classification accuracy of SVM

Classification accuracy of SVM

Table 5 Parameters setting of GA.

0.8 0.7 0.6 0.5

0.7 0.6 0.5 0.4

0.4

0

0.8

10

20

30 40 50 Number of features

60

70

0

10

20

(a) Fisher Score

30 40 50 Number of features

60

70

(b) Gini Index Fig. 7. Curves of average accuracy of top-k feature sets.

Then, we test the computational cost of IG, BPSO and IG-BPSO. For BPSO and IG-BPSO, kNN classiﬁer is used to evaluate the subsets, and their parameters are set in terms of Table 2. Their average running time is 0.34 s, 107.48 s and 95.95 s, respectively. The result shows that the computational cost of the proposed algorithm is lower than that of BPSO. The proposed algorithm does not result in a prolonged detection time, on the contrary, using the feature sets obtained by the proposed algorithm can shorten detection time and improve detection eﬃciency. The reason is that the feature selection approach is not integrated into the online detection systems, and the online detection systems only use the feature sets obtained by the proposed algorithm. For further validating the proposed algorithm, we make comparisons with other three algorithm, i.e. Fisher Score [15], Gini Index [15] and GA [18]. Fisher Score selects the features that assign similar values to the samples from the same class and different values to samples from different classes. The evaluation criterion used in Fisher Score can be formulated as:

c SC( fi ) =

n j (μi, j − μi ) c 2 j=1 n j σi, j

2

j=1

(8)

where μi is the mean of the feature fi , nj is the number of samples in the jth class, and μi , j and σ i , j are the mean and the variance of fi on class j, respectively. Fisher Score is an effective supervised feature selection algorithm, which has been widely applied in many real applications. Gini Index is a measure for quantifying a feature’s ability to distinguish between classes. Given C classes, Gini Index of a feature f can be calculated as:

GI( f ) = 1 −

C

[p(i| f )]

2

(9)

i=1

Gini Index of each feature is calculated independently and the top k features with the smallest Gini Index are selected. GA is an optimization algorithm with a probabilistic component that provides a means to search poorly understood, irregular spaces. GA works with a population of chromosomes. Each chromosome is a vector in hyperspace representing one potential solution to the optimization problem. A population is an ensemble or set of hyperspace vectors. As Fisher Score and Gini Index are ﬁlter approaches, we select the combination with highest classiﬁcation accuracy to make comparison with the subsets obtained by other algorithms. For GA, the parameters are set in terms of Table 5. Fig. 7 shows the result of classiﬁcation and the number of top-k features, respectively. As we can see, (1) for Fisher Score, the subset with 65 features has the highest average accuracy, 90.57%; (2) for Gini Index, the subset with 56 features has the highest average accuracy, 90.60%. Table 6 shows the result of performance of four algorithms. As we see, the proposed algorithm can select the subset with the smallest size and the highest average accuracy among four algorithms. The reason is that in the ﬁrst stage, the ﬁlter approach

124

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125 Table 6 Comparison of result of four algorithms. Methods

Average accuracy (%)

Average number of features

Fisher score Gini index GA IG-BPSO

90.57 90.60 91.07 91.48

65 56 37 34

Table 7 Comparison of BPSO and IG-BPSO with different classiﬁers.

Classiﬁers kNN SVM BP RBF PNN

Average accuracy (%)

Average number of features

BPSO

IG-BPSO

BPSO

IG-BPSO

86.37 90.62 89.02 90.57 86.84

90.01 91.48 91.25 91.77 87.28

39 42 44 42 47

35 34 36 35 42

decreases the effect of the noisy and irrelevant features so that it improves the performance of BPSO. Consequently, the proposed algorithm can ﬁnd the subsets with smallest size and the high accuracy which are related to certain speciﬁc classiﬁer. Finally, we also tested the proposed method with other three classiﬁers such as BP (Back Propagation, a famous Multilayer Perceptron algorithm), RBF (Radial Basis Function) and PNN (Probabilistic Neural Networks), and made comparisons with the results of SVM and kNN. In the experiments, we adopted the three algorithms in the Matlab neural network toolbox. For their parameters, we set them according to the experience which can achieve the best performance of each algorithm in the foreign ﬁber dataset. The results are shown in Table 7. As we see, for the different classiﬁers, the proposed method shows the better performance than BPSO, for example, the classiﬁcation accuracy of subsets is higher and the size of subsets is smaller. For RBF, the classiﬁcation accuracy is the highest among ﬁve classiﬁers, 91.77%. For SVM, the size of subset is the smallest among ﬁve classiﬁers, 34. 5. Conclusion An important issue in online detection system based on machine vision of foreign ﬁbers in cotton is to ﬁnd the optimal subset of foreign ﬁber. In this study, we have introduced a novel scheme for feature selection of foreign ﬁbers in cotton by combining IG ﬁlter and BPSO wrapper. The new scheme is a two-stage process that ﬁrst applies the IG ﬁlter to select a compact yet effective subset from the candidate set, then uses a BPSO wrapper approach that combines the BPSO search strategy and a learning algorithm as a ﬁtness function to accomplish the subset selection of foreign ﬁber in cotton. The experiments show that the proposed IG-BPSO outperforms the IG ﬁlter and BPSO wrapper. Also, the experimental comparisons demonstrate the effectiveness of IG-BPSO. The selected optimum feature set is of great signiﬁcance for online detection system based on machine vision of foreign ﬁbers in cotton due to its small size and high accuracy could improve the performance of the online detection system. Our future work will pay attention to improving the performance of classiﬁers for better online detection of cotton foreign ﬁbers. Acknowledgments This research is supported by the National Natural Science Foundation of China (NSFC) (61303113, 61373053, 61402195, 61471133, 61571444 and 61572226). This research is also funded by the Special Fund for Agro-scientiﬁc Research in the Public Interest (201203017), National Science and Technology Supporting Plan Project (2012BAD35B07), National Natural Science Foundation Framework Project (61471133), Guangdong Science and Technology Plan Project (2013B090500127 and 2013B021600014), Guangdong Natural Science Foundation (S2013010014790), and Science and Technology Plan Project of Wenzhou of China under Grant No (G20140048). References [1] Yang W, Li D, Zhu L, Kang Y, Li F. A new approach for image processing in foreign ﬁber detection. Comput Electron Agric. 2009;68(1):68–77. [2] Yang W, Li D, Wei X, Kang Y, Li F. AVI system for classiﬁcation of foreign ﬁbers in cotton. Trans Chin Soc Agric Mach. 2009;40(12):177–81. [3] Ji R, Li D, Chen L, Yang W. Classiﬁcation and identiﬁcation of foreign ﬁbers in cotton on the basis of a support vector machine. Math Comput Model. 2010;51(11):1433–7. [4] Li D, Yang W, Wang S. Classiﬁcation of foreign ﬁbers in cotton lint using machine vision and multi-class support vector machine. Comput Electron Agric. 2010;74(2):274–9. [5] Yang W, Lu S, Wang S, Li D. Fast recognition of foreign ﬁbers in cotton lint using machine vision. Math Comput Model. 2011;54(3):877–82. [6] Lin J, Ke H, Chien B, Yang W. Classiﬁer design with feature selection and feature extraction using layered genetic programming. Expert Syst Appl. 2008;34(2):1384–93.

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

125

[7] Narendra PM, Fukunaga K. A branch and bound algorithm for feature subset selection. IEEE Trans Comput. 1977;100(9):917–22. [8] El Akadi A, Amine A, El Ouardighi A, Aboutajdine D. A two-stage gene selection scheme utilizing MRMR ﬁlter and GA wrapper. Knowl Inf Syst. 2011;26(3):487–500. [9] Khalid S, Khalil T, Nasreen S. A survey of feature selection and feature extraction techniques in machine learning. In: Science and Information Conference, IEEE; 2014. p. 372–8. [10] Yang W, Li D, Zhu L. An improved genetic algorithm for optimal feature subset selection from multi-character feature set. Expert Syst Appl. 2011;38(3):2733– 40. [11] Zhao X, Li D, Yang W, Chen G. Feature selection based on ant colony optimization for cotton foreign ﬁber. Sensor Lett. 2011;9(3):1242–8. [12] Zhao X, Li D, Yang B, Ma C, Zhu Y, Chen H. Feature selection based on improved ant colony optimization for online detection of foreign ﬁber in cotton. Appl Soft Comput. 2014;24:585–96. [13] Li H, Wang J, Yang W, Liu S, Li Z, Li D. Feature selection for cotton foreign ﬁber objects based on PSO algorithm. In: In: Computer and Computing Technologies in Agriculture V; 2012. p. 446–52. [14] Sun X, Liu Y, Li J, Zhu J, Chen H, Liu X. Feature evaluation and selection with cooperative game theory. Pattern Recognit. 2012;45(8):2992–3002. [15] Zhao Z, Morstatter F, Sharma S, Alelyani S, Anand A, Liu H. Advancing feature selection research. ASU Feature Selection Repository; 2010. [16] Kennedy J, Eberhart RC. A discrete binary version of the particle swarm algorithm. In: In: Proceedings of IEEE conference on systems, man and cybernetics; 1997. p. 4104–8. [17] Bijalwan V, Vinay K, Pinki K, Jordan P. KNN based machine learning approach for text and document mining. Int J Database Theory Appl. 2014;7(1):61–70. [18] Oreski S, Oreski G. Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Syst Appl. 2014;41(4):2052–64. Xuehua Zhao received the Ph.D. degree in College of Computer Science and Technology, Jilin University in 2014. His main research interests are related to machine learning and data mining. Daoliang Li received the Ph.D. degree in College of Information and Electrical Engineering, China Agricultural University in 1999. He is currently a professor in the College of Information and Electrical Engineering, China Agricultural University. His current research interests are in the areas of data mining, knowledge engineering, with applications to Agricultural Engineering. Bo Yang is currently a professor in the College of Computer Science and Technology, Jilin University. He received his Ph.D. degree in College of Computer Science and Technology, Jilin University in 2003. His current research interests are in the areas of data mining, knowledge engineering. Huiling Chen received the Ph.D. degree in College of Computer Science and Technology, Jilin University in 2012. His main research interests are related to machine learning and data mining. Xinbin Yang received the Ph.D. degree in College of Information Science and Engineering, East China University of Science and Technology in 2003. His current research interests are in the areas of data mining, knowledge engineering. Chenglong Yu received the Ph.D. degree in College of Computer Science and Technology, Harbin Institute of Technology in 2013. His current research interests are in the areas of data mining, knowledge engineering. Shuangyin Liu received the Ph.D. degree in College of Information and Electrical Engineering, China Agricultural University in 2014. His current research interests are in the areas of data mining, knowledge engineering, with applications to Agricultural Engineering.

A two-stage feature selection method with its application

A two-stage feature selection method with its application

Recommend Documents