A two-stage feature selection method with its application

A two-stage feature selection method with its application

Computers and Electrical Engineering 47 (2015) 114–125 Contents lists available at ScienceDirect Computers and Electrical Engineering journal homepa...

749KB Sizes 1 Downloads 149 Views

Computers and Electrical Engineering 47 (2015) 114–125

Contents lists available at ScienceDirect

Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng

A two-stage feature selection method with its application Xuehua Zhao a, Daoliang Li b,∗, Bo Yang c, Huiling Chen d, Xinbin Yang a, Chenglong Yu a, Shuangyin Liu e a

School of Digital Media, Shenzhen Institute of Information Technology, Shenzhen 518172, China College of Information and Electrical Engineering, China Agricultural University, P. O. Box 121, 17 Tsinghua East Road, Beijing 100083, China College of Computer Science and Technology, Jilin University, Changchun 130012, China d College of Physics and Electronic Information, Wenzhou University, Wenzhou 325035, China e College of Information, Guangdong Ocean University, Zhanjiang 524025, China b c

a r t i c l e

i n f o

Article history: Received 22 April 2014 Revised 17 August 2015 Accepted 24 August 2015 Available online 24 September 2015 Keywords: Foreign fibers Feature selection Information gain Binary particle swarm optimization

a b s t r a c t Foreign fibers in cotton seriously affect the quality of cotton products. Online detection systems of foreign fibers based on machine vision are the efficient tools to minimize the harmful effects of foreign fibers. The optimum feature set with small size and high accuracy can efficiently improve the performance of online detection systems. To find the optimal feature sets, a two-stage feature selection algorithm combining IG (Information Gain) approach and BPSO (Binary Particle Swarm Optimization) is proposed for foreign fiber data. In the first stage, IG approach is used to filter noisy features, and the BPSO uses the classifier accuracy as a fitness function to select the highly discriminating features in the second stage. The proposed algorithm is tested on foreign fiber dataset. The experimental results show that the proposed algorithm can efficiently find the feature subsets with smaller size and higher accuracy than other algorithms. © 2015 Elsevier Ltd. All rights reserved.

1. Introduction Foreign fibers in cotton refer to non-cotton fibers and dyed fibers, such as hairs, binding ropes, plastic films, candy wrappers, and polypropylene twines, etc. The foreign fibers in cotton, especially in lint, will seriously affect the quality of the final cotton textile products, even low content of foreign fibers in cotton [1,2]. To reduce the harm of foreign fiber in cotton textile products, online detection systems based on machine vision have been studied for evaluating the quality of the cotton in recent years [2–5]. In such systems, the classification of foreign fibers in cotton is the basic and key technology, which is closely related to system’s performance. To improve the accuracy of classification and efficiency of systems, finding the optimum feature sets with the small size and high accuracy is essential because they not only simplify the design of classifier, but also improve the efficiency of online detection. It is a feature selection (FS) problem in nature. FS is the technique of selecting a subset of relevant features for building robust learning models, which is commonly used in machine learning. FS aims at simplifying a feature set by reducing its dimensionality and identifying relevant underlying features without sacrificing predictive accuracy [6]. Unfortunately, finding the optimal feature subsets has been proved to be NP-hard [7] so that a number of FS algorithms are proposed to find the near optimal solutions in smaller amount of time. These methods are generally divided into three categories: the filter approach, the wrapper approach and the embedded method. In the first



Corresponding author. Tel.:+86 10 62736764; fax: +86 10 62737741. E-mail address: [email protected], [email protected] (D. Li).

http://dx.doi.org/10.1016/j.compeleceng.2015.08.011 0045-7906/© 2015 Elsevier Ltd. All rights reserved.

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

115

category, the filter approaches are first utilized to select the subsets of features before the learning algorithms are applied. On the other hand, the wrapper approaches [8] utilize the learning algorithms as a fitness function and search for the best subsets of features in the space of all feature subsets. Besides the filters and wrappers, the embedded methods incorporate variable selection as a part of the training process, and feature relevance is obtained analytically from the objective of the learning model [9]. 1.1. Related work Currently, several FS approaches have been applied to select feature sets of foreign fibers in cotton for online detection. Yang et al. [10] proposed an improved genetic algorithm by which the optimal feature subset can be selected effectively and efficiently from a multi-character feature set. The algorithm adopted the segmented chromosome management scheme to implement local management of chromosome, and can obtain strong searching ability at the beginning of the evolution and achieved accelerated convergence along the evolution. Zhao et al. [11] proposed an FS algorithm that used ant colony optimization to select the feature subsets of foreign fibers in cotton. To improve the efficiency of ant colony optimization, Zhao et al. [12] proposed an improved ant colony optimization for feature selection, whose objective is to find the (near) optimal subsets in multi-character feature sets. In the algorithm, group constraint is adopted to limit subset constructing process and probability transition for reducing the effect of invalid subsets and improve the convergence efficiency. Li et al [13] used PSO (particle swarm optimization) to select the optimal feature sets of foreign fibers in cotton. However, these algorithms belong to wrapper approaches, their advantages include the interaction between feature subset search and model selection, and the ability to take into account feature dependencies. A common drawback of these algorithms is that they have a higher risk of overfitting than filter techniques and are very computationally intensive. 1.2. Motivation and contribution Filter approaches assess the relevance of features by looking only at the intrinsic properties of the data. In most cases a feature relevance score is calculated, and low-scoring features are removed. Afterwards, this subset of features is presented as input to the classification algorithm. Advantages of filter approaches are that they easily scale to very high-dimensional datasets due to their simplicity and fastness, and are independent of the classification algorithm. As a result, feature selection only need to be performed only once, and then different classifiers can be evaluated. A common disadvantage of filter approaches is that they ignore the interaction with the classifier, and that the most approaches are univariate. This means that each feature is considered separately, thereby ignoring feature dependencies, which may lead to worse classification performance when compared to other types of feature selection approaches. Whereas filter approaches are independent of the model hypothesis, wrapper methods embed the model hypothesis search within the feature subset search. In this setup, a search procedure in the space of possible feature subsets is defined, and various subsets of features are generated and evaluated. The evaluation of a specific subset of features is obtained by training and testing a specific classification model, rendering this approach tailored to a specific classification algorithm. To search the space of all feature subsets, a search algorithm is then ’wrapped’ around the classification model. However, as the space of feature subsets grows exponentially with the number of features, heuristic search methods are used to guide the search for an optimal subset. Advantages of wrapper approaches include the interaction between feature subset search and model selection, and the ability to take into account feature dependencies. A common drawback of these techniques is that they have a higher risk of overfitting than filter techniques and are very computationally intensive. In this paper, we proposed a two-stage feature selection algorithm by combining the IG (Information Gain) filter approach and BPSO (Binary Particle Swarm Optimization) wrapper approach for foreign fiber data. IG method selects the features with higher scores. BPSO, when combined with a learning algorithm, is a successful wrapper but computationally expensive. The integration of IG and BPSO thus leads to an effective feature selection scheme. In the first stage of our algorithm, the IG is applied to find a candidate feature set of foreign fibers. This filters out many unimportant features and reduces the computational load for BPSO. In the second stage, BPSO wrapper is applied to directly select the highest discriminative feature subset from the candidate set. We perform the comprehensive experiments to validate the efficiency of the algorithms on foreign fiber datasets. The experimental results show that the proposed algorithm is very effective for the data of foreign fibers in cotton. The rest of the paper is organized as follows: Section 2 introduces some basic concepts of classification and FS. Section 3 describes the proposed algorithm. Experimental results are presented in Section 4. The last section draws a general conclusion. 2. Preliminaries In this section, we introduce the basic notions of classification and FS. 2.1. Classification Machine learning is usually divided into two main types: supervised learning and unsupervised learning. In the supervised learning approach, the goal is to learn a mapping from inputs x to outputs y, given a labeled set of input–output pairs D = {xi , yi }ni=1 . Here D is called the training set, and N is the number of training examples.

116

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

In the simplest setting, each training input xi is a D-dimensional vector of numbers, representing, say, the height and weight of a person. These are called features. In general, however, xi could be a complex structured object, such as an image, a sentence, an email message, a graph, etc. Similarly the form of the output or response variable can in principle be anything, but most methods assume that yi is a categorical or nominal variable from some finite set, yi ∈ {1, . . . , C } (such as male or female), or that yi is a real-valued scalar (such as income level). When yi is categorical, the problem is known as classification or pattern recognition, and when yi is real-valued, the problem is known as regression. For classification, the goal is to learn a mapping from inputs x to outputs y, where yi ∈ {1, . . . , C } with C being the number of classes. If C = 2, this is called binary classification (in case we often assume yi ∈ {1, 0}); if C > 2, this is called multiclass classification. If the class labels are not mutually exclusive (e.g., somebody may be classified as tall and strong), we call it multilabel classification, but this is best viewed as predicting multiple related binary class labels (a so-called multiple output model). One way to formalize the problem is as function approximation. We assume y = f(x) for some unknown function f, and the goal of learning is to estimate the function f given a labeled training set, and then to make predictions using y = f (x). (We use the short line symbol to denote an estimate.) Our main goal is to make predictions on novel inputs, meaning ones that we have not seen before (this is called generalization), since predicting the response on the training set is easy. 2.2. Feature selection The high dimensionality of data poses challenges to learning tasks because of the curse of dimensionality. In the presence of many irrelevant and redundant features, learning models tend to overfit and become less comprehensive. FS is one effective way to identify relevant and redundant features for dimensionality reduction. Various studies show that features can be removed without performance deterioration. FS is closely related to two conceptions that are relevance and redundancy. A popular definition of relevance is given as the following. Let F be the full set of features, fi be a feature, Si = F – {fi }. Let C denote the class label and P denote the conditional probability of the class label C given a feature set. The statistical relevance of a feature can be formalized as: Definition 1. (Relevance) A feature Fi is relevant iff

∃Si∗ ⊂ Si , such that P (C |Fi , Si∗ ) = P(C |Si∗ ) Otherwise, the feature Fi is said to be irrelevant. Definition 2. (Redundancy) A feature Fi is redundant iff

P (C |Fi , Si ) = P (C |Si ), but ∃Si∗ ⊆ Si ,

such that P (C |Ci , Si∗ ) = P (C |Si∗ )

Definition 1 suggests that a feature, Fi , is statistically relevant if its removal from the feature set will decrease the prediction power. Definition 2 suggests that a feature, Fi , can become redundant due to the existence of other relevant features, which provide similar prediction power as Fi . Then, FS is to find the minimum subset S from set F, which has maximal relevance and minimum redundancy as possible. Meanwhile the subset S is optimized for improving the performance of the machine learning algorithm. Obviously, the rigorous way to find the optimal subset S is exhaustive search and evaluation of all the possible subsets of F, which has been demonstrated as NP-Completeness. To circumvent this problem, many heuristic subset search or selection strategies have been proposed [14]. 3. Methodology In this paper, we propose a two-stage FS algorithm combining IG and BPSO for feature selection of foreign fiber in cotton (IG-BPSO). In the IG-BPSO, IG, a class of filter approach, is used to eliminate the noisy and irrelevant features in order to induce the dimensions of feature space. BPSO, a class of wrapper approach, is used to find the best subset related to the certain specific classifier. As a result, IG-BPSO can overcome the shortcoming of filter approaches that ignore feature dependencies, and improve the efficiency of wrapper approaches by eliminating the noisy and irrelevant features. The proposed algorithm can be characterized by two stages: in the first stage, IG is used to filter noisy and irrelevant features of foreign fiber in cotton. IG looks at each feature in isolation and measures how important it is for the prediction of the correct class label. In this stage, IG cannot eliminate the redundant features and consider the dependency between the features. The redundancy and dependency are considered in second stage by BPSO. In the second stage, BPSO and classifiers work together for selecting the highly discriminating features. As a result, IG-BPSO can combine the advantages of filter and wrapper approaches and select the high quality of subsets related to the certain specific classifier. The scheme for the proposed algorithm is shown in Fig. 1. The detailed steps are described as follows. In the first stage, the original data are preprocessed by the IG filter. Each feature is evaluated and sorted according to IG criterion, and the first N features are selected to form a new subset. The dataset in the second stage is generated according to the feature set obtained by IG, i.e., each sample used in the second stage only contains these items corresponding to these features in the feature set obtained by IG.

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

Datasets of foreign fiber in cotton

Filter stage

Wrapper stage

IG feature selection

BPSO feature selection

Classifiers

Optimal subsets

117

Fig. 1. The general scheme of the IG-BPSO algorithm.

In the second stage, a wrapper approach combining BPSO and classifiers is used to accomplish the feature selection in the reduced feature set. The goodness of the subsets is evaluated by the classification accuracy of a given classifier. 3.1. Information gain (IG) filter IG [15] is a measure of dependence between the feature and the class label. It is one of the most popular feature selection techniques as it is easy to compute and simple to interpret. IG of a feature X and the class labels Y is calculated as

IG(X, Y ) = H (X ) − H (X |Y )

(1)

Entropy (H) is a measure of the uncertainty associated with a random variable. H(X) and H(X|Y) is the entropy of X and the entropy of X after observing Y, respectively.

H (X ) = −



P (xi )log2 (P (xi ))

(2)

i

H (X |Y ) = −

 j

P (y j )



P (xi |y j )log2 (P (xi |y j ))

(3)

i

The maximum value of information gain is 1. A feature with a high information gain is relevant. IG is evaluated independently for each feature and the features with the top-k values are selected as the relevant features. IG approach is an effectively supervised feature selection algorithm, which has been widely applied in many real applications. 3.2. BPSO wrapper PSO is a population-based search technique first proposed by Kennedy and Eberhart [16] and is motivated by the social behavior of organisms such as bird flocking and fish schooling. It is based on swarm intelligence and well suited for combinatorial optimization problems in which the optimization surface possesses many local optimal solutions. The underlying phenomenon of PSO is that knowledge is optimized by social interaction where the thinking is not only personal but also social. The particles in PSO are similar to the chromosomes in GA. However, PSO is usually easier to implement than the GA as there are neither crossover nor mutation operators in the PSO and the movement from one solution set to another is achieved through the velocity functions. PSO was firstly introduced for optimization of continuous space. To solve discrete spaces problem, a discrete binary particle swarm optimization is proposed by Kennedy and Eberhart [16]. In this study, the discrete binary particle swarm optimization is adopted for feature selection. PSO is based on the principle that each solution can be represented as a particle in a swarm. Each particle has a position and a corresponding fitness value evaluated by the fitness function to be optimized. The particles iterate from one position to another according to their most recent velocity vector. This velocity vector is determined according to the particle’s own experience as well as the experience of other particles by using the best positions encountered by the particle and the swarm. Specifically, the velocity vector of each particle is calculated by updating the previous velocity by following two best values. The first best value is the particle’s personal best value (pbest), i.e., the best position it has visited thus far, and is tracked by each particle. The other best value is tracked by the swarm and corresponds to the best position visited by any particle in the population. This best value is called the global best (gbest). The effect of personal best and global best on the velocity update is controlled by weights called learning factors. Through the joint self and swarm-based updating, the PSO achieves local and global search capabilities where the intensification and diversification are achieved via relative weighting. Considering a d-dimensional search space, the ith particle is represented as Xi = (xi,1 , xi,2 , … ,xi,d ), with velocity Vi = (vi,1 , vi,2 , … ,vi,d ). The best previous position of the ith particle is represented as Pi = (pi,1 , pi,2 , … ,pi,d ). The best particle in the population is represented as Pg = (pg, 1 , pg, 2 , … ,pg,d ). Each particle updates its position and velocity according to the two best values at each iteration.

118

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125 Table 1 The algorithm of BPSO for FS. Algorithm: BPSO for FS Input: m: the number of particles; c1 , c2 : acceleration coefficients; wmax : maximum of inertia weight; wmin : minimum of inertia weight; vmax : maximum velocity of particles; Itermax : maximum of iterations; Output: Pg best : global best solution; Begin 1. pg ←0; 2. Generate random population of m particles; 3. for each particle i 4. Calculate fitness(i); 5. p(i) ← fitness(i); 6. if fitness(i) is better than pg then 7. pg ← fitness(i); 8. Pg best ← i; 9. end if 10. end for 11. do until Itermax has been reached 12. for each particle i; 13. Calculate particle velocity according to Eq. (3); 14. Calculate particle position according to Eq. (5); 15. Calculate the fitness (i); 16. if fitness(i) is better than p (i) then 17. p (i) ← fitness(i); 18. end if 19. if fitness(i) is better than pg then 20. pg ← fitness(i); 21. Pg best ← i; 22. end if 23. end for 24. Update inertia weight, w according to Eq. (4); 25. end do; 26. Return Pg best; End

In binary PSO, each bit of a particle moves in a state space restricted to zero and one, which will be in one state or zero state in terms of the changes in probabilities. If the velocity is high it is more likely to choose 1, and lower values favor choosing 0. A sigmoid function is applied to transform the velocity from continuous space to probability space:

sig(vi, j ) =

1 , j = 1, 2, . . . , d 1 + exp ( − vi, j )

(4)

After each iteration t, the velocity is updated as follows:

vt+1 = w × vti, j + c1 × r1 ( pti, j − xti, j ) + c2 × r2 ( ptg, j − xti, j ) i, j

(5)

where w is inertia weight, which is used to balance the global exploration and local exploitation, a large inertia weight facilitates the global search, while a small inertia weight facilitates the local search. The parameters c1 and c2 are acceleration coefficients, which represents the magnitude of the influences on the particles velocity in the directions of the personal and the global optima, respectively. The parameters r1 and r2 are random numbers in the range [0, 1]. xi,j , pi,j and pg,j belong to {0, 1}. Particles’ velocities are limited to a maximum velocity, vmax , which determines how large steps through the solution space each particle is allowed to take. If vmax is too small, particles may not explore sufficiently beyond locally good regions. They could become trapped in local optima. On the other hand, if vmax is too high particles might fly past good solutions. The new particle position is updated using the following rule:



xt+1 = i, j

1,

if rnd < sig(vi, j )

0,

if rnd ≥ sig(vi, j )

, j = 1, 2, . . . , d

(6)

where sig(vi,j ) is calculated according to Eq. (4), and rnd is a uniform random number in [0, 1]. Designing the fitness function is important for BPSO-based wrapper approach, by which the quality of each subset is evaluated. The fitness evaluation contains two terms: (1) accuracy from the validation data and (2) number of features used. In our context, we use the classifier accuracy as the fitness function. The algorithm of BPSO for FS is presented in Table 1.

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

119

3.3. Classifier To evaluate the feature subsets, two classifiers are adopted, which are support vector machines (SVM) and k-nearest neighbor classifier, respectively. 3.3.1. SVM classifier SVM is a powerful classification algorithm that has shown state-of-the-art performance in a variety of classification tasks. The first generation of SVMs was only designed for binary classification. However, most real-life diagnostic tasks are not binary, and solving multiclass classification problems is much harder than solving binary ones. Fortunately, several algorithms have been proposed for extending binary SVM to multiclass classification [3]. When the data are linearly separable, SVM computes the hyperplane that maximizes the margin between the training examples and the class boundary. When the data are not linearly separable, the examples are mapped to a high-dimensional space where such a separating hyperplane can be found. The mechanism that defines this mapping process is called the kernel function. The reason why SVM insists on finding the maximum margin hyperplane is that it offers the best generalization ability. It allows not only the best classification performance (e.g., accuracy) on the training data, but also leaves much room for the correct classification of the future data. To ensure that the maximum margin hyperplanes are actually found, an SVM classifier − → attempts to maximize the following function with respect to w and b:

LP =

  1 → → − → − w − w  xi + b) + αi yi (− αi 2 t

t

i=1

i=1

(7)

where t is the number of training examples, and ai , i = 1, … , t, are non-negative numbers such that the derivatives of LP with − → respect to ai are zero. ai are the Lagrange multipliers and LP is called the Lagrangian. In this equation, the vectors w and constant b define the hyperplane. Currently, SVM has been utilized to deal with detection of foreign fibers in cotton and shows its excellent performance of classification [3, 4]. 3.3.2. kNN classifier To validate the proposed algorithm, we also use kNN classifier to evaluate the selected feature subsets. One of the simplest, and rather trivial classifiers is the Rote classifier, which memorizes the entire training data and performs classification only if the attributes of the test object match one of the training examples exactly. An obvious drawback of this approach is that many test records will not be classified because they do not exactly match any of the training records. A more sophisticated approach, k-nearest neighbor (kNN) classification [17], finds a group of k objects in the training set that are closest to the test object, and bases the assignment of a label on the predominance of a particular class in this neighborhood. There are three key elements of this approach: a set of labeled objects, e.g., a set of stored records, a distance or similarity metric to compute distance between objects, and the value of k, the number of nearest neighbors. To classify an unlabeled object, the distance of this object to the labeled objects is computed, its k-nearest neighbors are identified, and the class labels of these nearest neighbors are then used to determine the class label of the object. Given a training set D and a test object x = (x’, y’), the algorithm computes the distance (or similarity) between z and all the training objects (x, y) ∈ D to determine its nearest-neighbor list, Dz . (x is the data of a training object, while y is its class. Likewise, x’ is the data of the test object and y’ is its class.) Once the nearest-neighbor list is obtained, the test object is classified based on the majority class of its nearest neighbors:

Majority voting : y = argmax v



I(v = yi )

(xi ,yi )∈Dz

where v is a class label, yi is the class label for the ith nearest neighbors, and I(•) is an indicator function that returns the value 1 if its argument is true and 0 otherwise. The kNN classifiers are lazy learners, that is, models are not built explicitly unlike eager learners (e.g., decision trees, SVM, etc.). Thus, building the model is cheap, but classifying unknown objects is relatively expensive since it requires the computation of the k-nearest neighbors of the object to be labeled. This, in general, requires computing the distance of the unlabeled object to all the objects in the labeled set, which can be expensive particularly for large training sets. A number of techniques have been developed for efficient computation. 4. Experimental results and discussions In this section, we perform comprehensive experiments to validate IG-BPSO, and made comparisons with other FS algorithms on foreign fiber datasets. 4.1. Data preparation A large number of foreign fiber images were acquired using our experiment platform, and 254 representative images, which are 4000 pixels wide and 500 pixels high, were selected for the experiments. Fig. 2 shows several examples. The 254

120

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

Fig. 2. Example images of foreign fibers in cotton: (a) hair, (b) black plastic film, (c) red cloth, (d) hemp rope, (e) red polypropylene, (f) black feather. Table 2 BPSO parameters. Parameters

Value

The number of particles The number of generations The acceleration coefficient c1 The acceleration coefficient c2 Maximum velocity vmax

50 30 2 1.5 1

representative images consist of 48 images of hair, 42 images of black plastic film, 42 images of red cloth, 42 images of hemp rope, 40 images of polypropylene, and 40 images of black feather. Then, image segmentation is performed. Consequently, totally 702 foreign fiber objects are generated, which include 51 hair objects, 102 black plastic film objects, 108 cloth objects, 123 rope objects, 132 polypropylene twines objects, 186 feather objects. Afterwards, we extract the features from these objects. In the experiments, the total 75 features are extracted to form a 75-dimensional feature vector including 27 color features, 41 texture features and 7 shape features. Finally, all the samples are generated, and normalized for reducing the impact of different dimensions. 4.2. Experiment setting All experiments are conducted on Intel(R) Core(TM)2 Quad Processor with a CPU clock rate of 2.66 GHz and 4.0 GB main memory, running on the Windows 7 operating system. All the algorithms are implemented in Matlab 2010b. 10-fold crossvalidation is used to estimate the performance of classification. In 10-fold cross-validation, the data is first partitioned into 10 nearly equally sized folds. Subsequently ten iterations of training and validation are performed such that within each iteration a different fold of the data is held-out for validation while the remaining 9-folds are used for learning. The parameters of BPSO are set in terms of Table 2. These parameters were selected after several test evaluations of each algorithm until reaching the best configuration in terms of the quality of solutions and the computational effort. 4.3. Results and discussion First, we performed 10-fold cross-validation respectively using SVM and kNN classifiers on the foreign fiber datasets without feature selection. The results are shown in Table 3. As we can see, the number of features is 75, and the classification accuracy of SVM and kNN is 86.3% and 85.4%, respectively. Then we use IG to evaluate each feature and compute the classification accuracy of the top-k features in terms of SVM and kNN, respectively. Fig. 3 shows the classification accuracy of subsets corresponding to different k values. As we can see, for SVM, the subset consisting of the top-62 features has the highest classification accuracy, 91.2%; for kNN, the subset consisting of the top-66 features has the highest classification accuracy, 89.4%.

1

1

0.9

0.9

Classification accuracy of kNN

Classification accuracy of SVM

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

0.8 0.7 0.6 0.5

0.8 0.7 0.6 0.5 0.4

0.4 0.3 0

121

0.3

10

20

30 40 50 Number of features

60

70

0

10

20

30

40

50

60

70

Number of features

(a)

(b) Fig. 3. Curves of average accuracy of top-k feature set by IG.

Table 3 Classification accuracy of original feature set. Set

Average accuracy (%)

Number of features

kNN SVM

85.4 86.3

75 75

Table 4 Comparison of BPSO and IG-BPSO. Methods

Average accuracy (%)

Average number of features

BPSOk NN IG-BPSOk NN BPSOSVM IG-BPSOSVM

86.37 90.01 90.62 91.48

39 35 42 34

For IG-BPSO, the dataset used in the second stage is constructed according to the following method: (1) each feature is evaluated by IG; (2) all the features are sorted by their scores in descending order; (3) the feature subset consisting of the top-k feature is evaluated by classification accuracy; (4) the feature subset with highest accuracy is used to generate the dataset in the second stage. In the experiments, the datasets in the second stage are generated in terms of the top-62 features with SVM and the top-66 features with kNN, respectively. Based on these new datasets, the final results of IG-BPSO are showed in Table 4. For comparison, we also run BPSO in the original dataset and the results also are listed in Table 4. As we see, for SVM and kNN, IG-BPSO can find the optimal subset with smaller size and higher accuracy than BPSO. The FS processes of two algorithms with SVM are showed in Fig. 4. We can see that IG-BPSO can find optimal subset by less iterations than BPSO, and the optimal subset by IG-BPSO has higher classification accuracy than subset by BPSO. The reason is that some noisy and irrelevant features are eliminated in the first stage. Fig. 5 shows the comparison of performance of the original feature set and the subsets obtained by three algorithms. In the Fig. 5(a), we can see the following results: For SVM, IG-BPSO has the best performance, which selects subset with highest classification accuracy, 91.48%. The subset obtained by IG has higher classification accuracy than that by BPSO, and the performance of original set is the worst, only 86.3%. In the Fig. 5(b), we can see the following results: For kNN, IG-BPSO also has the best performance, which selects subset with highest classification accuracy, 90.01%. The subset obtained by IG has higher classification accuracy than that by BPSO, and the performance of original set is the worst, only 85.4%. Fig. 6 shows the comparison of the number of original feature set and subsets obtained by the three algorithms. In the Fig. 6(a), we can see the following results: for SVM, the number of features of subset obtained by IG-BPSO, BPSO and IG, is 34, 42, 62, respectively, and they are smaller than the number of original set, 75. IG-BPSO can find the subset with smallest size. We can see in Fig. 6(b), for kNN, the number of features of subset obtained by IG-BPSO, BPSO and IG, is 35, 39, 66, respectively. The subsets obtained by IG-BPSO have the smallest size. Finally, we can conclude that IG-BPSO can find the better subsets with small size and high accuracy than IG and BPSO.

122

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

0.93 IG-BPSO BPSO

0.92 0.91 Fitness value

0.9 0.89 0.88 0.87 0.86 0.85 0.84 0

5

10

15 Iterations

20

25

30

Fig. 4. Curves of fitness of BPSO and IG-BPSO with SVM.

0.92

0.9062

0.9 0.89 0.88

0.863

0.86 0.85

0.894

0.89 0.88 0.87 0.86

0.8637 0.854

0.85 0.84

0.84 0.83

0.9001

0.9

Classification accuracy of kNN

Classification accuracy of SVM

0.91

0.87

0.91

0.9148 0.9102

original set

subset of IG

subset of BPSO subset of IG-BPSO

0.83

Original set and subsets obtained by three algorithms

original set

subset of IG

subset of BPSO subset of IG-BPSO

Original set and subsets obtained by three algorithms

(a)

(b) Fig. 5. Comparisons of classification accuracy with SVM.

80

80

75

70

62 60 50

42 40

34 30 20 10

The number of features

The number of features

70

75 66

60 50

39

40

35

30 20

original set

subset of IG

subset of BPSO subset of IG-BPSO

Original set and subset obtained by three algorithms with SVM

10

original set

subset of IG

subset of BPSO subset of IG-BPSO

Original set and subset obtained by three algorithms with kNN

(a)

(b) Fig. 6. Comparisons of number of features of subsets.

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

123

Parameters

Value

Number of population Number of generations Crossover rate Mutation rate

50 30 0.9 0.05

1

1

0.9

0.9

Classification accuracy of SVM

Classification accuracy of SVM

Table 5 Parameters setting of GA.

0.8 0.7 0.6 0.5

0.7 0.6 0.5 0.4

0.4

0

0.8

10

20

30 40 50 Number of features

60

70

0

10

20

(a) Fisher Score

30 40 50 Number of features

60

70

(b) Gini Index Fig. 7. Curves of average accuracy of top-k feature sets.

Then, we test the computational cost of IG, BPSO and IG-BPSO. For BPSO and IG-BPSO, kNN classifier is used to evaluate the subsets, and their parameters are set in terms of Table 2. Their average running time is 0.34 s, 107.48 s and 95.95 s, respectively. The result shows that the computational cost of the proposed algorithm is lower than that of BPSO. The proposed algorithm does not result in a prolonged detection time, on the contrary, using the feature sets obtained by the proposed algorithm can shorten detection time and improve detection efficiency. The reason is that the feature selection approach is not integrated into the online detection systems, and the online detection systems only use the feature sets obtained by the proposed algorithm. For further validating the proposed algorithm, we make comparisons with other three algorithm, i.e. Fisher Score [15], Gini Index [15] and GA [18]. Fisher Score selects the features that assign similar values to the samples from the same class and different values to samples from different classes. The evaluation criterion used in Fisher Score can be formulated as:

c SC( fi ) =

n j (μi, j − μi ) c 2 j=1 n j σi, j

2

j=1

(8)

where μi is the mean of the feature fi , nj is the number of samples in the jth class, and μi , j and σ i , j are the mean and the variance of fi on class j, respectively. Fisher Score is an effective supervised feature selection algorithm, which has been widely applied in many real applications. Gini Index is a measure for quantifying a feature’s ability to distinguish between classes. Given C classes, Gini Index of a feature f can be calculated as:

GI( f ) = 1 −

C 

[p(i| f )]

2

(9)

i=1

Gini Index of each feature is calculated independently and the top k features with the smallest Gini Index are selected. GA is an optimization algorithm with a probabilistic component that provides a means to search poorly understood, irregular spaces. GA works with a population of chromosomes. Each chromosome is a vector in hyperspace representing one potential solution to the optimization problem. A population is an ensemble or set of hyperspace vectors. As Fisher Score and Gini Index are filter approaches, we select the combination with highest classification accuracy to make comparison with the subsets obtained by other algorithms. For GA, the parameters are set in terms of Table 5. Fig. 7 shows the result of classification and the number of top-k features, respectively. As we can see, (1) for Fisher Score, the subset with 65 features has the highest average accuracy, 90.57%; (2) for Gini Index, the subset with 56 features has the highest average accuracy, 90.60%. Table 6 shows the result of performance of four algorithms. As we see, the proposed algorithm can select the subset with the smallest size and the highest average accuracy among four algorithms. The reason is that in the first stage, the filter approach

124

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125 Table 6 Comparison of result of four algorithms. Methods

Average accuracy (%)

Average number of features

Fisher score Gini index GA IG-BPSO

90.57 90.60 91.07 91.48

65 56 37 34

Table 7 Comparison of BPSO and IG-BPSO with different classifiers.

Classifiers kNN SVM BP RBF PNN

Average accuracy (%)

Average number of features

BPSO

IG-BPSO

BPSO

IG-BPSO

86.37 90.62 89.02 90.57 86.84

90.01 91.48 91.25 91.77 87.28

39 42 44 42 47

35 34 36 35 42

decreases the effect of the noisy and irrelevant features so that it improves the performance of BPSO. Consequently, the proposed algorithm can find the subsets with smallest size and the high accuracy which are related to certain specific classifier. Finally, we also tested the proposed method with other three classifiers such as BP (Back Propagation, a famous Multilayer Perceptron algorithm), RBF (Radial Basis Function) and PNN (Probabilistic Neural Networks), and made comparisons with the results of SVM and kNN. In the experiments, we adopted the three algorithms in the Matlab neural network toolbox. For their parameters, we set them according to the experience which can achieve the best performance of each algorithm in the foreign fiber dataset. The results are shown in Table 7. As we see, for the different classifiers, the proposed method shows the better performance than BPSO, for example, the classification accuracy of subsets is higher and the size of subsets is smaller. For RBF, the classification accuracy is the highest among five classifiers, 91.77%. For SVM, the size of subset is the smallest among five classifiers, 34. 5. Conclusion An important issue in online detection system based on machine vision of foreign fibers in cotton is to find the optimal subset of foreign fiber. In this study, we have introduced a novel scheme for feature selection of foreign fibers in cotton by combining IG filter and BPSO wrapper. The new scheme is a two-stage process that first applies the IG filter to select a compact yet effective subset from the candidate set, then uses a BPSO wrapper approach that combines the BPSO search strategy and a learning algorithm as a fitness function to accomplish the subset selection of foreign fiber in cotton. The experiments show that the proposed IG-BPSO outperforms the IG filter and BPSO wrapper. Also, the experimental comparisons demonstrate the effectiveness of IG-BPSO. The selected optimum feature set is of great significance for online detection system based on machine vision of foreign fibers in cotton due to its small size and high accuracy could improve the performance of the online detection system. Our future work will pay attention to improving the performance of classifiers for better online detection of cotton foreign fibers. Acknowledgments This research is supported by the National Natural Science Foundation of China (NSFC) (61303113, 61373053, 61402195, 61471133, 61571444 and 61572226). This research is also funded by the Special Fund for Agro-scientific Research in the Public Interest (201203017), National Science and Technology Supporting Plan Project (2012BAD35B07), National Natural Science Foundation Framework Project (61471133), Guangdong Science and Technology Plan Project (2013B090500127 and 2013B021600014), Guangdong Natural Science Foundation (S2013010014790), and Science and Technology Plan Project of Wenzhou of China under Grant No (G20140048). References [1] Yang W, Li D, Zhu L, Kang Y, Li F. A new approach for image processing in foreign fiber detection. Comput Electron Agric. 2009;68(1):68–77. [2] Yang W, Li D, Wei X, Kang Y, Li F. AVI system for classification of foreign fibers in cotton. Trans Chin Soc Agric Mach. 2009;40(12):177–81. [3] Ji R, Li D, Chen L, Yang W. Classification and identification of foreign fibers in cotton on the basis of a support vector machine. Math Comput Model. 2010;51(11):1433–7. [4] Li D, Yang W, Wang S. Classification of foreign fibers in cotton lint using machine vision and multi-class support vector machine. Comput Electron Agric. 2010;74(2):274–9. [5] Yang W, Lu S, Wang S, Li D. Fast recognition of foreign fibers in cotton lint using machine vision. Math Comput Model. 2011;54(3):877–82. [6] Lin J, Ke H, Chien B, Yang W. Classifier design with feature selection and feature extraction using layered genetic programming. Expert Syst Appl. 2008;34(2):1384–93.

X. Zhao et al. / Computers and Electrical Engineering 47 (2015) 114–125

125

[7] Narendra PM, Fukunaga K. A branch and bound algorithm for feature subset selection. IEEE Trans Comput. 1977;100(9):917–22. [8] El Akadi A, Amine A, El Ouardighi A, Aboutajdine D. A two-stage gene selection scheme utilizing MRMR filter and GA wrapper. Knowl Inf Syst. 2011;26(3):487–500. [9] Khalid S, Khalil T, Nasreen S. A survey of feature selection and feature extraction techniques in machine learning. In: Science and Information Conference, IEEE; 2014. p. 372–8. [10] Yang W, Li D, Zhu L. An improved genetic algorithm for optimal feature subset selection from multi-character feature set. Expert Syst Appl. 2011;38(3):2733– 40. [11] Zhao X, Li D, Yang W, Chen G. Feature selection based on ant colony optimization for cotton foreign fiber. Sensor Lett. 2011;9(3):1242–8. [12] Zhao X, Li D, Yang B, Ma C, Zhu Y, Chen H. Feature selection based on improved ant colony optimization for online detection of foreign fiber in cotton. Appl Soft Comput. 2014;24:585–96. [13] Li H, Wang J, Yang W, Liu S, Li Z, Li D. Feature selection for cotton foreign fiber objects based on PSO algorithm. In: In: Computer and Computing Technologies in Agriculture V; 2012. p. 446–52. [14] Sun X, Liu Y, Li J, Zhu J, Chen H, Liu X. Feature evaluation and selection with cooperative game theory. Pattern Recognit. 2012;45(8):2992–3002. [15] Zhao Z, Morstatter F, Sharma S, Alelyani S, Anand A, Liu H. Advancing feature selection research. ASU Feature Selection Repository; 2010. [16] Kennedy J, Eberhart RC. A discrete binary version of the particle swarm algorithm. In: In: Proceedings of IEEE conference on systems, man and cybernetics; 1997. p. 4104–8. [17] Bijalwan V, Vinay K, Pinki K, Jordan P. KNN based machine learning approach for text and document mining. Int J Database Theory Appl. 2014;7(1):61–70. [18] Oreski S, Oreski G. Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Syst Appl. 2014;41(4):2052–64. Xuehua Zhao received the Ph.D. degree in College of Computer Science and Technology, Jilin University in 2014. His main research interests are related to machine learning and data mining. Daoliang Li received the Ph.D. degree in College of Information and Electrical Engineering, China Agricultural University in 1999. He is currently a professor in the College of Information and Electrical Engineering, China Agricultural University. His current research interests are in the areas of data mining, knowledge engineering, with applications to Agricultural Engineering. Bo Yang is currently a professor in the College of Computer Science and Technology, Jilin University. He received his Ph.D. degree in College of Computer Science and Technology, Jilin University in 2003. His current research interests are in the areas of data mining, knowledge engineering. Huiling Chen received the Ph.D. degree in College of Computer Science and Technology, Jilin University in 2012. His main research interests are related to machine learning and data mining. Xinbin Yang received the Ph.D. degree in College of Information Science and Engineering, East China University of Science and Technology in 2003. His current research interests are in the areas of data mining, knowledge engineering. Chenglong Yu received the Ph.D. degree in College of Computer Science and Technology, Harbin Institute of Technology in 2013. His current research interests are in the areas of data mining, knowledge engineering. Shuangyin Liu received the Ph.D. degree in College of Information and Electrical Engineering, China Agricultural University in 2014. His current research interests are in the areas of data mining, knowledge engineering, with applications to Agricultural Engineering.