Pattern Recognition 45 (2012) 3676–3686
Contents lists available at SciVerse ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/pr
Application of global optimization methods to model and feature selection Abderrahmane Boubezoul a, Se´bastien Paris b,n a b
Paris-Est University, IFSTTAR, IM, LEPSIS, F-75732 Paris, France Laboratory of Sciences of Information’s and of System, Aix-Marseille University, LSIS/DYNI UMR CNRS 7296, Av Escadrille de Normandie Niemen, 13397 Marseille Cedex 20, France
a r t i c l e i n f o
a b s t r a c t
Article history: Received 2 November 2011 Received in revised form 30 March 2012 Accepted 13 April 2012 Available online 23 April 2012
Many data mining applications involve the task of building a model for predictive classification. The goal of this model is to classify data instances into classes or categories of the same type. The use of variables not related to the classes can reduce the accuracy and reliability of classification or prediction model. Superfluous variables can also increase the costs of building a model particularly on large datasets. The feature selection and hyper-parameters optimization problem can be solved by either an exhaustive search over all parameter values or an optimization procedure that explores only a finite subset of the possible values. The objective of this research is to simultaneously optimize the hyperparameters and feature subset without degrading the generalization performances of the induction algorithm. We present a global optimization approach based on the use of Cross-Entropy Method to solve this kind of problem. & 2012 Elsevier Ltd. All rights reserved.
Keywords: Cross-Entropy Method Feature selection Particle swarm optimization Hyper-parameters optimization Support vector machines
1. Introduction In most cases, a dataset consisting of record measurements [8] (also referred to as attributes, features) is too large to be just presented to machine learning algorithm and in some cases the number of attributes is much larger than the number of examples (bioinformatics). Therefore feature selection is a fundamental crucial processing step in the analysis of this type of data. The main goal of feature selection is to reduce the number of features by eliminating redundant features or by selecting the most informative features which capture the relevant properties of data. Features selection is important for several benefits: reducing the measurement and storage requirements of data, reducing computation time, facilitating data visualization and data understanding. In supervised learning, feature selection can improve the generalization capability of the classifiers. Feature selection is particularly important for datasets with large numbers of features (microarray DNA, Webmining, etc.) and in some cases where the number of features is greater than observations (biology). Feature selection has been widely studied in the context of supervised learning (see [2,24,19]), where the ultimate goal is to select subset of features that can improve prediction performance of the classifier. Features selection algorithms can be divided into two categories [2,24]: filters and wrappers. The filter methods attempt to select features based on simple auxiliary criteria, such as (correlation, mutual information, etc.), to remove redundant features.
n
Corresponding author. Tel.: þ33 4 91 95 60 66. E-mail address:
[email protected] (S. Paris).
0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2012.04.015
Such approaches are considered as preprocessing step because they decouple (there is no interaction between) the bias of feature selection (and the bias of the classification algorithm) for the performance evaluation of the classifiers. The filter approaches evaluate the relevance of each feature using the dataset alone, the two well-known of filter-based methods are RELIEF [23] and its variant RELIEFF [26], where the basic idea is to assign a score relevance to each feature, and select the features with high relevance scores. Some drawbacks are related to such techniques, the selection of features individually may lead to filtering some variables which can be useful if we take them with other variables. Selecting relevant features regardless of the correlation in between does not guaranty to filtering redundant features. To overcome these weaknesses wrapper methods were proposed. Wrapper methods attempt to evaluate the relevance of features subset by estimating the learning accuracy, in other words, the learning algorithm (e.g. a nearest neighbor classifier, a neural network, a naive Bayes method, and support vector machines) is used to qualify the selected subset by estimating its generalization capability on unseen data. Wrappers are often criticized for their computationally demanding, but they can be superior in accuracy when compared with filters, because they couple the bias of feature selection and the bias of the learning algorithm [24]. Subset feature selection involves combinatorial searches through the feature space; an exhaustive search can conceivably be performed, if the number of variables is not too large. However, the problem is known to be NP-hard and the search becomes quickly computationally intractable [2]. Different types of heuristics, such as sequential forward or backward searches, and genetic search have been proposed in the literature
A. Boubezoul, S. Paris / Pattern Recognition 45 (2012) 3676–3686
[2,4,35,48]. In other words, the number of possible subsets is restricted by the greedy selection procedure, a variety of greedy algorithms have been suggested to select features sets sequentially, the two well-known of them (Forward selection, Backward elimination). Forward selection method starts from an empty set of features and at each step add a single feature which produces the greatest increase in performance. Conversely Sequential Backward elimination starts from full set of features and at each step eliminates the least promising ones. These two methods show that they are robust against over-fitting. In order to solve the problem of feature selection globally, the stochastic approaches like PSO, Simulated Annealing, Genetic Algorithm were recently proposed (see [15,6,44,12,30]). In this paper we will introduce one of these methods which belongs to the Rare events simulation framework, called Cross-Entropy Method. This method derives its name from the cross-entropy (or Kullback–Leibler) distance, a well known measure of ‘‘information’’. Many feature selection algorithms used the Kullback– Leibler (KL) divergence to select a subset of features by assigning the degree of interclass separability associated with each subset considered (see [25,32]). Using the Kullback–Leibler divergence as a feature selection criterion have been successfully applied in many research areas, such a text categorization problem, see Zhen et al. [49]. The feature selection method presented in this paper is different from the feature selection algorithms mentioned above. In our approach we are not assessing the features individually using the (KL) divergence as interclass distance measure. The objective of our research is to simultaneously optimize the feature subset and the parameters of the inductive algorithm, using a stochastic optimization method (CEM). A similar idea of using the Cross-Entropy (CE) Method in order to implement a feature selection algorithm was recently presented in Peters et al. [34]. The authors applied the CEM for the gene selection in a microarray data. They defined an objective function based on the number of the selected genes and associated that feature selection to Linear Discriminant Analysis (LDA) to test the interclass separability of the subset of features. As a future work the authors proposed to find the most effective combination of an objective function, for class discrimination, and optimization algorithm, for use in the construction of robust classifiers for diagnostic use. This is our proposal in this work. The reminder of this paper is organized as follows. In Section 2 we formulate the problem. In Section 3, we present the CE, PSO methods as global optimization procedures and we reformulate them to the hyper-parameters and feature selection problem. In Section 4, we present the results of numerical experiments using our new method on some benchmark datasets and compare results with up to date top scoring algorithms. Finally, we will conclude and propose perspectives in Section 5.
2. Problem formulation To achieve the goal of optimizing simultaneously the hyperparameters and the feature subset of the induction algorithm, we propose to formulate this problem as a task of finding the best set of parameters that minimizes a given objective function, i.e. minh A H fFðhÞg, where h A H represents the vector of input variables, FðhÞ is the scalar objective function and H is the constraint set. Many approaches exist to solve this kind of problem (Gradient based procedures, random search, Meta heuristics, modelbased methods, etc.) (see [17]). We define the collected data by S9fðx1 ,y1 Þ,ðx2 ,y2 Þ, . . . ,ðxN ,yN Þg, where xi 9½xi1 ,xi2 , . . . ,xiv T and y9fyi gi ¼ 1,...,N , yi A Y. N, L denote the number of records in the dataset and the number of classes, respectively. In real applications xi A X Rv is a multi-dimensional
3677
vector and Y ¼ f1, . . . ,Lg. Moreover, we suppose to have a function f : X /Y predicting label class such that y^ ¼ f ðxÞ for a new input vector x A X . The choice of the optimization method is important in many learning problems. We would like to use optimization methods that can handle a large number of features and converging fast. As an alternative to the stochastic gradient descent algorithm we consider the global optimization methods CEM and PSO approach because they offer good results for multi-extremal functional optimization. These methods require no special form of the objective function and neither its derivatives nor heuristic assumptions. We consider this optimization problem as a hybrid combination of continuous and discrete minimization problem with constraints. Hereafter, we will define the objective function to minimize. 2.1. Fitness function In statistical machine learning framework, we ideally want to minimize the total risk defined by Z Rðf Þ ¼ lðy,f ðxÞÞpðx,yÞ dx dy, x,y A X Y
where pðx,yÞ is a joint pdf unknown in practice and lðy,y0 Þ is a specific loss function. Only N i.i.d samples drawn from pðx,yÞ are available from the dataset and consequently only the empirical P risk is tractable, i.e. Remp ðf Þ ¼ ð1=NÞ N i ¼ 1 lðyi ,f ðxi ÞÞ. Under PAC assumptions, Vapnik [43] showed that the following inequality holds with a probability equal to 1Z: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi h log 2N þ 1 log Z4 h , ð1Þ Rðf Þ r Remp ðf Þ þ N where h designs Vapnik’s dimension. One important remark is to notice that the second right term of Eq. (1) (generalization error) is increasing when h increases. h is closely linked with v (for example for a linear kernel h ¼ v þ 1). So with a given number of training sample N, decreasing v (by a judicious feature selection subset) indirectly decreases h and limits the generalization error term, however, empirical risk may (the first right term of Eq. (1)) increase by this feature selection. So our fitness function must optimize the tradeoff between decreasing the generalization error and increasing the empirical risk. To evaluate classifier performance the most common evaluation metric in machine-learning research is classification accuracy. This accuracy measures the proportion of correctly classified examples and is defined by 1 Remp(f) with a 0–1 loss function i.e. P Acc ¼ ð1=NÞ N i ¼ 1 1fyi ¼ y^ i g where 1fxg is the indicator function equal to 1, when x is true, 0 otherwise. In evaluation of classifier performance we can talk about accuracy (Acc ) or error rate (ER), which is none other than (1Acc ). Our goal in this work is to find the smallest feature set that induce the lowest error rate of the induction algorithm in order to jointly minimize the misclassification rate and the generalization error. For this purpose, the problem can be treated as a monoobjective optimization problem, in which the search is constrained by the size of the feature set and by the specified error rate. We developed a formulation in order to provide control of the tradeoff between the two constraints, necessary when dealing with these kinds of problems. The classification accuracy (or the error rate) and the number of selected features are the two criteria used to design a fitness function. Thus, minimum error rate with simultaneously a small number of features should produce a low fitness value. We solve this problem by creating a single objective fitness function that
3678
A. Boubezoul, S. Paris / Pattern Recognition 45 (2012) 3676–3686
combines the two goals into one as defined by 8 v X > > < wa U þ wf lj , Fðh; X ,wa ,wf Þ9 j¼1 > > : wa þ wf ¼ 1,
ð2Þ
U represents the mean error rate of the classifier (1Acc )
evaluated by K cross-validation with a specific classifier, see Fig. 1, wa means predefined weight associated to the mean error rate, wf means predefined weight associated to the summation of the selected feature (with nonzero lj Þ. lj indicates the position value in k ¼ ½l1 , l2 , . . . , lv T and is equal to 1 if the jth feature is selected; 0 otherwise.
As an induction algorithm we choose the well known k-NN and support vector machines (SVMs) algorithms [43], SVM is one of the most powerful supervised machine learning algorithms. SVM algorithm often achieves superior classification performance compared to other learning algorithms across the majority of domains and tasks thanks to margin maximization principle. This algorithm is fairly insensitive to high dimensionality; however, the training time is OðvN 2 Þ in the worst case.
2.2. Support vector machines (SVMs) Support vector machines (SVMs), originally designed for binary classification, have been applied for multi-class classification, where an effective fusion scheme is required for combining outputs (from them) and producing a final result. Multi-class extensions have been proposed by many authors (e.g. [43,46,18,11]). The most common approach consists in building a set of binary classifiers, each one either trained to separate one class from the others (one-against-all) or only to distinguish between two classes (one-against-one).
Data set
Test set
Training set
Training set
Validation set
Select features & hyper-parameters Train the algorithm
yi ð/w,xi S þbÞ Z1,
8i ¼ 1, . . . ,N:
Termination Criterion
ð3Þ
w and b are coefficients which are determined by minimizing the regularized risk function: R½f ðxÞ ¼
N 1 1X JwJ2 þ C ‘ðyi ,f ðxi ÞÞ: 2 Ni¼1
ð4Þ
It is known that the classification estimation function is the one that minimizes Eq. (4) with the Hinge loss function ‘ðy,f ðxÞÞ9 maxf0; 1yf ðxÞg. In the case of non-separable data the above algorithm will find no feasible solutions. This problem can be solved by introducing positive slack variables xi , i¼1,y,L, the objective function then becomes 8 ( ) N X > > < min 1JwJ2 þ C xi 2 w,b i¼1 > > : y ð/w,x SÞ þ b Z1x , i
i
i
ð5Þ 8i ¼ 1, . . . ,N,
where C is a penalty parameter on the training error. This problem can be written in dual formulation by introducing Lagrange multipliers a ¼ ½a1 , . . . , aN T , one ai for each of the inequality constraints (3). This dual form gives the following maximization problem: 8 8 9 > N N
1X > > > max ai ai aj yi yj /xi ,xj S > > a : ; 2 > > i¼1 i,j ¼ 1 < 0 r ai r C, 8i ¼ 1, . . . ,N, > > > N >X > > > ai yi ¼ 0: > > :
ð6Þ
i¼1
P n n The prediction class of the vector: f ðxÞ ¼ signð N i ¼ 1 ai yi /xi ,xSþ b Þ n T where an ¼ ½an1 , . . . , anN and b are the solutions of the above problem. In practice, a small number of fani gi ¼ 1,...,N are different from zero, the respective training vectors xi having nonzero ani are called support vectors. In the case, where data is non-linearly separable data, Vapnik proposes to map training data in higher dimensional space H by the function jðÞ through dot products / , S, i.e. on functions of the form Kðx1 ,x2 Þ ¼ /jðx1 Þ, jðx2 ÞS. The dual problem is easily extended: 8 8 9 > N N 1X > > > max ai ai aj yi yj Kðxi ,xj Þ > > a : ; 2 > > i¼1 i,j ¼ 1 < 0 r ai r C, 8i ¼ 1, . . . ,N, > > > N > X > > > ai yi ¼ 0: > > :
Fitness evaluation
No
The binary support vector classification model is based on the structured risk minimization principle. The SVM model seeks to minimize an upper bound of the generalization error instead of the empirical error in the other neural network models. In the case of linear separable data, the basic idea of SVM is to find the optimal hyperplane /w,xSþ b ¼ 0 which separates positive yi ¼ þ 1 and negative yi ¼ 1 training samples and maximizes the margin (2=JwJ) between samples and the optimal hyperplane. /x,yS denotes the scalar product, i.e. /x,yS ¼ xT y. Thus, we can find the hyperplane by minimizing JwJ subject to the following constraints:
ð7Þ
i¼1
Yes
Optimized hyperparameters and features
Performances evaluation
Fig. 1. Architecture of feature and hyper-parameters selection model.
The prediction class of the vector is reformulated by f ðxÞ ¼ sign P n ð N i ¼ 1 ai Kðxi ,xÞ þ bÞ. Compared to the primal formulation, the dual form was introduced in order to handle more easily the constraints and take into account non-linear kernel thanks to the kernel trick.
A. Boubezoul, S. Paris / Pattern Recognition 45 (2012) 3676–3686
Any function that satisfies Mercer’s condition can serve as the kernel function. Some kernel functions include polynomial, radial basis function (RBF) and sigmoid kernel [9]: (1) Polynomial kernel:
3679
The particle’s velocity and its new position are updated as follows: 8 h ðt þ1Þ ¼ hi ðtÞ þ m i ðt þ 1Þ, i ¼ 1, . . . ,V, > > < i m i ðt þ 1Þ ¼ m i ðtÞ þc1 r1,i ðpi ðtÞhi ðtÞÞm i ðtÞ þc2 r 2,i ðpg ðtÞhi ðtÞÞ, > > m ¼ ½n1 , n2 , . . . , nv T : : i
i
i
ð9Þ
i
Kðx,zÞ ¼ ð/x,zSþ 1Þn , where
where n is the polynomial degree. (2) RBF (radial basis function) kernel:
(1) c1 ,c2 are learning factors, normally set as c1 ¼ c2 ¼ 2, (2) ðr 1,i ,r 2,i Þ U½0; 1.
Kðx,zÞ ¼ expðbJxzJ2 Þ, where b is the kernel width. (3) Sigmoid kernel: Kðx,zÞ ¼ tanhðA/x,zS þBÞ, where fA,Bg A R are two parameters that should be carefully set because Mercer’s condition are not verified for any values of these parameters. Many researches indicate that SVM with RBF kernel has stronger ability for classification [21,28,20]. When using SVM, two problems are confronted: how to choose the optimal input feature subset for SVM, and how to set the best kernel parameters. These two problems are crucial, because the feature subset choice influences the appropriate kernel parameters and vice versa [16]. Therefore, obtaining the optimal feature subset and SVM parameters may occur simultaneously.
At each step, the velocity of all particles is adjusted as a sum of its local best value, global best value and its present velocity, multiplied by the two constants fc1 ,c2 g respectively; the position of each particle is also modified by adding its velocity to the current position. However, even though PSO is a good and fast search algorithm, it has premature convergence, especially in complex search problems. Many variants were proposed to deal with this problem. Shi and Eberhart in [39,40] introduced an inertia weight oðtÞ and velocity clamping to improve the performance of the original PSO algorithm. In this version of the PSO algorithm, the particle’s velocity and its new position are updated as follows: 8 hi ðt þ1Þ ¼ hi ðtÞ þ m i ðt þ 1Þ, i ¼ 1, . . . ,V, > > < mbi ðt þ 1Þ ¼ oðtÞm i ðtÞ þc1 r 1,i ðpi ðtÞhi ðtÞÞm i ðtÞ þ c2 r 2,i ðpg ðtÞhi ðtÞÞ, > > j : ni ðt þ 1Þ ¼ signðnbji ðt þ 1ÞÞminf9nbji ðt þ1Þ9, njmax g, j ¼ 1, . . . ,v, ð10Þ
3. Global optimization methods
where
3.1. Particle swarm optimization (PSO) Particle swarm optimization (PSO) algorithm is relatively recent optimization technique in the global search methods family. Similar to Genetic and Evolutionary algorithms (GAs, EAs), PSO is a population-based probabilistic optimization tool, which searches for optima by updating generations. However, unlike GAs, PSO has no evolution operators such as crossover and mutation. The PSO algorithm has received increasing attention from the optimization community due to its easy implementation and it has few parameters to adjust. The PSO algorithm was originally developed by Kennedy and Eberhart [13]. This algorithm is based on the birds flocking or fish schooling paradigm. The PSO algorithm is a derivative-free method, it is particularly suited to continuous variable problems. Each potential solution is called a particle, and the set of potential solutions in each iteration step is called a population. Each particle flies through the multidimensional search space with a velocity, which is updated by the particle’s previous best performance and by the previous best performance of the particle’s neighbors. The first population is typically initialized using a random number generator to spread the particles uniformly in a user defined hypercube. Suppose that the search space is v-dimensional. The position of the ith particle can be represented by a v-dimensional vector hi ðtÞ and the velocity of this particle is m i ðtÞ. The best previously visited position (up to time t) of the ith particle is represented by pi and the global best position of the swarm found so far is denoted by pg . The fitness of each particle can be evaluated through putting its position into a designated objective function FðhÞ. 8 pi ðtÞ9 arg min fFðhÞg, > > > < h A fhi ðjÞgt j ¼ 0
pg ðtÞ9 arg min fFðhÞg: > > > : h A ffhi ðjÞgtj ¼ 0 gVi ¼ 1
ð8Þ
(1) njmax is the maximum velocity, a user defined parameter, which can be calculated as: njmax ¼ Zðaj bj Þ denote lower and upper bounds for each dimension j, the common value of Z ¼ 12. (2) oðtÞ represents the inertia weight at step t, this parameter can be computed as t T
oðtÞ ¼ o0 ðo0 o1 Þ,
ð11Þ
where o0 A R is the initial inertia weight, o1 A R is the inertia weight for the last iteration, with 0 r o0 r o1 , T is the maximum number of iterations. o0 ¼ 1:2 and o1 ¼ 0 can be considered as good choices (see [33]).
Many other variants were proposed, the most well-known of them is the version of PSO algorithm with a constriction coefficient that reduces the velocity (see [7]). In this paper, we will apply first the continuous PSO to hyperparameters optimization purpose (HCPSO) and we will extend it to a hybrid version to integrate the features selection problem. We will call it (HFSPSO).
3.2. Binary particle swarm optimization (PSO) As the basic particle swarm optimization operates in continuous and real number space, it cannot be used to optimize the pure discrete binary combinational problem. To tackle this drawback, Kennedy and Eberhart introduced a binary PSO (KBPSO) algorithm [22], where the particles take the values of binary vectors of length V and the velocity defines the probability of bit yji to take the value 1. The particle changes its bit value by below
3680
A. Boubezoul, S. Paris / Pattern Recognition 45 (2012) 3676–3686
equations: 8 mbi ðt þ 1Þ ¼ m i ðtÞ þc1 r1,i ðpi ðtÞhi ðtÞÞm i ðtÞ þc2 r2,i ðpg ðtÞhi ðtÞÞ, > > > > > > bji ðt þ1ÞÞminf9n bji ðt þ1Þ9, njmax g, j ¼ 1, . . . ,v < nji ðt þ 1Þ ¼ signðn > yji ðt þ 1Þ91frj o Sðnj ðt þ 1ÞÞg , > > i i > > > : hi ¼ ½y1i , y2i , . . . , yvi T ,
3.3. Cross-Entropy Method
ð12Þ
where (1) SðÞ is a sigmoid function defined as SðxÞ ¼ 1=ð1 þ ex Þ, (2) rji U½0; 1 with j¼1,y,v and i¼1,y,V, (3) njmax is often set to 4 to prevent a saturation of the sigmoid function, and c1 ,c2 are often chosen such that c1 þ c2 ¼ 4 (see [14]). In this work, the binary PSO (BPSO) was used since the position of each individual particle can be given in binary form (0 or 1), which adequately reflects the choice of whether a feature needs to be selected or not. The changes in particle velocity can be interpreted as a change in the probability of finding the particle in one state or another. The feature after updating is calculated by the function Sðnji ðt þ1ÞÞ, in which the velocity value is nji ðt þ 1Þ. If Sðnji ðt þ1ÞÞ is larger than a randomly produced disorder number that is within ½01, then its position value lj in k ¼ ½l1 , l2 , . . . , lv T is represented as 1 (meaning that feature is selected as a required feature for the next iteration); 0 otherwise. In order to make the features selection and the hyper-parameters jointly with the PSO algorithm, we will define the parameter h. For instance, in the case of C-SVM algorithm with RBF kernel, the parameter h will be defined as h ¼ ½b,C, kT . The algorithm is summarized as follows: Algorithm 1. HFSPSO for features selection and hyper-parameters optimization FBPSO. 8
Require: Define initial values fC min ,C max g, say f2
8
,2 g, and
fbmin , bmax g, say f24 ,24 g, V the number of particles, K the number of cross-validation, T denotes number of iterations. 1: Initialize m ð0Þl U½0; 1v , kl 9½ll , ll , . . . , ll T A f0; 1gv with 1
2
v
lli ¼ Berð0:5Þ, bl U½bmin , bmax , C l U½C min ,C max and hl ¼ ½hl1 , hl2 T with hl1 ¼ ½bl ,C l T and hl2 ¼ kl for the continuous and binary parts respectively and l¼1,y,V. 2: while t o T do 3: Determine the local and the global best particles (see Eq. (2)) 8 > pi ðtÞ9 arg min fFðh; X ,wa ,wf Þg, > > < h A fhi ðjÞgtj ¼ 0 pg ðtÞ9 arg min fFðh; X ,wa ,wf Þg: > > > : h A fhi ðjÞgtj ¼ 0 gVi ¼ 1
j
j
j
j
i
ði,2Þ
5: Set t ¼ t þ 1 and reiterate from step 2. 6: end while
Consider the following general cost function minimization problem. Let H be a constraint set and h A H is a vector containing all parameters to be tuned. Moreover, FðhÞ is a fitness function which determines how good a solution is (see Eq. (2) for example). Let us denote the wanted minimum of the function FðhÞ by g . The cost function minimization problem can be formulated by %
g ¼ minfFðhÞg:
ð13Þ
%
hAH
In order to apply the CEM to resolve this specific problem, first we have to transform it into a stochastic variant. For doing that, let us define h as a random vector taking values in some space H and pð; uÞ as a probability density function on H parameterized by arbitrary u A V. Suppose now that we are interested in the probability of occurrence of the event fFðhÞ r gg. Specifically, we are interested in the probability lðgÞ so that lðgÞ9Pu ðFðhÞ r gÞ ¼ Epð:;uÞ ½1fFðhÞ r gg :
ð14Þ
Minimizing FðhÞ in Eq. (13) is equivalent to sample h conditioned on fFðhÞ r gg for an increasing g-g . If we draw sample h from pð; uÞ, the event fFðhÞ r gg with g-g will be rarer and rarer especially when the prior parameter u is not well adapted. A direct estimator of lðgÞ is relying on a crude Monte-Carlo (CMC) formula given by %
%
ð15Þ
where fhi gi ¼ 1,...,V pð:; uÞ. When g-g and u is arbitrary, it is clear that lðgÞ-0 and the relative variance of blðgÞ, i.e. blðgÞ=lðgÞ2 ¼ 1=lðgÞ1. This relative variance increases when lðgÞ-0 so when g-g . An alternative to Crude Monte-Carlo (CMC) is based on importance sampling (IS) framework, when consists of drawing samples fhi gi ¼ 1,...,V from an importance sampling density g and the probability of occurrence of the event is estimated via the following unbiased estimator: Z pðh; uÞ pðh; uÞ lðgÞ ¼ Epð:;uÞ 1fFðhÞ r gg gðhÞ ¼ gðhÞ dh: 1fFðhÞ r gg gðhÞ gðhÞ hAH ð16Þ %
> > > njði,1Þ ðt þ 1Þ ¼ signðnbjði,1Þ ðt þ 1ÞÞminf9nbjði,1Þ ðt þ 1Þ9, njð1,maxÞ g, > > > > > : m ði,1Þ ¼ ½n1ði,1Þ , n2ði,1Þ T ,
nði,2Þ ðt þ1Þ ¼ signðnbði,2Þ ðt þ1ÞÞminf9nbði,2Þ ðt þ1Þ9, nð2,maxÞ g, > > > > > > m ði,2Þ ¼ ½n1ði,2Þ , n2ði,2Þ , . . . , nvði,2Þ T , > > > > > : yj ðt þ1Þ ¼ 1 j : ði,2Þ fr o Sðnj ðt þ 1ÞÞg
(1) Generating samples according to pð; vÞ and choosing the elite of these samples. (2) Use the elite samples to update the parameter v of the distribution family, in order to produce better samples in the next iteration.
V X blðgÞ ¼ 1 1 i , V i ¼ 1 fFðh Þ r gg
4: Update the particle’s locations and velocities 8 > > hði,1Þ ðt þ 1Þ ¼ hði,1Þ ðtÞ þ m ði,1Þ ðt þ 1Þ, > > > mbði,1Þ ðt þ 1Þ ¼ oðtÞmði,1Þ ðtÞ þc1 r1,i ðpi ðtÞhði,1Þ ðtÞÞm ði,1Þ ðtÞ > > > < þ c2 r 2,i ðpg ðtÞhði,1Þ ðtÞÞ,
8 hði,2Þ ðt þ1Þ ¼ hði,2Þ ðtÞ þ m ði,2Þ ðt þ1Þ, > > > > > mb ði,2Þ ðt þ1Þ ¼ m ði,2Þ ðtÞ þ c1 r1,i ðpi ðtÞhði,2Þ ðtÞÞm ði,2Þ ðtÞ > > > > > > < þ c2 r 2,i ðpg ðtÞhði,2Þ ðtÞÞ,
The Cross-Entropy (CE) Method is a population based optimization method, where members of the population are sampled from a parameterized probability distribution. In each generation, the parameters of the distribution are updated so that its cross-entropy distance from a desired distribution is minimized. This method was introduced by Rubinstein (see [36]) as an efficient method of adaptively estimating probabilities of rare events in complex stochastic networks and has been successfully applied to a great variety of research areas (see for example [47,29,3,5]). The main concepts of the CEM are described in Rubinstein and Kroese [38] and Kroese et al. [27]. For this reason, we briefly review this methodology and we will only present features of CEM that are relevant to the problem hereby studied. The CEM in its basic form, can be viewed as an iterative method that involves two major steps until convergence is reached:
j ¼ f1; 2g,
j ¼ 1, . . . ,v, :
%
A. Boubezoul, S. Paris / Pattern Recognition 45 (2012) 3676–3686
In other words, we will replace the expectation of u by the expectation of g, i.e. pðh; uÞ : ð17Þ lðgÞ ¼ Eg 1fFðhÞ r gg gðhÞ Hence the unbiased estimator of lðgÞ can be rewritten as follows: V X ^lðgÞ ¼ 1 1 i Wðhi ; uÞ, V i ¼ 1 fFðh Þ r gg
ð18Þ
with Wðh; uÞ9pðh; uÞ=gðhÞ is the likelihood ratio (LR). Ideally, a judicious choice of g would significantly reduce the variance of blðgÞ, even this variance will be equal to zero, because in this case almost all the samples will meet this criterion fFðhi Þ r gg. The difficulty with the IS approach is obtaining gn, as this density depends on the desired unknown probability lðgÞ itself [36,37]: g n ðhÞ ¼
1fFðhÞ r gg pðh; uÞ lðgÞ
:
ð19Þ
The main idea of the CE method for rare event simulation is to approximate the IS pdf gn with pð; vÞ where v is judiciously chosen. A particularly convenient measure of distance between gn and pð; vÞ is the Kullback–Leibler (KL) distance, which is also termed the cross-entropy. The Kullback–Leibler distance between two densities g1 and g2 is defined as follows: Z g ðhÞ ¼ g 1 ðhÞln g 1 ðhÞ dh KLðg 1 ,g 2 Þ9Eg 1 ln 1 g 2 ðhÞ hAH Z g 1 ðhÞln g 2 ðhÞ dh : ð20Þ hAH |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} DðvÞ
If g 1 ðhÞ ¼ g ðhÞ and g 2 ðhÞ ¼ f ðh; vÞ, minimizing the (KL) distance is equivalent to maximizing the second term of Eq. (20). The optimal parameter v is obtained by maximizing: Z Dv ¼ arg maxfDðvÞg ¼ arg max g n ðhÞln f ðh; vÞ : ð21Þ n
%
%
v
v
hAH
Using Eq. (19), this last equation can be rewritten as Z 1fFðhÞ r gg pðh; uÞ ln pðh; vÞg v ¼ arg max v lðgÞ hAH Z ¼ argmax 1fFðhÞ r gg ln pðh; vÞpðh; uÞg %
v
hAH
¼ argmaxfEpð;uÞ ½1fFðhÞ r gg ln f ðh; vÞg:
ð22Þ
v
The value v can be estimated by solving the stochastic problem: ( V 1X b b v ¼ arg maxfDðvÞg ¼ arg max 1 i ln pðhi ; vÞg: ð23Þ v v V i ¼ 1 fFðh Þ r gg %
b is optimized by sampling fhi gi ¼ 1,...,V from pð; uÞ and In Eq. (21), v evaluating the event 1fFðhi Þ r gg . As g-g , this event becomes a b becomes quite rarer and rarer event and direct estimation of v untractable. The idea of the CEM method is to decompose this direct b0 , g b0 Þ, difficult task by finding the sequence of couples fðv b1 , g bT1 , g b T1 -v and g b1 Þ, . . . ,ðv bT1 Þg so that v b0 Z g b1 Z Z ðv gbT1 -g . At each iteration t of the CEM algorithm, random b t1 Þ, then the new value samples are drawn on the basis of pð; v b t is updated according to v bt1 and the best elite samples (in of v bt1 as the r-quantile of the score distribution). order to estimate g The updated formula is particulary simple if pð; vÞ belongs to Natural Exponential Function (NEF) (Gaussian, Truncated Gaussian, Bernoulli, etc.) (see examples in the next sections).
3681
More details about the general CEM algorithm are given in Kroese et al. [27]. We present a proto-typical version of the CEM: Algorithm 2. Proto-typical version of the CE Method. Require: Given r the fraction of elite samples; T denotes the b 0 A V. Set t¼ 1 (level counter). final iteration. Choose some v 1: repeat n o 2: Generate samples hl from the density pð; v b Þ and t1
bt1 of the sample scores compute the r-quantile g Fðhl Þ, l ¼ 1, . . . ,V. 3: Use the same samples to solve the stochastic program ( ) V P l 1 b byarg maxfDðvÞg ¼ arg max V 1fFðhl Þ r bg g ln pðh ; vÞ :ð24Þ v v t1 l¼1
bt . Denote the solution by v 4: set t ¼ t þ 1 and reiterate from step 2. 5: until Convergence is reached or t ¼ T b t1 to v b t , the Instead of updating the parameter vector v smoothed updating procedure is often used: b t1 , b t ¼ av~ t þ ð1aÞv v
ð25Þ
with 0 r a r1 and v~ t is the solution of Eq. (21). This smoothed adaptation of Eq. (25) is used to reduce the probability that some b t will be zero or one at the first few iterations. b t,i of v component v Doing this may lead to premature convergence of the algorithm. As a result, the algorithm could converge to a wrong solution and get stuck in a local optimum. 3.4. Cross-Entropy Method for feature selection Now, we present a Cross-Entropy Method for solving the feature selection optimization problem by viewing it as a continuous optimization problem. As pointed out above, to apply the Cross-Entropy Method to such a problem we have to specify the sampling distribution and the updating rules for its parameters. The choice of the sampling distribution is quite arbitrary. In the case of a continuous optimization problem, some authors suggest to generate samples using the Gaussian, double-exponential, or beta distributions, for the simplicity of their parameter updates (see [27]). A further simplification of the general CEM algorithm can be obtained for f021g combinatorial optimization problems, if we consider the solution vector k9½l1 , l2 , . . . , lv T A f0; 1gv to be i.i.d. Bernoulli random variables. For the feature selection case only, we have h ¼ k and the multivariate pdf parameterized by d ¼ ½d1 , . . . , dv T for sampling fkl g, l ¼ 1, . . . ,V is given by
kl Berðdbt Þ9
v Y
lt,i
dbt,i ð1dbt,i Þð1lt,i Þ , l ¼ 1, . . . ,V:
ð26Þ
i¼1
Partial derivative of the logarithm of the last equation versus one element of d is given by
%
lj 1lj lj dj @ ln pðk; dÞ ¼ ¼ : @dj dj 1dj ð1dj Þdj
ð27Þ
Using this result, we can calculate the maximum in Eq. (21) by setting the first derivative with respect to dj to zero:
%
%
V V X @ X 1 1fFðkl Þ r bg g ln pðkl ; dÞ ¼ 1 l g g ðllj dj Þ, @dj l ¼ 1 ð1dj Þdj l ¼ 1 fFðk Þ r b
ð28Þ
and finally we obtain the following expression for d~ j :
d~ j ¼
XV
1fFðkl Þ r bg g llj
PV l¼1
l¼1
1fFðkl Þ r bg g :
ð29Þ
3682
A. Boubezoul, S. Paris / Pattern Recognition 45 (2012) 3676–3686
The update parameter formula is easy to compute which is nothing else that the mean of elite samples. The following algorithm block summarizes the procedure for feature selection.
Set t ¼1. 1: repeat 2: Generate samples 2
0
db0 ¼ ½12 , 12 , . . . , 12 T . Set t¼1. 1: repeat b Þ with success 2: Generate samples fkl g Berðd t1 b probability d for l¼1,y,V.
3:
Evaluate associated scores fSl ¼ Fðhl ; X ,wa ,wf Þg.
4:
Sort fSl g in ascending order and denote by I the set of corresponding indices. Let us denote I ðlÞ
3:
Evaluate scores fSl ¼ Fðkl ; X ,wa ,wf Þg via Eq. (2).
4:
Sort fSl g in ascending order and denote by I the set of
5: 6:
b I ðlÞ g the set of the corresponding indices. Let us denote { d t1 drVe best selected features, l ¼ 1, . . . ,drVe. 5: bt1 ¼ SI ðdrVeÞ . Update g b , where d b b t ¼ ad~ t þ ð1aÞd 6: Update pdf parameter: d t1 t1 is computed with Eq. (29). 7: Set t ¼ t þ 1 and reiterate from step 2. 8: until Convergence is reached or t ¼ T
7: 8:
l¼1
~ C Þ2 1fFðhl Þ r bg g ðC l m l¼1
1fFðhl Þ r bg g
C,t1
~ b,t1 , m ~ C,t1 , s~ b,t1 and s where m are computed using Eq. (30) set t ¼ t þ 1 and reiterate from step 2. until Convergence is reached or t ¼ T
In order to make the features selection and the hyper-parameters jointly with the CE algorithm, we will define the parameter h. For instance, in the case of C-SVM algorithm with RBF kernel, the parameter h will be defined as h ¼ ½b,C, kT . The algorithm is summarized in Algorithm 5: Algorithm 5. CEM for hyper-parameters optimization and feature selection HFSCEM. Require: Given r the fraction of elite samples, a the smoothing parameter, T denotes the final iteration. Define initial values fC min ,C max g, say f28 ,28 g and b , say fb , b g, say f24 ,24 g and d min
b b,0 ¼ > > :s b 2b,0 ¼
Require: Given r the fraction of elite samples, a the smoothing parameter, T denotes the final iteration. 8
fbmin , bmax g, say f24 ,24 g. T denotes the final iteration. 8 b þ bmax > b b,0 ¼ min > , 2 > :s b b,0 ¼ max 2 and 8 C þ C max > b C,0 ¼ min > , > b 2C,0 ¼ max :s , 2
2 b b,0 bmax m
,
2 and 8 C þ C max > b C,0 ¼ min > ,
Algorithm 4. CEM for hyper-parameters selection optimization HCEM.
,2 g and
0
max
db0 ¼ ½12 , 12 , . . . , 12 T . Set t ¼1. 8 bmin þ bmax >
:
The algorithm is summarized in Algorithm 4:
Define initial values fC min ,C max g, say f2
~ 2C,t1
3.6. Cross-Entropy Method for hyper-parameters optimization and feature selection
ð30Þ
8
ð31Þ
C,t1
2
fFðh Þ r g g
PV
IðlÞ
hyper-parameters, l ¼ 1, . . . ,drVe. bt1 ¼ SI ðdrVeÞ . Update g Update pdf parameters: 8 b b,t1 , ~ b,t1 þð1aÞm b b,t ¼ am m > > > > > b C,t1 , ~ C,t1 þ ð1aÞm b C,t ¼ am
C,t
In order to link the hyper-parameters selection with the CEM theory, we will define the parameter h. This parameter is sampled from the sampling distribution which belongs to the NEF family in order to simplify parameters update (similar to Eq. (29)). For instance, in the case of C-SVM algorithm with RBF kernel, the parameter h will be defined as h ¼ ½b,CT . That is, for each hl 9½bl ,C l T with l¼1,y,V. Each components bl are drawn from l b b,t , s b 2b,t , bmin , bmax Þ, where Truncated Gaussian pdf such b N tðm b b,t and s b 2b,t denote the mean and the variance at iteration t. The m hyper-parameter C is also drawn from Truncated Gaussian pdf b C,t , s b C,t and s b 2C,t ,C min ,C max Þ, where m b 2C,t denote the such C l N tðm mean and the variance at iteration t. Similar to Eq. (29), since truncated Gaussian pdf belongs to NEF family, parameter update equations for means and variance are given by 8 > 1fFðhl Þ r bg g bl 1fFðhl Þ r bg g C l XV XV > > > ~ ~ > m ¼ , m ¼ , P P C b > V V l¼1 l¼1 > < l l l¼11 l¼11 b b XV
I ðlÞ
b 2b,t ¼ as~ 2b,t1 þð1aÞs b 2b,t1 , s > > > > > :s b2 b 2 ¼ as~ 2 þ ð1aÞs ,
3.5. Cross-Entropy Method for hyper-parameters optimization
s~ 2C ¼
I ðlÞ
b b,t1 ,ðs b C,t1 ,ðs b b,t1 Þ2 , m b C,t1 Þ2 g the set of drVe best fm
t1
l > ~ b Þ2 > ðb m XV 1fFðhl Þ r b > gg > 2 > ~ s ¼ , > P > b V l¼1 : l ¼ 1 1fFðhl Þ r b gg
l
b C,t1 , s b C,t1 ,C min ,C max Þ, hl ¼ ½b ,C l T with fC l g N tðm l¼1,y,V.
Require: Given r the fraction of elite samples, a the smoothing b , say parameter, T the final iteration. Define initial value d
fFðh Þ r g g
2
l
b b,t1 , s b b,t1 , bmin , bmax Þ and fb g N tðm
Algorithm 3. CEM for feature selection optimization FCEM.
1: 2:
b C,0 C m > > b 2C,0 ¼ max :s : 2 repeat Generate samples l
ð32Þ
ð33Þ
2
b b,t1 , s b b,t1 , bmin , bmax Þ, fb g N tðm 2
b C,t1 , s b C,t1 ,C min ,C max Þ and fC l g N tðm b Þ. hl ¼ ½hl , hl T with hl ¼ ½bl ,C l T and fkl g Berðd t1 1 2 1
hl2 ¼ kl for the continuous and binary parts respectively with l ¼1,y,V. 3:
Evaluate associated scores fSl ¼ Fðhl ; X ,wa ,wf Þg.
4:
Sort fSl g in ascending order and denote by I the set of corresponding indices. Let us denote
A. Boubezoul, S. Paris / Pattern Recognition 45 (2012) 3676–3686
ðlÞ I ðlÞ ðlÞ ðlÞ b Ig,t1 b C,t1 b Ig,t1 b IC,t1 fm ,ðs Þ2 , m ,ðs Þ2 g the set of the drVe
5: 6:
7: 8:
best hyper-parameters, l ¼ 1, . . . ,drVe. bt1 ¼ SI ðdrVeÞ . Update g Update pdf parameters: 8 ~ b,t1 þ ð1aÞm b b,t1 , b b,t ¼ am m > > > > >m ~ b C,t1 , b m ¼ a þð1 a Þ m > C,t C,t1 > > < 2 2 b b ~ s b,t ¼ as b,t1 þ ð1aÞs 2b,t1 > > > b 2C,t1 , b 2C,t ¼ as~ 2C,t1 þ ð1aÞs > s > > > > :d b t ¼ ad~ t þ ð1aÞd b , t1
ð34Þ
~ b,t1 , m ~ C,t1 , s~ 2b,t1 and s~ 2C,t1 are where m computed using Eqs. (30) and (29). set t ¼ t þ 1 and reiterate from step 2. until Convergence is reached or t ¼ T
3683
Table 2 Accuracy rates on real datasets. The format of the numbers in this table is ðRÞ 7 ðstdÞ, where R is the average and std is the standard deviation of the accuracy rate on the validation dataset. For each dataset, the best method is marked by the boldface. Datasets
CEMFS-1-NN
PSOFS-1-NN
1-NN
Iris Thyroid Liver Pima Glass Vowel WBC Wine Heart Image Segment Twonorm
(92.88)7 (4.16) (95.46) 7 (2.70) (59.02) 7 (3.29) (69.73) 7 (2.90) (74.92) 7 (4.92) (97.60) 7 (0.73) (70.25)7 (5.91) (96.79) 7 (2.52) (76.17)7 (6.32) (96.78) 7 (0.69) (95.12)7 (3.12) (94.84) 7 (0.27)
(92.66) 7 (3.48) (93.12) 7(4.37) (59.12) 7(3.92) (69.08) 7(3.76) (73.23) 7(7.91) (93.36) 7(5.19) (70) 7 (5.16) (94.71) 7(3.05) (76.17) 7(5.07) (94.83) 7(2.41) (97.22) 7 (0.57) (92.19) 7(1.33)
(92.88)7 (4.16) (95.46)7 (2.70) (58.54)7 (3.30) (69.73)7 (2.90) (71.84)7 (4.58) (97.60)7 (0.73) (67.84)7 (4.59) (96.79)7 (2.67) (77.40)7 (4.80) (96.78)7 (0.69) (95.78)7 (1.03) (94.84)7 (0.27)
4. Experiment results To measure performances of the developed algorithms, the following datasets from the UCI machine learning repository [31] were used. The dimensionality, the size and the number of the classes of each dataset is given in Table 1. Scaling was applied to prevent feature values in greater numeric ranges from dominating those in smaller numeric ranges, and to prevent numerical difficulties in the calculation. Experimental results obtained in this study demonstrate that scaling the feature value improves the classification accuracy of SVM. In general, the range of each feature value can be linearly scaled to the range [ 1, þ1]. In the first step, the data is divided into two parts (training and testing sets). We build our model (determine its parameters) on the training set and then test its performance on the test set (unseen data) to measure its performance (holding the parameters constant). The K-fold method presented by Stone [41] was employed in the experiments. In this study, the value of K is set to 10. Thus, the dataset was split into 10 parts, with each part of the data sharing the same proportion of each class of data. Nine data parts were applied in the training process (to build models’ hyper-parameters and feature selection), while the remaining one was utilized in the validation process (to evaluate the fitness value). Fig. 1 shows the architecture of the developed approach to parameter determination and feature selection for SVM. The parameters in the algorithms were set as follows:
PSO: V ¼ 5N, where N denote the number of train samples, c1 ¼ c2 ¼ 2 are learning rates, njmax ¼ 4, 8j ¼ 1, . . . ,v, o0 ¼ 1:2 and o1 ¼ 0. Table 1 List of real world datasets used for comparison between algorithms. Datasets indicated by y contain features with only noise, while indicated by n contain intrinsic within-class multi-modal structure. Name
# Features
# Patterns
# Classes
Iris Thyroid Liver Pima Glass Vowel Wisconsin Breast Cancer (WBC) Wine Heart Image Segment Twonorm
4 5 6 8 9 10 10 13 13 18 19 20
150 215 345 768 214 528 699 175 270 2086 2310 7400
3 2 2 2 6 11 2 3 2 2 7 2
For the CEM: V ¼ 5N where N denote the number of train samples, r ¼ 3:10e3, b0 ¼ 0:95, h ¼ 5:10e6, q¼10, E ¼ 5:10e9 and c¼ 10. It is found empirically that when a is between 0.6 and 0.9 it gives the best results. In our case we choose a ¼ 0:7.
4.1. Feature selection evaluation At first, we will study the effectiveness of the feature selection without hyper-parameters optimization. For doing that, we will use the k-nearest neighbor (k-NN) as an evaluator [10]. The k-NN method has been successfully applied in many research fields: statistical estimation, pattern recognition and artificial intelligence. The advantage of the k-NN is that it is simple nonparametric method and easy to implement. To assess the feature selection without hyper-parameters optimization, the nearest neighbor (1-NN) will be used as an evaluator. Neighbors are calculated using their Euclidean distance. In our experiments we used 12 datasets from UCI to further compare the results of the developed (CEMFS-1-NN, PSOFS-1-NN) algorithms and (1-NN). We considered the function defined by Eq. (2) as an objective fitness function that combines the error rate and the number of selected features. According to the fitness function of Eq. (2), the classification error rate and the number of selected features are the two criteria used to design a fitness function. Thus, a small classification error rate, a small number of features produce a low fitness value. The contribution of these two parameters to the fitness function is controlled by weight factors (wa and wf). These weight factors are necessary input parameters but generally they are unknown. So the user should appropriately define these settings according to his/her requirements. We define wa ¼ 0:8 and wf ¼0.2 for all experiments related to hyper-parameters optimization and feature selection evaluation. In this comparison, we can notice in general that all solutions provided by our CE algorithm present the highest classification rate (Table 2). The classification accuracy obtained by the proposed method was the highest seven out of the 12 studied datasets, and for two datasets (Heart, Twonorm) the results of the CE algorithm are the same as the 1-NN. In these two cases, we can notice that the number of selected features is the same as the original number of each dataset (see Table 3), which is not the same for the PSO algorithm. This algorithm leads to the selection of less features; however, this selection leads to less performances. This indicates that good results are also possible to obtain from models with few features, clearly revealing that certain features are redundant or insignificant in relation to
3684
A. Boubezoul, S. Paris / Pattern Recognition 45 (2012) 3676–3686
Table 3 The number of the selected features for each database. The format of the numbers in this table is ðRÞ 7 ðstdÞ, where R is the average and std is the standard deviation of the selected features.
Table 5 The number of selected features for each database using Eq. (2) as an objective fitness function, with (wa ¼ 1 and wf ¼ 0). The format of the numbers in this table is (the average) 7(the standard deviation) of the selected features.
Datasets
CEMFS-1-NN
PSOFS-1-NN
Datasets
CEMSVM
PSOSVM
Iris Thyroid Liver Pima Glass Vowel WBC Wine Heart Image Segment Twonorm
(3.9000) 7(0.3162) (5) 7(0) (5.9000) 7(0.3162) (8) 7(0) (7.4000) 7(1.1738) (10) 7(0) (6.1000) 7(0.9944) (11.9000) 7(0.8756) (11)7(1.1547) (18)7(0) (19)7(0) (20) 7(0)
(3.5000) 7(0.7071) (4.2000) 7(0.9189) (5) 7(0.9428) (7.1000) 7(1.3703) (6.7000) 7(1.0593) (7.2000) 7(1.5492) (5.6000) 7(1.1738) (10.3000) 7(0.8233) (9.8000) 7(0.9189) (14.6000) 7(1.6465) (14.7000) 7(1.2517) (15.7000) 7(1.8288)
Iris Thyroid Liver Diabets Glass Vowel WBC Wine Heart Image Segment Twonorm
(4) 7 (0) (5) 7 (0) (5.90) 7 (0.31) (8) 7 (0) (8.20) 7 (0.91) (10) 7 (0) (7.10) 7 (1.44) (13)7 (0) (11.80)7 (1.13) (8.3) 7 (1.33) (9.4) 7 (1.64) (19.9)7 (0.31)
(4)7 (0) (57 )(0) (5.60) 7(0.84) (7)7 (0.66) (7.80) 7(0.9189) (9.60) 7(0.69) (7)7 (1.49) (11.50)7 (0.84) (10.10) 7(0.87) (9.6) 7(1.07) (9.4) 7(1.34) (16.6)7 (1.89)
Table 4 Accuracy rates on real datasets using Eq. (2) as an objective fitness function, with (wa ¼ 1 and wf ¼ 0). The format of the numbers in this table is (R) 7 (std), where R is the average and std is the standard deviation of the accuracy rate of the validation dataset. For each dataset, the best method is marked by the boldface.
Table 6 Accuracy rates on real datasets using Eq. (2) as an objective fitness function, with (wa ¼ 0:8 and wf ¼ 0.2). The format of the numbers in this table is ðRÞ 7 ðstdÞ, where R is the average and std is the standard deviation of the accuracy rate of the validation dataset. For each dataset, the best method is marked by the boldface.
Datasets
CEMSVM
PSOSVM
Grid Search
Datasets
CEMSVM
PSOSVM
Grid Search
Iris Thyroid Liver Diabetes Glass Vowel WBC Wine Heart Image Segment Twonorm
(95.11) 7 (2.73) (94.84)7 (2.85) (70.38) 7 (2.20) (76.00) 7 (2.67) (69.38) 7 (7.15) (98.18) 7 (0.91) (72.53) 7 (3.72) (98.11) 7 (1.98) (81.35) 7 (4.13) (97.72) 7 (0.52) (97.44) 7 (0.62) (97.70) 7 (0.21)
(93.33) 7(3.47) (94.68) 7(2.35) (65.82) 7(3.90) (70.78) 7(4.96) (67.23) 7(5.18) (97.67) 7(1.03) (68.35) 7(4.77) (97.73) 7(2.63) (75.67) 7(3.63) (97.45) (70.42) (97.08) ( 70.54) (95.93) (70.85)
(92.88)7 (3.74) (94.68)7 (2.57) (65.04)7(4.16) (64.69)7 (1.47) (62.30)7(3.98) (91.14)7 (2.05) (65.18)7 (1.81) (40.00)7 (0.79) (57.65)7 (1.54) (94.88) (7 0.66) (92.19) (7 0.81) (50.14) (7 0.04)
Iris Thyroid Liver Diabets Glass Vowel WBC Wine Heart Image Segment Twonorm
(97.55) 7 (3.54) (95.15) 7 (2.26) (72.23) 7 (3.52) (76.95) 7 (2.39) (69.23) (7 4.91) (98.58) 7 (0.75) (70.37) 7 (2.74) (96.60) 7 (1.48) (81.97) 7 (3.30) (96.83) 7 (0.46) (96.17) 7 (0.76) (97.81) 7 (0.33)
(93.77) (7 3.27) (95.15) (7 2.59) (68.05) (7 6.21) (72.43) (7 4.47) (69.23) 7(3.24) (98.18) (7 0.78) (70.00) (73.21) (95.47) (7 2.69) (78.39) (7 5.01) (96.17) (7 1.29) (95.65) (7 1.03) (96.29) (7 0.92)
(94.89) (95.15) (64.95) (64.95) (62.92) (90.77) (65.94) (39.62) (57.03) (94.72) (91.80) (50.03)
particular classification problems. The proposed method can serve as a pre-processing tool to help optimize the feature selection process, because it helps us to find small feature subsets with a high classification accuracy. 4.2. Comparison between global optimization methods and Grid Search The Grid Search algorithm is a common method for searching the best C and b. In the Grid algorithm, pairs of ðC, bÞ are tried and the one with the lowest fitness value is chosen [45]. The grid search method is simply an exhaustive search to determine the global optimum among those at each point on the grid of parameter values. Clearly, it is not very efficient, but is deterministic and reliable. For the purposes of comparison with the developed algorithms, we performed a standard grid search over the same range of parameters and with the same increments; i.e. C ¼ 2½4:2:4 and g ¼ 2½15:2:1 . Firstly, we studied the effectiveness of the hyperparameters selection without hyper-parameters optimization. For doing that, we considered the error rate as an objective fitness. Secondly, we considered the objective fitness function that combines the error rate and the number of selected features equation (2) with (wa ¼ 1 and wf ¼0) (Tables 4 and 5). In the following experiment, we used Eq. (2) as an objective fitness function, with (wa ¼ 0:8 and wf ¼0.2) in order to control that the accuracy value takes priority over the subset size, since
(73.15) (72.92) (74.47) (72.13) (75.15) (7 2.60) (72.01) (70.00) (7 1.40) (71.11) (7 0.76) (7 1.50)
the high accuracies are preferred when guiding the search process. The objective here is to maximize the accuracy and to minimize the number of selected features. The results obtained were compared with those of the grid search, as shown in Table 6 and 7. Results obtained using the developed (CEM þ SVM) and (PSO þ SVM) approaches were better than those of grid search in almost all cases. Thus, it is clear that the global optimization framework considered here is capable of searching more efficiently than the standard grid method on the objective function under a limited computational effort evaluations. We can also notice that the grid search fails in finding the best optimized hyper-parameters for three datasets (Wine, Heart, Twonorm). This can be explained by the chosen granularity of the grid search area. This granularity is not adapted for these databases. However, tuning this granularity and making it more fine will have consequences, such a time consumption. The Grid Search method is more demanding on computational cost of running than (CEM þ SVM) and (PSO þ SVM). The theoretical complexity of CE is an open problem under investigation. This complexity is partially dependent on the studied problem. It is very hard to characterize correctly the complexity of the proposed algorithm. First, there are two complexities involved: CEM complexity and the inductive algorithm complexity. For the feature selection algorithm associated to the 1-NN algorithm, the computational complexity of Algorithm 3 is in OðTN 2 nVÞ, where T is the total number of iterations needed before Algorithm 3 stops; N is the sample size of the training set and V is the number of generated
A. Boubezoul, S. Paris / Pattern Recognition 45 (2012) 3676–3686
Table 7 The number of selected features for each database. The format of the numbers in this table is (the average)7 (the standard deviation) of the selected features. Datasets
CEMSVM
PSOSVM
Iris Thyroid Liver Diabets Glass Vowel WBC Wine Heart Image Segment Twonorm
(4) 7 (0) (5) 7 (0) (6) 7 (0) (8) 7 (0) (8) 7 (0) (10) 7 (0) (6.9) 7 (1.41) (13)7 (0) (12)7 (0.81) (18)7 (0) (19)7 (0) (20) 7 (0)
(4) 7 (0) (5) 7 (0) (5) 7 (0) (5.5) 7 (0.70) (7.1) 7 (1.10) (9.7) 7 (0.48) (6.2) 7 (2.25) (10.60) 7 (0.51) (10.80) 7 (1.13) (16.5)7 (0.97) (16.9)7 (0.73) (17.4)7 (0.84)
samples. The architecture of the CE method algorithm being inherently parallel, it requires to evaluate a cost function, consequently the CE method can be accelerated by parallelizing all these evaluations.
5. Conclusion In this paper we have considered the hyper-parameters and feature selection problem; we formulated it as an stochastic global optimization problem and applied the CE method to solve it. The suggested method was shown to be apt to deal well with such problems. It produced recognition rates outperforming results from other state-of-the-art methods. The main benefit of the proposed method is its insensitivity to the choice of its hyperparameters related to the CE algorithm itself, which is not the case of the other stochastic methods such as PSO. This research has one innovative aspect, which lies on the design of an hybrid structure between the continuous and discrete CE for feature selection and SVM hyper-parameters selection. We can admit that the stochastic methods, such as CE method, can be heavily time consuming for resolving such problems, especially for a large datasets. To overcome the long training time when dealing with large datasets, the CESVM can be implemented with a distributed parallel architecture which permits to solve such problems in a reasonable time burden. In future work we will concentrate on the investigation of the optimization of the entire kernel (not the hyper-parameters only). Another perspective is to solve the hyper-parameters optimization and feature selection with a multiobjective optimization framework [42,1]. References [1] J. Bekker, C. Aldrich, The cross-entropy method in multi-objective optimisation: an assessment, European Journal of Operational Research 211 (May (1)) (2011) 112–121. [2] A. Blum, P. Langley, Selection of relevant features and examples in machine learning, Artificial Intelligence 97 (1997) 245–271. [3] P.D. Boer, D. Kroese, S. Mannor, R.Y. Rubinstein, A tutorial on the crossentropy method, Annals of Operations Research 134 (2005) 19–67. [4] R. Caruana, D. Freitag, Greedy attribute selection, in: International Conference on Machine Learning, 1994, pp. 28–36. [5] K. Chepuri, T.H. de Mello, Solving the vehicle routing problem with stochastic demands using the cross entropy method, Annals of Operations Research 134 (2005) 153–181. [6] L.-Y. Chuang, H.-W. Chang, C.-J. Tu, C.-H. Yang, Improved binary PSO for feature selection using gene expression data, Computational Biology and Chemistry 32 (1) (2008) 29–38. [7] M. Clerc, J. Kennedy, The particle swarm-explosion, stability, and convergence in a multidimensional complex space, IEEE Transaction on Evolutionary Computation 6 (1) (2002) 58–73. [8] E.S. Correa, M.T. Steiner, A.A. Freitas, C. Carnieri, Using a genetic algorithm for solving a capacity p-median problem, Numerical Algorithms 35 (2004) 373–388.
3685
[9] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (3) (1995) 273–297, http://dx.doi.org/10.1023/A:1022627411411. [10] T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE Transactions on Information Theory, 13 (1) (1967) 21–27, http://dx.doi.org/10.1109/TIT.1967. 1053964. [11] K. Crammer, Y. Singer, On the learnability and design of output codes for multiclass problems, in: The Thirteenth Annual Conference on Computational Learning Theory, 2000, pp. 35–46. [12] G.V. Dijck, M.M.V. Hulle, M. Wevers, Genetic algorithm for feature subset selection with exploitation of feature correlations from continuous wavelet transform: a real-case application, in: A. Okatan (Ed.), International Conference on Computational Intelligence, International Computational Intelligence Society, 2004, pp. 34–38. [13] R. Eberhart, J. Kennedy, A new optimizer using particle swarm theory, in: Proceedings of the Sixth International Symposium on Micromachine and Human Science, 1995, pp. 39–43. [14] R.C. Eberhart, Y. Shi, J. Kennedy, Swarm Intelligence, The Morgan Kaufmann Series in Artificial Intelligence, Morgan Kaufmann, 2001. [15] A. Freitas, Data Mining and Knowledge Discovery with Evolutionary Algorithms, Springer-Verlag, October 2002. [16] H. Frohlich, O. Chapelle, Feature selection for support vector machines by means of genetic algorithms, in: Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence, vol. 12, 2003, pp. 142–148. [17] M.C. Fu, F.W. Glover, J. April, Simulation optimization: a review, new developments, and applications, in: The 37th Winter Simulation Conference (WSC2005), 2005, pp. 83–95. [18] Y. Guermeur, A. Eliseeff, H. PaugamMoisy, A new multi-class SVM based on a uniform convergence result, in: The IEEE-INNS-ENNS International Joint Conference on Neural Networks IJCNN 2000, vol. IV, 2000, pp. 183–188. [19] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research 3 (2003) 1157–1182. [20] B. Hammer, K. Gersmann, A note on the universal approximation capability of support vector machines, Neural Processing Letters 17 (1) (2003) 43–53. [21] S.S. Keerthi, C.-J. Lin, Asymptotic behaviors of support vector machines with Gaussian kernel, Neural Computation 15 (2003) 1667–1689. [22] J. Kennedy, R.C. Eberhart, A discrete binary version of the particle swarm algorithm, in: Proceedings of the 1997 Conference on Systems, Man, and Cybernetics, 1997, pp. 4104–4109. [23] K. Kira, L.A. Rendell, The feature selection problem: traditional methods and a new algorithm, in: Proceedings of AAAI-92, 1992, pp. 129–134. [24] R. Kohavi, G.H. John, Wrappers for feature subset selection, Artificial Intelligence 97 (1997) 273–324. [25] D. Koller, M. Sahami, Toward optimal feature selection, in: ICML, 1996, pp. 284–292. [26] I. Kononenko, Estimating attributes: analysis and extensions of relief, in: European Conference on Machine Learning, 1994, pp. 171–182. [27] D. Kroese, S. Porotsky, R. Rubinstein, The cross-entropy method for continuous multi-extremal optimization, Methodology and Computing in Applied Probability 8 (2006) 383–407. [28] H.-T. Lin, C.-J. Lin, A Study on Sigmoid Kernels for SVM and the Training of Non-PSD Kernels by SMO-type Methods, Technical Report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, 2003. [29] Z. Liu, A. Doucet, S.S. Singh, The cross-entropy method for blind multiuser detection, in: IEEE International Symposium on Information Theory, 2004. [30] R. Meiri, J. Zahavi, Using simulated annealing to optimize the feature selection problem in marketing applications, European Journal of Operational Research 171 (June (3)) (2006) 842–858. [31] D.J. Newman, S. Hettich, C.L. Blake, C.J. Merz, UCI Repository of Machine Learning Databases. /http://www.ics.uci.edu/ mlearn/MLRepository.htmlS, 1998. [32] J. Novovicova´, P. Pudil, J. Kittler, Divergence based feature selection for multimodal class densities, IEEE Transactions on Pattern Analysis and Machine Intelligence 18 (February) (1996) 218–223. [33] K.E. Parsopoulos, M.N. Vrahatis, Recent approaches to global optimization problems through particle swarm optimization, Natural Computing 1 (June (2)) (2002) 235–306. [34] T. Peters, D.W. Bulger, T.H. Loi, J.Y.H. Yang, D. Ma, Two-step cross-entropy feature selection for microarrays—power through complementarity, IEEE/ ACM Transactions on Computational Biology and Bioinformatics 8 (July (4)) (2011) 1148–1151. [35] P. Pudil, J. Novovicova, J. Kittler, Floating search methods in feature-selection, Pattern Recognition Letters 15 (November (11)) (1994) 1119–1125. [36] R. Rubinstein, Optimization of computer simulation models with rare events, European Journal of Operational Research 99 (1997) 89–112. [37] R. Rubinstein, The cross-entropy method for combinatorial and continuous optimization, Methodology and Computing in Applied Probability 1 (1999) 127–190. [38] R.Y. Rubinstein, D.P. Kroese, The Cross-Entropy Method: A Unified Approach to Combinatorial Method, Monte-Carlo Simulation, Randomized Optimization and Machine Learning, Springer Verlag, 2004. [39] Y. Shi, R.C. Eberhart, A modified particle swarm optimizer, in: Proceedings of the IEEE International Conference on Evolutionary Computation, 1998, IEEE Press, Piscataway, NJ, pp. 69–73.
3686
A. Boubezoul, S. Paris / Pattern Recognition 45 (2012) 3676–3686
[40] Y. Shi, R.C. Eberhart, shi99-cec.pdf, Empirical study of particle swarm optimization, in: Proceedings of 1999 Congress on Evolutionary Computation, July 6–9 1999, pp. 1945–1950. [41] M. Stone, Cross validation choice and assessment of statistical predictions, Journal of the Royal Statistical Society B 36 (1974) 111–147. ¨ [42] A. Unveren, A. Acan, Multi-objective optimization with cross entropy method: stochastic learning with clustered Pareto fronts, In: IEEE Congress on Evolutionary Computation, 2007, pp. 3065–3071. [43] V.N. Vapnik, Statistical Learning Theory, Wiley, 1998. [44] X. Wang, J. Yang, X. Teng, W. Xia, R. Jensen, Feature selection based on rough sets and particle swarm optimization, Pattern Recognition Letters 28 (4) (2007) 459–471. [45] C.W. Hsu, C.C. Chang, C.J. Lin, A practical Guide to Support Vector Classification, Technical Report, Department of Computer Science, National Taiwan University, 2010.
[46] J. Weston, C. Watkins, Multi-class Support Vector Machines, Technical Report, Department of Computer Science, Royal Holloway, University of London, Egham, UK, 2000. [47] O. Wittner, B.E. Helvik, Cross entropy guided ant-like agents finding dependable primary/backup path patterns in networks, in: Congress on Evolutionary Computation (CEC2002), 2002. [48] J. Yang, V. Honavar, Feature subset selection using a genetic algorithm, IEEE Intelligent Systems 13 (1998) 44–49. [49] Z. Zhen, X. Zeng, H. Wang, L. Han, A global evaluation criterion for feature selection in text categorization using Kullback–Leibler divergence, in: 2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR), 2011, pp. 440–445.
Abderrahmane Boubezoul was born in Algeria on July 28, 1976, received his Master in Re´alite´ Virtuelle et Maitrise des Syste mes Complexes from the Evry Val d’Essone University (France) in 2004. He has been at the Universite´ d’Aix-Marseille 3 where he is a Ph.D. Student in Computer Sciences and Mathematics. Since 2009, he has been at IFSTTAR, Paris-est University. His current work is about statistical signal processing and machine learning.
Se´bastien Paris was born in France on December 12, 1973, received his MSc in Telecommunication, Propagation and Teledetection from the Nice University in 1996. He received his PhD about frequency lines tracking in sonar system using HMM in 2000 from Toulon University. From May 2000 to July 2001, he worked on tracking and vehicle path’s planning problems with particle filter, PCRB and cross-entropy methods at INRIA/IRISA. From 2002 to 2005, he joined SOPRAGROUP. Since February 2005, he has been at the Aix-Marseille University where he teaches Computer Sciences and Telecommunications. His current work is about statistical signal processing and machine learning.