Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classification

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Feature s...

Download PDF

978KB Sizes 50 Downloads 155 Views

Report

PDF Reader
Full Text

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classiﬁcation Chengming Qi a,b, Zhangbing Zhou a,c,n, Yunchuan Sun d, Houbing Song e, Lishuan Hu a,b, Qun Wang a a

School of Information Engineering, China University of Geosciences (Beijing), Beijing 100083, China College of Automation, Beijing Union University, Beijing 100101, China c Computer Science Department, TELECOM SudParis, Evry 91001, France d Business School, Beijing Normal University, Beijing 100875, China e Security and Optimization for Networked Globe Laboratory, West Virginia University, Montgomery, WV25136-2437, USA b

art ic l e i nf o

a b s t r a c t

Article history: Received 30 January 2016 Received in revised form 28 March 2016 Accepted 9 May 2016

Hyperspectral remote sensing sensors can capture hundreds of contiguous spectral images and provide plenty of valuable information. Feature selection and classiﬁcation play a key role in the ﬁeld of HyperSpectral Image (HSI) analysis. This paper addresses the problem of HSI classiﬁcation from the following three aspects. First, we present a novel criterion by standard deviation, Kullback–Leibler distance, and correlation coefﬁcient for feature selection. Second, we optimize the SVM classiﬁer design by searching for the most appropriate value of the parameters using particle swarm optimization (PSO) with mutation mechanism. Finally, we propose an ensemble learning framework, which applies the boosting technique to learn multiple kernel classiﬁers for classiﬁcation problems. Experiments are conducted on benchmark HSI classiﬁcation data sets. The evaluation results show that the proposed approach can achieve better accuracy and efﬁciency than state-of-the-art methods. & 2016 Elsevier B.V. All rights reserved.

Keywords: Ensemble learning Feature selection Hyperspectral remote sensing image Multiple kernel boosting

1. Introduction HyperSpectral Image (HSI) analysis has been an emerging research topic in recent years, which has the continuous coverage of the solar reﬂective wavelengths and a high spectral resolution. Hyperspectral sensors divide the electromagnetic spectrum into hundreds of spectral bands, which can provide the potential and detailed land-cover distinction and identiﬁcation [1]. Classiﬁcation of HSI consists of six sequential steps including pre-processing, feature extraction, feature selection, segmentation, classiﬁcation, and post-processing. Several hundreds of spectral bands lead to theoretical and practical problems [2,3]. Most applications [4,5] and classiﬁcation algorithms encountered the “Hughes phenomenon” [6]. Therefore, the feature extraction or feature selection technique is of core importance for HSI processing [7]. Feature selection is to select a subset of bands from the data cube. The selected subset should be the most informative and low-correlated ones. Without being transformed, band selection results are easier n Corresponding author at: School of Information Engineering, China University of Geosciences (Beijing), Beijing 100083, China. E-mail addresses: [email protected] (C. Qi), [email protected] (Z. Zhou), [email protected] (H. Song), [email protected] (L. Hu), [email protected] (Q. Wang).

to be interpreted by traditional image-processing methods. There are several feature selection techniques presented in the literature for supporting the analysis of HSI [8–14] and bioinformatics [15]. In [8], Patra et al. proposed a rough-set-based supervised method to select informative bands from HSI. In [9], leveraging the covariance matrix, Yang et al. presented a fast supervised band selection method for HSI classiﬁcation. MartinezUso et al. [10] grouped similar bands into a cluster by a clustering technique and selected the most informative bands by applying either a mutual information criterion or a Kullback–Leibler (KL) divergence criterion. In [11], Guo et al. proposed a GA-based feature selection method and optimized the parameter of linear support higher-order tensor machine. In [12], Shen et al. proposed a discriminative Gabor method for feature selection. In [13], Das et al. adopted the partitioned band image correlation to eliminate the band of HSI. In [14], Wang et al. proposed a band selection method based on column subset selection for HSI. Distance measure or mutual information measure is adopted as the criterion for selecting bands which show greater agreement with the ground truth [16,17]. In [18], Chavez et al. proposed the Optimum Index Factor (OIF) which can be calculated to obtain multivariate statistical information on a data set. In [19], Patel et al. employed features selected through OIF from both the individual years' and stacked images to classify the satellite images. Inspired

http://dx.doi.org/10.1016/j.neucom.2016.05.103 0925-2312/& 2016 Elsevier B.V. All rights reserved.

Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classiﬁcation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i

2

C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

by the OIF for feature selection, we propose a novel feature selection scheme which uses standard deviation, KL divergence, and correlation coefﬁcients to select the most informative and the least correlative bands for classiﬁcation. Recently, several criteria have been proposed to serve as the measure of the similarity between distributions, such as the KL divergence, the KL distance, the Bhattacharyya measure, the Chernoff measure, the (h, φ)-divergence [20,21] and the semanticbased structural similarity [22]. Among them, as a foundation of information theory, statistics, and machina learning, the KL distance is a popular distribution separable measure and has been widely applied. In [23], a region-based classiﬁer, rather than the individual pixels classiﬁer, for SAR images is proposed. In this algorithm, each region is assigned to the class that minimizes the criterion referring to the KL distance of gamma distribution for SAR images. In [24], Zeng et al. developed a statistics method that employed KL divergence to detect anomalous system behavior. In [25], Ferracuti et al. applied the KL divergence as an index for the electric motor defects automatic identiﬁcation. Support vector machine (SVM), which depends on the principle of structural risk minimization [26], has a promising generalization performance when being applied for supporting the HSI classiﬁcation [27]. The standard SVM only utilizes a single kernel function with ﬁxed parameters, which necessitates the model selection for a satisﬁable classiﬁcation performance. SVM has been widely used to solve some machine learning problems in the past several decades [28,29]. Single kernel learning usually needs to choose proper kernel parameters, while multiple kernel learning (MKL) is usually required to search linear/nonlinear combination of predeﬁned base kernels by maximizing the margin maximization. Generally, MKL provides more ﬂexibility in solving similarities of data source than single kernel learning. Rakotommonjy et al. [30] proposed SimpleMKL where the kernel weights are obtained by a reduced gradient descent method. Furthermore, semi-inﬁnite linear programming (SILP) [31], sparse MKL [32], and SpicyMKL [33] were proposed to solve MKL problem. Recently, Cortes et al. [34,35] and Wang et al. [36] proposed two-stage procedure to address the MKL problem, respectively. Pastor López-Monroy et al. [37] proposed a discriminative visual n-grams and MKL strategies to improve the Bag-of-Visual-Words. The ﬁrst stage is to ﬁnd the optimal weights to combine the kernels, which made use of the information from the complete training data and could be computed efﬁciently. The second stage aims to train a standard SVM by means of the combined kernel. Recently, ensemble methods are proposed for MKL. Ensemble methods consider the result of the misclassiﬁed data in the training phase and collect several classiﬁers to classify test examples. The extensive algorithms use ensemble learning to solve MKL. Xia et al. [38] proposed a framework adopted boosting to solve the MKL problem. Since the support vector coefﬁcients cannot be obtained, Sun et al. [39] used a selective MKL method to approximate support vectors. Cai et al. [40] proposed a computational framework from incomplete matrix for constructing an inﬂuenza antigenic cartography. Gu et al. [41] employed a boosting strategy for screening the limited training samples under MKL framework. Ayerdi et al. [42] used extreme learning machine classiﬁers ensemble for HSI classiﬁcation and segmentation. Zhang et al. [43] showed that ensemble methods combining spectral and spatial information outperformed traditional single kernel approaches for HSI classiﬁcation. However, they have to resolve a complicated optimization task when learning classiﬁer using boosting methods. In addition, some approaches adopt boosting technique with SVM to improve kernel methods, such as, BoostSVM [44], AdaBoost with SVM [45–47]. The representative boosting algorithm is

the AdaBoost algorithm [48]. Various simulation results for hyperspectral remote sensing data show that SVM ensemble with bagging or boosting outperforms a single SVM in terms of classiﬁcation accuracy signiﬁcantly [49]. On the other hand, they can hardly deal with multiple kernels that originate from multiple resources. In 1995, Eberhart and Kennedy [50] proposed the particle swarm optimization (PSO) algorithm. PSO is a swarm intelligent optimization method, which can ﬁnd solution quickly in a high dimension space by its stochastic and multi-point searching ability. PSO is adapted to select bands in HSI and to optimize the penalty parameter C and the kernel parameter γ for SVM, which leads to the improved classiﬁcation performance. For example, in [51], Melgani et al. used PSO to enhance the classiﬁcation performance of SVM in electrocardiogram signals classiﬁcation. In [52], Monteiro et al. proposed to perform feature extraction from hyperspectral data. However, PSO suffers from the shortcoming of the premature convergence. To address this issue, extensive studies were conducted in HSI analysis. For example, in [53], Zhang et al. used adaptive chaotic PSO to ﬁnd the optimal parameters of the forward neural network. Couceiro et al. [54] proposed fractional order Darwinian PSO (FODPSO). Ghamisi et al. [55–57] applied FODPSO for hyperspectral data. Following the similar strategy, we employ the mutation mechanism to prevent particles from converging to a local optimum and losing diversity. To improve the efﬁciency of multiple kernel boosting framework for classiﬁcation, in this paper, we propose a strategy for feature selection and employ PSO to optimize SVM classiﬁer parameters, named OIMKB. In comparison with the other data analysis approaches applied upon HSI, this approach has three speciﬁc contributions summarized as follows: (i) a new feature selection scheme has been introduced, which uses standard deviation, KL distance, and correlation coefﬁcients for the selection of the most informative and the least correlative bands for supporting the classiﬁcation, (ii) the SVM classiﬁer design has been optimized through searching the most appropriate value of the parameters using PSO with mutation mechanism, and a boosting framework is constructed for improving MKL, and (iii) extensive experiments has been conducted on the hyperspectral image for validating the applicability and performance of our approach by comparing with various state-of-the-art kernel-based algorithms, and for evaluating various settings for parameters of OIMKB to provide a tradeoff between accuracy and efﬁciency. The remainder of this paper is organized as follows. In Section 2, we review the SVM and MKL. In Section 3, we formulate the proposed framework of OIMKB. We present the experimental results and evaluate various parameters of OIMKB in Section 4, and ﬁnally, conclude this work in Section 5.

2. Preliminaries: SVM and MKL In this section, we brieﬂy review SVM and MKL, which are widely used for supporting HSI classiﬁcation. 2.1. SVM SVM, introduced by Vapnik [26], is one of the most successful kernel methods. Standard SVM uses a hypothesis space of linear function in a high dimensional feature space by using the kernel theory. Since it performs well with a small training data set, it is a appropriate candidate for remote sensing data classiﬁcation during the last decade [58,59]. Nowadays, SVM has been regarded as a promising method for hyperspectral remote sensing data processing and image classiﬁcation. SVM is a discriminative classiﬁer based on a single kernel.

Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classiﬁcation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i

C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Given a sample of independent and identically distributed training instances {(xi , yi )}iN= 1, where xi ∈ RD and yi ∈ { − 1, + 1} is its class label, SVM ﬁnds the linear discriminant with the maximum margin in the feature space induced by the mapping function Φ (·). The discriminant function is deﬁned as follows:

f (x) = 〈w , Φ (x)〉 + b

N

1 ∥ w ∥22 + C ∑ ξi 2 i=1

Min w. r. t. s. t.

w∈

RS,

ξ∈

R +N ,

N i=1

1 2

N

b∈R

N

k=1

⎠

⎛ l ⎞ = sgn ⎜⎜ ∑ αi yi K (Xi , x) + b⎟⎟ ⎝ i=1 ⎠

(5)

where the resultant K is a convex combination of the base kernels K1, …, Kn:

N

∑ αi yi

∑ μk Kk

=0

i=1

In real applications, each base kernel Kk may either use the full set of variables describing x, or subsets of variables stemming from different data sources. Alternatively, base kernels can simply be classical kernels (such as Gaussian kernels) with different parameters. Within this framework, the problem of data representation and fusion through the kernel is then transferred to the choice of weights μ.

3. Multiple kernel boosting framework

i=1 i=1

s. t.

⎞

k=1

∑ ∑ αi αi yi yj k (xi , xj )

α ∈ [0, C ]N ,

w. r. t.

n

∑ μk Kk (Xi , x) + b⎟⎟

n

where w is the vector of weight coefﬁcients, S is the dimensionality of the feature space obtained by Φ (·), C is a predeﬁned positive trade-off parameter between model simplicity and classiﬁcation error, ξ is the vector of slack variables, and b is the bias term of the separating hyperplane. Instead of solving this optimization problem directly, the Lagrangian dual function enables us to obtain the following dual formulation:

∑ αi −

⎛ l f (x) = sgn ⎜⎜ ∑ αi yi ⎝ i=1

K=

yi (〈w , Φ (xi )〉 + b) ≥ 1 − ξi ∀ i

Max

α = (α1, … , αl )T . Finally, the resultant decision function can be written as follows:

(1)

whose parameters can be learned by solving the following quadratic optimization problem:

3

(2)

where α is the vector of dual variables corresponding to each separation constraint and the obtained kernel matrix of k (xi , xj ) = 〈Φ (xi ) , Φ (xj )〉 is positive semideﬁnite. Thus, getting N

w = ∑i = 1 αi yi Φ (xi ), the discriminant function can be written as follows:

In this section, we address the problem of HSI classiﬁcation which includes the following three steps. First, we present a novel criterion by standard deviation, KL divergence, and correlation coefﬁcient for feature selection. Second, we optimized SVM classiﬁer design by searching for the most appropriate value of parameters. Finally, we use the Adaboost algorithm [62] and propose an ensemble learning framework, which applies the boosting technique to learn multiple kernel classiﬁers for solving the classiﬁcation problem.

N

f (x) =

∑ αi yi k (xi , x) + b

(3)

i=1

Generally, a cross-validation procedure is applied to choose the most appropriate kernel function k (· , ·), and parameters (e.g., q or s) among a set of kernel functions on a separate validation set, which is different from the training set. 2.2. MKL Instead of having a single kernel k, MKL has a set of n base kernels k1, … , k n , with the corresponding feature maps ϕ1, …, ϕn. After explicitly modeling the weights (μ1, … , μn )T of the given kernels through a variational argument, an MKL formulation was developed in [60]:

1 ⎛⎜ ∑ μ ∥ wk 2 ⎜⎝ k = 1 k n

Min

μ, w, b, ξ

s. t.

l ⎞2 ∥⎟⎟ + C ∑ ξi ⎠ i=1

i = 1, …, l

n

∑ μk = 1, k=1

μk ≥ 0,

k = 1, …, n

HSI data can be represented as a N × M matrix X. N is the number of pixels in a single band, and M is the number of bands. Suppose that we choose k bands from X. Thus, the feature selection is to select k columns, which can be efﬁcient to represent matrix X. The OIF is an unsupervised method which can select the most informative features. The OIF value is determined with respect to the variance and the correlation among the different bands. OIF beneﬁts the selection of suitable three-band combination. A large value of OIF indicates the optimum combination of bands which is the one with the largest amount of standard deviations and the least amount of correlation among band pairs. The OIF is determined by the following formula:

⎛ 3 ⎞ ∑ σ (i) ⎟ OIF = max ⎜ 3i = 1 ⎜ ∑ |r (j )| ⎟ ⎝ j=1 ⎠

⎛ n ⎞ yi ⎜ ∑ μk wkT ξk (xi ) + b⎟ ≥ 1 − ξi ⎜ ⎟ ⎝ k=1 ⎠ ξi ≥ 0,

3.1. Feature selection optimum index

(4)

Solving the MKL problem as presented in Eq. (4) is more challenging than solving the standard SVM problem as presented in Eq. (1). Several techniques have been proposed to solve the MKL optimization problem and the comparative evaluation of these methods can be found in the literature [61]. A weight matrix w and the bias b can be determined according to the learned and

(6)

where σ (i ) is the standard deviation of i-th band, and r (j ) is the value of correlation coefﬁcient between any two bands. Inspired by the OIF for feature selection, in this paper, we propose a novel feature selection optimum index, named OI, which combines KL divergence. The KL divergence is a popular distribution separable measure being applied in many research domains. We recall the KL entropy between two discrete probability distributions. Generally, KL divergence is deﬁned as follows:

DKL (P ∥ Q ) =

∑ pi log2 i

pi qi

(7)

Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classiﬁcation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i

C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

where pi and qi denote the probability densities of P and Q at the feature i. DKL has the following properties: (1) DKL (p ∥ q) ≥ 0. (2) DKL (p ∥ q) = 0, if p = q KL divergence is not a positive semideﬁnite matrix, since DKL is a non-symmetric distance. It does not satisfy the triangle inequality. In this paper, we apply the following symmetric KL divergence equation [63] to evaluate features:

DSKL (P ∥ Q ) = DKL (P ∥ Q ) + DKL (Q ∥ P )

(8)

Symmetric KL divergence, also called KL distance, is nonnegative. This quantity has been used for feature selection in classiﬁcation problem. To add the quantity of information in selected features, we propose to employ standard deviation, KL distance, and correlation coefﬁcient to deﬁne the feature selection optimum index as follows:

⎛ 3 ⎞ ∑ (α·σ (i) + (1 − α )·DSKL (p ∥ q)) ⎟ OI = max ⎜ i = 1 3 ⎜ ⎟ ∑ j = 1 |r (j )| ⎝ ⎠

(9)

3 ∑i = 1

where DSKL (p ∥ q) is a summary of pairwise KL distance in three-band combination. α is a factor which weighs between standard deviation and KL distance. The larger the alpha is, the more important the standard deviation is. When α = 0.5, standard deviation and KL distance are assumed the same in importance. The OI can be employed to select the largest amount of standard deviations, KL distance with maximum information and the least amount of correlation among band pairs for classiﬁcation. To calculate the OI, we create a map list that contains the multispectral bands, calculate a correlation matrix for the maps, calculate standard deviation and KL distance of each three-band combination, and rank the OI values. Finally, the largest OI is selected for band composition. 3.2. PSO PSO is a biologically inspired technique derived from the collective behavior of an entire ﬂock of birds. By following current optimal particle, all particles search in the solution space until an optimal solution is found. Suppose that the search space is N-dimensional, the number of particles is n, the i-th particle of the swarm is represented by the N-dimensional vector Xi = (xi1, xi2, … , xiN ). The best previous position of the i-th particle is recorded and represented as pi = (pi1, pi2 , … , piN ), which gives the best ﬁtness value called pbest. The particle with the lowest function value is denoted as gbest or Pg. The position change (velocity) of the i-th particle is Vi = (vi1, vi2, … , viN ). The particles are manipulated according to the following equations (the superscripts denote the iteration number):

xid = xid + vid

3.3. Multiple kernel boosting framework In this section we present a multiple kernel boosting framework based on SVM, whose kernel parameters are optimized through PSO. Following the procedure of popular and successful boosting algorithm, i.e., Adaptive Boosting (AdaBoost) [48], we formulate OIMKB by applying boosting technique to learn a classiﬁer using multiple kernels. In particular, our algorithm maintains a probability distribution D t over the training examples. At each boosting trial t (t ¼1, …, T), where T denotes the total number of boosting trials, we learn some kernel classiﬁers with multiple kernels ftj (x ) iteratively. The misclassiﬁcation rate ϵ t j of this kernel classiﬁer on the training examples is computed and used to adjust the probability distribution on the training examples: N

(

)

ϵtj = ϵ ftj (x) =

(10)

(12)

We learn the SVM classiﬁer ft(x), which kernel parameters had been optimized by PSO with mutation mechanism, from these training data. For the t-th boosting trial, we can build the classiﬁer ftj (x ) by choosing the best classiﬁer with the smallest error rate, i.e., j

min

ϵ(ftj (x))

ft (x), j ∈{1, … , M}

(13)

Computing the misclassiﬁcation rate ϵt for the combined classiﬁer ft(x) over the distribution Dt with training data is shown in Eq. (14):

(11)

where 1 ≤ i ≤ n, 1 ≤ d ≤ N , w is the inertia weight, c 1 and c 2 are two positive constants, called the cognitive and social parameter respectively; rand1 () and rand2 () are two random numbers uniformly distributed within the range [0, 1]. Some variants of PSO impose a maximum allowed velocity V max to k+1 > V prevent the swarm from explosion (i.e. if vid max , then k+1 = V ) [64]. vid max Eq. (10) is used to calculate the i-th particle's velocity at each

∑ Dt (i) ( ftj (x) ≠ yi ) i=1

ft (x) = arg

vidk+ 1 = w × vidk + c1 × rand1 () × (pidk − xidk ) k + c2 × rand2 () × (pgd − xidk )

k is the speed of i-th particle. c × rand ()(p k − x k ) iteration. w × vid 1 1 id id is the distance between the i-th particle and its personal best k k ) is the distance between the i-th particle. c2 × rand2 ()(pgd − x id particle and the global best position. The parameters c1, c2, rand1 () and rand2 () provide randomness that makes the technique less predictable yet more ﬂexible [65]. Eq. (11) provides the new position of the i-th particle, adding its new velocity, to its current position. The inertia weight w is employed to control the impact of the previous history of velocities on the current velocity. In this way, the parameter w regulates the trade-off between the global and local exploration abilities of the swarm and inﬂuences PSO convergence behavior. A small inertia weight facilitates local exploration, while a large one tends to facilitate global exploration. Parameter selection of kernel function is a critical factor for SVM. PSO is used to search the punishment factor C and the parameter of γ in kernel function (such as, Radial Basis Function). In this paper, we use the grid search method to determine a limited range of the parameter to reduce the search time. Furthermore, in order to avoid PSO trapping in local optimum, mutation mechanism, which can increase the randomness of individuals [66], is taken in PSO model. Speciﬁcally, at each particle updating, the ﬁtness value of particle Xi is equal to the global optimum, i.e. Pi ¼Pg.

N

ϵt =

∑ Dt (i)(ft (xi ) ≠ yi ) i=1

(14)

The next step of each boosting trial is to update the weight of each training example Dt + 1 (i ) which follows the similar procedure of AdaBoost as follows:

Dt + 1 (i) =

Dt (i) ⎧ βt if ft (xi ) = yi , ×⎨ Zt ⎩ 1 otherwise.

(15)

Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classiﬁcation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i

C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

where βt = ϵt /(1 − ϵt ) and Zt is a normalization factor to make Dt + 1 a distribution. The ﬁnal classiﬁer, after ﬁnishing all T boosting trials, f(x), is constructed by a weighted vote of the individual classiﬁers as follows:

⎛ T ⎞ f (x) = sign ⎜⎜ ∑ αt ft (x) ⎟⎟ ⎝ t=1 ⎠

(16)

The OIMKB approach is shown in Algorithm 1. Algorithm 1. OIMKB. Input:

training data: D = (x1, y1), …, (xN , yN ); kernel function: κj (· , ·): X × X → R, j = 1, …, M ; initialize weight distribution D1 (i) = 1/n for all i; integer T specifying number of iterations; initialize PSO parameters: inertia weight ω , constant

c1, c2, mutation rate ρ ; T

Output: the ﬁnal hypothesis: f (x ) = sign ( ∑t = 1 αt ft (x )) 1: for t = 1, … , T do 2: select a subset SN bands according to (9) 3: sample Sn = n examples with distribution Dt 4: for j = 1, … , M do 5: PSO optimize SVM parameters 6: train weak classiﬁer with kernel κj 7:

get back a hypothesis ftj = SVM (D, Dt )

8:

calculate the training error of ftj over Dt

9:

ϵtj = ∑i = 1 Dt (i )(ftj (xi ) ≠ yi )

N

10: 11:

end for; choose the best classiﬁer with the smallest error rate

ft (x ) = argmin f j (x ), j ∈{1, … , M} ϵ(ftj (x )) t

12:

compute the training error over M

Dt ϵt = ∑i = 1 Dt (i )(ft (xi ) ≠ yi ) 13:

choose the weight of ft : αt =

1 2

14:

update distribution Dt + 1 (i ) ←

Dt (i ) Zt

ln

1 −ϵt ϵt

exp ( − αt yi ft (xi ))

where Zt = ∑i Dt (i ) is a normalization constant (chosen so that Dt + 1 will be a distribution). 15: end for.

5

An illustrative example is given in the following to describe how to apply Algorithm 1. Example. Given three classes: Corn-notill (C1), Corn-mintill (C2), Corn (C3) of Indian Pines dataset (see Section 4), 6 kernel functions, initialize PSO parameters, and the number of iterations T ¼10, we select a subset SN ¼ 9 bands from all 200 bands according to Formula (9) in the ﬁrst run of 10 iterations. Sample Sn ¼500 (0.2 n 2502) examples as training. In the ﬁrst run of 6 iterations of weak classiﬁers construction, sample n ¼ 100 (0.20n2502) with the distribution D1. Select κ1 ¼3 kernels as the SVM kernel function. After optimized the punishment factor C ¼ 673 and the parameter γ = 0.8 of SVM, Algorithm 1 training 3 weak classiﬁers {f11 , f12 , f13 }. Measure the training error of

{ϵ11 = 1/3, ϵ12 = 1/3, ϵ13 = 1/3} over D1 and repeat the loop of weak classiﬁers construction. Choose the best classiﬁer f1 of all 6 iterations with the smallest error rate (suppose f11 is picked as the ﬁrst base learner). Compute the training error ϵ1 over D1 according to Formula (14). Determine the weight α1 = 0.5 ln 2 ≈ 0.35 of f1. Update distribution D2 according to Formula (15) and enter the next cycle. Obtain the ﬁnal hypothesis f(x) according to Formula (16) by completing iteration 10 iterations.

4. Experimental results In this section, we implement several experiments to verify the performance of our method and compare with previous kernelbased methods for HSI classiﬁcation. Then, we evaluate various settings for parameters of OIMKB. 4.1. Data set description Two hyperspectral images have been used in our experiments. The ﬁrst is acquired using the AVIRIS sensor over the Indian Pines region, Northwestern Indian, USA, in 1992, which is widely used to verify the performance of classiﬁcation algorithms. The Indian Pines data set comprises 220 bands with the spatial size of 145 × 145 pixels in the wavelength range from 0.4 to 2.5 μm . Removing the noisy bands, 200 bands remained. The ground truth has 10,062 labeled pixels which consists of 16 land cover classes. The data set is now publicly available at https://engineering.pur due.edu/ biehl/MultiSpec/hyperspectral.html. Fig. 1(a) shows a

Fig. 1. Classiﬁcation maps of the AVIRIS Indian Pine dataset: (a) false color composite (bands 17, 27, and 50 for RGB), and (b) ground truth.

Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classiﬁcation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i

6

C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

false color composite (bands 17, 27, and 50 for RGB) and (b) the ground truth. The second is an urban image acquired using ROSIS sensor during a ﬂight over University of Pavia, northern Italy. The image size in pixels is 610 340, with a very high spatial resolution of 1.3 m per pixel. It consists of 115 bands in a range from 0.43 to 0.86 μm at a spatial resolution of 1.3 m. 103 bands are ﬁnally remained after removing 12 noisy ones. The ground truth data consist of 9 classes of interest and 42,776 labeled pixels. The ground truth map is shown in Fig. 2(a). 4.2. Experimental setup For evaluation, we compare the experiment results of our algorithm with that of several previous competitive ones, including SVM-based single kernel (SVM for short), OIF feature selection with SVM (OIF for short), SimpleMKL (SMKL for short), and AdaBoost with SVM (AdaBoost for short). For SVM classiﬁer, we employ the polynomial and Gaussian radial basis function kernel. We have performed 10-fold cross-validation procedure using a single SVM for ﬁnding the optimal SVM parameters σ ∈ {10−2 , … , 102}, C ∈ {101, … , 104}. SimpleMKL is one of the algorithms used to solve the MKL problem. To implement SimpleMKL algorithms, we adopt the SimpleMKL toolbox [30] and their default settings suggested by this toolbox. AdaBoostSVM is an algorithm applying AdaBoost to improve the SVM learning accuracy [44]. For AdaBoostSVM, 10-fold cross validation is adopted to select the best kernel, other settings are the same as those in SMKB algorithms. In all cases, the one-versus-one multiclass scheme implemented in LibSVM [67] was used. For OIMKB, we follow the typical approach used in previous MKL and AdaBoost. In particular, 16 base kernels are used initially in the ensemble, including 13 Gaussian kernels with different bandwidth parameters from {2−6 , 2−5 , … , 26 } and 3 polynomial kernels with degrees 1, 2, and 3 respectively. Before classiﬁcation, preprocessing is performed on these data clusters. Data sources should be scaled to the range [ 1,1]. This eases the tuning of SVM kernel parameters [67]. We set the total number of boosting trials T to 100, the boosting sampling ratio to 0.3, and classiﬁer sampling ratio ρ to 0.3. For SVM, we adopt the popular LIBSVM toolbox [67] as solver, but parameters are optimized by PSO with mutation Mechanism. The conﬁguration for PSO is set as follows. The swarm size is ﬁxed to 20, the maximum iteration number 50, inertia weight w ¼ 0.8, c1 = c 2 = 1.6, and mutation probability ρ = 0.02. In feature selection OI (Eq. (6)), we set α = 0.25. We implement all experiments in an MATLAB environment on computer with 2.9 GHz Intel CPU (and 16 GB RAM). Three evaluation metrics, overall accuracy (OA), average accuracies (AA), and Kappa coefﬁcient, are widely used to measure the statistical signiﬁcance for hyperspectral image classiﬁcation [68]. OA (Eq. (17)) is the sum of the pixels correctly classiﬁed divided by the total number of samples, AA (Eq. (18)) is the average of individual class producers' accuracy, and Kappa coefﬁcient (Eq. (19)) is the percentage of agreement [69,70]: r

OA =

∑i = 1 xii

AA =

∑i = 1 xi +

N

× 100

(17)

r

Fig. 2. Classiﬁcation maps of the PaviaU dataset: (a) ground truth, (b) SimpleSVM, (c) SVM, (d) OIF, (e) AdaboostSVM, and (f) our OIMKB.

N

× 100

(18)

Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classiﬁcation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i

C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Table 1 The number of training and testing samples of two data sets. Indian Pines

Training

Test

Pavia university

Training

Test

Corn-notill Corn-mintill Corn Grass/Pasture Grass/Trees Hay-windrowed Soybeans-notill Soybeans-mintill Soybeans-clean Wheat Woods Bldg–Grass–Tree

267 171 43 84 155 113 181 485 131 37 266 79

1161 659 194 399 575 365 791 1970 462 168 999 307

Asphalt Meadows Gravel Trees Metal sheets Bare soil Bitumen Bricks Shadows

636 1829 212 356 143 492 142 369 97

5995 16,820 1887 2708 1202 4537 1188 3313 850

r

Kappa =

r

N ∑i = 1 xii − ∑i = 1 xi + × x +i N2

−

r ∑i = 1

x i + × x +i

(19)

where r is the number of rows of the confusion matrix, r r N = ∑i = 1 ∑ j = 1 xij is the total number of observations, x ii is entry (i, i ) of the confusion matrix, and xi +, x+i is the marginal total of column j, row i, respectively. 4.3. Comparisons In Indian Pines data set, after removing four classes with small

7

sample size, only 12 classes are considered, including Corn-notill (C1), Corn-mintill (C2), Corn (C3), Grass/Pasture (C4), Grass/Trees (C5), Hay-windrowed (C6), Soybeans-notill (C7), Soybeans-mintill (C8), Soybeans-clean (C9), Wheat (C10), Woods (C11) and Bldg– Grass–Tree (C12). About 20% of available labeled samples are randomly selected for training, and the remaining for test in each run. In each run, we select 50 three-band combinations with the highest OI. About 10% are randomly selected for experiments on data set of Pavia University, and the remaining are used for test in each run. In each run, we select 22 three-band combinations with the highest OI. Table 1 summarizes the number of training and testing samples for Indian Pines and Pavia University. We repeat each algorithm 10 times on every data set and report the average to avoid unstable results. Table 2 (for Indian Pines data set) and Table 3 (for Pavia University data set) show OAs, AAs, individual classiﬁcation accuracies (in percent), and the kappa statistic obtained for different kernel-based classiﬁcation methods. The processing times are also shown in two tables. The highest scores for each class are highlighted in boldface font. From Tables 2 and 3, the proposed method shows better than previous works in terms of overall accuracy and kappa statistic. Furthermore, it is noticeable that the other type of kernels deﬁned in the generalized approach also produced competitive results. From Table 2, it can be seen that our method generates the highest OA above 88%. In Table 3, our method has the highest OA with 95.8%. It also shows that our method presents higher performances, especially in classes with small number of training samples such as Corn and Wheat. As observed, OIF with SVM overpasses the SVM slightly and our method performs best and obviously

Table 2 Comparison of classiﬁcation results using ﬁve algorithms for Indian Pines. Class

Class name

SVM

OIF

SMKL

AdaBoost

OIMKB

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12

Corn-no till Corn-min till Corn Grass/Pasture Grass/Trees Hay-windrowed Soybeans-no till Soybeans-min till Soybeans-clean till Wheat Woods Bldg–Grass–Tree

79.31 73.2 73.59 7 3.1 65.05 7 2.5 91.44 73.2 96.42 7 2.6 96.747 1.4 77.15 72.7 83.87 7 2.1 81.23 73.8 93.977 3.8 94.03 7 1.8 64.767 3.1

80.277 1.7 75.677 3.1 64.96 7 2.5 91.84 7 3.7 96.727 3.2 96.747 1.9 79.54 72.7 83.677 3.1 83.43 73.2 94.077 4.8 94.13 71.2 67.03 72.4

76.53 7 0.7 70.8071.1 58.46 7 2.1 90.08 7 0.4 92.747 2.1 96.55 7 0.1 72.13 70.4 81.577 4.3 75.86 7 2.4 92.25 7 3.7 93.917 4.1 64.73 72.3

82.56 7 1.1 79.62 7 2.9 75.017 2.6 92.647 2.6 96.857 0.1 97.55 72.8 82.137 1.7 86.667 2.1 85.367 1.3 94.51 71.6 94.54 7 0.2 71.14 71.8

85.46 71.4 82.32 70.8 77.85 72.4 94.89 70.9 96.60 7 1.0 99.167 1.9 81.80 7 0.4 87.027 2.5 85.077 3.1 97.357 1.1 95.377 1.2 73.387 1.1

0.8086 82.14 80.97 428

0.8109 82.92 81.71 403

0.7746 80.08 78.64 760

0.8385 85.22 84.27 686

0.8579 88.02 87.68 773

Kappa OA AA Time (s)

Table 3 Comparison of classiﬁcation results using ﬁve algorithms for Pavia university. Class

Class name

SVM

OIF

SMKL

AdaBoost

OIMKB

C1 C2 C3 C4 C5 C6 C7 C8 C9

Asphalt Meadows Gravel Trees Metal sheets Bare Soil Bitumen Bricks Shadows

89.317 2.2 96.59 7 3.7 82.32 7 2.0 93.44 72.2 96.23 7 2.1 88.94 7 1.1 80.157 1.7 85.677 2.7 99.23 7 3.1

92.34 7 1.2 96.977 2.1 81.46 7 2.2 93.82 7 2.7 98.86 7 1.1 91.74 71.2 84.54 7 3.7 87.6774.1 99.43 7 3.4

86.69 7 1.1 95.977 1.4 79.98 7 2.3 92.98 7 1.8 97.497 2.0 88.68 71.1 77.69 7 1.8 85.84 7 3.3 99.13 72.4

94.82 7 1.4 97.69 7 3.2 84.93 7 2.9 95.20 7 3.6 99.53 72.2 92.44 72.2 88.247 1.4 89.40 7 3.1 99.917 2.3

95.66 71.6 97.837 1.8 86.747 2.8 97.897 1.4 99.417 1.9 93.227 1.7 89.877 2.4 90.227 2.9 99.86 7 3.5

0.9051 92.32 91.44 268

0.9189 93.16 92.32 294

0.8865 90.23 88.96 176

0.9337 94.93 93.72 661

0.9403 95.81 94.97 702

Kappa OA AA Time (s)

Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classiﬁcation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i

C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

Pavia University data set 98

Overall Accuracy (%)

96

94

92

90

OIMKB SVM OIF simpleMKL AdaboostSVM

88

86 5

10

15

20

25

30

35

40

45

50

Number of training samples (%) Fig. 3. Relationship between scales selection and overall accuracy.

University scene. It can be seen from Fig. 3 that our method is superior to SVM, OIF with SVM, and SimpleMKL, and closes to AdaBoostSVM for Pavia University data set. The performance can be improved by OIF feature selection to SVM. OA curve indicates that SimpleMKL obtains lower performance than all other methods in each experiment. This result is similar to what Tuia et al. reported in [71]. Fig. 3 illustrates that our method has a better discriminative ability to deal with a smaller number of labeled samples, for example, 5% of training sample size for each class. As the number of training sample increases, the classiﬁcation performance of each kernel generally increases respectively. This is due to the fact that the complexity of the data construction increases. However, the number of sample for more than 10% do not have much impact on the OA. Especially, the advantage of our method is very obvious for 10% and 20% of training samples size with 1.7% and 1.4% in OA against AdaBoostSVM, respectively. However, such an advantage becomes not so obvious when training sample size is at 15% and 30%. SVM and OIF with SVM also have the closest OA in 5% and 30% of training samples. 4.5. Evaluation of the numbers of selected bands

Indian Pines data set 0.86

kappa coefficient

0.84

0.82

0.8

0.78

0.76

OIMKB AdaboostSVM OIF SimpleMKL

0.74

0.72 20

40

60

80

100

120

140

160

180

200

numbers of bands Fig. 4. Kappa coefﬁcients varying with the number of bands.

outperforms SimpleMKL method. The values of Kappa coefﬁcient show that OIMKB makes a good enhance on AdaBoostSVM. Some classiﬁcation maps are shown in Fig. 2 obtained for the University of Pavia scene using ﬁxed training conﬁguration. In Fig. 2, we can easily see that, our proposed method, OIMKB (f), got a higher accuracy compared with SMKL(b), SVM(c), OIF(d), and AdaboostSVM(e), where (a) is the ground truth. From the class ”Meadows” region of the lower part of the image, it is clear that OIMKB can obtain the better performance and is more closer to the material spatial distribution than all other methods. 4.4. Evaluation of scales selection The effect of training sample size on classiﬁcation performance is analyzed. We ﬁx band number to 66. Fig. 3 shows the evolution of the overall accuracy as a function of the percentage of training samples used for our ﬁve evaluated approaches in Pavia University scene. Evaluation experiments of scales selection are also performed with different numbers of training samples (5–50% of all labeled samples in each class) in all cases. Fig. 3 shows the evolution of the overall accuracy as a function of the percentage of training samples used for our ﬁve evaluated approaches in Pavia

The experiment is to examine the impact of the varying number of bands on kappa coefﬁcients for OIMKB, AdaBoostSVM, OIF with SVM, and SimpleMKL algorithms (due to using all bands, SVM is absent). Fig. 4 shows the error bars results of kappa coefﬁcients versus the number of bands in term of feature selection OI by varying its value from 15 to 200 for different methods. First of all, we observe that, in terms of classiﬁcation accuracy performance, our method and AdaBoostSVM are more accurate than SVM with OIF and SimpleMKL, particularly when the number of bands is relatively small. However, there are different superiorities over SVM and SimpleMKL. Increasing of the band number can improve performance of our method consistently. The improvement of classiﬁcation accuracy performance usually become very small when the band number is large (e.g., 120). The main reason is that, when the number of bands is too large, the base kernel classiﬁers trained at the boosting process may suffer from band correlation and information redundancy. By referring to the results in Fig. 4 that our method is relatively insensitive to the precise choice of band size since the classiﬁcation accuracy tends to saturate when the number of bands is large enough (e.g., larger than 120). Fig. 4 indicates that an appropriate number of bands is essentially a tradeoff between classiﬁcation accuracy and efﬁciency performance.

5. Conclusion In this paper a multiple kernel ensemble learning approach (OIMKB) has been applied to hyperspectral remote sensing image classiﬁcation, leveraging the feature selection and PSO. Our approach presents a novel feature selection criterion by standard deviation, KL divergence, and correlation coefﬁcient. Therefore, we obtain more accurate results than the single kernel and multiple base kernels. Furthermore, we optimized the SVM classiﬁer by PSO with the mutation mechanism to search the most optimal value of the parameters. These techniques have been evaluated using two standard hyperspectral datasets recorded by different sensors. Experimental results show that, compared with state-of-the-art algorithms, our algorithm has a promising performance on HSI classiﬁcation. Generally, our ensemble framework is faster than the mixture kernel, but is slower than the single kernel. Further research will focus on reducing the computational cost and developing more efﬁcient schemes.

Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classiﬁcation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i

C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

References [1] C.I. Chang, Hyperspectral Data Exploitation: Theory and Applications, Wiley, Hoboken, NJ, USA, 2007. [2] L. Bruzzone, C. Persello, A novel context-sensitive semisupervised SVM classiﬁer robust to mislabeled training samples, IEEE Trans. Geosci. Remote Sens. 47 (2009) 2142–2154. [3] B.C. Kuo, C.H. Li, J.M. Yang, Kernel nonparametric weighted feature extraction for hyperspectral image classiﬁcation, IEEE Trans. Geosci. Remote Sens. 47 (2009) 1139–1155. [4] Z. Lv, et al., Managing big city information based on WebVRGIS, IEEE Access 4 (2016) 407–415. [5] J. Yang, J. Zhou, Z. Lv, et al., A real-time monitoring system of industry carbon monoxide based on wireless sensor networks, Sensors 15 (2015) 29535–29546. [6] G.F. Hughes, On the mean accuracy of statistical pattern recognizers, IEEE Trans. Inf. Theory IT-14 (1968) 55–63. [7] C.J.C. Burges, Dimension reduction: a guided tour, Found. Trends Mach. Learn. 2 (2010) 275–365. [8] S. Patra, P. Modi, L. Bruzzone, Hyperspectral band selection based on rough set, IEEE Trans. Geosci. Remote Sens. 53 (2015) 5495–5503. [9] H. Yang, Q. Du, H. Su, Y. Sheng, An efﬁcient method for supervised hyperspectral band selection, IEEE Geosci. Remote Sens. Lett. 8 (2011) 138–142. [10] A. Martinez-Uso, F. Pla, J.M. Sotoca, P. Garcia-Sevilla, Clustering-based hyperspectral band selection using information measures, IEEE Trans. Geosci. Remote Sens. 45 (2007) 4158–4171. [11] T. Guo, L. Han, L. He, X. Yang, A GA-based feature selection and parameter optimization for linear support higher-order tensor machine, Neurocomputing 144 (2014) 408–416. [12] L. Shen, Z. Zhu, S. Jia, J. Zhu, Y. Sun, Discriminative Gabor feature selection for hyperspectral image classiﬁcation, IEEE Geosci. Remote Sens. Lett. 10 (2013) 29–33. [13] A. Das, S. Ghosh, A. Ghosh, Band elimination of hyperspectral imagery using partitioned band image correlation and capacitory discrimination, Int. J. Remote Sens. 35 (2014) 554–577. [14] C. Wang, M. Gong, M. Zhang, Y. Chan, Unsupervised hyperspectral image band selection via column subset selection, IEEE Geosci. Remote Sens. Lett. 12 (2015) 1411–1415. [15] Z. Cai, R. Goebel, M. Salavatipour, G. Lin, Selecting genes with dissimilar discrimination strength for class prediction, BMC Bioinform. 8 (2007) 206. [16] J. Feng, L.C. Jiao, X. Zhang, T. Sun, Hyperspectral band selection based on trivariate mutual information and clonal selection, IEEE Trans. Geosci. Remote Sens. 52 (2014) 4092–4105. [17] X. Geng, K. Sun, L. Ji, Y. Zhao, A fast volume-gradient-based band selection method for hyperspectral image, IEEE Trans. Geosci. Remote Sens. 52 (2014) 7111–7119. [18] P. Chavez, G. Berlin, L. Sowers, Statistical method for selecting landsat MSS ratios, J. Appl. Photogr. Eng. 1 (1982) 23–30. [19] N. Patel, B. Kaushal, Classiﬁcation of features selected through optimum index factor (OIF) for improving classiﬁcation accuracy, J. For. Res. 22 (2011) 99–105. [20] C.M. Bachmann, T.L. Ainsworth, R.A. Fusina, Exploiting manifold geometry in hyperspectral imagery, IEEE Trans. Geosci. Remote Sens. 43 (2005) 441–454. [21] B. Wu, L. Zhang, Y. Zhao, Feature selection via Cramer's V-test discretization for remotesensing image classiﬁcation, IEEE Trans. Geosci. Remote Sens. 52 (2014) 2593–2606. [22] Y. Sun, R. Bie, J. Zhang, Measuring semantic-based structural similarity in multi-relational networks, Int. J. Data Warehous. Min. 12 (2016) 20–33. [23] X. Qin, H. Zou, S. Zhou, K. Ji, Region-based classiﬁcation of SAR images using Kullback–Leibler distance between generalized gamma distributions, IEEE Geosci. Remote Sens. Lett. 12 (2015) 1655–1659. [24] J. Zeng, U. Kruger, J. Geluk, X. Wang, L. Xie, Detecting abnormal situations using the Kullback–Leibler divergence, Automatica 50 (2014) 2777–2786. [25] F. Ferracuti, A. Giantomassi, S. Iarlori, G. Ippoliti, S. Longhi, Electric motor defects diagnosis based on kernel density estimation and Kullback–Leibler divergence in quality control scenario, Eng. Appl. Artif. Intell. 44 (2015) 25–32. [26] V.N. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, 1998. [27] E. Pasolli, F. Melgani, D. Tuia, F. Paciﬁci, W.J. Emery, SVM active learning approach for image classiﬁcation using spatial information, IEEE Trans. Geosci. Remote Sens. 52 (2014) 2217–2233. [28] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, New York, 2004. [29] L. Gao, J. Li, M. Khodadadzadeh, A. Plaza, B. Zhang, Z. He, H. Yan, Subspacebased support vector machines for hyperspectral image classiﬁcation, IEEE Geosci. Remote Sens. Lett. 12 (2015) 349–353. [30] A. Rakotomamonjy, F.R. Bach, S. Canu, Y. Grandvalet, SimpleMKL, J. Mach. Learn. Res. 9 (2008) 2491–2521. [31] S. Sonnenburg, G. Rätsch, C. Schäfer, B. Schölkopf, Large scale multiple kernel learning, J. Mach. Learn. Res. 7 (2006) 1531–1565. [32] N. Subrahmanya, Y.C. Shin, Sparse multiple kernel learning for signal processing applications, IEEE Trans. Pattern Anal. Mach. Int. 32 (2010) 788–798. [33] T. Suzuki, R. Tomioka, SpicyMKL: a fast algorithm for multiple kernel learning with thousands of kernels, Mach. Learn. 85 (2011) 1–32. [34] C. Cortes, M. Mohri, A. Rostamizadeh, Two-stage learning kernel algorithms, in: E.H. Zarantonello, Author 2 (Eds.), Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 2010, pp. 239–246. [35] C. Cortes, M. Mohri, A. Rostamizadeh, Algorithms for learning kernels based

9

on centered alignment, J. Mach. Learn. Res. 13 (2012) 795–828. [36] T. Wang, D. Zhao, Y. Feng, Two-stage multiple kernel learning with multiclass kernel polarization, Knowl.-Based Syst. 48 (2013) 10–16. [37] A. Pastor López-Monroy, et al., Improving the BoVW via discriminative visual n-grams and MKL strategies, Neurocomputing 175 (Part A) (2016) 768–781. [38] H. Xia, C.H.H. Steven, MKBoost: a framework of multiple kernel boosting, IEEE Trans. Knowl. Data Eng. 25 (2013) 1574–1586. [39] T. Sun, L. Jiao, S. Wang, J. Feng, Selective multiple kernel learning for classiﬁcation with ensemble strategy, Pattern Recognit. 46 (2013) 3081–3090. [40] Z. Cai, T. Zhang, X. Wan, A computational framework for inﬂuenza antigenic cartography, PLoS Comput. Biol. 6 (10) (2010) e1000949. [41] Y. Gu, H. Liu, Sample-screening MKL method via boosting strategy for hyperspectral image classiﬁcation, Neurocomputing 173 (2016) 1630–1639, Part 3. [42] B. Ayerdi, I. Marqués, M. Graña, Spatially regularized semisupervised ensembles of extreme learning machines for hyperspectral image segmentation, Neurocomputing 149 (2015) 373–386, Part A. [43] Y. Zhang, H.L. Yang, S. Prasad, E. Pasolli, J. Jung, M. Crawford, Ensemble multiple kernel active learning for classiﬁcation of multisource remote sensing data, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 8 (2015) 845–858. [44] X. Zhang, F. Ren, Improving SVM learning accuracy with Adaboost, in: Proceedings of International Conference on Natural Computation, 2008, pp. 221–225. [45] D. Pavlov, J. Mao, B. Dom, Scaling-up support vector machines using boosting algorithm, in: Proceedings of 15th International Conference on Pattern Recognition (ICPR), 2000, pp. 2219–2222. [46] X. Li, L. Wang, E. Sung, Adaboost with SVM-based component classiﬁers, Eng. Appl. Artif. Intell. 21 (2008) 785–795. [47] S.M. Valiollahzadeh, A. Sayadiyan, M. Nazari, Face Detection Using Adaboosted SVM-Based Component Classiﬁer, CoRR, abs/0812.2575, 2008. [48] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55 (1997) 119–139. [49] T. Sun, L. Jiao, J. Feng, F. Liu, X. Zhang, Imbalanced hyperspectral image classiﬁcation based on maximum margin, IEEE Geosci. Remote Sens. Lett. 12 (2015) 522–526. [50] J. Kennedy, R. Eberhart, A new optimizer using particle swarm theory, in: Proceedings of IEEE 6th International Symposium on Micro Machine and Human Science, 1995, pp. 39–43. [51] F. Melgani, Y. Bazi, Classiﬁcation of electrocardiogram signals with support vector machines and particle swarm optimization, IEEE Trans. Inf. Technol. Biomed. 12 (2008) 667–677. [52] S.T. Monteiro, Y. Kosugi, Particle swarms for feature extraction of hyperspectral data, IEEE Trans. Remote Sens. Geosci. 90 (2007) 1038–1046. [53] Y. Zhang, L. Wu, Crop classiﬁcation by forward neural network with adaptive chaotic particle swarm optimization, Sensors 11 (2011) 4721–4743. [54] M.S. Couceiro, R.P. Rocha, N.M.F. Ferreira, J.A.T. Machado, Introducing the fractional-order Darwinian PSO, Signal Image Video Process. 6 (2012) 343–350. [55] P. Ghamisi, M.S. Couceiro, F.M. Martins, J.A. Benediktsson, Multilevel image segmentation approach for remote sensing images based on fractional-order Darwinian particle swarm optimization, IEEE Trans. Remote Sens. Geosci. 52 (2014) 2382–2394. [56] P. Ghamisi, M. Couceiro, M. Fauvel, J.A. Benediktsson, Integration of segmentation techniques for classiﬁcation of hyperspectral images, IEEE Geosci. Remote Sens. Lett. 11 (2014) 342–346. [57] P. Ghamisi, M. Couceiro, J.A. Benediktsson, A novel feature selection approach based on FODPSO and SVM, IEEE Trans. Remote Sens. Geosci. 53 (2015) 2935–2947. [58] G. Camps-Valls, L. Bruzzone, Kernel-based methods for hyperspectral image classiﬁcation, IEEE Trans. Geosci. Remote Sens. 43 (2005) 1351–1362. [59] J. Li, H. Zhang, Y. Huang, L. Zhang, Hyperspectral image classiﬁcation by nonlocal joint collaborative representation with a locally adaptive dictionary, IEEE Trans. Geosci. Remote Sens. 52 (2014) 3707–3719. [60] A. Zien, C.S. Ong, Multiclass multiple kernel learning, in: Proceedings of the 24th International Conference on Machine Learning, Corvallis, Oregon, USA, 2007, pp. 1191–1198. [61] M. Gönen, E. AlpayIn, Multiple kernel learning algorithms, J. Mach. Learn. Res. 12 (2011) 2211–2268. [62] B. Schölkopf, A.J. Smola, Learning With Kernels, MIT Press, Cambridge, MA, 2002. [63] L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, in: Proceedings of the IEEE, vol. 77, 1989. [64] J. Kennedy, R.C. Eberhart, Swarm Intelligence, Morgan Kaufmann Publishers, San Francisco, California, 2001. [65] J. Kennedy, The behavior of particles, Evolut. Program. 7 (1998) 581–587. [66] D. Beasley, D. Bull, R. Martin, An overview of genetic algorithms, Univ. Comput. 15 (1993) 58–69. [67] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (2011) 27. [68] R.G. Congalton, R.G. Oderwald, R.A. Mead, Assessing Landsat classiﬁcation accuracy using discrete multivariate analysis statistical techniques, Photogramm. Eng. Remote Sens. 49 (1983) 1671–1678. [69] G.M. Foody, Classiﬁcation accuracy comparison: hypothesis tests and the use of conﬁdence intervals in evaluations of difference, equivalence and non-inferiority, Remote Sens. Environ. 113 (2009) 1658–1663. [70] P. Ramzi, F. Samadzadegan, P. Reinartz, Classiﬁcation of hyperspectral data using an AdaBoostSVM technique applied on band clusters, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 7 (2014) 2066–2079. [71] D. Tuia, G. Camps-Valls, G. Matasci, M. Kanevski, Learning relevant image features with multiple-kernel classiﬁcation, IEEE Trans. Geosci. Remote Sens. 48 (2010) 3780–3791.

Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classiﬁcation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i

10

C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Chengming Qi is a faculty at the College of Automation, Beijing Union University, China. He received his Master of Engineering from the School of Information Engineering, China University of Geosciences (Beijing), 2004. He is currently pursuing the Ph.D. degree with the School of Information Engineering, China University of Geosciences (Beijing). His research interests include computational intelligence, machine learning. He has published around 40 research papers.

Lishuan Hu is a faculty at the College of Automation, Beijing Union University, China. He received his Master of Engineering from the School of Information Engineering, China University of Geosciences (Beijing), 2005. He is currently pursuing the Ph.D. degree with the School of Information Engineering, China University of Geosciences (Beijing). His research interests include computational intelligence, machine learning. He has published around 5 research papers.

Zhangbing Zhou is a professor at China University of Geosciences (Beijing), China, and an adjunct associate professor at TELECOM SudParis, France. His interests include services computing and business process management. He has published over 100 referred papers.

Qun Wang is currently a professor at the School of Information Engineering, China University of Geosciences (Beijing), China. He has published over 100 scientiﬁc papers. His research interests include high performance computing and visualization techniques and geoengineering, geospatial data mining, networking engineering and security.

Yunchuan Sun is currently an associate professor in Beijing Normal University, China, and IEEE senior member and CCF member. He received his Ph.D. from Institute of Computing Technology, Chinese Academy of Science, China in 2009. He acts as the Secretary of IEEE Communications Society Technical Subcommittee for IoT since 2013. He is the associate editor of the Springer journal Personal and Ubiquitous Computing since 2012. His research interests include Big Data, Event-linked Network, Internet of Things, Semantic Technologies, Information Security. He has published 60þ papers in international journals and conferences. As the founder and program co-chairs, he successfully organized the series international events IIKI2012, IIKI2013, IIKI2014, and IIKI2015. He organized several special issues in journals like Knowledge Based Systems, Personal and Ubiquitous Computing, Journal of Networks Computer Applications, International Journal of Electronic Commerce, Electronic Commerce Research, and so on. He also holds or participates in several research projects from NSFC, 863 Program of China, etc.

Houbing Song (M12-SM14) received the Ph.D. degree in electrical engineering from the University of Virginia, Charlottesville, VA, in August 2012. In August 2012, he joined the Department of Electrical and Computer Engineering, West Virginia University, Montgomery, WV, where he is currently an Assistant Professor and the founding director of the Security and Optimization for Networked Globe Laboratory (SONG La. His research interests lie in the areas of cyber-physical systems, internet of things, cloud computing, big data, connected vehicle, wireless communications and networking, and optical communications and networking. Dr. Songs research has been supported by the West Virginia Higher Education Policy Commission. Dr. Song is a senior member of IEEE and a member of ACM. Dr. Song is an associate editor for several international journals, including IEEE Access, KSII Transactions on Internet and Information Systems, and SpringerPlus and a guest editor of several special issues. Dr. Song was the general chair of 4 international workshops, including the ﬁrst IEEE International Workshop on Security and Privacy for Internet of Things and Cyber-Physical Systems (IOT/CPSSecurity), held in London, UK, the ﬁrst/second/third IEEE ICCC International Workshop on Internet of Things (IOT 2013/2014/2015), held in Xian/Shanghai/ Shenzhen, China, and the ﬁrst IEEE International Workshop on Big Data Analytics for Smart and Connected Health, to be held in Washington D.C., USA. Dr. Song also served as the technical program committee chair of the fourth IEEE International Workshop on Cloud Computing Systems, Networks, and Applications (CCSNA), held in San Diego, USA. Dr. Song has served on the technical program committee for numerous international conferences, including ICC, GLOBECOM, INFOCOM, WCNC, and so on. Dr. Song has published more than 80 academic papers in peer-reviewed international journals and conferences.

Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classiﬁcation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i

Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classification

Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classification

Recommend Documents