Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classification Chengming Qi a,b, Zhangbing Zhou a,c,n, Yunchuan Sun d, Houbing Song e, Lishuan Hu a,b, Qun Wang a a
School of Information Engineering, China University of Geosciences (Beijing), Beijing 100083, China College of Automation, Beijing Union University, Beijing 100101, China c Computer Science Department, TELECOM SudParis, Evry 91001, France d Business School, Beijing Normal University, Beijing 100875, China e Security and Optimization for Networked Globe Laboratory, West Virginia University, Montgomery, WV25136-2437, USA b
art ic l e i nf o
a b s t r a c t
Article history: Received 30 January 2016 Received in revised form 28 March 2016 Accepted 9 May 2016
Hyperspectral remote sensing sensors can capture hundreds of contiguous spectral images and provide plenty of valuable information. Feature selection and classification play a key role in the field of HyperSpectral Image (HSI) analysis. This paper addresses the problem of HSI classification from the following three aspects. First, we present a novel criterion by standard deviation, Kullback–Leibler distance, and correlation coefficient for feature selection. Second, we optimize the SVM classifier design by searching for the most appropriate value of the parameters using particle swarm optimization (PSO) with mutation mechanism. Finally, we propose an ensemble learning framework, which applies the boosting technique to learn multiple kernel classifiers for classification problems. Experiments are conducted on benchmark HSI classification data sets. The evaluation results show that the proposed approach can achieve better accuracy and efficiency than state-of-the-art methods. & 2016 Elsevier B.V. All rights reserved.
Keywords: Ensemble learning Feature selection Hyperspectral remote sensing image Multiple kernel boosting
1. Introduction HyperSpectral Image (HSI) analysis has been an emerging research topic in recent years, which has the continuous coverage of the solar reflective wavelengths and a high spectral resolution. Hyperspectral sensors divide the electromagnetic spectrum into hundreds of spectral bands, which can provide the potential and detailed land-cover distinction and identification [1]. Classification of HSI consists of six sequential steps including pre-processing, feature extraction, feature selection, segmentation, classification, and post-processing. Several hundreds of spectral bands lead to theoretical and practical problems [2,3]. Most applications [4,5] and classification algorithms encountered the “Hughes phenomenon” [6]. Therefore, the feature extraction or feature selection technique is of core importance for HSI processing [7]. Feature selection is to select a subset of bands from the data cube. The selected subset should be the most informative and low-correlated ones. Without being transformed, band selection results are easier n Corresponding author at: School of Information Engineering, China University of Geosciences (Beijing), Beijing 100083, China. E-mail addresses:
[email protected] (C. Qi),
[email protected] (Z. Zhou),
[email protected] (H. Song),
[email protected] (L. Hu),
[email protected] (Q. Wang).
to be interpreted by traditional image-processing methods. There are several feature selection techniques presented in the literature for supporting the analysis of HSI [8–14] and bioinformatics [15]. In [8], Patra et al. proposed a rough-set-based supervised method to select informative bands from HSI. In [9], leveraging the covariance matrix, Yang et al. presented a fast supervised band selection method for HSI classification. MartinezUso et al. [10] grouped similar bands into a cluster by a clustering technique and selected the most informative bands by applying either a mutual information criterion or a Kullback–Leibler (KL) divergence criterion. In [11], Guo et al. proposed a GA-based feature selection method and optimized the parameter of linear support higher-order tensor machine. In [12], Shen et al. proposed a discriminative Gabor method for feature selection. In [13], Das et al. adopted the partitioned band image correlation to eliminate the band of HSI. In [14], Wang et al. proposed a band selection method based on column subset selection for HSI. Distance measure or mutual information measure is adopted as the criterion for selecting bands which show greater agreement with the ground truth [16,17]. In [18], Chavez et al. proposed the Optimum Index Factor (OIF) which can be calculated to obtain multivariate statistical information on a data set. In [19], Patel et al. employed features selected through OIF from both the individual years' and stacked images to classify the satellite images. Inspired
http://dx.doi.org/10.1016/j.neucom.2016.05.103 0925-2312/& 2016 Elsevier B.V. All rights reserved.
Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classification, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i
2
C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
by the OIF for feature selection, we propose a novel feature selection scheme which uses standard deviation, KL divergence, and correlation coefficients to select the most informative and the least correlative bands for classification. Recently, several criteria have been proposed to serve as the measure of the similarity between distributions, such as the KL divergence, the KL distance, the Bhattacharyya measure, the Chernoff measure, the (h, φ)-divergence [20,21] and the semanticbased structural similarity [22]. Among them, as a foundation of information theory, statistics, and machina learning, the KL distance is a popular distribution separable measure and has been widely applied. In [23], a region-based classifier, rather than the individual pixels classifier, for SAR images is proposed. In this algorithm, each region is assigned to the class that minimizes the criterion referring to the KL distance of gamma distribution for SAR images. In [24], Zeng et al. developed a statistics method that employed KL divergence to detect anomalous system behavior. In [25], Ferracuti et al. applied the KL divergence as an index for the electric motor defects automatic identification. Support vector machine (SVM), which depends on the principle of structural risk minimization [26], has a promising generalization performance when being applied for supporting the HSI classification [27]. The standard SVM only utilizes a single kernel function with fixed parameters, which necessitates the model selection for a satisfiable classification performance. SVM has been widely used to solve some machine learning problems in the past several decades [28,29]. Single kernel learning usually needs to choose proper kernel parameters, while multiple kernel learning (MKL) is usually required to search linear/nonlinear combination of predefined base kernels by maximizing the margin maximization. Generally, MKL provides more flexibility in solving similarities of data source than single kernel learning. Rakotommonjy et al. [30] proposed SimpleMKL where the kernel weights are obtained by a reduced gradient descent method. Furthermore, semi-infinite linear programming (SILP) [31], sparse MKL [32], and SpicyMKL [33] were proposed to solve MKL problem. Recently, Cortes et al. [34,35] and Wang et al. [36] proposed two-stage procedure to address the MKL problem, respectively. Pastor López-Monroy et al. [37] proposed a discriminative visual n-grams and MKL strategies to improve the Bag-of-Visual-Words. The first stage is to find the optimal weights to combine the kernels, which made use of the information from the complete training data and could be computed efficiently. The second stage aims to train a standard SVM by means of the combined kernel. Recently, ensemble methods are proposed for MKL. Ensemble methods consider the result of the misclassified data in the training phase and collect several classifiers to classify test examples. The extensive algorithms use ensemble learning to solve MKL. Xia et al. [38] proposed a framework adopted boosting to solve the MKL problem. Since the support vector coefficients cannot be obtained, Sun et al. [39] used a selective MKL method to approximate support vectors. Cai et al. [40] proposed a computational framework from incomplete matrix for constructing an influenza antigenic cartography. Gu et al. [41] employed a boosting strategy for screening the limited training samples under MKL framework. Ayerdi et al. [42] used extreme learning machine classifiers ensemble for HSI classification and segmentation. Zhang et al. [43] showed that ensemble methods combining spectral and spatial information outperformed traditional single kernel approaches for HSI classification. However, they have to resolve a complicated optimization task when learning classifier using boosting methods. In addition, some approaches adopt boosting technique with SVM to improve kernel methods, such as, BoostSVM [44], AdaBoost with SVM [45–47]. The representative boosting algorithm is
the AdaBoost algorithm [48]. Various simulation results for hyperspectral remote sensing data show that SVM ensemble with bagging or boosting outperforms a single SVM in terms of classification accuracy significantly [49]. On the other hand, they can hardly deal with multiple kernels that originate from multiple resources. In 1995, Eberhart and Kennedy [50] proposed the particle swarm optimization (PSO) algorithm. PSO is a swarm intelligent optimization method, which can find solution quickly in a high dimension space by its stochastic and multi-point searching ability. PSO is adapted to select bands in HSI and to optimize the penalty parameter C and the kernel parameter γ for SVM, which leads to the improved classification performance. For example, in [51], Melgani et al. used PSO to enhance the classification performance of SVM in electrocardiogram signals classification. In [52], Monteiro et al. proposed to perform feature extraction from hyperspectral data. However, PSO suffers from the shortcoming of the premature convergence. To address this issue, extensive studies were conducted in HSI analysis. For example, in [53], Zhang et al. used adaptive chaotic PSO to find the optimal parameters of the forward neural network. Couceiro et al. [54] proposed fractional order Darwinian PSO (FODPSO). Ghamisi et al. [55–57] applied FODPSO for hyperspectral data. Following the similar strategy, we employ the mutation mechanism to prevent particles from converging to a local optimum and losing diversity. To improve the efficiency of multiple kernel boosting framework for classification, in this paper, we propose a strategy for feature selection and employ PSO to optimize SVM classifier parameters, named OIMKB. In comparison with the other data analysis approaches applied upon HSI, this approach has three specific contributions summarized as follows: (i) a new feature selection scheme has been introduced, which uses standard deviation, KL distance, and correlation coefficients for the selection of the most informative and the least correlative bands for supporting the classification, (ii) the SVM classifier design has been optimized through searching the most appropriate value of the parameters using PSO with mutation mechanism, and a boosting framework is constructed for improving MKL, and (iii) extensive experiments has been conducted on the hyperspectral image for validating the applicability and performance of our approach by comparing with various state-of-the-art kernel-based algorithms, and for evaluating various settings for parameters of OIMKB to provide a tradeoff between accuracy and efficiency. The remainder of this paper is organized as follows. In Section 2, we review the SVM and MKL. In Section 3, we formulate the proposed framework of OIMKB. We present the experimental results and evaluate various parameters of OIMKB in Section 4, and finally, conclude this work in Section 5.
2. Preliminaries: SVM and MKL In this section, we briefly review SVM and MKL, which are widely used for supporting HSI classification. 2.1. SVM SVM, introduced by Vapnik [26], is one of the most successful kernel methods. Standard SVM uses a hypothesis space of linear function in a high dimensional feature space by using the kernel theory. Since it performs well with a small training data set, it is a appropriate candidate for remote sensing data classification during the last decade [58,59]. Nowadays, SVM has been regarded as a promising method for hyperspectral remote sensing data processing and image classification. SVM is a discriminative classifier based on a single kernel.
Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classification, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i
C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Given a sample of independent and identically distributed training instances {(xi , yi )}iN= 1, where xi ∈ RD and yi ∈ { − 1, + 1} is its class label, SVM finds the linear discriminant with the maximum margin in the feature space induced by the mapping function Φ (·). The discriminant function is defined as follows:
f (x) = 〈w , Φ (x)〉 + b
N
1 ∥ w ∥22 + C ∑ ξi 2 i=1
Min w. r. t. s. t.
w∈
RS,
ξ∈
R +N ,
N i=1
1 2
N
b∈R
N
k=1
⎠
⎛ l ⎞ = sgn ⎜⎜ ∑ αi yi K (Xi , x) + b⎟⎟ ⎝ i=1 ⎠
(5)
where the resultant K is a convex combination of the base kernels K1, …, Kn:
N
∑ αi yi
∑ μk Kk
=0
i=1
In real applications, each base kernel Kk may either use the full set of variables describing x, or subsets of variables stemming from different data sources. Alternatively, base kernels can simply be classical kernels (such as Gaussian kernels) with different parameters. Within this framework, the problem of data representation and fusion through the kernel is then transferred to the choice of weights μ.
3. Multiple kernel boosting framework
i=1 i=1
s. t.
⎞
k=1
∑ ∑ αi αi yi yj k (xi , xj )
α ∈ [0, C ]N ,
w. r. t.
n
∑ μk Kk (Xi , x) + b⎟⎟
n
where w is the vector of weight coefficients, S is the dimensionality of the feature space obtained by Φ (·), C is a predefined positive trade-off parameter between model simplicity and classification error, ξ is the vector of slack variables, and b is the bias term of the separating hyperplane. Instead of solving this optimization problem directly, the Lagrangian dual function enables us to obtain the following dual formulation:
∑ αi −
⎛ l f (x) = sgn ⎜⎜ ∑ αi yi ⎝ i=1
K=
yi (〈w , Φ (xi )〉 + b) ≥ 1 − ξi ∀ i
Max
α = (α1, … , αl )T . Finally, the resultant decision function can be written as follows:
(1)
whose parameters can be learned by solving the following quadratic optimization problem:
3
(2)
where α is the vector of dual variables corresponding to each separation constraint and the obtained kernel matrix of k (xi , xj ) = 〈Φ (xi ) , Φ (xj )〉 is positive semidefinite. Thus, getting N
w = ∑i = 1 αi yi Φ (xi ), the discriminant function can be written as follows:
In this section, we address the problem of HSI classification which includes the following three steps. First, we present a novel criterion by standard deviation, KL divergence, and correlation coefficient for feature selection. Second, we optimized SVM classifier design by searching for the most appropriate value of parameters. Finally, we use the Adaboost algorithm [62] and propose an ensemble learning framework, which applies the boosting technique to learn multiple kernel classifiers for solving the classification problem.
N
f (x) =
∑ αi yi k (xi , x) + b
(3)
i=1
Generally, a cross-validation procedure is applied to choose the most appropriate kernel function k (· , ·), and parameters (e.g., q or s) among a set of kernel functions on a separate validation set, which is different from the training set. 2.2. MKL Instead of having a single kernel k, MKL has a set of n base kernels k1, … , k n , with the corresponding feature maps ϕ1, …, ϕn. After explicitly modeling the weights (μ1, … , μn )T of the given kernels through a variational argument, an MKL formulation was developed in [60]:
1 ⎛⎜ ∑ μ ∥ wk 2 ⎜⎝ k = 1 k n
Min
μ, w, b, ξ
s. t.
l ⎞2 ∥⎟⎟ + C ∑ ξi ⎠ i=1
i = 1, …, l
n
∑ μk = 1, k=1
μk ≥ 0,
k = 1, …, n
HSI data can be represented as a N × M matrix X. N is the number of pixels in a single band, and M is the number of bands. Suppose that we choose k bands from X. Thus, the feature selection is to select k columns, which can be efficient to represent matrix X. The OIF is an unsupervised method which can select the most informative features. The OIF value is determined with respect to the variance and the correlation among the different bands. OIF benefits the selection of suitable three-band combination. A large value of OIF indicates the optimum combination of bands which is the one with the largest amount of standard deviations and the least amount of correlation among band pairs. The OIF is determined by the following formula:
⎛ 3 ⎞ ∑ σ (i) ⎟ OIF = max ⎜ 3i = 1 ⎜ ∑ |r (j )| ⎟ ⎝ j=1 ⎠
⎛ n ⎞ yi ⎜ ∑ μk wkT ξk (xi ) + b⎟ ≥ 1 − ξi ⎜ ⎟ ⎝ k=1 ⎠ ξi ≥ 0,
3.1. Feature selection optimum index
(4)
Solving the MKL problem as presented in Eq. (4) is more challenging than solving the standard SVM problem as presented in Eq. (1). Several techniques have been proposed to solve the MKL optimization problem and the comparative evaluation of these methods can be found in the literature [61]. A weight matrix w and the bias b can be determined according to the learned and
(6)
where σ (i ) is the standard deviation of i-th band, and r (j ) is the value of correlation coefficient between any two bands. Inspired by the OIF for feature selection, in this paper, we propose a novel feature selection optimum index, named OI, which combines KL divergence. The KL divergence is a popular distribution separable measure being applied in many research domains. We recall the KL entropy between two discrete probability distributions. Generally, KL divergence is defined as follows:
DKL (P ∥ Q ) =
∑ pi log2 i
pi qi
(7)
Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classification, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i
C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
4
where pi and qi denote the probability densities of P and Q at the feature i. DKL has the following properties: (1) DKL (p ∥ q) ≥ 0. (2) DKL (p ∥ q) = 0, if p = q KL divergence is not a positive semidefinite matrix, since DKL is a non-symmetric distance. It does not satisfy the triangle inequality. In this paper, we apply the following symmetric KL divergence equation [63] to evaluate features:
DSKL (P ∥ Q ) = DKL (P ∥ Q ) + DKL (Q ∥ P )
(8)
Symmetric KL divergence, also called KL distance, is nonnegative. This quantity has been used for feature selection in classification problem. To add the quantity of information in selected features, we propose to employ standard deviation, KL distance, and correlation coefficient to define the feature selection optimum index as follows:
⎛ 3 ⎞ ∑ (α·σ (i) + (1 − α )·DSKL (p ∥ q)) ⎟ OI = max ⎜ i = 1 3 ⎜ ⎟ ∑ j = 1 |r (j )| ⎝ ⎠
(9)
3 ∑i = 1
where DSKL (p ∥ q) is a summary of pairwise KL distance in three-band combination. α is a factor which weighs between standard deviation and KL distance. The larger the alpha is, the more important the standard deviation is. When α = 0.5, standard deviation and KL distance are assumed the same in importance. The OI can be employed to select the largest amount of standard deviations, KL distance with maximum information and the least amount of correlation among band pairs for classification. To calculate the OI, we create a map list that contains the multispectral bands, calculate a correlation matrix for the maps, calculate standard deviation and KL distance of each three-band combination, and rank the OI values. Finally, the largest OI is selected for band composition. 3.2. PSO PSO is a biologically inspired technique derived from the collective behavior of an entire flock of birds. By following current optimal particle, all particles search in the solution space until an optimal solution is found. Suppose that the search space is N-dimensional, the number of particles is n, the i-th particle of the swarm is represented by the N-dimensional vector Xi = (xi1, xi2, … , xiN ). The best previous position of the i-th particle is recorded and represented as pi = (pi1, pi2 , … , piN ), which gives the best fitness value called pbest. The particle with the lowest function value is denoted as gbest or Pg. The position change (velocity) of the i-th particle is Vi = (vi1, vi2, … , viN ). The particles are manipulated according to the following equations (the superscripts denote the iteration number):
xid = xid + vid
3.3. Multiple kernel boosting framework In this section we present a multiple kernel boosting framework based on SVM, whose kernel parameters are optimized through PSO. Following the procedure of popular and successful boosting algorithm, i.e., Adaptive Boosting (AdaBoost) [48], we formulate OIMKB by applying boosting technique to learn a classifier using multiple kernels. In particular, our algorithm maintains a probability distribution D t over the training examples. At each boosting trial t (t ¼1, …, T), where T denotes the total number of boosting trials, we learn some kernel classifiers with multiple kernels ftj (x ) iteratively. The misclassification rate ϵ t j of this kernel classifier on the training examples is computed and used to adjust the probability distribution on the training examples: N
(
)
ϵtj = ϵ ftj (x) =
(10)
(12)
We learn the SVM classifier ft(x), which kernel parameters had been optimized by PSO with mutation mechanism, from these training data. For the t-th boosting trial, we can build the classifier ftj (x ) by choosing the best classifier with the smallest error rate, i.e., j
min
ϵ(ftj (x))
ft (x), j ∈{1, … , M}
(13)
Computing the misclassification rate ϵt for the combined classifier ft(x) over the distribution Dt with training data is shown in Eq. (14):
(11)
where 1 ≤ i ≤ n, 1 ≤ d ≤ N , w is the inertia weight, c 1 and c 2 are two positive constants, called the cognitive and social parameter respectively; rand1 () and rand2 () are two random numbers uniformly distributed within the range [0, 1]. Some variants of PSO impose a maximum allowed velocity V max to k+1 > V prevent the swarm from explosion (i.e. if vid max , then k+1 = V ) [64]. vid max Eq. (10) is used to calculate the i-th particle's velocity at each
∑ Dt (i) ( ftj (x) ≠ yi ) i=1
ft (x) = arg
vidk+ 1 = w × vidk + c1 × rand1 () × (pidk − xidk ) k + c2 × rand2 () × (pgd − xidk )
k is the speed of i-th particle. c × rand ()(p k − x k ) iteration. w × vid 1 1 id id is the distance between the i-th particle and its personal best k k ) is the distance between the i-th particle. c2 × rand2 ()(pgd − x id particle and the global best position. The parameters c1, c2, rand1 () and rand2 () provide randomness that makes the technique less predictable yet more flexible [65]. Eq. (11) provides the new position of the i-th particle, adding its new velocity, to its current position. The inertia weight w is employed to control the impact of the previous history of velocities on the current velocity. In this way, the parameter w regulates the trade-off between the global and local exploration abilities of the swarm and influences PSO convergence behavior. A small inertia weight facilitates local exploration, while a large one tends to facilitate global exploration. Parameter selection of kernel function is a critical factor for SVM. PSO is used to search the punishment factor C and the parameter of γ in kernel function (such as, Radial Basis Function). In this paper, we use the grid search method to determine a limited range of the parameter to reduce the search time. Furthermore, in order to avoid PSO trapping in local optimum, mutation mechanism, which can increase the randomness of individuals [66], is taken in PSO model. Specifically, at each particle updating, the fitness value of particle Xi is equal to the global optimum, i.e. Pi ¼Pg.
N
ϵt =
∑ Dt (i)(ft (xi ) ≠ yi ) i=1
(14)
The next step of each boosting trial is to update the weight of each training example Dt + 1 (i ) which follows the similar procedure of AdaBoost as follows:
Dt + 1 (i) =
Dt (i) ⎧ βt if ft (xi ) = yi , ×⎨ Zt ⎩ 1 otherwise.
(15)
Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classification, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i
C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
where βt = ϵt /(1 − ϵt ) and Zt is a normalization factor to make Dt + 1 a distribution. The final classifier, after finishing all T boosting trials, f(x), is constructed by a weighted vote of the individual classifiers as follows:
⎛ T ⎞ f (x) = sign ⎜⎜ ∑ αt ft (x) ⎟⎟ ⎝ t=1 ⎠
(16)
The OIMKB approach is shown in Algorithm 1. Algorithm 1. OIMKB. Input:
training data: D = (x1, y1), …, (xN , yN ); kernel function: κj (· , ·): X × X → R, j = 1, …, M ; initialize weight distribution D1 (i) = 1/n for all i; integer T specifying number of iterations; initialize PSO parameters: inertia weight ω , constant
c1, c2, mutation rate ρ ; T
Output: the final hypothesis: f (x ) = sign ( ∑t = 1 αt ft (x )) 1: for t = 1, … , T do 2: select a subset SN bands according to (9) 3: sample Sn = n examples with distribution Dt 4: for j = 1, … , M do 5: PSO optimize SVM parameters 6: train weak classifier with kernel κj 7:
get back a hypothesis ftj = SVM (D, Dt )
8:
calculate the training error of ftj over Dt
9:
ϵtj = ∑i = 1 Dt (i )(ftj (xi ) ≠ yi )
N
10: 11:
end for; choose the best classifier with the smallest error rate
ft (x ) = argmin f j (x ), j ∈{1, … , M} ϵ(ftj (x )) t
12:
compute the training error over M
Dt ϵt = ∑i = 1 Dt (i )(ft (xi ) ≠ yi ) 13:
choose the weight of ft : αt =
1 2
14:
update distribution Dt + 1 (i ) ←
Dt (i ) Zt
ln
1 −ϵt ϵt
exp ( − αt yi ft (xi ))
where Zt = ∑i Dt (i ) is a normalization constant (chosen so that Dt + 1 will be a distribution). 15: end for.
5
An illustrative example is given in the following to describe how to apply Algorithm 1. Example. Given three classes: Corn-notill (C1), Corn-mintill (C2), Corn (C3) of Indian Pines dataset (see Section 4), 6 kernel functions, initialize PSO parameters, and the number of iterations T ¼10, we select a subset SN ¼ 9 bands from all 200 bands according to Formula (9) in the first run of 10 iterations. Sample Sn ¼500 (0.2 n 2502) examples as training. In the first run of 6 iterations of weak classifiers construction, sample n ¼ 100 (0.20n2502) with the distribution D1. Select κ1 ¼3 kernels as the SVM kernel function. After optimized the punishment factor C ¼ 673 and the parameter γ = 0.8 of SVM, Algorithm 1 training 3 weak classifiers {f11 , f12 , f13 }. Measure the training error of
{ϵ11 = 1/3, ϵ12 = 1/3, ϵ13 = 1/3} over D1 and repeat the loop of weak classifiers construction. Choose the best classifier f1 of all 6 iterations with the smallest error rate (suppose f11 is picked as the first base learner). Compute the training error ϵ1 over D1 according to Formula (14). Determine the weight α1 = 0.5 ln 2 ≈ 0.35 of f1. Update distribution D2 according to Formula (15) and enter the next cycle. Obtain the final hypothesis f(x) according to Formula (16) by completing iteration 10 iterations.
4. Experimental results In this section, we implement several experiments to verify the performance of our method and compare with previous kernelbased methods for HSI classification. Then, we evaluate various settings for parameters of OIMKB. 4.1. Data set description Two hyperspectral images have been used in our experiments. The first is acquired using the AVIRIS sensor over the Indian Pines region, Northwestern Indian, USA, in 1992, which is widely used to verify the performance of classification algorithms. The Indian Pines data set comprises 220 bands with the spatial size of 145 × 145 pixels in the wavelength range from 0.4 to 2.5 μm . Removing the noisy bands, 200 bands remained. The ground truth has 10,062 labeled pixels which consists of 16 land cover classes. The data set is now publicly available at https://engineering.pur due.edu/ biehl/MultiSpec/hyperspectral.html. Fig. 1(a) shows a
Fig. 1. Classification maps of the AVIRIS Indian Pine dataset: (a) false color composite (bands 17, 27, and 50 for RGB), and (b) ground truth.
Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classification, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i
6
C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
false color composite (bands 17, 27, and 50 for RGB) and (b) the ground truth. The second is an urban image acquired using ROSIS sensor during a flight over University of Pavia, northern Italy. The image size in pixels is 610 340, with a very high spatial resolution of 1.3 m per pixel. It consists of 115 bands in a range from 0.43 to 0.86 μm at a spatial resolution of 1.3 m. 103 bands are finally remained after removing 12 noisy ones. The ground truth data consist of 9 classes of interest and 42,776 labeled pixels. The ground truth map is shown in Fig. 2(a). 4.2. Experimental setup For evaluation, we compare the experiment results of our algorithm with that of several previous competitive ones, including SVM-based single kernel (SVM for short), OIF feature selection with SVM (OIF for short), SimpleMKL (SMKL for short), and AdaBoost with SVM (AdaBoost for short). For SVM classifier, we employ the polynomial and Gaussian radial basis function kernel. We have performed 10-fold cross-validation procedure using a single SVM for finding the optimal SVM parameters σ ∈ {10−2 , … , 102}, C ∈ {101, … , 104}. SimpleMKL is one of the algorithms used to solve the MKL problem. To implement SimpleMKL algorithms, we adopt the SimpleMKL toolbox [30] and their default settings suggested by this toolbox. AdaBoostSVM is an algorithm applying AdaBoost to improve the SVM learning accuracy [44]. For AdaBoostSVM, 10-fold cross validation is adopted to select the best kernel, other settings are the same as those in SMKB algorithms. In all cases, the one-versus-one multiclass scheme implemented in LibSVM [67] was used. For OIMKB, we follow the typical approach used in previous MKL and AdaBoost. In particular, 16 base kernels are used initially in the ensemble, including 13 Gaussian kernels with different bandwidth parameters from {2−6 , 2−5 , … , 26 } and 3 polynomial kernels with degrees 1, 2, and 3 respectively. Before classification, preprocessing is performed on these data clusters. Data sources should be scaled to the range [ 1,1]. This eases the tuning of SVM kernel parameters [67]. We set the total number of boosting trials T to 100, the boosting sampling ratio to 0.3, and classifier sampling ratio ρ to 0.3. For SVM, we adopt the popular LIBSVM toolbox [67] as solver, but parameters are optimized by PSO with mutation Mechanism. The configuration for PSO is set as follows. The swarm size is fixed to 20, the maximum iteration number 50, inertia weight w ¼ 0.8, c1 = c 2 = 1.6, and mutation probability ρ = 0.02. In feature selection OI (Eq. (6)), we set α = 0.25. We implement all experiments in an MATLAB environment on computer with 2.9 GHz Intel CPU (and 16 GB RAM). Three evaluation metrics, overall accuracy (OA), average accuracies (AA), and Kappa coefficient, are widely used to measure the statistical significance for hyperspectral image classification [68]. OA (Eq. (17)) is the sum of the pixels correctly classified divided by the total number of samples, AA (Eq. (18)) is the average of individual class producers' accuracy, and Kappa coefficient (Eq. (19)) is the percentage of agreement [69,70]: r
OA =
∑i = 1 xii
AA =
∑i = 1 xi +
N
× 100
(17)
r
Fig. 2. Classification maps of the PaviaU dataset: (a) ground truth, (b) SimpleSVM, (c) SVM, (d) OIF, (e) AdaboostSVM, and (f) our OIMKB.
N
× 100
(18)
Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classification, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i
C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Table 1 The number of training and testing samples of two data sets. Indian Pines
Training
Test
Pavia university
Training
Test
Corn-notill Corn-mintill Corn Grass/Pasture Grass/Trees Hay-windrowed Soybeans-notill Soybeans-mintill Soybeans-clean Wheat Woods Bldg–Grass–Tree
267 171 43 84 155 113 181 485 131 37 266 79
1161 659 194 399 575 365 791 1970 462 168 999 307
Asphalt Meadows Gravel Trees Metal sheets Bare soil Bitumen Bricks Shadows
636 1829 212 356 143 492 142 369 97
5995 16,820 1887 2708 1202 4537 1188 3313 850
r
Kappa =
r
N ∑i = 1 xii − ∑i = 1 xi + × x +i N2
−
r ∑i = 1
x i + × x +i
(19)
where r is the number of rows of the confusion matrix, r r N = ∑i = 1 ∑ j = 1 xij is the total number of observations, x ii is entry (i, i ) of the confusion matrix, and xi +, x+i is the marginal total of column j, row i, respectively. 4.3. Comparisons In Indian Pines data set, after removing four classes with small
7
sample size, only 12 classes are considered, including Corn-notill (C1), Corn-mintill (C2), Corn (C3), Grass/Pasture (C4), Grass/Trees (C5), Hay-windrowed (C6), Soybeans-notill (C7), Soybeans-mintill (C8), Soybeans-clean (C9), Wheat (C10), Woods (C11) and Bldg– Grass–Tree (C12). About 20% of available labeled samples are randomly selected for training, and the remaining for test in each run. In each run, we select 50 three-band combinations with the highest OI. About 10% are randomly selected for experiments on data set of Pavia University, and the remaining are used for test in each run. In each run, we select 22 three-band combinations with the highest OI. Table 1 summarizes the number of training and testing samples for Indian Pines and Pavia University. We repeat each algorithm 10 times on every data set and report the average to avoid unstable results. Table 2 (for Indian Pines data set) and Table 3 (for Pavia University data set) show OAs, AAs, individual classification accuracies (in percent), and the kappa statistic obtained for different kernel-based classification methods. The processing times are also shown in two tables. The highest scores for each class are highlighted in boldface font. From Tables 2 and 3, the proposed method shows better than previous works in terms of overall accuracy and kappa statistic. Furthermore, it is noticeable that the other type of kernels defined in the generalized approach also produced competitive results. From Table 2, it can be seen that our method generates the highest OA above 88%. In Table 3, our method has the highest OA with 95.8%. It also shows that our method presents higher performances, especially in classes with small number of training samples such as Corn and Wheat. As observed, OIF with SVM overpasses the SVM slightly and our method performs best and obviously
Table 2 Comparison of classification results using five algorithms for Indian Pines. Class
Class name
SVM
OIF
SMKL
AdaBoost
OIMKB
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12
Corn-no till Corn-min till Corn Grass/Pasture Grass/Trees Hay-windrowed Soybeans-no till Soybeans-min till Soybeans-clean till Wheat Woods Bldg–Grass–Tree
79.31 73.2 73.59 7 3.1 65.05 7 2.5 91.44 73.2 96.42 7 2.6 96.747 1.4 77.15 72.7 83.87 7 2.1 81.23 73.8 93.977 3.8 94.03 7 1.8 64.767 3.1
80.277 1.7 75.677 3.1 64.96 7 2.5 91.84 7 3.7 96.727 3.2 96.747 1.9 79.54 72.7 83.677 3.1 83.43 73.2 94.077 4.8 94.13 71.2 67.03 72.4
76.53 7 0.7 70.8071.1 58.46 7 2.1 90.08 7 0.4 92.747 2.1 96.55 7 0.1 72.13 70.4 81.577 4.3 75.86 7 2.4 92.25 7 3.7 93.917 4.1 64.73 72.3
82.56 7 1.1 79.62 7 2.9 75.017 2.6 92.647 2.6 96.857 0.1 97.55 72.8 82.137 1.7 86.667 2.1 85.367 1.3 94.51 71.6 94.54 7 0.2 71.14 71.8
85.46 71.4 82.32 70.8 77.85 72.4 94.89 70.9 96.60 7 1.0 99.167 1.9 81.80 7 0.4 87.027 2.5 85.077 3.1 97.357 1.1 95.377 1.2 73.387 1.1
0.8086 82.14 80.97 428
0.8109 82.92 81.71 403
0.7746 80.08 78.64 760
0.8385 85.22 84.27 686
0.8579 88.02 87.68 773
Kappa OA AA Time (s)
Table 3 Comparison of classification results using five algorithms for Pavia university. Class
Class name
SVM
OIF
SMKL
AdaBoost
OIMKB
C1 C2 C3 C4 C5 C6 C7 C8 C9
Asphalt Meadows Gravel Trees Metal sheets Bare Soil Bitumen Bricks Shadows
89.317 2.2 96.59 7 3.7 82.32 7 2.0 93.44 72.2 96.23 7 2.1 88.94 7 1.1 80.157 1.7 85.677 2.7 99.23 7 3.1
92.34 7 1.2 96.977 2.1 81.46 7 2.2 93.82 7 2.7 98.86 7 1.1 91.74 71.2 84.54 7 3.7 87.6774.1 99.43 7 3.4
86.69 7 1.1 95.977 1.4 79.98 7 2.3 92.98 7 1.8 97.497 2.0 88.68 71.1 77.69 7 1.8 85.84 7 3.3 99.13 72.4
94.82 7 1.4 97.69 7 3.2 84.93 7 2.9 95.20 7 3.6 99.53 72.2 92.44 72.2 88.247 1.4 89.40 7 3.1 99.917 2.3
95.66 71.6 97.837 1.8 86.747 2.8 97.897 1.4 99.417 1.9 93.227 1.7 89.877 2.4 90.227 2.9 99.86 7 3.5
0.9051 92.32 91.44 268
0.9189 93.16 92.32 294
0.8865 90.23 88.96 176
0.9337 94.93 93.72 661
0.9403 95.81 94.97 702
Kappa OA AA Time (s)
Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classification, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i
C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
8
Pavia University data set 98
Overall Accuracy (%)
96
94
92
90
OIMKB SVM OIF simpleMKL AdaboostSVM
88
86 5
10
15
20
25
30
35
40
45
50
Number of training samples (%) Fig. 3. Relationship between scales selection and overall accuracy.
University scene. It can be seen from Fig. 3 that our method is superior to SVM, OIF with SVM, and SimpleMKL, and closes to AdaBoostSVM for Pavia University data set. The performance can be improved by OIF feature selection to SVM. OA curve indicates that SimpleMKL obtains lower performance than all other methods in each experiment. This result is similar to what Tuia et al. reported in [71]. Fig. 3 illustrates that our method has a better discriminative ability to deal with a smaller number of labeled samples, for example, 5% of training sample size for each class. As the number of training sample increases, the classification performance of each kernel generally increases respectively. This is due to the fact that the complexity of the data construction increases. However, the number of sample for more than 10% do not have much impact on the OA. Especially, the advantage of our method is very obvious for 10% and 20% of training samples size with 1.7% and 1.4% in OA against AdaBoostSVM, respectively. However, such an advantage becomes not so obvious when training sample size is at 15% and 30%. SVM and OIF with SVM also have the closest OA in 5% and 30% of training samples. 4.5. Evaluation of the numbers of selected bands
Indian Pines data set 0.86
kappa coefficient
0.84
0.82
0.8
0.78
0.76
OIMKB AdaboostSVM OIF SimpleMKL
0.74
0.72 20
40
60
80
100
120
140
160
180
200
numbers of bands Fig. 4. Kappa coefficients varying with the number of bands.
outperforms SimpleMKL method. The values of Kappa coefficient show that OIMKB makes a good enhance on AdaBoostSVM. Some classification maps are shown in Fig. 2 obtained for the University of Pavia scene using fixed training configuration. In Fig. 2, we can easily see that, our proposed method, OIMKB (f), got a higher accuracy compared with SMKL(b), SVM(c), OIF(d), and AdaboostSVM(e), where (a) is the ground truth. From the class ”Meadows” region of the lower part of the image, it is clear that OIMKB can obtain the better performance and is more closer to the material spatial distribution than all other methods. 4.4. Evaluation of scales selection The effect of training sample size on classification performance is analyzed. We fix band number to 66. Fig. 3 shows the evolution of the overall accuracy as a function of the percentage of training samples used for our five evaluated approaches in Pavia University scene. Evaluation experiments of scales selection are also performed with different numbers of training samples (5–50% of all labeled samples in each class) in all cases. Fig. 3 shows the evolution of the overall accuracy as a function of the percentage of training samples used for our five evaluated approaches in Pavia
The experiment is to examine the impact of the varying number of bands on kappa coefficients for OIMKB, AdaBoostSVM, OIF with SVM, and SimpleMKL algorithms (due to using all bands, SVM is absent). Fig. 4 shows the error bars results of kappa coefficients versus the number of bands in term of feature selection OI by varying its value from 15 to 200 for different methods. First of all, we observe that, in terms of classification accuracy performance, our method and AdaBoostSVM are more accurate than SVM with OIF and SimpleMKL, particularly when the number of bands is relatively small. However, there are different superiorities over SVM and SimpleMKL. Increasing of the band number can improve performance of our method consistently. The improvement of classification accuracy performance usually become very small when the band number is large (e.g., 120). The main reason is that, when the number of bands is too large, the base kernel classifiers trained at the boosting process may suffer from band correlation and information redundancy. By referring to the results in Fig. 4 that our method is relatively insensitive to the precise choice of band size since the classification accuracy tends to saturate when the number of bands is large enough (e.g., larger than 120). Fig. 4 indicates that an appropriate number of bands is essentially a tradeoff between classification accuracy and efficiency performance.
5. Conclusion In this paper a multiple kernel ensemble learning approach (OIMKB) has been applied to hyperspectral remote sensing image classification, leveraging the feature selection and PSO. Our approach presents a novel feature selection criterion by standard deviation, KL divergence, and correlation coefficient. Therefore, we obtain more accurate results than the single kernel and multiple base kernels. Furthermore, we optimized the SVM classifier by PSO with the mutation mechanism to search the most optimal value of the parameters. These techniques have been evaluated using two standard hyperspectral datasets recorded by different sensors. Experimental results show that, compared with state-of-the-art algorithms, our algorithm has a promising performance on HSI classification. Generally, our ensemble framework is faster than the mixture kernel, but is slower than the single kernel. Further research will focus on reducing the computational cost and developing more efficient schemes.
Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classification, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i
C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
References [1] C.I. Chang, Hyperspectral Data Exploitation: Theory and Applications, Wiley, Hoboken, NJ, USA, 2007. [2] L. Bruzzone, C. Persello, A novel context-sensitive semisupervised SVM classifier robust to mislabeled training samples, IEEE Trans. Geosci. Remote Sens. 47 (2009) 2142–2154. [3] B.C. Kuo, C.H. Li, J.M. Yang, Kernel nonparametric weighted feature extraction for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 47 (2009) 1139–1155. [4] Z. Lv, et al., Managing big city information based on WebVRGIS, IEEE Access 4 (2016) 407–415. [5] J. Yang, J. Zhou, Z. Lv, et al., A real-time monitoring system of industry carbon monoxide based on wireless sensor networks, Sensors 15 (2015) 29535–29546. [6] G.F. Hughes, On the mean accuracy of statistical pattern recognizers, IEEE Trans. Inf. Theory IT-14 (1968) 55–63. [7] C.J.C. Burges, Dimension reduction: a guided tour, Found. Trends Mach. Learn. 2 (2010) 275–365. [8] S. Patra, P. Modi, L. Bruzzone, Hyperspectral band selection based on rough set, IEEE Trans. Geosci. Remote Sens. 53 (2015) 5495–5503. [9] H. Yang, Q. Du, H. Su, Y. Sheng, An efficient method for supervised hyperspectral band selection, IEEE Geosci. Remote Sens. Lett. 8 (2011) 138–142. [10] A. Martinez-Uso, F. Pla, J.M. Sotoca, P. Garcia-Sevilla, Clustering-based hyperspectral band selection using information measures, IEEE Trans. Geosci. Remote Sens. 45 (2007) 4158–4171. [11] T. Guo, L. Han, L. He, X. Yang, A GA-based feature selection and parameter optimization for linear support higher-order tensor machine, Neurocomputing 144 (2014) 408–416. [12] L. Shen, Z. Zhu, S. Jia, J. Zhu, Y. Sun, Discriminative Gabor feature selection for hyperspectral image classification, IEEE Geosci. Remote Sens. Lett. 10 (2013) 29–33. [13] A. Das, S. Ghosh, A. Ghosh, Band elimination of hyperspectral imagery using partitioned band image correlation and capacitory discrimination, Int. J. Remote Sens. 35 (2014) 554–577. [14] C. Wang, M. Gong, M. Zhang, Y. Chan, Unsupervised hyperspectral image band selection via column subset selection, IEEE Geosci. Remote Sens. Lett. 12 (2015) 1411–1415. [15] Z. Cai, R. Goebel, M. Salavatipour, G. Lin, Selecting genes with dissimilar discrimination strength for class prediction, BMC Bioinform. 8 (2007) 206. [16] J. Feng, L.C. Jiao, X. Zhang, T. Sun, Hyperspectral band selection based on trivariate mutual information and clonal selection, IEEE Trans. Geosci. Remote Sens. 52 (2014) 4092–4105. [17] X. Geng, K. Sun, L. Ji, Y. Zhao, A fast volume-gradient-based band selection method for hyperspectral image, IEEE Trans. Geosci. Remote Sens. 52 (2014) 7111–7119. [18] P. Chavez, G. Berlin, L. Sowers, Statistical method for selecting landsat MSS ratios, J. Appl. Photogr. Eng. 1 (1982) 23–30. [19] N. Patel, B. Kaushal, Classification of features selected through optimum index factor (OIF) for improving classification accuracy, J. For. Res. 22 (2011) 99–105. [20] C.M. Bachmann, T.L. Ainsworth, R.A. Fusina, Exploiting manifold geometry in hyperspectral imagery, IEEE Trans. Geosci. Remote Sens. 43 (2005) 441–454. [21] B. Wu, L. Zhang, Y. Zhao, Feature selection via Cramer's V-test discretization for remotesensing image classification, IEEE Trans. Geosci. Remote Sens. 52 (2014) 2593–2606. [22] Y. Sun, R. Bie, J. Zhang, Measuring semantic-based structural similarity in multi-relational networks, Int. J. Data Warehous. Min. 12 (2016) 20–33. [23] X. Qin, H. Zou, S. Zhou, K. Ji, Region-based classification of SAR images using Kullback–Leibler distance between generalized gamma distributions, IEEE Geosci. Remote Sens. Lett. 12 (2015) 1655–1659. [24] J. Zeng, U. Kruger, J. Geluk, X. Wang, L. Xie, Detecting abnormal situations using the Kullback–Leibler divergence, Automatica 50 (2014) 2777–2786. [25] F. Ferracuti, A. Giantomassi, S. Iarlori, G. Ippoliti, S. Longhi, Electric motor defects diagnosis based on kernel density estimation and Kullback–Leibler divergence in quality control scenario, Eng. Appl. Artif. Intell. 44 (2015) 25–32. [26] V.N. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, 1998. [27] E. Pasolli, F. Melgani, D. Tuia, F. Pacifici, W.J. Emery, SVM active learning approach for image classification using spatial information, IEEE Trans. Geosci. Remote Sens. 52 (2014) 2217–2233. [28] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, New York, 2004. [29] L. Gao, J. Li, M. Khodadadzadeh, A. Plaza, B. Zhang, Z. He, H. Yan, Subspacebased support vector machines for hyperspectral image classification, IEEE Geosci. Remote Sens. Lett. 12 (2015) 349–353. [30] A. Rakotomamonjy, F.R. Bach, S. Canu, Y. Grandvalet, SimpleMKL, J. Mach. Learn. Res. 9 (2008) 2491–2521. [31] S. Sonnenburg, G. Rätsch, C. Schäfer, B. Schölkopf, Large scale multiple kernel learning, J. Mach. Learn. Res. 7 (2006) 1531–1565. [32] N. Subrahmanya, Y.C. Shin, Sparse multiple kernel learning for signal processing applications, IEEE Trans. Pattern Anal. Mach. Int. 32 (2010) 788–798. [33] T. Suzuki, R. Tomioka, SpicyMKL: a fast algorithm for multiple kernel learning with thousands of kernels, Mach. Learn. 85 (2011) 1–32. [34] C. Cortes, M. Mohri, A. Rostamizadeh, Two-stage learning kernel algorithms, in: E.H. Zarantonello, Author 2 (Eds.), Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 2010, pp. 239–246. [35] C. Cortes, M. Mohri, A. Rostamizadeh, Algorithms for learning kernels based
9
on centered alignment, J. Mach. Learn. Res. 13 (2012) 795–828. [36] T. Wang, D. Zhao, Y. Feng, Two-stage multiple kernel learning with multiclass kernel polarization, Knowl.-Based Syst. 48 (2013) 10–16. [37] A. Pastor López-Monroy, et al., Improving the BoVW via discriminative visual n-grams and MKL strategies, Neurocomputing 175 (Part A) (2016) 768–781. [38] H. Xia, C.H.H. Steven, MKBoost: a framework of multiple kernel boosting, IEEE Trans. Knowl. Data Eng. 25 (2013) 1574–1586. [39] T. Sun, L. Jiao, S. Wang, J. Feng, Selective multiple kernel learning for classification with ensemble strategy, Pattern Recognit. 46 (2013) 3081–3090. [40] Z. Cai, T. Zhang, X. Wan, A computational framework for influenza antigenic cartography, PLoS Comput. Biol. 6 (10) (2010) e1000949. [41] Y. Gu, H. Liu, Sample-screening MKL method via boosting strategy for hyperspectral image classification, Neurocomputing 173 (2016) 1630–1639, Part 3. [42] B. Ayerdi, I. Marqués, M. Graña, Spatially regularized semisupervised ensembles of extreme learning machines for hyperspectral image segmentation, Neurocomputing 149 (2015) 373–386, Part A. [43] Y. Zhang, H.L. Yang, S. Prasad, E. Pasolli, J. Jung, M. Crawford, Ensemble multiple kernel active learning for classification of multisource remote sensing data, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 8 (2015) 845–858. [44] X. Zhang, F. Ren, Improving SVM learning accuracy with Adaboost, in: Proceedings of International Conference on Natural Computation, 2008, pp. 221–225. [45] D. Pavlov, J. Mao, B. Dom, Scaling-up support vector machines using boosting algorithm, in: Proceedings of 15th International Conference on Pattern Recognition (ICPR), 2000, pp. 2219–2222. [46] X. Li, L. Wang, E. Sung, Adaboost with SVM-based component classifiers, Eng. Appl. Artif. Intell. 21 (2008) 785–795. [47] S.M. Valiollahzadeh, A. Sayadiyan, M. Nazari, Face Detection Using Adaboosted SVM-Based Component Classifier, CoRR, abs/0812.2575, 2008. [48] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55 (1997) 119–139. [49] T. Sun, L. Jiao, J. Feng, F. Liu, X. Zhang, Imbalanced hyperspectral image classification based on maximum margin, IEEE Geosci. Remote Sens. Lett. 12 (2015) 522–526. [50] J. Kennedy, R. Eberhart, A new optimizer using particle swarm theory, in: Proceedings of IEEE 6th International Symposium on Micro Machine and Human Science, 1995, pp. 39–43. [51] F. Melgani, Y. Bazi, Classification of electrocardiogram signals with support vector machines and particle swarm optimization, IEEE Trans. Inf. Technol. Biomed. 12 (2008) 667–677. [52] S.T. Monteiro, Y. Kosugi, Particle swarms for feature extraction of hyperspectral data, IEEE Trans. Remote Sens. Geosci. 90 (2007) 1038–1046. [53] Y. Zhang, L. Wu, Crop classification by forward neural network with adaptive chaotic particle swarm optimization, Sensors 11 (2011) 4721–4743. [54] M.S. Couceiro, R.P. Rocha, N.M.F. Ferreira, J.A.T. Machado, Introducing the fractional-order Darwinian PSO, Signal Image Video Process. 6 (2012) 343–350. [55] P. Ghamisi, M.S. Couceiro, F.M. Martins, J.A. Benediktsson, Multilevel image segmentation approach for remote sensing images based on fractional-order Darwinian particle swarm optimization, IEEE Trans. Remote Sens. Geosci. 52 (2014) 2382–2394. [56] P. Ghamisi, M. Couceiro, M. Fauvel, J.A. Benediktsson, Integration of segmentation techniques for classification of hyperspectral images, IEEE Geosci. Remote Sens. Lett. 11 (2014) 342–346. [57] P. Ghamisi, M. Couceiro, J.A. Benediktsson, A novel feature selection approach based on FODPSO and SVM, IEEE Trans. Remote Sens. Geosci. 53 (2015) 2935–2947. [58] G. Camps-Valls, L. Bruzzone, Kernel-based methods for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 43 (2005) 1351–1362. [59] J. Li, H. Zhang, Y. Huang, L. Zhang, Hyperspectral image classification by nonlocal joint collaborative representation with a locally adaptive dictionary, IEEE Trans. Geosci. Remote Sens. 52 (2014) 3707–3719. [60] A. Zien, C.S. Ong, Multiclass multiple kernel learning, in: Proceedings of the 24th International Conference on Machine Learning, Corvallis, Oregon, USA, 2007, pp. 1191–1198. [61] M. Gönen, E. AlpayIn, Multiple kernel learning algorithms, J. Mach. Learn. Res. 12 (2011) 2211–2268. [62] B. Schölkopf, A.J. Smola, Learning With Kernels, MIT Press, Cambridge, MA, 2002. [63] L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, in: Proceedings of the IEEE, vol. 77, 1989. [64] J. Kennedy, R.C. Eberhart, Swarm Intelligence, Morgan Kaufmann Publishers, San Francisco, California, 2001. [65] J. Kennedy, The behavior of particles, Evolut. Program. 7 (1998) 581–587. [66] D. Beasley, D. Bull, R. Martin, An overview of genetic algorithms, Univ. Comput. 15 (1993) 58–69. [67] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (2011) 27. [68] R.G. Congalton, R.G. Oderwald, R.A. Mead, Assessing Landsat classification accuracy using discrete multivariate analysis statistical techniques, Photogramm. Eng. Remote Sens. 49 (1983) 1671–1678. [69] G.M. Foody, Classification accuracy comparison: hypothesis tests and the use of confidence intervals in evaluations of difference, equivalence and non-inferiority, Remote Sens. Environ. 113 (2009) 1658–1663. [70] P. Ramzi, F. Samadzadegan, P. Reinartz, Classification of hyperspectral data using an AdaBoostSVM technique applied on band clusters, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 7 (2014) 2066–2079. [71] D. Tuia, G. Camps-Valls, G. Matasci, M. Kanevski, Learning relevant image features with multiple-kernel classification, IEEE Trans. Geosci. Remote Sens. 48 (2010) 3780–3791.
Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classification, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i
10
C. Qi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Chengming Qi is a faculty at the College of Automation, Beijing Union University, China. He received his Master of Engineering from the School of Information Engineering, China University of Geosciences (Beijing), 2004. He is currently pursuing the Ph.D. degree with the School of Information Engineering, China University of Geosciences (Beijing). His research interests include computational intelligence, machine learning. He has published around 40 research papers.
Lishuan Hu is a faculty at the College of Automation, Beijing Union University, China. He received his Master of Engineering from the School of Information Engineering, China University of Geosciences (Beijing), 2005. He is currently pursuing the Ph.D. degree with the School of Information Engineering, China University of Geosciences (Beijing). His research interests include computational intelligence, machine learning. He has published around 5 research papers.
Zhangbing Zhou is a professor at China University of Geosciences (Beijing), China, and an adjunct associate professor at TELECOM SudParis, France. His interests include services computing and business process management. He has published over 100 referred papers.
Qun Wang is currently a professor at the School of Information Engineering, China University of Geosciences (Beijing), China. He has published over 100 scientific papers. His research interests include high performance computing and visualization techniques and geoengineering, geospatial data mining, networking engineering and security.
Yunchuan Sun is currently an associate professor in Beijing Normal University, China, and IEEE senior member and CCF member. He received his Ph.D. from Institute of Computing Technology, Chinese Academy of Science, China in 2009. He acts as the Secretary of IEEE Communications Society Technical Subcommittee for IoT since 2013. He is the associate editor of the Springer journal Personal and Ubiquitous Computing since 2012. His research interests include Big Data, Event-linked Network, Internet of Things, Semantic Technologies, Information Security. He has published 60þ papers in international journals and conferences. As the founder and program co-chairs, he successfully organized the series international events IIKI2012, IIKI2013, IIKI2014, and IIKI2015. He organized several special issues in journals like Knowledge Based Systems, Personal and Ubiquitous Computing, Journal of Networks Computer Applications, International Journal of Electronic Commerce, Electronic Commerce Research, and so on. He also holds or participates in several research projects from NSFC, 863 Program of China, etc.
Houbing Song (M12-SM14) received the Ph.D. degree in electrical engineering from the University of Virginia, Charlottesville, VA, in August 2012. In August 2012, he joined the Department of Electrical and Computer Engineering, West Virginia University, Montgomery, WV, where he is currently an Assistant Professor and the founding director of the Security and Optimization for Networked Globe Laboratory (SONG La. His research interests lie in the areas of cyber-physical systems, internet of things, cloud computing, big data, connected vehicle, wireless communications and networking, and optical communications and networking. Dr. Songs research has been supported by the West Virginia Higher Education Policy Commission. Dr. Song is a senior member of IEEE and a member of ACM. Dr. Song is an associate editor for several international journals, including IEEE Access, KSII Transactions on Internet and Information Systems, and SpringerPlus and a guest editor of several special issues. Dr. Song was the general chair of 4 international workshops, including the first IEEE International Workshop on Security and Privacy for Internet of Things and Cyber-Physical Systems (IOT/CPSSecurity), held in London, UK, the first/second/third IEEE ICCC International Workshop on Internet of Things (IOT 2013/2014/2015), held in Xian/Shanghai/ Shenzhen, China, and the first IEEE International Workshop on Big Data Analytics for Smart and Connected Health, to be held in Washington D.C., USA. Dr. Song also served as the technical program committee chair of the fourth IEEE International Workshop on Cloud Computing Systems, Networks, and Applications (CCSNA), held in San Diego, USA. Dr. Song has served on the technical program committee for numerous international conferences, including ICC, GLOBECOM, INFOCOM, WCNC, and so on. Dr. Song has published more than 80 academic papers in peer-reviewed international journals and conferences.
Please cite this article as: C. Qi, et al., Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classification, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.05.103i