Pattern Recognition 60 (2016) 1041–1056
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/pr
Learning group-based sparse and low-rank representation for hyperspectral image classification Zhi He a,b,n, Lin Liu a,c, Suhong Zhou a, Yi Shen b a School of Geography and Planning, Guangdong Provincial Key Laboratory of Urbanization and Geo-simulation, Center of Integrated Geographic Information Analysis, Sun Yat-sen University (SYSU), Guangzhou 510275, China b Department of Control Science and Engineering, Harbin Institute of Technology (HIT), Harbin 150001, China c Department of Geography, University of Cincinnati (UC), Cincinnati 45221, USA
art ic l e i nf o
a b s t r a c t
Article history: Received 16 June 2015 Received in revised form 22 February 2016 Accepted 13 April 2016 Available online 22 April 2016
Previous studies have demonstrated that the structured sparse representation can yield significant improvements in spectral–spatial hyperspectral classification. However, a dictionary that contains all of the training samples in the sparsity-aware methods is ineffective in capturing the class-discriminative information. This paper makes the first attempt to learn group-based sparse and low-rank representation for improving the dictionary. First, super-pixel segmentation is applied to obtain homogeneous regions that act as spatial groups. Dictionary is then learned with group-based sparse and low-rank regularizations to achieve common representation matrix for the same spatial group. Those group-based sparse and low-rank regularizations facilitate identifying both local and global structure of the hyperspectral image (HSI). Finally, representation matrices of test samples are employed to determine the class labels by a linear support vector machine (SVM). Experimental results on two benchmark HSIs show that the proposed method achieves better performance than the state-of-the-art methods, even with small sample sizes. & 2016 Elsevier Ltd. All rights reserved.
Keywords: Classification Hyperspectral image (HSI) Dictionary learning Sparse representation Low-rank representation
1. Introduction Hyperspectral imaging sensors have emerged as powerful tools in remote sensing for capturing the radiance of materials from hundreds of contiguous spectral bands and, therefore, provide rich information for various land covers. Due to its ability to reflect both spectral and spatial distributions of distinct materials, the hyperspectral image (HSI) has attracted considerable interest over the last few decades [1–3]. One of the fundamental procedures in HSI is classification [4–6], which aims at labeling each pixel with one of the classes. Various classification techniques have been developed in the literature to effectively identify the discriminative information hidden within the HSI data sets. Among available classifiers, support vector machine (SVM) [7,8] is a state-of-the-art approach that has achieved great success in high dimensional scenario. The effectiveness of SVM largely depends on the choice of kernel functions, among which radial basis function (RBF) is the most n Corresponding author at: School of Geography and Planning, Guangdong Provincial Key Laboratory of Urbanization and Geo-simulation, Center of Integrated Geographic Information Analysis, Sun Yat-sen University (SYSU), Guangzhou 510275, China. Tel.: þ 86 20 84113057; fax: þ 86 20 84112593. E-mail address:
[email protected] (Z. He).
http://dx.doi.org/10.1016/j.patcog.2016.04.009 0031-3203/& 2016 Elsevier Ltd. All rights reserved.
widely used one. To refine the classification performance, several improved versions of SVM have been proposed, in which the SVM with composite kernel (SVMCK) [9] is dedicated to integrating the spectral–spatial information of HSI while the multiple kernel learning (MKL) [10–12] can improve the flexibility of kernels in machine learning. However, SVM-based methods are sensitive to parameters setting.1 Motivated by the rapid development of compressed sensing, sparse representation classifier (SRC) has announced impressive results and opened up new avenues for the classification of HSI. Notably that hyperspectral pixels of the same class generally belong to the same low-dimensional subspace, an unknown test sample can be sparsely represented by the combination of a few training samples from the entire dictionary, while the corresponding sparse representation vector encodes the class information implicitly. The superiority of SRC for HSI classification has been demonstrated in [13–16]. More recently, low-rank representation (LRR) has been proposed by Liu et al. [17,18] to solve subspace clustering problems. Unlike the sparse representation, which finds sparsest representation of 1 For instance, suppose RBF kernel (κ ðxi ; xj Þ ¼ expð γ J xi xj J 2 Þ; γ A R þ ) is used in the SVM, the classification accuracy will vary considerably with different kernel parameter γ.
1042
Z. He et al. / Pattern Recognition 60 (2016) 1041–1056
each data individually, LRR seeks a representation for all the data under lowest-rank constraint and provides a robust2 tool for capturing the correlation of data belonging to several subspaces. Particularly, the high spatial similarity of HSI implies the low-rank characteristic of data. As such, LRR can capture the global structure of HSI, offering an effective way to divide HSI into various classes. However, a dictionary constructed by all of the training samples in SRC or LRR-based methods is ineffective to detect the crucial class-discriminative information for classification. Therefore, one needs to learn a proper dictionary that can well represent the given samples [20–24]. In this paper, we propose a structured dictionary-based model to learn group-based sparse and low-rank (GSLR) representation for classification of HSI. Taking advantage of the inherent group-structured property in HSI, neighboring pixels are grouped together to achieve common representation matrix. The motivation of using the low-rank regularization in this paper is to capture the global structure of HSI. Although the low-rank regularization is capable of identifying the global structure of data, it cannot necessarily ensure the sparsity of representations. In this regard, sparse and low-rank criterions are combined in the GSLR to achieve sparsity both within and across groups. In addition, square windows [13,25] are commonly used in SRC and LRR-based methods to integrate spatial information. However, it produces artificial shapes which ignore the dissimilarity of various neighboring pixels especially around the edges of different classes. Notably that image segmentation [6,26–29] can partition the pixels into homogeneous regions, we adopt super-pixel segmentation [26] to obtain spatial groups in natural forms. Compared to the existing literature, the main contributions of this paper lie in the following two aspects: (1) a fast super-pixel segmentation method is applied to exploit a number of spatial groups in more natural forms than artificial square shapes; (2) a GSLR method is proposed to learn structured dictionary for capturing both local and global structures of the HSI. The layout of this paper is as follows. The related work is introduced in Section 2. Details of the proposed GSLR method is described in Section 3. Experimental results on two benchmark HSIs are reported in Section 4, and conclusions are drawn in Section 5.
2. Related work The major purpose of this paper is to develop a novel structured dictionary learning model for hyperspectral classification. In this section, we provide an overview of dictionary learning, SRC and LRR, which are closely related to the research problem. 2.1. Dictionary learning Since SRC or LRR-based methods can represent a data sample as linear combinations of a few atoms from the given dictionary, the choice of dictionary becomes of particular importance for the success of those methods. In general, the dictionaries are roughly emerged from two sources: (1) constructing a dictionary via mathematical model-based methods. Many traditional dictionaries [30] proposed by early works, such as Fourier, wavelet, and discrete cosine transform (DCT) based ones, belong to this category. Those methods are usually realized by analytic formulation and fast implementation. However, the dictionaries are fixed in 2 The sparse representation with strict sparsity constraint aims to obtain the sparsest representation of each data individually, lacking global restrictions to the solutions. As a result, the LRR is more robust to noise compared to the strict sparsity constraint. For instance, the aliasing problem generated by the strict sparsity regularization can be mitigated by LRR [19].
advance and thus cannot represent the high-dimensional HSI adaptively; (2) learning a dictionary to have optimal performance on the training samples. A recent trend is to learn compact and discriminative dictionaries for the classification problem. General methods include the method of optimal directions (MOD) [31], KSingular Value Decomposition (K-SVD) [32] and online dictionary learning [33]. Specifically, some low-rank-based dictionary learning methods can be found in [22–24], where discriminative lowrank dictionary [22] is learned for SRC-based face recognition, structured LRR [23] is learned for image classification, and bilinear low-rank coding method [24] is proposed for image recovery. As to HSI, spectral–spatial dictionary is specifically designed for patchbased SRC in [20], whereas spatial-aware dictionary learning (SADL) is proposed in [21] by dividing the pixels of HSI into several square patches that can act as contextual groups and representing the atom in HSI with linear combination of a few learned basic elements. This type of dictionary is flexible to adapt to specific data and demonstrates promising results in recent years. Moreover, it is notable that the proposed GSLR also belongs to the second type of dictionary, which can be learned from the training samples of the input HSI data set. 2.2. Sparse representation Although SRC has been successfully applied to HSI classification, its sparse coefficients suffer from instability or nonuniqueness due to coherency3 [34] of the dictionary. To alleviate this problem, much efforts have been made to exploit more robust representation by incorporating structured priors into SRC. Typical methods include the joint sparsity model (JSM) [35], Laplacian regularized Lasso [36] and collaborative group/hierarchical Lasso [25]. The JSM assumes that the sparse vectors of all the neighboring pixels share a common sparsity support. Hence, the representation matrix of JSM has a row sparsity pattern. The JSM solved by simultaneous versions of orthogonal matching pursuit (SOMP) [13] has attracted extensive attention recently due to its ability to make small neighborhood pixels share a common sparsity support. It is notable that SOMP can classify an unknown test sample by making full use of the complete patches with required neighborhood. The Laplacian regularized Lasso is more flexible than JSM since it introduces a weighting matrix to characterize the similarity among neighboring pixels. The collaborative group Lasso assumes that the representation matrix has a group-wise sparsity pattern, whereas the collaborative hierarchical lasso further enforce sparsity within each group. 2.3. Low-rank representation LRR is dedicated to find the lowest-rank representation of the data by a given dictionary. It can effectively recover the data from multiple subspaces and capture the global structure of data. The representation matrix of LRR can be solved by inexact augmented Lagrange multiplier (IALM) [18,37,38]. Since its initial introduction by Liu et al. [17,18], LRR has provided successful results in various applications, including subspace clustering [18], objective detection [39] and image classification [40], etc. Particularly, LRR has also attracted much attention in the hyperspectral community over the past few years. For instance, spatial information of neighboring pixels is integrated into the LRR by adding a group prior [25]. A generalized LRR is proposed to solve the abundance estimation problem in [19]. The maximum a posteriori (MAP) model is built upon LRR for hyperspectral segmentation in [41]. 3 The coherency [34] is defined by the maximum absolute value of the cosine of the angle between any two columns of a dictionary.
Z. He et al. / Pattern Recognition 60 (2016) 1041–1056
Low-rank regularization is added into the HSI denoising model to deal with the global redundancy and correlation (RAC) in spectral domain [42]. 2.4. Summary As stated previously, integrating spatial information (i.e. structured prior, contextual information, etc.) has provided significant performance in HSI classification and has drawn extensive attention in recent years. In particular, neighboring pixels grouped by square windows [13,21,25] are commonly used in SRC and LRRbased methods. For instance, spatial correlation between neighboring pixels is incorporated in JSM [13,25], while square patches are taken as contextual groups in SADL [21]. However, square patches ignore the dissimilarity of neighboring pixels. To provide much more flexible contextual groups than the artificial square shapes, super-pixel segmentation [26] is employed to integrate the spatial information of neighboring pixels by partition the pixels into homogeneous regions. Typical segmentation methods include watershed (WS) [27], mean-shift (MS) [28], simple linear iterative clustering (SLIC), and entropy rate super-pixel segmentation (ERS) [26], etc. However, the WS and MS may generate small local regions with irregular shapes, which provide unfavorable boundary adherence. In this paper, we adopt entropy rate super-pixel segmentation (ERS) [26] due to its promising performance in both efficiency and effectiveness [43]. Other state-of-the-art methods such as SLIC [29] can also be used to replace the ERS. Moreover, the dictionaries of SRC and LRR-based methods are conventionally constructed by all of the training samples, which are ineffective in capturing the important class-discriminative information. A good dictionary is a key factor to attain high classification accuracy. Therefore, learning proper dictionaries instead of directly taking the training samples as dictionary is crucial to induce significant results in HSI classification. As mentioned earlier, the high spatial similarity of HSI leads to low-rank property of the data. In this regard, low-rank criterion is adopted in the proposed GSLR method. On the other hand, the low-rank constraint cannot ensure the sparsity of representations. As such, we learn the dictionary with group-based sparse and low-rank regularizations to achieve a tradeoff between low-rank and sparsity constraints.
1043
dictionary to be learned, respectively, Z Gi refers to the ith group of Z, di is the ith column of D, λ1 and λ2 are parameters that balance the three terms in the objective function, wGi represents the regularization pffiffiffiffiffiffiffiffiffi parameter for the ith group, which can be set as wGi ¼ 10 j Gi j ; i ¼ 1; 2; …; n with j Gi j refers to the number of pixels in the ith spatial group Gi , di is the ith column of D, J J F and J J 1 represent the Frobenius norm and the l1-norm of a matrix, respectively, and J J n refers to the nuclear norm of a matrix, i.e., the sum of singular values of a matrix. Moreover, to adhere to the physical meaning of spectral reflectance, Z and D are constrained to be nonnegative in Eq. (1). A common two-step strategy, which can update Z and D interactively, is adopted to solve the optimization problem in Eq. (1). As will be shown later, the first step is to compute the optimal Z for a given dictionary D, while the dictionary D is solved with fixed Z in the second step. 3.2. Optimize Z by fixing D In this step, the representation matrix Z is solved with the current D. Accordingly, the optimization problem can be modeled by min Z
s:t:
n n X X 1 J X DZ J 2F þ λ1 wGi J Z Gi J n þ λ2 w Gi J Z Gi J 1 2 i¼1 i¼1
Z Z0
ð2Þ
The objective function demonstrated in Eq. (2) is separable with various spatial groups fG1 ; G2 ; …; Gn g and, therefore, Z can be solved for each Z Gi independently. Without loss of generality, we simply express Z Gi as Z g with g fG1 ; G2 ; …; Gn g. In this regard, the subproblem yields 1 J X g DZ g J 2F þ β1 J Z g J n þ β2 J Z g J 1 2 s:t: Z g Z 0
min Z
ð3Þ
where X g derives from the group g of X, the formula structure of Eq. (3) contains three parts: the reconstruction error, low-rank representation and sparsity of Z g , β1 ¼ λ1 wGi and β 2 ¼ λ2 wGi reflect the tradeoff among those three parts. The problem stated in Eq. (3) is convex and can be efficiently solved by IALM [18,37]. Adding two auxiliary matrices E and W , and the problem in Eq. (3) can be reformulated as 1 J E J 2F þ β1 J Z g J n þ β2 J W J 1 2 s:t: E ¼ X g DZ g ; W ¼ Z g ; W Z0
min
3. The proposed GSLR method
Z
3.1. Problem statement
ð4Þ
Hence, the augmented Lagrangian function can be expressed as
To integrate the spatial context of neighboring pixels, a fast super-pixel segmentation method [26] is implemented on the first principle component (PC1) of HSI to partition the pixels into a number of homogeneous regions. Note that pixels within a small neighborhood usually contain similar land covers, those pixels in the same homogeneous regions are assumed to belong to the same spatial groups. Subsequently, let X A Rb denote the pixels of a HSI with b being the number of spectral bands, fG1 ; G2 ; …; Gn g represent a set of spatial groups obtained from a super-segmentation on the HSI, the objective function of GSLR can be formulated as follows:
LðZ g ; W ; E; Y 1 ; Y 2 ; μÞ 1 ¼ J E J 2F þ β 1 J Z g J n þ β 2 J W J 1 þ tr½Y T1 ðX g DZ g EÞ 2 μ þ tr½Y T2 ðZ g W Þ þ J X g DZ g E J 2F þ J Z g W J 2F 2 1 2 ¼ J E J F þ β 1 J Z g J n þ β 2 J W J 1 þ QðZ g ; W ; E; Y 1 ; Y 2 ; μÞ 2 1 J Y 1 J 2F þ J Y 2 J 2F 2μ
ð5Þ
ð1Þ
where μ 4 0 denotes the penalty parameter, Y 1 and Y 2 are the matrices of Lagrangian multipliers, trðÞ refers to the trace of a matrix, and 2 2 ! μ Y 1 Y 2 QðZ g ; W; E; Y 1 ; Y 2 ; μÞ ¼ X g DZ g E þ þ Z g W þ : 2 μ F μ F
where X ¼ ½x1 ; x2 ; …; xN is a matrix consisted of the members of X , xi A X ; i ¼ 1; 2; …; N, Z and D denote the representation matrix with group-based sparse and low-rank regularizations and the
The Q in Eq. (5) is quadratic and can be approximated by its first order approximation adding a proximal term [44]. Moreover, Eq. (5) is unconstrained and can be minimized by updating the
min D;Z
s:t:
n n X X 1 J X DZ J 2F þ λ1 w G i J Z G i J n þ λ2 w Gi J Z Gi J 1 2 i¼1 i¼1
Z Z 0; 8 i
J di J 2 r 1; di Z 0
1044
Z. He et al. / Pattern Recognition 60 (2016) 1041–1056
Randomly selected
HSI Training samples
PC 1
PCA
Super-pixel segmentation
Spatial groups
Spatial groups generation
Test samples
Dictionary learning by Algorithm 2
D
Group-based sparse and low-rank representation
Z
Classification results
Linear SVM
Dictionary and representation matrixdetermination
Classification
Fig. 1. Outline of the proposed HSI classification method.
Fig. 2. Indian Pines image. (a) Three-band false color composite and (b) reference data with 16 classes.
variables Z g ; W and E one at a time. At this point, the ðk þ 1Þth Z g ; W and E can be respectively evaluated as
"
Yk Z kg þ 1 þ k2
¼ max S β2
!
μ
μk
#
;0
ð7Þ
Z kg þ 1 ¼ argminZ g LðZ g ; W k ; E k ; Y k1 ; Y k2 ; μk Þ ¼ argminZ g β1 J Z g J n þQðZ kg ; W k ; E k ; Y k1 ; Y k2 ; μk Þ ¼ argminZ g β1 J Z g J n þ
μk η1 2
J Z g Z kg J 2F
þ〈∇Z g QðZ kg ; W k ; E k ; Y k1 ; Y k2 ; μk Þ; Z g Z kg 〉
"
β 1 1 Yk ¼ argminZ g k 1 J Z g J n þ Z g Z kg þ DT X g DZ kg E k þ k1 η1 2 μ η1 μ þ
!
!# " ! 1 Yk 2 DT X g DZ kg E k þ k1 F ¼ D β1 Z kg þ η1 μ μ μk η1
Yk Z kg W k þ k2
Z kg W k þ
Y k2
!#! ð6Þ
μk
E k þ 1 ¼ argminE LðZ kg þ 1 ; W k þ 1 ; E; Y k1 ; Y k2 ; μk Þ 0 2 1 1 μk @ Y k1 kþ1 2 ¼ JE JF þ E þ k A X DZ 2 2 μ F
0 !2 1 1 μ @ Y k1 kþ1 2 ¼ JE JF þ þ k A E X DZ 2 2 μ k
¼
μ
k
1 þ μk
Yk X DZ k þ 1 þ k1
F
!
ð8Þ
μ
where ∇Z g Q in Eq. (6) denotes the partial differential of Q with respect to Z g , η1 ¼ J D J 22 , D β1 refers to the singular value μk η1
W k þ 1 ¼ arg min LðZ kg þ 1 ; W; E k ; Y k1 ; Y k2 ; μk Þ W Z0 2 μk k þ 1 Y k2 ¼ arg min β2 J W J 1 þ Z g W þ k W Z0 2 μ F !2 β 1 Yk ¼ arg min 2k J W J 1 þ W Z kg þ 1 þ k2 W Z 0μ 2 μ F
thresholding operator. That is, let " ! !# 1 Y k1 Y k2 k T k k k k D X g DZ g E þ k Z g W þ k C1 ¼ Zg þ ;
η1
μ
μ
the rank of matrix C 1 be r, ðÞ þ ¼ maxð; 0Þ, the singular value decomposition of C 1 be C 1 ¼ U ΣV T ; Σ ¼ diagðfπ i g1 r i r r Þ
ð9Þ
Z. He et al. / Pattern Recognition 60 (2016) 1041–1056
1045
Table 1 Number of samples (NoS) and available training samples (NoATS) used in the experiments.
and then Z kg þ 1 satisfies Z kg þ 1 ¼ D
β1 μk η1
ðC 1 Þ
¼ UD
β1 μk η1
¼ Udiag
ðΣÞV T (
β πi k 1 μ η1
)
! VT
þ
ð10Þ
1rirr
Moreover, S β2 in Eq. (7) is the shrinkage operator defined as μk β ð11Þ S β ðÞ ¼ signðÞmax 0; j j 2k 2
μk
μ
According to the above analysis, we can summarize the procedure of solving Eq. (4) in Algorithm 1, which is convergence when the relative error of two successive Z g is less than the tolerance (e.g. 10 3). Algorithm 1. Solving problem Eq. (4) by IALM. Require: The matrix X g , the dictionary D, and the parameters β1, β2, η1 and μ0 Ensure: Z g , W and E
1: Initialize: Z 0g ¼ W 0 ¼ E 0 ¼ Y 01 ¼ Y 02 ¼ 0, ρ ¼ 1:1, μmax ¼ 106 and k ¼0 2: while Convergence is not attained do 3: Fix W ; E and update Z g according to Eq. (6) 4: Fix Z g ; E and update W according to Eq. (7) 5: Fix Z g ; W and update E according to Eq. (8) 6: Update Lagrange multipliers Y 1 ; Y 2 by Y k1 þ 1 ¼ Y k1 þ μk ðX g DZ kg E k Þ Y k2 þ 1 ¼ Y k2 þ μk ðZ kg W k Þ 7:
Update parameter
μk þ 1 ¼ minðμmax ; ρμk Þ 8:
Update k by
k ¼ kþ1 9:end while
μ by
1046
Z. He et al. / Pattern Recognition 60 (2016) 1041–1056
85
76
80
74 OA(%)
OA(%)
Fig. 3. University of Pavia image. (a) Three-band false color composite, (b) reference data with 9 classes, and (c) available training samples.
75 70
70
65 0
0.2
0.4
0.6
0.8
68
1
100
100
90
90
80 SVM SVMCK
70
SADL1 SADL2
GSLR1 GSLR2
OA(%)
OA(%)
60
72
0.2
0.4
0.6
SADL1 SADL2
SVM SVMCK
80
0.8
1
GSLR1 GSLR2
70
60 50 0
0
60 20
40
60
80
0
100
20
40
60
80
100
C
C
60
100
OA(%)
OA(%)
80 60 40
40
20
20 0
p=1
p=2
p=1
0
p=1
p=2
p=1
Fig. 4. The effect of η (SVMCK), C (SVM, SVMCK, SADL1, SADL2, GSLR1, GSLR2) and p (SOMP) on overall accuracy (%). Left column: Indian Pines data, right column: University of Pavia data.
Z. He et al. / Pattern Recognition 60 (2016) 1041–1056
1047
96 95 94
90
93
85
92
80 1
91
92
0
0
2
λ
SADL2 (Indian Pines data) GSLR2 (Indian Pines data) SADL2 (University of Pavia data) GSLR2 (University of Pavia data)
86
90
0.5
90 88
1
0.5 λ
OA (%)
OA(%)
94 95
84 100
1
200
300
400 500 600 700 Number of super pixels
800
900 1000
Fig. 6. The effect of patch sizes on the super-pixel-based methods.
90
OA(%)
95
the objective function, Eq. (12) leads to
88 min
90
86
85
84
80 82
1 1
0.5 λ
0.5 0 2
0
λ
80
1
Fig. 5. The effect of λ1 and λ2 on overall accuracy (%) in (a) Indian Pines data and (b) University of Pavia data.
Remark. Although the convergence of IALM has been well studied in case the number of blocks (i.e. the unknown matrix variables) is at most two, it is still not easy to prove its convergence with three or more blocks in theory. As stated in [18], two sufficient conditions have been presented to guarantee the convergence of IALM with three blocks. In reality, the IALM has exhibited significant performance [38]. Moreover, the IALM also performs well with four blocks [19]. Therefore, we can expect that Algorithm 1 with three blocks has good convergence properties.
di
s:t:
2 1 Ri Z i di i T2 2 J ZT J 2 2
J di J 2 r 1; di Z0
whose solution is di ¼ projl þ 2
Ri Z iT
!
where projl þ ðÞ is the projection inside the nonnegative orthant of 2 a unit l2 ball [21]. The dictionary learning process is outlined in Algorithm 2, where the representation matrix Z and the dictionary D are updated alternately. The iteration of Algorithm 2 stops when the relative error of two successive objective functions in Eq. (1) is less than the tolerance (e.g. 10 3). Moreover, the schematic procedure of HSI classification can be summarized in Fig. 1, which composes of three main steps: (1) spatial groups generation by super-pixel segmentation; (2) dictionary and representation matrix determination by Algorithm 2; and (3) classification by linear SVM. Algorithm 2. The proposed dictionary learning method. Require: The matrix X and the parameters Ensure Z and D
In the dictionary update step, Z is fixed and the dictionary D can be optimized by solving
1: Initialize: D0 and k ¼0 2: while Convergence is not attained
min D
s:t:
8i
J di J 2 r 1; di Z 0
ð12Þ
which is quadratic with respect to D. The constraints of Eq. (12) project dictionary atoms di with larger than unit norm inside the nonnegative orthant of a unit l2 ball. When the gradient of the objective function equals zero, we have D ¼ XZ T ðZZ T Þ 1
ð15Þ
J Z iT J 22
3.3. Optimize D by fixing Z
1 J X DZ J 2F 2
ð14Þ
λ1, λ2 and wG
i
3:
Solve Z k þ 1 ; W k þ 1 ; E k þ 1 with respect to Dk by Algorithm 1
4: 5:
Update Dk þ 1 with fixed Z k þ 1 ; W k þ 1 ; E k þ 1 by solving Eq. (12) Update k by
k ¼ kþ1 6:end while
ð13Þ
Note that Eq. (13) may suffer from computing the inverse of ZZ T , a block coordinate descent (BCD) strategy [45], which can update dictionary atoms di of D iteratively, is adopted to solve Eq. P j (12). Let Ri ¼ X j a i dj Z jT T , where Z T is the jth row of Z in column form, and then the objective function of the ith atom yields iT 2 1 2 J Ri di Z T J F . Maintaining only the terms in di and rearranging
4. Experiments In this section, experiments are conducted on two benchmark data sets to validate the effectiveness of our proposed GSLR method by comparing with several state-of-the-art methods, including the SVM, SVMCK, SOMP and SADL.
1048
Z. He et al. / Pattern Recognition 60 (2016) 1041–1056
Fig. 7. Segmentation results of the Indian Pines data (the first row) and University of Pavia data (the second row). The first two and last two columns are obtained by square patches and super-pixel segmentation, respectively. (a), (c), (e) and (g) are boundary maps, and the rest are randomly colored maps. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)
Square patches-based methods
Super-pixel segmentation-based methods
Enlarge
Reference data
Enlarge
Enlarge
Enlarge
SADL1
Enlarge
GSLR1
Enlarge
Enlarge
SADL2
GSLR2
Reference data
Fig. 8. Schematic illustration about the effects of partition methods on classification results.
4.1. Data sets Two real-world hyperspectral data sets involved in the experiments are briefly described as follows. 1. Indian Pines Data: the first data set was acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the
agricultural Indian Pine test site in northwest Indiana, USA, on June 12, 1992. The original data set consists of 224 spectral bands ranging from 0.4 to 2.5 μm. After removing 4 bands full of zero and 20 noisy bands affected by water-vapor absorption, the left 200 bands are used for experiments. The whole scene contains 145 145 pixels with a spatial resolution of 20 m per pixel. There are 16 classes of land covers and the numbers of
Z. He et al. / Pattern Recognition 60 (2016) 1041–1056
pixels range unbalanced from 20 to 2468, which makes the classification task challenging. Fig. 2 shows the three-band false color composite image along with the ground truth. The number of samples for each class is displayed in Table 1, whose background color corresponds to different classes of land covers.
Computation time (s)
8 6
Indian Pines data University of Pavia data
5.2010
4
0.5128
0.0524 0.0604 Square patches
Super−pixel segmentation
Fig. 9. Computation time (s) of different methods to obtain spatial groups.
1
0.9 0.85 0.8
50
0.85 0.8
100 150 Rotation angle
200
WS MS
SLIC ERS Boudary recall
0.85 0.8 0.75
100 150 Rotation angle
200
WS MS
SLIC ERS
0.9 0.85 0.8 0.75
0.1
0.2 0.3 Shear factor
0.4
0.7
0.5
0
0.1
0.2 0.3 Shear factor
0.4
0.5
1
1 WS MS
SLIC ERS Boudary recall
0.95
0.9
WS MS
0.95
SLIC ERS
0.9
0.85
0.85
0.8 0
50
0.95
0.9
0.7 0
0
1
0.95 Boudary recall
0.9
0.7 0
1
Boudary recall
SLIC ERS
0.75
0.75 0.7
WS MS
0.95 Boudary recall
Boudary recall
1
SLIC ERS
WS MS
0.95
2. University of Pavia Data: the second data set was collected by the Reflective Optics System Imaging Spectrometer (ROSIS-03) optical sensor over an urban area surrounding the University of Pavia, Italy, on July 8, 2002. The original data set was captured by 115 spectral channels ranging from 0.43 to 0.86 μm. After removing 12 most noisy channels, 103 spectral bands remained for experiments. The whole scene consists of 610 340 pixels with a spatial resolution of 1.3 m per pixel. There are 9 classes of land covers and each class contains more than 900 pixels. However, the available training samples of each class are less than 600. Table 1 lists the number of samples together with the available training samples for each class, while Fig. 3 depicts the false color composite image, the ground truth map and the available training samples. Remark. It is not the normal case for remote sensing to work on unseen images, from which training samples are not available. Different from the traditional images with single band, a HSI is a three-dimensional cube data consisted of one spectral dimension and two spatial dimension. Each spectral signal with hundreds of data points is a sample of HSI. For instance, the Indian Pines data
2 0
1049
0.02
0.04 0.06 0.08 Noise variance
0.1
0.8 0
0.02
0.04 0.06 0.08 Noise variance
0.1
Fig. 10. Clustering stability of different methods towards rotation, shearing and Gaussian noise. Left column: Indian Pines data, right column: University of Pavia data.
1050
Z. He et al. / Pattern Recognition 60 (2016) 1041–1056
contains 10,366 samples (see Table 1), which can be randomly chosen as training and test samples. The size of each sample in the Indian Pines data is 200 1. 4.2. Experimental setup To assess the effectiveness of our proposed GSLR method, we compare it with several state-of-the-art competitive SVM and SRCbased methods, including the SVM, SVMCK, SOMP and SADL. As stated in [21], linear SVM is used in SADL. For consistency, we also adopt linear SVM in SVM, SVMCK, SADL and GSLR. It is worth stressing that if RBF-based SVM is used in those methods, better classification performance can be obtained by those methods but the proposed GSLR still yields higher classification accuracy than other methods. On the other hand, two powerful spectral–spatial techniques (i.e. SOMP and SADL) are considered in the SRC-based methods. To demonstrate the advantage of super-pixel segmentation, spatial groups obtained from both square patches and super-pixel segmentation are considered in SADL and GSLR. For simplicity, the square patches-based ones are abbreviated as
Reference data
SADL2
SADL1 and GSLR1, while SADL2 and GSLR2 refer to the super-pixel segmentation-based ones. Three experiments are designed in this section: (1) detailed comparisons are performed on both data sets with fixed ratios of labeled samples; (2) additional comparisons are also conducted to evaluate the validity of GSLR with small number of labeled samples; (3) comparisons are performed on the Indian Pines data under other special situations, i.e. balanced training sets and noisy scenario. In experiment 1, around 5% of the Indian Pines data (see 3rd column of Table 1) are randomly selected for training and the remaining 95% are taken as testing samples. As to University of Pavia data, only 5% of the available training samples (see 7th column of Table 1) are randomly chosen for training. In experiment 2, 0.5%, 1%, 2%, 3%, 4% and 5% labeled samples of each class from the Indian Pines data and 0.5%, 1%, 2%, 3%, 4% and 5% of the available training samples from the University of Pavia data are successively selected as training samples, whereas the rest are taken as test samples. In experiment 3, two special cases are considered: (1) balanced training sets; (2) noisy scenario. In the first scenario, 9 classes (i.e. class 2, 3, 5, 6, 8, 10, 11, 12, 14) of land covers with the
SOMP
GSLR2
SADL1
GSLR2
Fig. 11. Representation matrices corresponding to 50 samples inside two 5 5 patches (see blue dotted boxes in (a)) of the Indian Pines data for (b) SOMP, (c) SADL1, (d) SADL2, (e) GSLR1, and (f) GSLR2. The x-axis labels the sample number and the y-axis labels the representation number. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)
Z. He et al. / Pattern Recognition 60 (2016) 1041–1056
1051
Fig. 12. Classification maps of the Indian Pines data obtained by (a) SVM, (b) SVMCK, (c) SOMP, (d) SADL1, (e) SADL2, (f) GSLR1 and (g) GSLR2 in experiment 1.
highest number of pixels are involved, while the rest 7 classes which have small number of pixels and may cause serious imbalance are ignored. 50 samples from each class are randomly selected as training samples and the remaining ones are used for testing. In the second scenario, we adopt all of the original 224 spectral bands (i.e. the 200 spectral bands used in the former two experiments, together with the 24 removed zero bands and noisy bands) and select the same training and testing samples as experiment 1. All of the above-mentioned methods are compared numerically (overall accuracy (OA) and average accuracy (AA)) and statistically (kappa coefficient (κ)). The overall accuracy (OA) denotes the number of correctly classified samples divided by the total number of test samples, while the average accuracy (AA) denotes the average of individual class accuracies. Both OA and AA only involve commission errors. The kappa coefficient (κ) not only involves commission errors but also omission errors. The definition of κ can be expressed by
κ¼
P0 Pe 1 Pe
ð16Þ
P P where P 0 ¼ ci P ii ¼ ci MN~ ii refers to the “observed” agreement, c indicates the number of classes, Mii is the ith diagonal element of the confusion matrix, N~ is the total number of test samples, P e ¼ Pc C~ i ~ R~ i ~ i ðP i P i Þ is the “expected” agreement, P i ¼ N~ , P i ¼ N~ , R i and C i represent the sum of ith row and column in confusion matrix, respectively. Substituting P0 and Pe into (16) yields P Pc P c ðP P Þ κ ¼ i ii Pc i i i 1 i ðP i P i Þ Pc Pc ~ ~ N~ i M ii i RiC i ð17Þ ¼ P 2 c ~ ~ ~ N i RiC i The experimental results reported in this section are averaged over ten runs to overcome individual variations.
Moreover, several parameters need to be tuned in the experiments. For the SVM-based methods, the composite weight η in SVMCK is tuned through fivefold cross validation (η A f0; 0:1; …; 1g). Regarding the penalty term C, which does not seriously affect the classification results in case C 4 10, we simply set C¼ 60 for all experiments. For SOMP, p ¼ 1 in the lp-norm since it leads to impressive results. The dictionary consists of all of the training samples, and the window size and sparsity level are chosen followed by [36]. We examine how the parameters affect the performance of classifiers. In the figures about parameter fitting, 5% of the samples are randomly chosen from all labeled samples (or available training samples) for training in the Indian Pines data (or University of Pavia data). Specifically, the effect of η (SVMCK), C (SVM, SVMCK, SADL1, SADL2, GSLR1, GSLR2) and p (SOMP) on OA (%) is plotted in Fig. 4, which convinces the parameter selections quantitatively. In addition, both λ1 and λ2 in GSLR are set to 1. The effect of λ1 and λ2 on OA (%) is depicted in Fig. 5, from which one can observe that the classification performance is much better with λ1 ; λ2 A f0:8; 0:9; 1g than λ1 ; λ2 A f0; 0:1; …; 0:7g. For SADL and GSLR, 1/8 (or all of the) training samples are selected to initialize the dictionary in experiment 1 (or 2). For SADL1 and GSLR1, since the University of Pavia data has higher spatial resolution than the Indian Pines data, 8 8 patch size is used in the Indian Pines data, while the University of Pavia data involves 16 16 patch size. To be fair, we produce the same number of homogeneous regions in all of the square patchesbased methods and super-pixel segmentation-based methods. That means, for SADL2 and GSLR2, the PC1 of both data sets are segmented into 361 and 858 non-overlapping spatial groups, respectively. Fig. 6 displays the effect of patch sizes on the super-pixelbased methods. As shown in Fig. 6, SADL2/GSLR2 yield comparable OA when the number of patches changes from 300 to 700 (or 600 to 1000) for the Indian Pines data (or University of Pavia data). In this regard, it is acceptable to segment those two data sets into 361 and 858 patches, respectively.
1052
Z. He et al. / Pattern Recognition 60 (2016) 1041–1056
Fig. 13. Classification maps of the University of Pavia data obtained by (a) SVM, (b) SVMCK, (c) SOMP, (d) SADL1, (e) SADL2, (f) GSLR1 and (g) GSLR2 in experiment 1.
4.3. Experimental results We first compare the spatial groups exploited by square patches and super-pixel segmentation (i.e. ERS) in Fig. 7. For comparison convenience, both methods are performed on the PC1, which retain 72.49% and 58.31% energies of the Indian Pines data and University of Pavia data, respectively. It is observed that the square patches-based method forcibly divides the HSI into several square contextual groups with a predetermined size, while ignoring the heterogeneous land covers of various pixels especially those at the border of different classes. On the contrary, the superpixel segmentation-based method defines more intelligent spatial regions with varying sizes and shapes, and thus yields more precise results than the square patches-based one. Fig. 8 schematically illustrates the effects of partition methods on classification results. One can clearly see that the square patches-based methods (i.e. SADL1 and GSLR1) generate more square errors than the super-pixel segmentation-based ones (i.e. SADL2 and GSLR2), which validates the effectiveness of super-pixel segmentation. In
addition, Fig. 9 depicts the computation time of those two methods. It is notable that the square patches-based method is much faster than the super-pixel segmentation-based one, as the former is much simpler than the latter. Even so, the super-pixel segmentation-based method is still quite fast since the spatial groups can be exploited within only a few seconds. Moreover, the clustering stability among WS, MS, SLIC and ERS is compared in Fig. 10. Stability can be tested whether a segmentation method obtains the same regions independent of image changes. In this paper, we perform three typical changes (i.e. rotation, shearing and Gaussian noise adding) on the PC1 and depict the boundary recall of each method. It is observed from Fig. 10 that ERS and SLIC yield comparable stability for rotation transform, shear transform and Gaussian noise, whereas the performances of WS and MS fall behind ERS and SLIC. The results of clustering stability confirm the effectiveness of ERS. Having obtained the spatial groups, we perform dictionary learning with the given training samples and the representation matrices of test samples are subsequently identified. To provide
Z. He et al. / Pattern Recognition 60 (2016) 1041–1056
Table 2 Classification accuracy (%) and standard deviation (in bracket) of different methods for the Indian Pines data in experiment 1. Class
SVM
SVMCK
SOMP
SADL1
SADL2
GSLR1
GSLR2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0.65 42.41 8.00 4.35 42.30 92.13 0 97.99 0 39.50 93.54 9.49 79.60 97.71 8.82 86.11
19.61 75.33 61.49 62.61 89.62 98.73 4.17 99.14 42.11 67.03 90.66 64.67 99.50 97.56 58.17 94.44
62.75 89.50 81.69 82.88 73.52 98.59 83.33 100 0 74.10 98.00 84.22 98.01 99.02 96.68 81.11
68.63 87.74 86.62 75.68 86.44 99.15 100 100 73.68 93.15 96.93 72.04 98.51 94.14 98.06 98.89
88.24 90.60 95.96 78.38 95.55 97.60 95.83 98.92 78.95 89.66 91.04 91.60 99.01 97.72 85.60 96.67
84.31 90.60 89.90 81.08 97.46 98.87 95.83 99.57 89.47 94.02 93.94 92.62 99.50 97.64 79.50 97.78
96.08 95.30 96.72 95.05 90.25 97.74 91.67 98.92 84.21 90.64 95.22 91.60 100 97.80 83.10 92.22
OA
61.37 (0.75) 43.91 (2.71) 53.81 (0.015)
81.96 (0.60) 70.30 (1.99) 79.23 (0.011)
90.60 (0.52) 81.46 (1.97) 89.20 (0.007)
91.88 (0.51) 89.35 (1.62) 90.74 (0.007)
92.88 (0.50) 91.96 (1.60) 91.90 (0.006)
93.61 (0.47) 92.63 (1.55) 92.72 (0.006)
94.75 (0.46) 93.53 (1.52) 94.02 (0.005)
AA κ
Table 3 Classification accuracy (%) and standard deviation (in bracket) of different methods for the University of Pavia data in experiment 1. Class
SVM
SVMCK
SOMP
SADL1
SADL2
GSLR1
GSLR2
1 2 3 4 5 6 7 8 9
69.80 65.93 42.06 92.45 90.01 54.08 85.13 91.10 90.23
75.10 65.96 59.17 93.51 99.28 75.46 87.77 91.65 96.48
48.11 60.09 26.23 72.18 99.37 70.08 36.80 21.20 69.43
82.47 88.39 70.74 95.23 99.10 76.38 79.00 91.56 99.50
83.50 87.83 68.82 96.33 99.82 87.20 94.50 88.79 99.25
82.07 93.06 83.36 88.56 98.92 82.74 89.19 88.61 96.10
84.47 93.02 75.65 97.05 99.73 92.32 94.80 92.95 99.37
OA
69.77 (0.48) 75.64 (0.40) 61.23 (0.013)
74.41 (0.40) 82.71 (0.37) 67.56 (0.009)
56.12 (1.32) 55.94 (1.23) 44.10 (0.013)
86.34 (0.30) 86.93 (0.29) 81.78 (0.006)
87.64 (0.28) 89.56 (0.26) 83.67 (0.005)
89.14 (0.27) 89.18 (0.25) 85.41 (0.004)
91.45 (0.25) 92.15 (0.22) 88.61 (0.004)
AA κ
Table 4 Wilcoxon test between GSLR2 and other methods. Methods
GSLR2 GSLR2 GSLR2 GSLR2 GSLR2 GSLR2
vs. vs. vs. vs. vs. vs.
SVM SVMCK SOMP SADL1 SADL2 GSLR1
Indian Pines Data p-value
University of Pavia Data p-value
0.0002 0.0002 0.0002 0.0002 0.0002 0.0006
0.0002 0.0002 0.0002 0.0002 0.0002 0.0002
intuitive insight, we depict in Fig. 11 the sparse representations generated for two 5 5 patches belonging to different land covers of the Indian Pines data. Representation matrices of the central pixels, rather than the full neighborhood, are plotted in Fig. 11 (b) for simplicity. It is notable that (1) all of the SRC-based methods except SOMP exhibit clear sparsity patterns; (2) there is
1053
no obvious distinction between the representation matrices of two different land covers in Fig. 11(c); (3) the proposed GSLR method can alleviate the aliasing problem and achieve sparsity both within and across groups. In experiment 1, around 5% of the Indian Pines data and 5% of the available training samples from the University of Pavia data are respectively selected for training. Classification maps for all of the methods on both data sets are reported in Figs. 12 and 13, and the accuracies (i.e. the classification accuracy of each class, OA, AA and κ) are displayed and compared in Tables 2 and 3. There are several observations to be made in this experiment. First, SVM provides poor classification performance compared with the other methods. One can roughly find from Fig. 12 that SVM leads to more salt and pepper-like errors than the other classifiers. As displayed in Table 2, the OA of SVM is at least 20% lower than the other ones. That is because only spectral information is adopted in SVM, while the other methods can significantly integrate both spectral and spatial information. Second, in the Indian Pines data, SOMP offers comparable classification accuracies to SVMCK, which implies that the training samples inside the window surrounding a central pixel are considered as part of the best atoms. However, SOMP brings the worst results in the University of Pavia. As observed from Table 3, the OA, AA and κ of SOMP are as low as 56.12%, 55.94% and 44.10%, respectively. The reason why SOMP cannot perform well may lie in that the available training samples (see Fig. 3(c)) consist of small patches, and thus the window surrounding a pixel may contain no training samples. Third, with the same dictionary learning methods, spatial groups determined by super-pixel segmentation yield better performance than square patches-based ones. As shown in Table 2, the OA of SADL2 (or GSLR2) is 1.00% (or 1.14%) higher than that of SADL1 (or GSLR1). Table 3 provides similar results in this regard. Moreover, as depicted in Figs. 12 and 13, the classification maps of SADL1 (or GSLR1) exhibit more square errors than SADL2 (or GSLR2). Fourth, among all algorithms, the dictionary learningbased approaches (i.e. SADL1/SADL2 and GSLR1/GSLR2) achieve higher classification accuracies than the other ones. It is observed from Table 3 that the OA of SADL1, SADL2, GSLR1 and GSLR2 are 11.93%, 13.23%, 14.73%, 17.04% higher than SVMCK, respectively. Similar results can also be derived from Table 2. Those phenomena indicate that a good dictionary is of key importance for SRC to gain high classification accuracy. Fifth, with regard to the dictionary learning methods, GSLR1/GSLR2 yield better results than SADL1/ SADL2. For instance, the OA, AA, and κ of GSLR1 (see Table 2) are 1.73%, 3.28% and 1.98% higher than those of SADL1, respectively. Moreover, GSLR2 achieves the best performance among all algorithms. As shown in Figs. 12 and 13, the classification maps of GSLR2 are more closely related to the reference data. Finally, the aforementioned pros and cons are based on the overall performance of various methods. The classification accuracy of each class obtained by GSLR is not always better than other methods. For instance, GSLR1/GSLR2 lead to worse results for class 8 (haywindrowed) and class 15 (bldg-grass-tree-drives) than SOMP and SADL1. Similar phenomenon can also be observed between the square-patches-based methods and super-pixel segmentation. As shown in Table 2, GSLR1 yields better classification results for class 5 (grass/pasture) and class 10 (soybean-no till) than GSLR2. That is because different methods may be suitable for classifying different types of classes. GSLR1 based on square-patches can significantly classify class 5 and class 10 with regular shapes. In addition, the classification accuracy of a particular class is also related to the training samples (e.g. which and how many samples are chosen for training). As will be seen shortly, the classification performance of GSLR2 for class 5 is improved in experiment 3 with balanced training sets. As such, GSLR2 provides a tradeoff among different classes, which cannot ensure to obtain the best classification
1054
Z. He et al. / Pattern Recognition 60 (2016) 1041–1056
100
SVM SVMCK SOMP SADL1 SADL2 GSLR1 GSLR2
OA (%)
80 60 40 20 0
0.5
1 2 3 4 Percentage of training samples (%)
5
Indain Pines data 100
SVM SVMCK SOMP SADL1 SADL2 GSLR1 GSLR2
OA (%)
80 60 40 20 0
0.5
1 2 3 4 Percentage of available training samples (%)
5
University of pavia data Fig. 14. Overall accuracy (%) of different methods with a small number of labeled samples for (a) Indian Pines data and (b) University of Pavia data in experiment 2.
accuracy for each class but can achieve the best overall performance. A more in-depth theoretical analysis needs further investigation. We also validate the classification results by Krusal–Wallis test and Wilcoxon test [46]. Both tests are non-parametric statistical hypothesis tests, which makes the hypothesis decision based on the p-value. Wilcoxon test can determine whether the differences of the classification results between two methods are statistically significant, while Krusal–Wallis test can deal with three or more methods. We first adopt the Krusal–Wallis test to evaluate whether the 7 methods (i.e. SVM, SVMCK, SOMP, SADL1, SADL2, GSLR1 and GSLR2) differ significantly. The p-value of the two HSI data sets is 2.2852 10 12 and 1.2653 10 12, respectively, which demonstrates that the experimental methods have remarkable difference. In Wilcoxon test, if the p-value smaller than 0.05, the difference between classification accuracies is statistical significant with 95% confidence. The results of Wilcoxon test is shown in Table 4, including comparisons between GSLR2 and all of the other methods. The statistical difference in Table 4 indicates that GSLR2 outperforms (pvalue o 0:05) other classification methods. In experiment 2, we evaluate the performance of GSLR in scenarios with a small number of labeled samples. Various ratios of training samples are chosen and the OA of each method is reported in Fig. 14, from which two main results can be highlighted: (1) GSLR1/GSLR2 attain a large improvement for very few labeled samples with SADL1/SADL2 trailing slightly behind. As shown in Fig. 14(a), although only 1% labeled samples are selected in the Indian Pines data, the OA of GSLR1/GSLR2 are almost reach to 80%, while the OA of SADL1/SADL2 are slightly lower than those of GSLR1/GSLR2. Similar properties can also be observed from Fig. 14 (b); (2) the OA of SOMP in the Indian Pines data is comparable to SVMCK, while SOMP exhibits the worst performance in the University of Pavia data. All those results are in accordance with the conclusions drawn from experiment 1.
In experiment 3, we assess the performance of GSLR under other special situations, i.e. balanced training sets and noisy scenario. As to the first situation, 50 samples from each of the 9 classes are selected and the classification map of each method is plotted in Fig. 15, from which we can find that GSLR1/GSLR2 achieve smaller difference with the reference data than other methods. The classification accuracy and standard deviation of various techniques are displayed in Table 5, which also indicate the effectiveness of the proposed GSLR1/GSLR2. As to the noisy scenario, the OA of various methods is compared in Fig. 16. It is notable that although the classification performance of GSLR1/ GSLR2 is worse than that of the experiment 1, GSLR1/GSLR2 still achieve higher OA than other methods with the zero and noisy bands. In a nutshell, the experimental results on two HSI data sets confirm the superiority of GSLR over other state-of-the-art classification methods (e.g. SVM, SVMCK, SOMP and SADL).
5. Conclusion In this paper, we have proposed a spectral–spatial dictionary learning approach for classification of HSI. Our approach starts from partitioning the pixels of a HSI into homogeneous regions by super-pixel segmentation. It is followed by dictionary learning with group-based sparse and low-rank regularizations, which force pixels that belong to the same group, share a common active set of atoms from the dictionary. The group-based sparse and lowrank regularizations are capable of capturing both local and global structures of the HSI and thus facilitate the integration of spatial information into the classification task. The class labels of test samples are determined by linear SVM trained on representation matrices. The proposed GSLR method has been compared against several state-of-the-art SVM and SRC-based techniques. Experiments conducted on two benchmark data sets reveal that GSLR
Z. He et al. / Pattern Recognition 60 (2016) 1041–1056
1055
Fig. 15. Classification maps of the Indian Pines data obtained by (a) SVM, (b) SVMCK, (c) SOMP, (d) SADL1, (e) SADL2, (f) GSLR1, (g) GSLR2 and (h) reference data with 9 classes in experiment 3 with balanced training sets.
100
Table 5 Classification accuracy (%) and standard deviation (in bracket) of different methods for the Indian Pines data in experiment 3 with balanced training sets. SVM
SVMCK
SOMP
SADL1
SADL2
GSLR1
GSLR2
2 3 5 6 8 10 11 12 14
52.67 23.60 28.86 79.48 99.32 51.42 56.95 42.55 98.15
72.18 80.23 97.76 94.84 98.86 71.13 64.43 78.55 97.99
79.84 95.03 95.75 98.71 100 88.02 65.72 99.47 99.84
81.94 86.35 97.99 99.28 100 93.25 84.53 92.91 99.68
84.83 97.83 93.74 99.71 99.09 94.01 85.69 93.79 99.60
90.17 95.79 99.78 99.14 99.77 90.63 85.86 94.50 99.20
91.11 98.98 99.33 99.14 99.32 90.85 89.37 95.57 99.28
OA
60.07 (0.93) 59.22 (2.04) 53.47 (0.018)
79.07 (0.71) 84.00 (1.27) 75.86 (0.016)
85.50 (0.69) 91.37 (1.25) 83.26 (0.013)
90.43 (0.58) 92.88 (1.22) 88.80 (0.010)
92.11 (0.53) 94.25 (1.20) 90.77 (0.009)
92.74 (0.49) 94.98 (1.14) 91.50 (0.008)
94.18 (0.47) 95.88 (1.12) 93.16 (0.006)
AA κ
possesses the best classification performance, even in scenarios with small number of labeled samples. In the future, we will investigate the theoretical guarantee on the convergence of IALM. Moreover, additional future activities will be conducted to indepth analysis of the pros and cons of GSLR for classifying different types of classes and apply image denoising strategies to postprocess the classification maps.
OA(%)
Class
80 60 40 20 0
SVM
SVMCK SOMP SADL1 SADL2 GSLR1 GSLR2
Fig. 16. Overall accuracy (%) of different methods in experiment 3 without removing zero and noisy bands for the Indian Pines data.
Acknowledgments This research is supported in part by the following funds: National High Technology Research and Development Program of China (863 Program) under Grant no. 2013AA122302, National Natural Science Foundation of China under Grant nos. 41501368, 41522104 and 41531178, and Fundamental Research Funds for the Central Universities under Grant nos. 16lgpy04 and 15lgjc24. Moreover, the authors would like to thank Prof. D. Landgrebe from Purdue University, West Lafayette, IN, USA, for providing the AVIRIS image of Indian Pines, Prof. Gamba from the University of Pavia for providing the ROSIS data set, and the associate editor and the anonymous reviewers for their careful reading and helpful remarks.
References Conflict of interest None declared.
[1] J.M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders, N.M. Nasrabadi, J. Chanussot, Hyperspectral remote sensing data analysis and future challenges, IEEE Geosci. Remote Sens. Mag. 1 (2) (2013) 6–36.
1056
Z. He et al. / Pattern Recognition 60 (2016) 1041–1056
[2] G. Lixin, X. Weixin, P. Jihong, Segmented minimum noise fraction transformation for efficient feature extraction of hyperspectral images, Pattern Recognit. 48 (10) (2015) 3216–3226. [3] A. Villa, J. Chanussot, J.A. Benediktsson, C. Jutten, R. Dambreville, Unsupervised methods for the classification of hyperspectral images with low spatial resolution, Pattern Recognit. 46 (6) (2013) 1556–1568. [4] X. Jia, B.C. Kuo, M.M. Crawford, Feature mining for hyperspectral image classification, Proc. IEEE 101 (3) (2013) 676–697. [5] M. Fauvel, Y. Tarabalka, J.A. Benediktsson, J. Chanussot, J.C. Tilton, Advances in spectral–spatial classification of hyperspectral images, Proc. IEEE 101 (3) (2013) 652–675. [6] Y. Tarabalka, J. Chanussot, J.A. Benediktsson, Segmentation and classification of hyperspectral images using watershed transformation, Pattern Recognit. 43 (7) (2010) 2367–2379. [7] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995. [8] P. Ramzi, F. Samadzadegan, P. Reinartz, Classification of hyperspectral data using an AdaBoostSVM technique applied on band clusters, IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 7 (6) (2014) 2066–2079. [9] G. Camps-Valls, L. Gomez-Chova, J. Muñoz-Marí, J. Vila-Francés, J. CalpeMaravilla, Composite kernels for hyperspectral image classification, IEEE Geosci. Remote Sens. Lett. 3 (1) (2006) 93–97. [10] K.H. Liu, Y.Y. Lin, C.S. Chen, Linear spectral mixture analysis via multiple-kernel learning for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 53 (4) (2015) 2254–2269. [11] Y. Gu, Q. Wang, H. Wang, D. You, Y. Zhang, Multiple kernel learning via lowrank nonnegative matrix factorization for classification of hyperspectral imagery, IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 8 (6) (2015) 2739–2751. [12] Z. He, J. Li, Multiple data-dependent kernel for classification of hyperspectral images, Expert Syst. Appl. 42 (3) (2015) 1118–1135. [13] Y. Chen, N.M. Nasrabadi, T.D. Tran, Hyperspectral image classification via kernel sparse representation, IEEE Trans. Geosci. Remote Sens. 51 (1) (2013) 217–231. [14] Z. He, Q. Wang, Y. Shen, M. Sun, Kernel sparse multitask learning for hyperspectral image classification with empirical mode decomposition and morphological wavelet-based features, IEEE Trans. Geosci. Remote Sens. 52 (8) (2014) 5150–5163. [15] M. Cui, S. Prasad, Class-dependent sparse representation classifier for robust hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 53 (5) (2015) 2683–2695. [16] P. Du, Z. Xue, J. Li, A. Plaza, Learning discriminative sparse representations for hyperspectral image classification, IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 9 (6) (2015) 1089–1104. [17] G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation, in: Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 663–670. [18] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by low-rank representation, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 171–184. [19] Q. Qu, N.M. Nasrabadi, T.D. Tran, Abundance estimation for bilinear mixture models via joint sparse and low-rank representation, IEEE Trans. Geosci. Remote Sens. 52 (7) (2014) 4404–4423. [20] Z. Wang, N.M. Nasrabadi, T.S. Huang, Spatial-spectral classification of hyperspectral images using discriminative dictionary designed by learning vector quantization, IEEE Trans. Geosci. Remote Sens. 52 (8) (2014) 4808–4822. [21] A. Soltani-Farani, H.R. Rabiee, S.A. Hosseini, Spatial-aware dictionary learning for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 53 (1) (2015) 527–541. [22] L. Ma, C. Wang, B. Xiao, W. Zhou, Sparse representation for face recognition based on discriminative low-rank dictionary learning, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2586–2593. [23] Y. Zhang, Z. Jiang, L.S. Davis, Learning structured low-rank representations for image classification, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 676–683.
[24] Z. Zhang, S. Yan, M. Zhao, F.Z. Li, Bilinear low-rank coding framework and extension for robust image recovery and feature representation, Knowl. Based Syst. 86 (2015) 143–157. [25] X. Sun, Q. Qu, N.M. Nasrabadi, T.D. Tran, Structured priors for sparserepresentation-based hyperspectral image classification, IEEE Geosci. Remote Sens. Lett. 11 (7) (2014) 1235–1239. [26] M.Y. Liu, O. Tuzel, S. Ramalingam, R. Chellappa, Entropy rate superpixel segmentation, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 2097-2104. [27] L. Vincent, P. Soille, Watersheds in digital spaces: an efficient algorithm based on immersion simulations, IEEE Trans. Pattern Anal. Mach. Intell. 13 (6) (1991) 583–598. [28] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space analysis, IEEE Trans. Pattern Anal. Mach. Intell. 24 (5) (2002) 603–619. [29] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Susstrunk, SLIC superpixels compared to state-of-the-art superpixel methods, IEEE Trans. Pattern Anal. Mach. Intell. 34 (11) (2012) 2274–2282. [30] R. Rubinstein, A.M. Bruckstein, M. Elad, Dictionaries for sparse representation modeling, Proc. IEEE 98 (6) (2010) 1045–1057. [31] K. Engan, B.D. Rao, K. Kreutz-Delgado, Frame design using FOCUSS with method of optimal directions (MOD), in: Proceedings of the Nordic Signal Processing Symposium (NORSIG), 1999, pp. 65–69. [32] M. Aharon, M. Elad, A. Bruckstein, K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation, IEEE Trans. Signal Process. 54 (11) (2006) 4311–4322. [33] L. Wang, H. Geng, P. Liu, K. Lu, J. Kolodziej, R. Ranjan, A.Y. Zomaya, Particle swarm optimization based dictionary learning for remote sensing big data, Knowl. Based Syst. 79 (2015) 43–50. [34] M.D. Iordache, J.M. Bioucas-Dias, A. Plaza, Sparse unmixing of hyperspectral data, IEEE Trans. Geosci. Remote Sens. 49 (6) (2011) 2014–2039. [35] H. Zhang, J. Li, Y. Huang, L. Zhang, A nonlocal weighted joint sparse representation classification method for hyperspectral imagery, IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 7 (6) (2014) 2056–2065. [36] Y. Chen, N.M. Nasrabadi, T.D. Tran, Hyperspectral image classification using dictionary-based sparse representation, IEEE Trans. Geosci. Remote Sens. 49 (10) (2011) 3973–3985. [37] Z. Lin, M. Chen, Y. Ma, The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices, University of Illinois at UrbanaChampaign, Champaign, IL, USA, UIUC Technical Report, UILU-ENG-09-2215, 2009, pp. 1–20. [38] Y. Zhang, Recent advances in alternating direction methods: practice and theory, Presented at the IPAM Workshop, Los Angeles, CA, USA, October 11, 2010, Tutorial [Online], Available: 〈https://www.ipam.ucla.edu/publications/ opws2/opws2_9066.pdf〉. [39] X. Zhou, C. Yang, W. Yu, Moving object detection by detecting contiguous outliers in the low-rank representation, IEEE Trans. Pattern Anal. Mach. Intell. 35 (3) (2013) 597–610. [40] L. Li, S. Li, Y. Fu, Learning low-rank and discriminative dictionary for image classification, Image Vis. Comput. 32 (10) (2014) 814–823. [41] Y. Yuan, M. Fu, X. Lu, Low-rank representation for 3D hyperspectral images analysis from map perspective, Signal Process. 112 (2015) 27–33. [42] Y.Q. Zhao, J. Yang, Hyperspectral image denoising via sparse representation and low-rank constraint, IEEE Trans. Geosci. Remote Sens. 53 (1) (2015) 296–308. [43] P. Neubert, P. Protzel, Superpixel benchmark and comparison, in: Proceedings of Forum Bildverarbeitung, 2012, pp. 1–12. [44] Z. Lin, R. Liu, Z. Su, Linearized alternating direction method with adaptive penalty for low-rank representation, in: Annual Conference on Advances in Neural Information Processing Systems (NIPS), 2011, pp. 612–620. [45] J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online learning for matrix factorization and sparse coding, J. Mach. Learn. Res. 11 (2010) 19–60. [46] J.D. Gibbons, S. Chakraborti, Nonparametric Statistical Inference, Springer, Berlin, Heidelberg, 2011.
Zhi He received the B.S. degree in School of Mathematics from Harbin institute of technology (HIT), Harbin, China, in 2009, the M.S. and Ph.D. degrees in Control Science and Engineering from HIT, Harbin, China, in 2011 and 2015, respectively. Currently, she is a lecturer in School of Geography and Planning, Sun Yat-sen University (SYSU). Her research interests include Hilbert-Huang transform, sparse representation and their application to hyperspectral classification.
Lin Liu received the B.S. degree in Geography from Peking University (PKU), Beijing, China, in 1984, the M.S. degree in Remote Sensing and Cartography from PKU, Beijing, China, in 1987, and the Ph.D. degree in Geography with a specialization in Geographic Information Science (GIS) from Ohio State University (OSU), Ohio, USA, in 1994. Currently, he is a professor of Geography at University of Cincinnati and SYSU. His research interests include GIS, remote sensing and their applications. He has published over 100 articles and lead multiple national and international research projects.
Suhong Zhou received the B.S. and M.A. degrees in Urban Planning from Wuhan University (WHU), Wuhan, China, in 1997 and 2000, respectively, and the Ph.D. degree in Geography from SYSU, China, in 2003. Currently, she is a professor in School of Geography and Planning, SYSU. Her research interests include urban transportation and geographic information science applications to urban affairs. She has published close to 100 papers and leaded 3 national level research projects.
Yi Shen received the B.S., M.S., and Ph.D. degrees from HIT, Harbin, China, in 1985, 1988 and 1995, respectively. Since 1997, he has been a professor in the Department of Control Science and Engineering, HIT. He has published more than 100 journal and conference papers. His current research interests include instrumentation and measurement, ultrasound signal processing, and pattern recognition.