Hyperspectral classification based on spectral–spatial convolutional neural networks

Hyperspectral classification based on spectral–spatial convolutional neural networks

Engineering Applications of Artificial Intelligence 68 (2018) 165–171 Contents lists available at ScienceDirect Engineering Applications of Artifici...

2MB Sizes 0 Downloads 193 Views

Engineering Applications of Artificial Intelligence 68 (2018) 165–171

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai

Hyperspectral classification based on spectral–spatial convolutional neural networks Congcong Chen a,b , Feng Jiang b , Chifu Yang b , Seungmin Rho c , Weizheng Shen a, *, Shaohui Liu b , Zhiguo Liu d a b c d

School of Electrical Engineering and Information, Northeast Agricultural University, Harbin, 150030, China Harbin Institute of Technology, Harbin, 150001, China Department of Media Software, Sungkyul University, Republic of Korea College of Information Technology, Beijing Union University, Beijing 100101, China

a r t i c l e

i n f o

Keywords: Hyperspectral classification Convolutional neural network Support vector machine Spectral–spatial convolutional neural network Adaptive window

a b s t r a c t Hyperspectral image classification is an important task in remote sensing image analysis. Traditional machine learning techniques are difficult to deal with hyperspectral images directly, because hyperspectral images have too many redundant spectral channels. In this paper we propose a novel method for hyperspectral image classification, by which spectral and spatial features are jointly exploited from hyperspectral images. Firstly, considering the local similarity in spatial domain, we employ a large spatial window to get image blocks from hyperspectral image Secondly, each spectral channel of the image block is filtered to extract their spatial and spectral features, after that the features are merged by convolutional layers. Finally, the fully-connected layers are used to get the classification result. Comparing with other state-of-the-art techniques, the proposed method pays more attention to the correlation of spatial neighborhood by using a large spatial window in the network. In addition, we combine the proposed network with the traditional support vector machine (SVM) classifier to improve the performance of hyperspectral image classification. Moreover, an adaptive method of the spatial window sizes selection is proposed in this paper. Experimental results conducted on the AVIRIS and ROSIS datasets demonstrate that the proposed method outperforms the state-of-the-art techniques. © 2017 Elsevier Ltd. All rights reserved.

1. Introduction With the development of sensing technology, hyperspectral remote sensing has drawn much more attention in recent years. Accordingly, the classification for hyperspectral images has gradually become a highly investigated field (Landgrebe, 2002). Hyperspectral image includes tens or hundreds of grayscale channels, each of which represents an image under a specific spectral frequency (Lacar et al., 2001). The difference between hyperspectral images and common RGB images is that hyperspectral images capture more detail information under different spectral (Li et al., 2016). These information are helpful for the task of detection and classification. At present, hyperspectral images are widely used in agricultural development, military reconnaissance, geological exploration, weather prediction, etc. (Van Der Meer, 2004; Yuen and Richardson, 2010; Hege et al., 2004). However, the hyperspectral image not only provides more spectral details, but also adds more redundant information and increases the complexity of calculation. That is known

as the Hughes phenomenon (Gowen et al., 2007). Therefore, reducing the redundancy in spectral domain and extracting representative spectral–spatial features, has become the focus of the hyperspectral image classification (Bioucas-Dias et al., 2013; Ambikapathi et al., 2012). In the early stages of hyperspectral image classification, the spectral line is the major feature for classification. To reduce spectral redundancy, principal component analysis (PCA) (Halldorsson et al., 2004; Jiang et al., 2016b; Liu et al., 2016) and compressed sensing (Halldorsson et al., 2004) were used to get compact features. Support vector machine with radial basis function kernel (RBF-SVM) (Chen et al., 2013; Chen and Ji, 2016) was then trained for the purpose of classification (Halldorsson et al., 2004). However, hyperspectral image is a three-dimensional matrix, which is abundant in the third dimension. As shown in Fig. 1. The 𝑥-axis and 𝑦-axis represents the spatial location respectively, and z-axis represents spectral lines in the hyperspectral

* Corresponding author.

E-mail address: [email protected] (W. Shen). https://doi.org/10.1016/j.engappai.2017.10.015 Received 14 January 2017; Received in revised form 12 October 2017; Accepted 19 October 2017 0952-1976/© 2017 Elsevier Ltd. All rights reserved.

C. Chen et al.

Engineering Applications of Artificial Intelligence 68 (2018) 165–171

are still too rough. The morphological method (Jiang et al., 2015b) mentioned above extracts features only from several important spectral channels, ignoring other spectral channels. The sparse representation based methods (Chen et al., 2011; Song et al., 2016) discard the context information of the spatial location among different spectral lines. The DBN (Jiang et al., 2014; Chen et al., 2015) which flattens the image block into a 1-D vector, also weakens the information of spatial context. For the 9-layers FCN (Lee and Kwon, 2016), the authors focus on designing a deeper network and the size of spatial windows is limit to 3 × 3. To overcome those issues, in this paper, we propose a convolution neural network to jointly exploit spectral and spatial features from hyperspectral images. Comparing with other state-of-the-art techniques, the proposed method dedicates to exploiting the correlation of spatial neighborhood by using a larger spatial window. The main contributions of this paper are as follows: (1) A convolution neural network is designed to extract spectral and spatial features simultaneously. The network uses larger spatial windows to integrate local features without being confused by too much redundant spatial information. (2) A fusion scheme is present for combining the proposed convolution network with the traditional SVM (Hearst et al., 1998; Jiang et al., 2015b) classifier. The proposed network has a strong ability to extract features from hyperspectral image. However, the fully-connected layers in the proposed network are only linear structures. To improve the performance of classification, some of the fully-connected layers are replaced by RBF-SVM to increase the nonlinearity of classifiers. (3) We investigate the classification performance of different spatial window sizes, and propose a method to choose the size of spatial window adaptively. In a smooth area which contains only one or two categories, large windows usually get a better classification results. Oppositely, in the category-mix areas, large window may contain so much noise from unrelated categories. The features extracted from those areas with large window will thus deteriorate the classification performance. The proposed method use a classification confidence yielded by the proposed network to guide the selection of the window size. The main limitations of our proposed method include:

Fig. 1. Illustration of a hyperspectral image.

image (Du et al., 2016). Roughly using a single spectral line in 𝑧-axis without considering its spatial neighbors in 𝑥-axis and 𝑦-axis makes the effect of classification unsatisfactory (Chen et al., 2009a, b; Jiang et al., 2015c). In the past decades, existing works mainly focus on extracting joint spectral–spatial features to enhance classification performance. Benediktsson et al. (2005) applied the morphological method to extracting features from hyperspectral image. First, representative spectral images were extracted from the spectral channels. Then opening and closing operations were conducted to extract spectral–spatial contour features. Chen et al. (2011) proposed a dictionary-based joint sparse representation (JSR). In that paper, spectral line was represented by a dictionary linearly. Sparse coefficients are then calculated, based on which classification results can be inferred. To jointly use spectral– spatial information, a small adjacent spatial block of the spectral line was extracted, and all of spectral lines in that block were represented simultaneously by the dictionary and sparse coefficients. Song et al. (2016) used a kNN (k-Nearest Neighbor)-based sparse representation in the feature extraction. Principal component analysis (PCA) was firstly applied to reducing spectral vectors into a low dimensional feature space. Next, a dictionary was constructed and the sparse coefficients were calculated. Finally, kNN was used to determine the category of the input. Kang et al. (Jiang et al. 2016a) proposed a filter-based method for the purpose of feature extraction. Firstly, the spectral lines were classified pixel-wisely to get a map of classification results. The classification map was then described by several probability maps, each of which was filtered. Finally, the probability maps were merged to get a new map of classification scores (Jiang et al., 2015a, 2016a). Recently, deep learning methods have been introduced into the hyperspectral image analysis (Xia et al., 2014; Chen et al., 2015; Lee and Kwon, 2016). Chen et al. (2015) applied the Deep Belief Network (DBN). In that work, 3-D blocks were first extracted from hyperspectral image, and then these blocks were flattens into a 1 × 𝑑 vector and inputted into the fully-connect layers. Lee and Kwon (2016) designed a 9-layer Fully Convolutional Network (FCN) Based on this FCN, a block with 𝑑 × 𝑑 spectral lines can be extracted and classified. Compared with conventional feature extraction, deep learning can learn features with more discriminative information. a two channel deep convolutional neural network (Two-CNN) learned jointly spectral– spatial feature from hyperspectral image, and then extracted the characteristics in the full connection layer . But, this method only extracted some of the band’s information, all bands information (Yang et al., 2016). Although the current research of hyperspectral image classification has already involved the combination of spectral and spatial features, the process of feature extraction, merging and dimension reduction

(a) Our method cannot better deal with the image boundary problem since our method expect a large context information. (b) Our network is too shallow, which is the trade-off between the performance and the running speed, to fully model the context information. (c) The full connection layer makes the network needs fixed size input that restricts the flexibility of our network to make full use of different scale context information. In our future work, we will proposed effective ways to solve this problem.

2. Spectral–spatial convolutional neural networks In this section, we first proposed our end-to-end spectral–spatial convolutional neural networks for hyperspectral image classification. Then, we introduce how to combine our convolutional neural networks based method with the classical support vector machine. Finally, we present adaptive window method to further improve our network performance. 2.1. Network structure In order to deal with hyperspectral images, this paper presents a spectral–spatial convolutional neural network. The first convolution layer is called Multi-scale Feature Extraction layer. That layer makes full use of the convolution filter to extract good features which are invariant with deformation and rotation. To make features have invariance in scaling at the same time, this paper adopts the Inception Module presented in GoogleNet (Szegedy et al., 2015). The 166

C. Chen et al.

Engineering Applications of Artificial Intelligence 68 (2018) 165–171

Fig. 2. The framework of the proposed network. (a) Data blobs of the network. (b) Structure of the network.

regard the fully-connected layers as a temporary tool to find the optimal parameters of convolution layers during the process of back propagation. SVM is designed to minimize structural risk, it maps the sample vectors to a high-dimensional or even infinite-dimensional feature space (Hilbert space) through a nonlinear mapping 𝜌, then an optimal hyperplane with low VC dimension is constructed as the decision plane in the high dimensional feature space. In SVM, the nonlinear separable problem in the original sample space is transformed into a linear separable problem in the feature space. Let T = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ) … , (𝑥𝑛 , 𝑦𝑛 )} be the samples to be classified, the formalization of SVM can be described below:

1×1×𝐵 and 3×3×𝐵 dimension filters are used simultaneously in the first convolution layer (𝐵 is the channels of hyperspectral image). Then the features captured by those two kinds of filters are cascaded. By processing hyperspectral image with 1×1 and 3×3 filters, the feature extraction layer obtains abundant Spectral–Spatial information. A pooling layer is followed to reduce parameters of the network and accelerate the speed of convergence. The second convolution layer is called Feature Fusion layer, which can adequately merges the spatial and spectral features obtained by feature extraction layer. A pooling layer is followed. The third convolution layer is called feature reduction layer. The number of filters in that layer is fewer than first two convolution layers, which helps to remove redundant features and improve the representativeness of the features. The last three layers are used for feature classification and parameter optimization in the back propagation process. The input of the proposed network is a 𝑃 × 𝑃 × 𝐵 dimension image block, where 𝑃 × 𝑃 is the size of spatial window and B is the channels of hyperspectral image. The output of the proposed network is a label which represents the predicted classification result of the central spectral line in the input block. Traditional methods usually lose their classification accuracy when dealing with a large size (bigger than 10 × 10) of spatial windows, because they are unable to reduce the redundancy and choose representative features in large windows. The proposed network is show in Fig. 2. The network consists of multi-scale feature extraction layer, feature fusion layer and feature reduction layer. In this network, each 𝑃 × 𝑃 × 𝐵 input image block needs one label for supervised training. The size of spatial window can range from 3 to 30, which helps to catch more spatial correlation information. At the end of the network, Softmax layer is adopted for classification. The Softmax layer normalizes the output of fully-connect layers to 0–1, providing an important degree of confidence for classification, where 0 means that degree of confidence is the lowest, and 1 means highest.

∑ 1 𝜉𝑖 ‖𝑤‖2 + 𝐶 2 𝑖=1 𝑛

min

𝑤,b,𝜉

s.t.

( ) ) { ( 𝑦𝑖 𝑤 ⋅ 𝜑 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖 ; 𝑖 = 1, 2, … , 𝑛 𝜉𝑖 ≥ 0

(1)

where 𝑊 represents the linear weight of SVM, 𝑦𝑖 is the correct classification label, 𝐶 represents the penalty coefficient, 𝑏 represents the offset. where 𝜉𝑖 is the relaxation coefficient, the Lagrangian duality problem is usually used when solving SVM: max − 𝛼

𝑛 𝑛 𝑛 ( ) ∑ 1 ∑∑ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑘 𝑥𝑖 , 𝑥𝑗 + 𝛼𝑖 2 𝑗=1 𝑖=1 𝑖=1

𝑛 ⎧∑ ⎪ 𝛼𝑖 𝑦𝑖 = 0 s.t. ⎨ 𝑖=1 ; 𝑖 = 1, 2, … , 𝑛 ⎪0 ≤ 𝛼 ≤ 𝐶 𝑖 ⎩

(2)

) ( where 𝛼𝑖 represents the Lagrangian weight. 𝑘 𝑥𝑖 , 𝑥𝑗 represents the SVM kernel function. Linear kernel or RBF kernel can be chosen in SVM. RBF kernel is mainly used in linear indivisible case. The number of parameters in RBF kernel is larger, the classification result is more dependent on the parameters. Cross-validation can be used to find the appropriate parameters. Although the process of finding appropriate parameters in RBF-SVM is time-consuming, the RBF-SVM classifier usually achieves better performance than the linear kernel SVM once the appropriate parameters are determined. To prove the feature extraction performance of the convolution layers, the output of the FC1 (fully connected) layer is concatenated with the SVM. The detail process is as follows: In the first training stage, the proposed network uses a structure of the three-layer convolution and three-layer fully-connected layer,

2.2. Combination of CNN and SVM Our proposed network adopts convolution layers to extract the spectral–spatial features from hyperspectral images, and then uses fullyconnect layers and Softmax layers to classify features. Considering that there is no non-linear structure in the fully-connect layers, we finally use RBF-SVM to replace some of the fully-connect layers. The number of hyperspectral image samples is limited, so to avoid the over-fitting problems, we do not design a very deep network. However, there will be some bottlenecks in our not-too-deep network if we only apply three linear fully-connect layers for classification. Therefore, we introduce the RBF-SVM in the classification, and only 167

C. Chen et al.

Engineering Applications of Artificial Intelligence 68 (2018) 165–171

as shown in Fig. 2. In the second training stage, when the training of convolutional parameters is finished, we keep the first three convolution layers parameters unchanged, and replace the FC2 and FC3 with RBFSVM. At this time, some 𝑃 × 𝑃 × 𝐵 image blocks are inputted into the network. The network will produce some 1-D vectors after FC1 layer, that is the inputs of SVM for SVM training. In the test stage, the 𝑃 ×𝑃 ×𝐵 image block is inputted into the network, FC1 layer then produces 1D vector. The trained SVM model utilizes the 1-D vector to get the classification result. 2.3. Adaptation of spatial window size Traditional machine learning methods is difficult to adopt large spatial windows when dealing with hyperspectral image, due to the adverse impact of the data redundancy. The network proposed in this paper is focused on how to make better use of local similarity in spatial domain. After discussing the structure of network, and the combination between the network and SVM, we compare the performance of different window sizes on the classification accuracy. Larger window means considering more local similarity, while it also means importing noise which is uncorrelated with classification. Convolution network has excellent ability to extract features from image and filter out noise, it is thus potential in dealing with the large windows. Using a fixed size window for whole image is not a sensible choice. In some specific area where the categories intersect, a large window will contain too many irrelevant categories which may make the network confused. To use a small window in those specific areas, we propose an adaptive method to choose the best window size. The output of FC3 layer is also the input of Softmax layer, so the input size of Softmax layer is 𝐶 × 1, where 𝐶 is the number of categories in the dataset. Labels apply the One-Hot encoding format, that is, when the label belongs to the 𝑘th category, the label can be expressed as: 𝑉𝑙𝑎𝑏𝑒𝑙 = [𝑎1 = 0, 𝑎2 = 0, … , 𝑎𝑘−1 = 0, 𝑎𝑘 = 1, 𝑎𝑘+1 = 0, … , 𝑎𝐶 = 0].

Fig. 3. Process of adaptive window size selection.

where 0 < 𝜃1 < 1. If Formula (8) is satisfied, the algorithm can determine that the classification result is 𝑚th category, because the confidence of being classified into the 𝑚th category is far higher than the confidence of being classified into other categories. In this case, it is not necessary to use the adaptive method. The window size 𝑃𝐴 × 𝑃𝐴 is enough for classification. But if Formula (8) is not satisfied, and Formula (9) is satisfied. It means that using the window size 𝑃𝐵 ×𝑃𝐵 can achieve a relatively higher confident result than using the window size 𝑃𝐴 ×𝑃𝐴 , the input block will finally be classified into 𝑚′ th category:

(3)

And the output of FC3 layer is expressed as: 𝑉𝐹 𝐶3 = [𝑏1 , 𝑏2 , 𝑏3 , … , 𝑏𝑘−1 , 𝑏𝑘 , 𝑏𝑘+1 , … , 𝑏𝐶 ].

(4)

𝐶𝑜𝑛𝑓𝑃𝐵

(5)

So the 𝑏𝑘 , in the output of FC3 layer, can be regarded as the confidence which represents the possibility of the input being classified into the 𝑘th category. The confidence is high when 𝑏𝑘 is approximate to 1, and low when 𝑏𝑘 is approximate to 0. The formalization of confidence can be described as follow: Conf𝑘 = 1 − ||1 − 𝑏𝑘 ||

(6)

where Conf𝑘 represents the possibility of the input being classified into the 𝑘th category. When the following formula is satisfied, the input will finally be classified into 𝑐th category. 𝑐 = argmax Conf𝑘 .

(7)

𝑘

The process of adaptive window size algorithm is shown in Fig. 3. We choose two different sizes of the spatial windows 𝑃𝐴 and 𝑃𝐵 , assume 𝑃𝐴 > 𝑃𝐵 . Then we extract 𝑃𝐴 × 𝑃𝐴 and 𝑃𝐵 × 𝑃𝐵 spatial windows from hyperspectral image separately. Let m be the most possible class when the size of window is 𝑃𝐴 × 𝑃𝐴 , and 𝑛 be the second most possible class. When the following formula is satisfied, the input will finally be classified into 𝑚th category. 𝐶𝑜𝑛𝑓𝑃𝐴 < 𝐶𝑜𝑛𝑓𝑃𝐴 × 𝜃1 𝑛

𝑚

< 𝐶𝑜𝑛𝑓𝑃𝐵

𝑚′

× 𝜃2

(9)

where 𝑚′ is the most possible class when the window size is 𝑃𝐵 × 𝑃𝐵 , and 𝑛′ is the second most possible class, and 0 < 𝜃2 < 1. The detail process of the adaptive method is shown in Fig. 3. Firstly two networks with different window sizes are training. (the structure of network is show in Fig. 2, making one of them can classify 𝑃𝐴 × 𝑃𝐴 × 𝐵𝐴𝑁𝐷 image blocks, and the other one can classify 𝑃𝐵 × 𝑃𝐵 × 𝐵𝐴𝑁𝐷 image blocks (𝐵𝐴𝑁𝐷 is channels of hyperspectral image). In the testing process, we put 𝑃𝐴 × 𝑃𝐴 × 𝐵𝐴𝑁𝐷 and 𝑃𝐵 × 𝑃𝐵 × 𝐵𝐴𝑁𝐷 into the network, and calculate 𝑚, 𝑛, 𝑚′ , 𝑛′ . Formula (8) and Formula (9) can be used to choose the best window size adaptively. Note that, in some cases both Formula (8) and (6) are not satisfied at the same time. In those cases, we use 𝑃𝐴 × 𝑃𝐴 window size instead of 𝑃𝐵 × 𝑃𝐵 to determine the final result, because larger spatial window usually get better classification accuracy compared with the smaller one in our proposed network. The employment of spatial windows also brings a problem of extracting windows from edge pixels. To solve it, we can decrease the size of spatial windows, or make a combination of the proposed method with pixel-wise SVM, sparse representation based method (Chen et al., 2011; Song et al., 2016).

After training, there will be 𝑉𝐹 𝐶3 → 𝑉𝑙𝑎𝑏𝑒𝑙 : 𝑏1 → 𝑎1 = 0, 𝑏2 → 𝑎2 = 0, … , 𝑏𝑘 → 𝑎𝑘 = 1 … 𝑏𝐶 → 𝑎𝐶 = 0.

𝑛′

3. Experiment & analysis Indian-Pines and Pavia University are two datasets that are widely used in the evaluation of hyperspectral classification algorithms. IndianPines dataset is captured by US AVIRIS (Green et al., 1998) imaging spectrometer. The resolution of the image of the Indian-Pines dataset is

(8) 168

C. Chen et al.

Engineering Applications of Artificial Intelligence 68 (2018) 165–171

Table 1 Comparison of classification results between various methods on Indian-Pines. Class

SVM

JSR

KNN-SR

Two-CNN

Proposed

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 OA

0.7618 0.8284 0.8309 0.8351 0.9343 0.9496 0.8871 0.9526 0.8079 0.8495 0.8644 0.8186 0.9112 0.9389 0.8698 0.9896 0.8769

0.9439 0.9399 0.9756 0.9277 0.9559 0.9617 0.9912 0.8835 0.8893 0.9794 0.9938 0.9248 0.9456 0.9745 0.9877 0.8093 0.9427

0.9967 0.9645 0.9677 0.9764 0.9897 0.9112 0.9916 0.9774 0.9565 0.9767 0.9719 0.9527 0.9982 0.9894 0.9864 0.9947 0.9751

0.9856 0.9634 0.9538 0.9053 0.9487 0.9714 0.9510 0.9428 0.9476 0.9843 0.9522 0.9501 0.9657 0.9921 0.9745 0.9978 0.9616

0.9947 0.9806 0.9668 0.9783 0.9935 0.9296 0.9965 0.9719 0.9674 0.9814 0.9697 0.9624 1.00 0.9939 1.00 0.9962 0.9802 Fig. 4. Illustration of the influence of different SVM parameters. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

145 × 145 × 220. There are 16 categories in Indian-Pines including corns, soybeans, wheats and so on. Each pixel in the 145 × 145 spatial domain represents a 20 × 20 m2 real surface area on the earth. Pavia University dataset adopts the ROSIS sensor to acquire remote sensing information. It is a 610 × 340 size picture with 103 channels and each pixel represents a 13 × 13 m2 real area. Pavia University dataset contains 9 categories. The environment of experimental is: e3 CPU, GTX Titan X GPU, Tensorflow deep learning framework.

in Table 2. Our approach is based on the depth of learning GPU parallel computing, the time required to process each input is 0.23 s. KNN-SR method with our results need to be 1.62 s. Fig. 7 reports the average running time comparison between various methods. As shown in the figure, our method has the lowest time complexity. We use the LIBSVM library (Chang and Lin, 2011) to complete the SVM part of this experiment. The parameters of SVM classifier include penalty coefficient C and Gaussian kernel parameter 𝜎. The image block has a fix size of 20 × 20 × BAND. After selecting the appropriate parameters, SVM can improve the final classification accuracy. Experiment shows that the proposed network can extract good image feature and can achieve superior performance. The effect of different parameters for SVM is shown in Fig. 4, where the 𝑥-axis is the logarithm of the penalty coefficient log2 (𝐶) and 𝑦axis is the logarithm of Gaussian kernel parameter log2 (𝜎). The line in Fig. 4 represents the classification accuracy of SVM under the training samples, where the green line indicates that the model get an accuracy of 100% under the training samples with the optimal parameters. We select a parameter of log2 (𝐶) = 5 , and log2 (𝜎) = −7 within the green line, and get an accuracy of 98.02% under the test samples. Table 3 compares the effect of FC1+ SVM, FC2+ SVM, and FC3+ SVM. It can be seen that with the increase number of fully connected layers, the accuracy gradually drops, so directly cascading SVM to the output of FC1 is the best choice in our network. In Table 3, size refers to the grid-space window.

3.1. Experiment on the proposed network In the Multi-scale feature extraction layer, the number of 1 × 1 filters and 3 × 3 filters are both 200. In the Feature Fusion layer and feature reduction layer, we used 400 and 100 filters separately, the kernel size of the convolution layers are 3 × 3. The fully-connected layers FC1, FC2, FC3 have 200, 100, C nodes separately, where C is the number of categories in datasets. Indian-Pines and Pavia University datasets are divided into training samples and test samples. The training samples are used to train the network, and the test samples are used to evaluate the classification accuracy. Images from the Indian-Pines dataset have a resolution of 145 × 145 pixels. Each pixel in that image represents a spectral line with a specific label. So the dataset has 145 × 145 = 21 025 samples for training and testing. To compare with other methods, we use the following method to choose training samples. Let 𝑑𝑖 be the number of each class, we randomly select min {𝑑𝑖 × 20%, 200} samples from each classes as training samples, the remains are regarded as testing samples. The input of the proposed network is a 𝑃 × 𝑃 × BAND image block, and the output is a single label. The network uses a stochastic gradient descent (SGD) to guide the training process, the batch is 128, iteration is 12 000, momentum is set to 0.9, weight decay is 0.0001, and the learning rate is initialized to 0.01 and then the learning rate gradually decays to 0.00001. The convolution layer is initialized using a Gaussian function with a mean of zero and a standard deviation of 0.05. Table 1 shows the effect of using the proposed network for classification under the Indian-Pines dataset. The experimental results indicate that for the most of the categories, the proposed method can achieve better performance.

3.3. Experiment on adaptive window sizes In order to study the classification effect of different block sizes, we select the image blocks of size 5 × 5, 10 × 10, 15 × 15, 20 × 20, 25 × 25, 30 × 30 to input into the convolution network, and make a comparison with other state-of-art methods. Results of the proposed method. the Joint Sparse Representation (JSR) (Chen et al., 2011) and kNN based Sparse Representation (Song et al., 2016) are shown in Fig. 5. The accuracy of the proposed method continually increase as the size of spatial window increases from 5 × 5 to 30 × 30. It proves that when obtaining a larger spatial window, the proposed method can capture more local similarity without being confused with the noise of irrelevant classes. The intuitive result for classification under Indian-Pines dataset is shown in Fig. 6. This dataset is a 145 × 146 × 224 size image, of which 224 is the number of channels of the image, the image of each 1 × 1 × 224 vector can be divided into a class The upper left map is the ground truth for Indian-Pines. The second map is the result obtained

3.2. Combination of the proposed network and SVM The three fully-connect layers are equivalent to a linear classifier which lack nonlinearity, to improve the classification performance, we keep the first three convolution layers and the first fully-connected layer FC1 unchanged, and replace the remaining network with RBF-SVM. The three convolution layers are regarded as a method for feature extraction. The effect of a combination of our proposed network and SVM is shown 169

C. Chen et al.

Engineering Applications of Artificial Intelligence 68 (2018) 165–171 Table 2 Comparison of classification results between several methods and our CNN + SVM method. Dataset

RBF-SVM

JSRC

FCN

KNN-SR

CNN

CNN +SVM

Indian Pines Pavia University

0.8767 0.8928

0.9427 0.9103

0.9206 0.9403

0.9773 0.9802

0.9802 0.9826

0.9839 0.9844

Table 3 Comparisons between different combinations of our spectral–spatial CNN network and the classical SVM classifier. Method

FC1 + SVM FC2 + SVM FC3 + SVM CNN

Size 5

10

15

20

25

30

0.917 0.913 0.912 0.913

0.957 0.952 0.950 0.948

0.975 0.973 0.972 0.971

0.984 0.981 0.980 0.979

0.985 0.982 0.981 0.982

0.991 0.990 0.988 0.989

The result of the proposed adaptive window size method is shown in the last map of Fig. 6. Let 𝑃𝐴 = 30, 𝑃𝐵 = 20. In Formula (8) and (9), 𝜃1 is set to 0.95, and 𝜃2 is set to 0.90. The value of 𝜃1 is close to 1, which ensured that only when the 𝑃𝐴 × 𝑃𝐴 block is difficult to classify (the confidence of recognizing the input as 𝑚th class is close the confidence of recognizing it as 𝑛th class), we try to use another 𝑃𝐵 × 𝑃𝐵 block. The reason why 𝜃1 > 𝜃2 is that the larger size 𝑃𝐴 × 𝑃𝐴 of windows usually has a higher accuracy than the smaller size 𝑃𝐵 × 𝑃𝐵 Therefore, only when 𝑃𝐴 × 𝑃𝐴 is very confused and 𝑃𝐵 × 𝑃𝐵 gives a high confidence for classification, we choose the size of 𝑃𝐵 ×𝑃𝐵 . The result of using adaptive window size method is almost the same with the one which only using 𝑃𝐴 × 𝑃𝐴 windows. But as shown in the last picture of Fig. 6, in the right side, there is an area which painted red, is classified wrongly under 𝑃𝐴 × 𝑃𝐴 windows . but classify correctly under the adaptive windows and 𝑃𝐵 × 𝑃𝐵 windows. It means that our adaptive method successfully makes the classifier to use the relatively smaller 𝑃𝐵 × 𝑃𝐵 windows. The area which painted red is a class-mixed area, if we use too large windows to extract features in that area, the network will be confused. Our adaptive method can accurately select the relatively superior window size.

Fig. 5. Illustration of the influence of different sizes of spatial windows.

by directly applying SVM pixel-wisely without considering any spatial similarity. Using the proposed network has an obvious improvement. As can be seen in Fig. 6, with the increase of the window size, the result of classification improves continually. The algorithm has problems for dealing with the edge pixels of the image, because we cannot get large windows at the edge of the image. So we can apply a relatively small windows at edge pixels or use JSR, KNN-SR for supplement.

Fig. 6. The intuitive results of classification under Indian Pine dataset. The first map is ground truth, the second one is the result obtained by directly applying SVM on spectral lines, the following three maps use the proposed network with spatial windows 10 × 10, 20 × 20, 30 × 30. The last one is the result of adaptive window size method. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

170

C. Chen et al.

Engineering Applications of Artificial Intelligence 68 (2018) 165–171 Chen, B.W., Ji, W., 2016. Intelligent marketing in smart cities: Crowdsourced data for geo-conquesting. IT Prof. 18 (4), 18–24. Chen, Y., Nasrabadi, N.M., Tran, T.D., 2011. Hyperspectral image classification using dictionary-based sparse representation. IEEE Trans. Geosci. Remote Sens. 49 (10), 3973–3985. Chen, B.W., Tsai, A.C., Wang, J.F., 2009a. Structuralized context-aware content and scalable resolution support for wireless VoD services. IEEE Trans. Consum. Electron. 55 (2). Chen, B.W., Wang, J.C., Wang, J.F., 2009b. A novel video summarization based on mining the story-structure and semantic relations among concept entities. IEEE Trans. Multimedia 11 (2), 295–312. Chen, Y., Zhao, X., Jia, X., 2015. Spectral–spatial classification of hyperspectral data based on deep belief network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 8 (6), 2381– 2392. Du, P.J., Xia, J.S., Xue, Z.H., Tan, K., Su, H.J., Bao, ., 2016. Review of hyperspectral remote sensing image classification. J. Remote Sens. 20 (2), 236–256. Gowen, A.A., O’Donnell, C., Cullen, P.J., Downey, G., Frias, J.M., 2007. Hyperspectral imaging–an emerging process analytical tool for food quality and safety control. Trends Food Sci. Technol. 18 (12), 590–598. Green, R.O., Eastwood, M.L., Sarture, C.M., Chrien, T.G., Aronsson, M., Chippendale, B.J., Olah, M.R., 1998. Imaging spectroscopy and the airborne visible/infrared imaging spectrometer (AVIRIS). Remote Sens. Environ. 65 (3), 227–248. Halldorsson, G.H., Benediktsson, J.A., Sveinsson, J.R., 2004. Source based feature extraction for support vector machines in hyperspectral classification. In: Geoscience and Remote Sensing Symposium, 2004. IGARSS’04. Proceedings. 2004 IEEE International, Vol. 1. IEEE. Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B., 1998. Support vector machines. IEEE Intell. Syst. Appl. 13 (4), 18–28. Hege, E.K., O’Connell, D., Johnson, W., Basty, S., Dereniak, E.L., 2004. Hyperspectral imaging for astronomy and space surviellance. In: Optical Science and Technology, SPIE’S 48th Annual Meeting. International Society for Optics and Photonics, pp. 380– 391. Jiang, J., Huang, L., Li, H., Xiao, L., 2016b. Hyperspectral image supervised classification via multi-view nuclear norm based 2D PCA feature extraction and kernel ELM. In: Geoscience and Remote Sensing Symposium (IGARSS), 2016 IEEE International. IEEE, pp. 1496–1499. Jiang, F., Rho, S., Chen, B.W., Du, X., Zhao, D., 2015a. Face hallucination and recognition in social network services. J. Supercomput. 71 (6), 2035–2049. Jiang, F., Rho, S., Chen, B.W., Li, K., Zhao, D., 2016a. Big data driven decision making and multi-prior models collaboration for media restoration. Multimedia Tools Appl. 75 (20), 12967–12982. Jiang, F., Wang, C., Gao, Y., Wu, S., Zhao, D., 2015b. Discriminating features learning in hand gesture classification. IET Comput. Vis. 9 (5), 673–680. Jiang, F., Wu, S., Yang, G., Zhao, D., Kung, S.Y., 2014. Viewpoint-independent hand gesture recognition with Kinect. Signal Image Video Process. 8 (1), 163–172. Jiang, F., Zhang, S., Wu, S., Gao, Y., Zhao, D., 2015c. Multi-layered gesture recognition with Kinect. J. Mach. Learn. Res. 16 (2), 227–254. Lacar, F.M., Lewis, M.M., Grierson, I.T., 2001. Use of hyperspectral imagery for mapping grape varieties in the Barossa Valley, South Australia. In: Geoscience and Remote Sensing Symposium, 2001. IGARSS’01. IEEE International, Vol. 6. IEEE, pp. 2875– 2877. Landgrebe, D., 2002. Hyperspectral image data analysis. IEEE Signal Process. Mag. 19 (1), 17–28. Lee, H., Kwon, H., 2016. Contextual deep CNN based hyperspectral classification. In: Geoscience and Remote Sensing Symposium (IGARSS), 2016 IEEE International. IEEE, pp. 3322–3325. Li, T., Sun, J., Zhang, X., Wang, X., 2016. A spectral-spatial joint classification method of hyperspectral remote sensing image. Chin. J. Sci. Instrum. 37 (6), 11–15. Liu, J., Wu, Z., Li, J., Xiao, L., Plaza, A., Benediktsson, J.A., 2016. Spatial–spectral hyperspectral image classification using random multiscale representation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 9 (9), 4129–4141. Song, W., Li, S., Kang, X., Huang, K., 2016. Hyperspectral image classification based on KNN sparse representation. In: Geoscience and Remote Sensing Symposium (IGARSS), 2016 IEEE International. IEEE, pp. 2411–2414. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, 2015. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Van Der Meer, F., 2004. Analysis of spectral absorption features in hyperspectral imagery. Int. J. Appl. Earth Obs. Geoinf. 5 (1), 55–68. Xia, J., Du, P., He, X., Chanussot, J., 2014. Hyperspectral remote sensing image classification based on rotation forest. IEEE Geosci. Remote Sens. Lett. 11 (1), 239–243. Yang, J., Zhao, Y., Chan, J.C.W., Yi, C., 2016. Hyperspectral image classification using two-channel deep convolutional neural network. In: Geoscience and Remote Sensing Symposium (IGARSS), 2016 IEEE International. IEEE, pp. 5079–5082. Yuen, P.W., Richardson, M., 2010. An introduction to hyperspectral imaging and its application for security, surveillance and target acquisition. Imaging Sci. J. 58 (5), 241–253.

Fig. 7. The average running time comparison between various methods.

4. Conclusion In order to solve the problem that hyperspectral image has too many redundant channels, we propose a spectral–spatial convolutional networks to extract hyperspectral features. The proposed network contains multi-scale feature extraction layer, Feature Fusion layer and feature reduction layer. Our method achieved 98.02% effect in the Indian pines dataset, 0.51% higher than other methods, combined with SVM, and achieved 98.39% and 98.44% effect at Indian Pens and Pavia university, respectively. Furthermore, we make a combination of the proposed network with a SVM classifier. The proposed method dedicates to exploit more information of the local similarity in spatial domain by using a larger spatial window in CNN. Finally, we present a method to adaptively select the best size of local windows. On the AVIRIS and ROSIS datasets, the proposed method improves the classification accuracy for hyperspectral image significantly. In the part of choosing the size of spatial windows adaptively, we only consider two fixed and discrete sizes of windows to choose. In the future study, we will try to find how to adaptively choose the window sizes within the continuous domain. Acknowledgments This work is partially funded by the MOE–Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology (Grant No. KM20006074), the National Key Research and Development Program of China (2016YFD0700204–02), the Major State Basic Research Development Program of China (973 Program 2015CB351804) and the National Natural Science Foundation of China under Grant No. 61572155, 61672188 and 61272386. We would also like to acknowledge NVIDIA Corporation who kindly provided two sets of GPU. References Ambikapathi, A., Chan, T.H., Lin, C.H., Chi, C.Y., 2012. Convex geometry based outlierinsensitive estimation of number of endmembers in hyperspectral images. Signal 1, 1–20. Benediktsson, J.A., Palmason, J.A., Sveinsson, J.R., 2005. Classification of hyperspectral data from urban areas based on extended morphological profiles. IEEE Trans. Geosci. Remote Sens. 43 (3), 480–491. Bioucas-Dias, J.M., Plaza, A., Camps-Valls, G., Scheunders, P., Nasrabadi, N., Chanussot, J., 2013. Hyperspectral remote sensing data analysis and future challenges. IEEE Geosci. Remote Sens. Mag. 1 (2), 6–36. Chang, C.C., Lin, C.J., 2011. LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2 (3), 27. Chen, B.W., Chen, C.Y., Wang, J.F., 2013. Smart homecare surveillance system: Behavior identification based on state-transition support vector machines and sound directivity pattern analysis. IEEE Trans. Syst. Man Cybern.: Syst. 43 (6), 1279–1289.

171