Deep hybrid dilated residual networks for hyperspectral image classification

Deep hybrid dilated residual networks for hyperspectral image classification

Deep Hybrid Dilated Residual Networks for Hyperspectral Image Classification Communicated by Dr B Hu Journal Pre-proof Deep Hybrid Dilated Residual...

888KB Sizes 1 Downloads 126 Views

Deep Hybrid Dilated Residual Networks for Hyperspectral Image Classification

Communicated by Dr B Hu

Journal Pre-proof

Deep Hybrid Dilated Residual Networks for Hyperspectral Image Classification Feilong Cao, Wenhui Guo PII: DOI: Reference:

S0925-2312(19)31696-0 https://doi.org/10.1016/j.neucom.2019.11.092 NEUCOM 21622

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

25 June 2019 3 October 2019 11 November 2019

Please cite this article as: Feilong Cao, Wenhui Guo, ual Networks for Hyperspectral Image Classification, https://doi.org/10.1016/j.neucom.2019.11.092

Deep Hybrid Dilated ResidNeurocomputing (2019), doi:

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Elsevier B.V. All rights reserved.

Deep Hybrid Dilated Residual Networks for Hyperspectral Image Classification Feilong Caoa,∗, Wenhui Guoa a

Department of Applied Mathematics, College of Sciences, China Jiliang University, Hangzhou 310018, Zhejiang Province, P R China

Abstract This study presents a new architecture for deep convolution networks, end-to-end hybrid dilated residual networks wherein 3D cube images are input for hyperspectral image (HSI) classification, and this is termed as 3D-2D SSHDR. The proposed networks could greatly improve the performance of HSI classification as they select the spectral bands adaptively and avoid the limitations of handcrafted selection. Specifically, 3D spectral residual blocks are initially used to learn discriminant features from rich spectral features. Subsequently, extracted 3D images with spectral features are reshaped into 2D feature maps as the input for spatial feature learning, which does not affect the extraction of spatial-spectral features and could reduce the number of parameters. This is followed by using hybrid dilated convolution (HDC) residual blocks to continue learning discriminative spatial features, expanding the receptive field of the convolution kernel without increasing the computational complexity, and avoiding the gridding effect generated by the dilated convolution. Finally, the proposed networks are trained via supervised learning. Besides, this method could accelerate the convergence speed and alleviate the over-fitting problem due to the added batch normalization layers and the design of a multi-residual structure. Experimental results validate that the performance of the proposed network is superior on several HSI benchmark datasets when compared with that of existing methods. Keywords: classification; hyperspectral image; deep residual learning; hybrid dilated convolution; spectral-spatial feature extracting

1. Introduction Hyperspectral images (HSIs) contain abundant spectral information [1], making the accurate recognition of corresponding ground materials possible. Its importance is emphasized ∗

Corresponding author. Email addresses: [email protected] (Feilong Cao), [email protected] (Wenhui Guo)

Preprint submitted to Neurocomputing

December 16, 2019

in the fields of remote sensing such as environment, crop, surface, and physics-related applications [2–5]. Consequently, HSI classification is a hot topic and has emerged as the focus of several studies. Traditional classification methods of HSI include the support vector machine (SVM) [6], manifold ranking method [7], kernel-based methods [8], sparse representation method [9] and artificial neural networks [10]. Recently, deep learning [11, 12], which arose from research into artificial neural networks, is extremely popular as one of the most advanced machine learning techniques. The main advantage of deep learning is that it combines lowlevel features to form more abstract high-level attribute categories or features to discover distributed feature representations of data. Accordingly, a more complicated manual feature selection process is not required, and deep learning can automatically discover effective features. In the deep learning framework, convolutional neural networks (CNNs) constitute classical models and are widely used in image classification and recognition [13]. Results indicate that the use of CNN to classify ordinary images is a good strategy. As widely known, HSI is a 3-dimensional (3D) cube, which differs from ordinary natural images. It includes 1-dimensional (1D) spectral information and 2-dimensional (2D) spatial information [14]. There are hundreds of bands to collect the geometric information of the target scene in the spatial domain, and each pixel demands the reflection messages of the spatial location [15]. HSI has a large number of spectral bands, high redundancy, and a large amount of data with limited training samples, leading to the difficulty of classification. To this end, both spatial and spectral features should be considered to better classify HSI. The existing HSI classification frameworks are mainly divided into two branches by some researchers, namely one for spatial feature learning and another for spectral feature learning. For example, in [16], the spectral-spatial unified networks (SSUN) framework was separated into two branches where one involves extracting spectral feature information with long short-term memory (LSTM) structure and the other involves extracting spatial feature information with CNN. This might result in the sufficient extraction of information. Recently, more studies focused on the HSI classification based on 3D convolution. Li et al. [17] proposed a 3D convolution neural networks (3D-CNN) for the spectral-spatial classification of HSI. Subsequently, a 3D recurrent convolution neural networks (R-3D-CNN) [18] framework was proposed to classify HSI via 3D-CNN. The aforementioned 3D network architectures could extract more valuable feature information than the previous methods. However, it is not easy to accurately distinguish between discriminant spectral feature information and spatial feature information, and this may lead to the loss of important feature information to a certain extent.

2

To overcome the above-mentioned disadvantages, it is necessary to design a network that can distinguish spectral and spatial feature information as much as possible. Additionally, the network should prevent the problems of over-fitting, vanishing gradient, and exploding gradient. Therefore, it is natural to consider residual networks [19–21] since they were easier to optimize and the accuracy could be improved by increasing the depth. Based on these works, an interesting study simply increased network depth to promote HSI classification performance [22–24]. In [25], two shallow CNNs classified HSIs and then the spectral quilts. In [26], a novel network took advantage of both CNNs and multiple features learning to better predict the class labels for HSI pixels and this method could enhance feature extraction. Zhong et al. [27] proposed a network of spectral-spatial residual networks (SSRN), in which residual blocks are used in the process of feature extraction to improve the classification accuracy. Luo et al. [24] presented a novel model, namely the convolution neural network framework of HSI (HSI-CNN), to boost the performance. In [28], attention inception neural networks (AI-Net) architecture was learned adaptively by dynamic routing between attention initiation modules. Shamsolmoali et al. [29] devised a new network of network, which exploited spatial and spectral information, and produced high-level features from the original HSIs. In [30], an effective statistical metric learning method was developed for spectral-spatial classification of HSI. Notwithstanding, we still expect to explore the advanced methods for the issue of HSI classification based on limited training samples. Inspired by [27, 31, 32], this study proposes a new HSI classification network by constructing high dimensional residual networks with hybrid dilated convolution (HDC) referred to as 3D-2D SSHDR. Firstly, the proposed framework designs two groups of 3D convolutional residual blocks to extract spectral features and retain discriminant spectral feature information. Subsequently, discriminant feature information is preserved and transformed into 2D feature maps from 3D column data. Finally, two groups of HDC residual blocks are used to extract spatial features. Meanwhile, a simple residual networks of residual network (ROR) namely multi-residual network [33] model is embedded in the network to strengthen the learning ability of the residual network, and the final features correspond to a fusion of spectral and spatial features. It should be noted that this paper adopts HDC to enhance the ability of features learning for HSI, which expands the receptive field and avoids the shortcomings of the grid effect caused by dilated convolution without increasing the number of parameters. This is likely to be beneficial to the classification of HSI. Then, we respectively add the skipping layers for spectral feature learning and spatial feature learning to extract the spatial-spectral information sufficiently. Next, we exploit a 3D framework in spectral feature extraction and then use a 2D framework to extract spatial features. This structure does not affect the extraction of spatial-spectral features but could

3

reduce the parameters. The major contributions of this study are summarized as follows: • We propose an end-to-end hybrid dilated residual network to enhance the ability of features learning for HSI. Without increasing the number of parameters, this method could not only enlarge the receptive field but also avoid the defects of the grid effect caused by dilated convolution, which is conductive to feature extraction. • A multi-residual network embedded in the 3D-2D framework is presented to further promote the extraction of deep features. Especially, the spatial extraction with 2D convolution could considerably reduce the number of parameters and thereby simplifying the network structure. • Substantial experiments on several popular HSI benchmarks are conducted, which contains Indian Pines (IN), Kennedy Space Center (KSC), and University of Pavia (UP). Experimental results verify the superior performance of the proposed approach in comparison with the existing methods. The rest of this paper is organized as follows. In Section 2, a detailed description of the proposed framework is presented. The experimental results and analysis are given in Section 3. A summary and future directions of this study are discussed in Section 4.

2. Methodology Actually, HSI is typically composed of hundreds of spectral bands from the visible spectrum to the infrared spectrum. It corresponds to a 3D image that reflects the spatial and spectral information of objects, and it has a large amount of data. In this section, the framework, an end-to-end continuous process of extracting spectral and spatial features for HSI classification, will be designed. This framework not only takes into account the information on each dimension but also the universality. The proposed architecture is termed as 3D-2D SSHDR and comprises five parts, i.e., spectral feature learning process, 3D to 2D deformable part, spatial feature learning process, an average pooling layer, and a fully connected (FC) layer. Additionally, some convolution layers are followed by batch normalization (BN) [34] and ReLU activation function, and a softmax layer is connected to output a 1 × 1 × L classification result. We elaborate on our

proposed novel 3D-2D SSHDR framework with 3D samples that exhibit a size of 7 × 7 × 103

as an example, as shown in Fig. 1.

4

Fig. 1: 3D-2D SSHDR with 7 × 7 × 103 input HSI cube.

2.1. Spectral feature learning First, we consider 3D multi-channel convolution as an example to illustrate the principle of 3D convolution. Subsequently, we explain spectral feature learning in a step by step manner. Assume that the number of channels of the input image is 3. Then the size of the input image is expressed as (3, depth, height, width), and the size of the convolution kernel is expressed as (3, kd , kh , kw ). The 3D convolution operation is similar to the 2D convolution operation. Each sliding window and all values (kd , kh , kw ) of the convolution kernel window implement a convolution operation, and each convolution operation obtains a value in the output 3D images. The diagrammatic sketch of 3D multi-channel convolution is displayed in Fig. 2. To describe the convolution calculation process clearly, we calculate the convolution by

Fig. 2: 3D multi-channel convolution

5

Fig. 3: Spectral feature learning includes a skipping layer and two spectral residual blocks.

using the following equation (1).   PX i −1 R i −1 Q i −1 X X X (x+p),(y+q),(z+r) x,y,z p,q,r Xi,j = σ Wi,j,m Xi−1,m + bi,j  , m

(1)

p=0 q=0 r=0

x,y,z where Xi,j represents the j-th feature cube in the i-th layer at position (x, y, z), m is the p,q,r feature map in (i−1)-th layer, which attaches to the current feature map. Wi,j,m denotes the

column weight of the m-th feature cube at position (p, q, r), bi,j stands for the j-th feature cube in the i-th layer’s bias items of the filter, Pi and Qi denote the length and width of the kernel in spatial domain, respectively, Ri expresses spectral dimensions in HSI, σ(·) is the activation function, and the ReLU function is considered as the activation function. In the first part, we introduce the spectral feature learning in detail, as shown in Fig. 3. The spectral feature learning consists of two convolution layers, two spectral residual blocks, and a skipping connection. The input of the network is a column 7 × 7 × 103.

Initially, we use 24 convolution kernels of 1 × 1 × 7 with a stride of (1, 1, 2) to convolute the

HSI cube to obtain 24 feature columns with the size of 7 × 7 × 49. Its role alleviates the redundant spectral information. Later, two spectral residual blocks are connected wherein each contitutes a shortcut and two convolution layers using padding to ensure that the size of output feature maps is the same as the input. A skipping connection of the outermost

layer is used to alleviate the problem of vanishing gradient, and it also contributes to the back propagation of the gradient and more fully extract features. At each convolution layer, we expolit 24 convolution kernels of 1 × 1 × 7 to learn rich spectral features and add

the BN layer and ReLU activation function in front of some convolution layers to improve classification results. The last convolution layer adopting 128 convolution kernels with a size of 1 × 1 × 49 is implemented to retain discriminative spectral features. X u and W u represent the 3D input samples and spectral kernels, respectively. The output of the first convolution layer is computed as follows: X u+1 = X u ∗ W u+1 + bu . 6

(2)

The output of convolution in residual blocks is X u+3 = X u+1 ∗ W 0 + F (X u+1 ; θ),

(3)

where F (X u+1 ; θ) = σ(X u+2 ) ∗ W u+3 + bu+3  denotes a residual function, θ = W u+2 , W u+3 , bu+2 , bu+3 and X u+2 = σ(X u+1 ) ∗ W u+2 + bu+2 .

(4)

(5)

Similarly, the output of the second residual block is X u+5 = X u+3 ∗ W 00 + F (X u+3 ; δ),

(6)

The final output of spectral feature learning is Xspectral = X u+3 ∗ W 00 + F (X u+3 ; δ) + X u+1 ∗ W 000 ,

(7)

where F (X u+3 ; δ) = σ(X u+4 ) ∗ W u+5 + bu+5  denotes a residual function, δ = W u+4 , W u+5 , bu+4 , bu+5 and X u+4 = σ(X u+3 ) ∗ W u+4 + bu+4 ,

(8)

(9)

where “∗” represents a convolution operation, W 0 , W 00 and W 000 are respectively convolution kernels of outer skipping structures. 2.2. 3D to 2D deformation After the spectral features are learned, the 128 discriminative spectral feature columns of 7 × 7 × 1 are preserved. Next, we use the retained features to learn the spatial features to achieve ultimately discriminative spectral-spatial features. Obviously, it only needs to learn the spatial features in 2D space, which does not affect the extraction of spatial-spectral features, but could reduce the parameters. Therefore, the 3D feature are reshaped into 128 2D feature maps of 7 × 7, and spatial features are learned on the feature maps. The transformation process is shown in Fig. 4.

7

Fig. 4: Framework of 3D to 2D deformable.

2.3. Spatial feature learning Generally, the smaller the convolution kernel size, the smaller the receiving field and the less global vision. Conversely, an increase in the size of the convolution kernel will cause the receiving field to become larger, thereby increasing the amount of information observed in the image and promoting features extraction. However, an issue is that enlarging the size of convolution kernels increases the corresponding calculation. To solve the issue, a dilated convolution network is introduced. This still maintains the computational complexity unchanged when the receptive field expands. The dilated convolution causes the convolution kernels that are close to each other to become “fluffy”. Thus, all the positions of the “fluffy” convolution kernels are filled with 0, and the subsequent calculation is based on the original method of convolution operation. The form of dilated convolution is shown in Fig. 5. The 2D dilated convolution is defined as equation (10):

y(p, q) =

Q P X X i=1 j=1

x(p + r × i, q + r × j)w(i, j),

(10)

where y(p, q), x(p, q) and w(i, j) denote the output, input and kernel of dilated convolution, respectively; P and Q represent the length and width of kernel w(i, j), respectively; r is the dilation coefficient of dilated convolution, and r = 1 denotes standard convolution.

Fig. 5: 3 × 3 convolution kernels with different dilation coefficients corresponding to 1, 2 and 3.

8

Fig. 6: Combination of dilation coefficients of hybrid dilated convolution (1, 2, 3)

However, due to the problem of grid effect caused by dilated convolution, the percentage of which actual participation in the calculation is reduced. In the process of feature extraction, HDC residual blocks are used to extract spatial features. By combining convolution layers with different dilation coefficients, such as Fig. 6, the receptive field is expanded and the issue of grid effect is avoided. The specific process of spatial feature extraction is given in Fig. 7. First, 24 convolution kernels of 3×3 are given to convolute the input of 128 feature maps of 7 × 7. The stride of convolution is (1, 1), and the padding is “valid” to obtain 24 feature maps of 5 × 5. Afterward, the combination of HDC residual blocks and a skip structure are used to learn spatial features wherein dilation coefficients of this group of convolution layers are 1 and 3, and the padding is “same” in each convolution layer. Then we utilize 24 kernels of 3 × 3 to learn spatial features. The BN layer and ReLU activation function

are added to improve classification results. The use of residual blocks is to alleviate the problem of vanishing gradient and avoid precision decay. The calculation process is similar to spectral feature extraction.

Ultimately, an average pooling layer with a size of 5×5 is connected to obtain 24 feature maps of 1 × 1 after spectral-spatial feature learning which aims to reduce data variance and

computation complexity. Then an FC layer and a softmax layer are used to classify HSI.

Fig. 7: Spatial feature learning including a skipping layer and two HDC residual blocks.

9

Fig. 8: The process of HSI classification based on the 3D-2D SSHDR.

The output is yˆ = [ˆ y1 , yˆ2 , . . . , yˆL ], where L denotes the number of categories of samples used for classification. 2.4. Network of HSI classification based on the 3D-2D SSHDR The diagram below reveals the entire process of obtaining experimental results based on the proposed 3D-2D-SSHDR network, as listed in Fig. 8. Concerning different datasets, we set different percentages of training datasets, validation datasets, and test datasets. For example, IN dataset and KSC dataset are divided into 20% for training, 10% for validation, and 70% for testing. Additionally, with respect to the UP dataset, it is split into 10% for training, 10% for validation, and 80% for testing. Denote A1 , A2 , and A3 be the training, validation, and test data, respectively; Y1 , Y2 , and Yˆ3 stand for training labels, validation labels, and final prediction labels, respectively, where Ai = {a1 , a2 , . . . , aN } ∈ Rm×m×n , Yi = {y1 , y2 , . . . , yN } ∈ R1×1×l . The complete experimental process is mainly

divided into three steps. To begin with, a certain training set is used to train the model, and the verification set is verified the model wherein the optimal model is retained. The optimal model validates the remaining test set to achieve the prediction labels Yˆ3 = {ˆ y1 , yˆ2 , . . . , yˆL }.

Prediction labels are compared with the real object labels Y3 = {y1 , y2 , . . . , yL } to compute the classification accuracy. In the process of training, the cross-entropy loss function is used to update the model. The loss function is given as (11): L

Loss = −

1X (yi log(ˆ yi ) + (1 − yi ) log(1 − yˆi )), L

(11)

i=0

where L denotes the number of categories of each dataset, yi represents the real sample labels, and yˆi denotes the predicted sample labels. 10

Table 1: The information of datasets IN dataset

KSC dataset

UP dataset

No.

Class

Number of sample

No.

Class

Number of sample

No.

Class

1

Alfalfa

46

1

Scrub

347

1

Asphalt

6631

2

Corn-notill

1428

2

Willow swamp

243

2

Meadows

18649

3

Corn-mintill

830

3

CP hammock

256

3

Gravel

2099

4

Corn

237

4

Slash pine

252

4

Trees

3064

5

Grass-pasture

483

5

Oak/broadleaf

161

5

Painted metal sheets

1345

6

Grass-tree

730

6

Hardwood

229

6

Bare soil

5029

7

Grass-pasture-mowed

28

7

Swamp

105

7

Bitumen

1330

8

Hay-windrowed

478

8

Graminoid marsh

390

8

Self-blocking bricks

3682

Number of sample

9

Oats

20

9

Spartina marsh

520

9

Shadows

976

10

Soybean-notill

972

10

Cattail marsh

404

/

/

/

11

Soybean-mintill

2455

11

Salt marsh

419

/

/

/

12

Soybean-clean

593

12

Muds flats

503

/

/

/

13

Wheat

205

13

Water

927

/

/

/

14

Woods

1265

/

/

/

/

/

/

15

Buildings-Grass-Trees

386

/

/

/

/

/

/

16

Stone-Steel-Towers

93

/

/

/

/

/

/

Total

/

10249

Total

/

5211

Total

/

42776

3. Experiment results and discussions 3.1. Datasets and evaluation indices In this subsection, we use three HSI datasets

1

including Indian Pines (IN), Kennedy

Space Center (KSC), and University of Pavia (UP) to test the performance of the proposed architecture. For the three datasets, we run ten trials and take the mean and standard deviation of classification metrics as final results. In the following, a brief introduction to these three datasets are given and the information of datasets are listed in Tables 1. The IN dataset was collected by AVIRIS from Indiana in 1992 and comprises 16 types of surface materials. It contains 145 × 145 pixels, resolution of 20 m pixel by pixel, and originally 220 bands. Given that 20 bands are de-

stroyed, the remaining 200 bands are used for experiment, and thus the data size changes to 145×145×200. The KSC dataset was gathered by AVIRIS, Florida, in 1996 and contains 13 categories with 512 × 614 pixels and 18 meters resolution. There are 176 spectral bands for evaluation, and the data size is 512 × 614 × 176. The UP dataset contains 9 types of urban land covers that were acquired via the Reflective Optical System Imaging Spectrometer in northern Italy in 2001. The data size is 610 × 340 × 103 with a resolution of 1.3 m/pixel. Later, three indices [35], the overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa), are measured to illustrate the effectiveness of this structure. The OA represents the ratio of the number of samples that are accurately classified and 1

http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral Remote Sensing Scenes

11

the total number of samples, i.e., OA =

Pn

i=1 aii

N

× 100%,

(12)

where N denotes the total number of samples, n is the number of categories, and aii represents the diagonal elements of the corresponding confusion matrix. The CA refers to the percentage of pixels accurately classified in each category in the total number of categories, that is, aii CAi = Pn

j=1 aij

× 100%,

(13)

where aij represents elements that originally belong to class i and are assigned to class j. The AA stands for the average value of CAi . It is defined as: Pn CAi AA = i=1 . n

(14)

Kappa is an index to measure the accuracy of classification. The mathematical expression is described as: Kappa =

N

P − ni=1 (ai+ a+i ) i=1 aii P , N 2 − ni=1 (ai+ a+i )

Pn

(15)

where ai+ denotes the sum of ith row, and a+i refer to the sum of the ith column. In short, larger OA, AA, and Kappa reflect better effectiveness of the proposed method. 3.2. Comparison experiments In this subsection, we will report the classification results of the proposed approach in comparison with other methods, including SVM-RBF (SVM with radical basis function) [36], 1D-CNN [37], AI-Net [28], 2D-CNN [38], 3D-CNN [17], 3D-DenseNet[39], and SSRN [27]. Tables 2 – 4 give a comparison of OA, AA, and Kappa with different methods under three benchmark HSI datasets. It can be seen clearly that the accuracies of the proposed method are superior to those of other methods. One possible reason is that the proposed framework adopts the HDC, in which the use of the dilated convolution makes the receptive field of the convolution kernel larger without adding parameters. Another point is that the hybrid dilated convolution relieves the problem of grid effect. This design approach facilitates capturing more contextual information.

12

Table 2: Classification results with different methods for the IN dataset. Class

SVM-RBF

1D-CNN

AI-Net

2D-CNN

3D-CNN

3D-DenseNet

SSRN

Ours

1

75.00

66.66

96.97

88.88

100

100

97.82

100

2

78.63

91.68

97.06

99.29

98.90

97.88

99.17

99.70

3

64.41

75.00

99.12

99.27

99.65

99.64

99.53

99.82

4

59.09

67.04

95.65

97.15

98.75

99.38

97.79

99.35

5

86.98

86.09

100

97.92

98.56

99.70

99.24

96.88

6

96.91

97.43

98.42

100

99.80

100

99.51

99.41

7

77.72

81.81

100

100

94.73

81.82

98.70

100

8

97.38

99.47

100

100

100

99.70

99.85

100

9

18.75

56.25

100

68.75

100

93.75

98.50

100

10

77.94

87.41

96.01

97.29

99.55

99.71

98.74

99.27

11

87.92

86.22

99.40

98.19

98.83

99.42

99.30

99.83

12

73.21

87.50

88.45

98.21

96.52

99.25

98.43

98.55

13

97.56

96.34

100

100

100

100

100

100

14

97.72

98.81

98.66

100

99.77

99.55

99.31

99.86

15

50.94

55.18

97.12

96.22

100

98.87

99.20

98.19

16

75.67

89.18

97.05

98.64

98.57

95.38

97.82

100

OA(%)

83.67 ± 0.40

88.34 ± 0.56

96.97 ± 0.56

98.89 ± 0.67

99.07 ± 0.13

99.20 ± 0.20

99.19 ± 0.26

99.46 ± 0.07

81.21 ± 0.48

86.63 ± 0.83

96.54 ± 0.56

98.56 ± 0.62

98.97 ± 0.11

99.09 ± 0.22

99.07 ± 0.30

99.38 ± 0.47

AA(%) k × 100

75.96 ± 0.53

82.63 ± 0.61

97.58 ± 0.92

97.82 ± 0.93

98.94 ± 0.15

97.75 ± 0.58

98.93 ± 0.59

99.43 ± 0.21

Table 3: Classification results with different methods for KSC dataset. Class

SVM-RBF

1D-CNN

AI-Net

2D-CNN

3D-CNN

3D-DenseNet

SSRN

Ours 100

1

95.49

98.31

99.43

99.43

100

100

99.24

2

87.58

86.08

100

90.20

100

100

99.40

100

3

78.43

83.33

95.65

93.13

96.35

100

100

98.88

4

51.65

78.10

89.66

83.58

97.64

100

98.32

99.41

5

43.75

54.68

93.94

85.93

93.96

100

100

100

6

30.65

33.33

87.43

97.81

99.37

98.83

100

100

7

88.88

96.42

94.03

96.42

100

100

100

98.73

8

74.89

90.60

99.34

97.27

100

99.33

100

100

9

87.82

89.42

100

98.31

100

100

100

100

10

90.90

96.28

98.95

95.66

100

100

100

100

11

98.00

98.80

100

100

100

100

100

100

12

83.38

90.79

98.91

98.75

100

100

100

100

13

99.60

100

100

100

100

100

100

100

OA(%)

84.09 ± 0.63

89.56 ± 0.76

98.22 ± 0.57

96.78 ± 0.54

99.47 ± 0.49

99.75 ± 0.06

99.61 ± 0.22

99.89 ± 0.07

82.33 ± 0.71

88.37 ± 0.99

98.02 ± 0.63

96.42 ± 0.75

99.42 ± 0.44

99.73 ± 0.07

99.56 ± 0.25

99.88 ± 0.06

AA(%) k × 100

77.77 ± 0.85

84.32 ± 1.04

96.82 ± 0.66

95.12 ± 0.89

13

99.02 ± 0.66

99.63 ± 0.11

99.33 ± 0.57

99.77 ± 0.11

Table 4: Classification results with different methods for UP dataset. Class

SVM-RBF

1D-CNN

AI-Net

2D-CNN

3D-CNN

3D-DenseNet

SSRN

Ours 99.85

1

92.71

94.63

96.35

98.63

99.43

99.28

99.92

2

98.42

98.59

99.54

99.93

99.87

100

99.96

99.90

3

73.90

76.39

99.29

90.88

99.07

97.46

98.46

99.81

4

93.53

96.64

99.14

98.97

99.54

100

99.69

100

5

99.29

99.84

99.90

100

100

99.91

99.99

100

6

86.37

87.90

99.70

99.72

99.47

99.93

99.94

99.92

7

87.49

89.54

96.46

99.46

99.62

99.90

99.82

99.07

8

92.10

92.99

92.83

98.77

98.51

99.49

99.22

99.22

9

99.44

99.55

100

98.77

100

100

99.95

99.49

OA(%)

93.60 ± 0.09

94.68 ± 0.53

98.47 ± 0.59

98.87 ± 0.29

99.58 ± 0.10

99.70 ± 0.04

99.79 ± 0.09

99.81 ± 0.06

91.62 ± 0.14

93.05 ± 0.49

98.01 ± 0.39

98.56 ± 0.77

99.44 ± 0.13

99.60 ± 0.03

99.72 ± 0.12

99.74 ± 0.14

AA(%) k × 100

91.47 ± 0.06

92.90 ± 0.34

98.23 ± 0.59

97.82 ± 0.57

99.50 ± 0.09

99.55 ± 0.04

99.66 ± 0.17

99.69 ± 0.10

Further, Fig. 9 – 11 visualize the classification results of the best-trained models in these benchmark datasets, including the false-color images, the corresponding ground-truth maps and visualization images under diverse comparison methods. Apparently, we can observe that the visual images generated by SVM-RBF and 1D-CNN are relatively coarse, which have lower classification accuracies and obvious noises. The probably explanation is that the conventional methods could not effectively extract the spectral-spatial features, making the classification effect unsatisfactory. Due to the use of CNNs, the visual images obtained by AI-Net, 2D-CNN and 3D-CNN are rather smooth. Also, the visual images acquired by 3D-DenseNet and SSRN are smoother under the end-to-end deep learning framework. Compared with other methods, the proposed method has higher classification accuracies and better visualization results for the three HSIs data. The important reason is that we have extracted the feature information from the HDC and embedded it into the multi-residuals network, thereby the discriminative spectral and spatial features are fully extracted sequentially. In a nutshell, the proposed framework is effective which promotes the classification of images, and it is superior to that under the existing approaches. 3.3. Discussion and analysis To select a better combination of dilated coefficients, we carry out some comparative experiments. Table 5 gives a comparison of the experimental results in the cases of CNNd=1 , HDCd=1,2,3 and HDCd=1,2,3 , where d represents the dilated coefficients and the dilated blocks compose of convolution layers with dilated coefficients of 1, 2, and 3, respectively, denoted by d = 1, 2, 3. In the experiment, we compare the OA, AA, and Kappa of three datasets IN, KSC, and UP. In order to achieve comparability, these networks are built with 14

(a) False-color

(b) Truth Labels

(c) SVM

(d) 1D-CNN

(e) AI-NET

(f) 2D-CNN

(g) 3D-CNN

(h) 3D-DenseNet

(i) SSRN

(j) Our

(k) Colors-ground

Fig. 9: Visual classification results of the best models with different methods for the IN dataset.

(a) False-color

(b) Truth Labels

(c) SVM

(d) 1D-CNN

(e) AI-NET

(f) 2D-CNN

(g) 3D-CNN

(h) 3D-DenseNet

(i) SSRN

(j) Our

(k) Colors-ground

Fig. 10: Visual classification results of the optimal models with different methods for the KSC dataset.

15

(a) False-color

(b) Truth Labels

(c) SVM

(d) 1D-CNN

(e) AI-Net

(f) 2D-CNN

(g) 3D-CNN

(h) 3D-DenseNet

(i) SSRN

(j) Our

(k) Colors-ground

Fig. 11: Visual classification results of the best models with different methods for the UP dataset.

the same training set, verification set, and test set, which correspond to 20%, 10%, and 70% for the IN and KSC datasets, and 10%, 10%, and 80% for the UP dataset. Furthermore, RMSPorp [40] is used to optimize the network. The experimental results in Table 5 indicate that the accuracies of the proposed residual network structure with HDC exceed that of the conventional convolution network structure. Simultaneously, the experiments fully verify the accuracy of HDCd=1,3 is higher than the accuracy of HDCd=1,2,3 . Therefore, HDCd=1,3 will be used to train the network in our experiment. Table 5: Contrastive experiments and accuracy with convolution CNNd=1 , HDCd=1,2,3 and HDCd=1,3 . Sizes

Acc

7×7

OA(%)

7×7

k × 100

9×9

OA(%)

9×9

k × 100

7×7

AA(%)

IN(CNNd=1 ) IN(HDCd=1,2,3 ) IN(HDCd=1,3 ) KSC(CNNd=1 ) KSC(HDCd=1,2,3 ) KSC(HDCd=1,3 ) UP(CNNd=1 ) UP(HDCd=1,2,3 ) UP(HDCd=1,3 ) 98.49 ± 0.34 97.69 ± 0.52 98.28 ± 0.39 99.17 ± 0.32

98.85 ± 0.27

98.93 ± 0.14

99.50 ± 0.13

99.53 ± 0.25

99.64 ± 0.14

99.49 ± 0.10

99.61 ± 0.09

99.64 ± 0.07

98.69 ± 0.31

98.87 ± 0.09

99.44 ± 0.21

99.48 ± 0.27

99.60 ± 0.21

99.31 ± 0.11

99.48 ± 0.12

99.53 ± 0.06

99.27 ± 0.14

99.28 ± 0.22

99.17 ± 0.16

99.08 ± 0.24

98.89 ± 0.09

98.93 ± 0.32

99.71 ± 0.07

11 × 11 OA(%)

98.68 ± 0.29

11 × 11 k × 100

97.81 ± 0.34

9×9

AA(%)

11 × 11 AA(%)

99.06 ± 0.37 98.70 ± 0.21

98.39 ± 0.23

99.56 ± 0.16

99.43 ± 0.12

99.36 ± 0.10

99.58 ± 0.12

99.67 ± 0.06

99.75 ± 0.09

99.72 ± 0.14

99.58 ± 0.05

99.68 ± 0.04

99.77 ± 0.05

99.63 ± 0.05

99.73 ± 0.07

99.69 ± 0.11

99.44 ± 0.08

99.58 ± 0.06

99.64 ± 0.08

99.46 ± 0.07

99.75 ± 0.11

99.81 ± 0.12

99.89 ± 0.07

99.75 ± 0.05

99.70 ± 0.03

99.81 ± 0.06

99.38 ± 0.47

99.72 ± 0.12

99.79 ± 0.13

99.88 ± 0.06

99.67 ± 0.07

99.60 ± 0.03

99.74 ± 0.14

98.99 ± 0.19

99.63 ± 0.06

99.41 ± 0.25 99.33 ± 0.29

99.51 ± 0.07

99.10 ± 0.38

99.42 ± 0.22

99.43 ± 0.21

99.69 ± 0.21

99.62 ± 0.06

99.82 ± 0.09

16

99.60 ± 0.16

99.77 ± 0.11

99.32 ± 0.07

99.65 ± 0.06

99.61 ± 0.07

99.68 ± 0.05

99.63 ± 0.11

99.69 ± 0.10

Further, we discuss how to choose a relatively better and more appropriate patch size and training percentage. To this end, we design different sizes patches and different proportions for these HSI datasets to establish the relationship between classification accuracy and patches size and proportion, respectively. In this experiment, the input size of patches is devised as 3 × 3, 5 × 5, 7 × 7, 9 × 9, and 11 × 11. The training proportion is taken as 5%,

10%, 12.5%, 15%, and 20%.

Fig. 12 and 13 show the test accuracy results of training samples with different spatial sizes and different proportions. As displayed in Fig. 12, the accuracy increases rapidly and then grows slowly as the input spatial size gradually increases. Meanwhile, the amount of calculation parameters is also gradually increased, and we do not verify the larger input scale. Accordingly, the input size in our contrast experiment is selected as 11 × 11. Under such a spatial input size, the accuracies of the three datasets are 99.46% (IN), 99.89% (KSC), and 99.81% (UP). As displayed in Fig. 13, increasing the percentage of input training samples gradually increases the accuracy. When the sample number reaches 20%, the accuracy of the three datasets exceeds 99%, in particular, the UP dataset approximates 99.94%. In summary, the performance of the proposed network is superior to that of other comparative methods. It should be noticed that the version of the computer is NVIDIA Quadro K2200(GPU) on Windows 10 based on Intel(R) Xeon(R) CPU E5-1620 v3 with 32-GB RAM, and the software Keras of python 3.52 has been adopted to train the network.

(a) OA

(b) AA

(c) Kappa

Fig. 12: Accuracy of training samples with different spatial sizes.

(a) IN

(b) KSC

(c) UP

Fig. 13: Accuracy of different training samples for the three datasets.

17

Table 6: The influence of the skipping layer on three datasets. With the skipping layer

Datasets IN

AA(%)

Kappa(%)

OA(%)

AA(%)

Kappa(%)

99.46 ± 0.07

99.43 ± 0.21

99.38 ± 0.47

99.26 ± 0.05

99.18 ± 0.09

99.16 ± 0.05

99.81 ± 0.06

99.69 ± 0.10

99.74 ± 0.14

99.70 ± 0.06

99.73 ± 0.10

99.60 ± 0.14

99.89 ± 0.07

KSC UP

Without the skipping layer

OA(%)

99.77 ± 0.11

99.88 ± 0.06

99.61 ± 0.05

99.22 ± 0.11

98

98

98

96

94

92

Accuracy(%)

100

Accuracy(%)

100

Accuracy(%)

100

96

94

92

OA AA Kappa

90

90 1+1

1+2

2+1

2+2

2+3

3+2

3+3

Numbers of spectral and spatial residual blocks

(a) IN

96

94

92

OA AA Kappa

99.57 ± 0.06

OA AA Kappa

90 1+1

1+2

2+1

2+2

2+3

3+2

3+3

Numbers of spectral and spatial residual blocks

(b) KSC

1+1

1+2

2+1

2+2

2+3

3+2

3+3

Numbers of spectral and spatial residual blocks

(c) UP

Fig. 14: Test accuracies of different residual blocks for three datasets. In the horizontal axis, the formation of k1 + k2 represents k1 spectral residual blocks and k2 spatial residual blocks.

Also, we verify the effect of the added skipping layer on three HSI datasets, and the experimental results are given in Table 6. Adding the skipping layer could promote HSI classification performance. The main reason is that adding the skipping layer facilitates the full utilization and extraction of features and thereby improve the classification results of HSI. Moreover, the influence of different residual blocks on the three datasets is considered. In this experiment, we give test accuracies (OA, AA, Kappa) statistical graphs with a various number of residual blocks on the three datasets to select the appropriate number of residual blocks, as shown in Fig. 14. Clearly, it can be seen that while the number of spectral residual block and the spatial residual block is both 2, the accuracies of the three datasets are better than in other cases. Consquently, we choose the combination of 2 + 2 as the final network in this study. Besides, the HSI classification accuracy would not continue to increase as the number of residual blocks increases. One possible reason is that HSI data used for training is limited, and the deeper network is not conducive to feature extraction.

4. Conclusions This paper proposes a supervised deep 3D-2D-SSHDR framework for HSI classification, which continuously extracts spectral feature information and spatial feature information by integrating the spectral residual blocks, spatial HDC residual blocks, and an outermost skip 18

structure. Concretely, in spatial feature learning, HDC is used to extract valuable features, which enlarges the receptive field and avoids the drawbacks of the grid effect caused by dilated convolution without increasing the number of parameters. Further, the introduction of the skipping layers for spectral feature learning and spatial feature learning could extract more available spatial-spectral information, improving the optimization and learning ability of the proposed network. Moreover, we exploit a 3D convolution in spectral feature learning and then use a 2D convolution to learn spatial features. This structure could reduce the number of parameters without influencing the extraction of spatial-spectral features. The experimental results on some HSI datatsets support the proposed 3D-2D-SSHDR method and exhibit its superiority, even when training samples are insufficient. Addtionally, HSI belongs to high-dimensional small and imbalance sample data [41]. Therefore, future work will focus on adopting the transfer learning method [42] or other methods of expanding samples, by combining the characteristics of HSI, to solve this issue. Further, we could also consider the spatial feature extraction in the case of 3D in the future.

Conflict of interests There are no conflict of interests.

Acknowledgments This study was supported by the National Network Science Foundation of China under Grant 61672477.

References [1] D. Landgrebe, Hyperspectral image data analysis, IEEE Signal Process. Mag. 19 (1) (2002) 17–28. doi:10.1109/79.974718. [2] J. M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders, N. Nasrabadi, J. Chanussot, Hyperspectral remote sensing data analysis and future challenges, IEEE Geosci. Remote Sens. Mag. 1 (2) (2013) 6–36. doi:10.1109/MGRS.2013.2244672. [3] X. Zhang, Y. Sun, K. Shang, L. Zhang, S. Wang, Crop classification based on feature band set construction and object-oriented approach using hyperspectral images, IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 9 (9) (2016) 4117–4128. doi:10.1109/ JSTARS.2016.2577339.

19

[4] B. Uzkent, A. Rangnekar, M. Hoffman, Aerial vehicle tracking by adaptive fusion of hyperspectral likelihood maps, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, IEEE, 2017, pp. 39–48. doi:10.1109/ CVPRW.2017.35. [5] G. Carter, K. Lucas, G. Blossom, C. Lassitter, D. Holiday, D. Mooneyhan, D. Fastring, T. Holcombe, J. Griffith, Remote sensing and mapping of tamarisk along the colorado river, USA: a comparative use of summer-acquired hyperion, thematic mapper and quickbird data, Remote Sens. 1 (3) (2009) 318–329. doi:10.3390/rs1030318. [6] B. Sch¨ olkopf, A. J. Smola, F. Bach, et al., Learning with kernels: support vector machines, regularization, optimization, and beyond, MIT press, USA, 2002. [7] Q. Wang, J. Lin, Y. Yuan, Salient band selection for hyperspectral image classification via manifold ranking, IEEE Trans. Neural Netw. Learn. Syst. 27 (6) (2016) 1279–1289. doi:10.1109/TNNLS.2015.2477537. [8] G. Camps-Valls, L. Bruzzone, Kernel-based methods for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 43 (6) (2005) 1351–1362. doi:10.1109/ TGRS.2005.846154. [9] Y. Yuan, J. Lin, Q. Wang, Hyperspectral image classification via multitask joint sparse representation and stepwise MRF optimization, IEEE Trans. Cybern. 46 (12) (2015) 2966–2977. doi:10.1109/TCYB.2015.2484324. [10] C. Hern´ andez-Espinosa, M. Fern´ andez-Redondo, J. Torres-Sospedra, Some experiments with ensembles of neural networks for classification of hyperspectral images, in: Proceedings of International Symposium on Neural Networks, Springer, 2004, pp. 912–917. doi:10.1007/978-3-540-28647-9_150. [11] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436. doi: 10.1038/nature14539. [12] C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi, Inception-v4, inception-resnet and the impact of residual connections on learning, in: Proceedings of Thirty-First AAAI Conference on Artificial Intelligence, AAAI, 2017, pp. 1–12. [13] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2016, pp. 770–778. doi:10.1109/CVPR.2016.90.

20

[14] A. Plaza, P. Martinez, R. Perez, J. Plaza, A new approach to mixed pixel classification of hyperspectral imagery based on extended morphological profiles, Pattern Recognit. 37 (6) (2004) 1097–1116. doi:10.1016/j.patcog.2004.01.006. [15] P. Ghamisi, N. Yokoya, J. Li, W. Liao, S. Liu, J. Plaza, B. Rasti, A. Plaza, Advances in hyperspectral image and signal processing: a comprehensive overview of the state of the art, IEEE Geosci. Remote Sens. Mag. 5 (4) (2017) 37–78. doi:10.1109/MGRS. 2017.2762087. [16] Y. Xu, L. Zhang, B. Du, F. Zhang, Spectral–spatial unified networks for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 56 (10) (2018) 5893–5909. doi:10.1109/TGRS.2018.2827407. [17] Y. Li, H. Zhang, Q. Shen, Spectral–spatial classification of hyperspectral imagery with 3D convolutional neural network, Remote Sens. 9 (1) (2017) 67. doi:10.3390/ rs9010067. [18] X. Yang, Y. Ye, X. Li, R. Y. Lau, X. Zhang, X. Huang, Hyperspectral image classification with deep learning models, IEEE Trans. Geosci. Remote Sens. 56 (9) (2018) 5408–5423. doi:10.1109/TGRS.2018.2815613. [19] K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, in: Proceedings of European Conference on Computer Vision, Springer, 2016, pp. 630– 645. doi:10.1007/978-3-319-46493-0_38. [20] F. Yu, V. Koltun, T. Funkhouser, Dilated residual networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 472–480. doi:10.1109/CVPR.2017.75. [21] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, N. Navab, Deeper depth prediction with fully convolutional residual networks, in: Proceedings of 2016 Fourth International Conference on 3D Vision, IEEE, 2016, pp. 239–248. doi:10.1109/3DV.2016.32. [22] T. Wang, M. Sun, K. Hu, Dilated deep residual network for image denoising, in: Proceedings of 2017 IEEE 29th International Conference on Tools with Artificial Intelligence, IEEE, 2017, pp. 1272–1279. doi:10.1109/ICTAI.2017.00192. [23] S. Pouyanfar, S.-C. Chen, M.-L. Shyu, An efficient deep residual-inception network for multimedia classification, in: Proceedings of 2017 IEEE International Conference on Multimedia and Expo, IEEE, 2017, pp. 373–378. doi:10.1109/ICME.2017.8019447.

21

[24] Y. Luo, J. Zou, C. Yao, X. Zhao, T. Li, G. Bai, Hsi-cnn: A novel convolution neural network for hyperspectral image, in: Proceedings of 2018 International Conference on Audio, Language and Image Processing, IEEE, 2018, pp. 464–469. doi:10.1109/ ICALIP.2018.8455251. [25] L. Shu, K. McIsaac, G. R. Osinski, Hyperspectral image classification with stacking spectral patches and convolutional neural networks, IEEE Trans. Geosci. Remote Sens. 56 (10) (2018) 5975–5984. doi:10.1109/TGRS.2018.2829400. [26] Q. Gao, S. Lim, X. Jia, Hyperspectral image classification using convolutional neural networks and multiple feature learning, Remote Sens. 10 (2) (2018) 299. doi:10.3390/ rs10020299. [27] Z. Zhong, J. Li, Z. Luo, M. Chapman, Spectral–spatial residual network for hyperspectral image classification: A 3-D deep learning framework, IEEE Trans. Geosci. Remote Sens. 56 (2) (2017) 847–858. doi:10.1109/TGRS.2017.2755542. [28] Z. Xiong, Y. Yuan, Q. Wang, AI-NET: Attention inception neural networks for hyperspectral image classification, in: Proceedings of 2018 IEEE International Geoscience and Remote Sensing Symposium, IEEE, 2018, pp. 2647–2650. doi:10.1109/IGARSS. 2018.8517365. [29] P. Shamsolmoali, M. Zareapoor, J. Yang, Convolutional neural network in network (CNNiN): hyperspectral image classification and dimensionality reduction, IET Image Process. 13 (2) (2018) 246–253. doi:10.1049/iet-ipr.2017.1375. [30] Z. Gong, P. Zhong, W. Hu, Z. Xiao, X. Yin, A novel statistical metric learning for hyperspectral image classification, arXiv preprint arXiv:1905.05087 (2019). [31] N. Audebert, B. Le Saux, S. Lef`evre, Deep learning for classification of hyperspectral data: a comparative review, IEEE Geosci. Remote Sens. Mag. 7 (2) (2019) 159–173. doi:10.1109/MGRS.2019.2912563. [32] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, G. Cottrell, Understanding convolution for semantic segmentation, in: Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision, IEEE, 2018, pp. 1451–1460. doi:10.1109/WACV.2018.00163. [33] K. Zhang, M. Sun, T. X. Han, X. Yuan, L. Guo, T. Liu, Residual networks of residual networks: multilevel residual networks, IEEE Trans. Circuits Syst. Video Technol. 28 (6) (2017) 1303–1314. doi:10.1109/TCSVT.2017.2654543. 22

[34] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep ntwork training by reducing internal covariate shift, in: Proceedings of the 32nd International Conference on Machine Learning, ACM, 2015, pp. 448–456. [35] J. A. Richards, Remote sensing digital image analysis: an introduction, 5th Edition, Springer, Berlin, 2013. doi:10.1007/978-3-642-30062-2. [36] B.-C. Kuo, H.-H. Ho, C.-H. Li, C.-C. Hung, J.-S. Taur, A kernel-based feature selection method for SVM with RBF kernel for hyperspectral image classification, IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 7 (1) (2013) 317–326. doi:10.1109/JSTARS. 2013.2262926. [37] H. Luo, Y. Y. Tang, Y. Wang, J. Wang, L. Yang, C. Li, T. Hu, Hyperspectral image classification based on spectral–spatial one-dimensional manifold embedding, IEEE Trans. Geosci. Remote Sens. 54 (9) (2016) 5319–5340.

doi:10.1109/TGRS.2016.

2560529. [38] Q. Wang, J. Gao, Y. Yuan, A joint convolutional neural networks and context transfer for street scenes labeling, IEEE Trans. Intell. Transp. Syst. 19 (5) (2017) 1457–1470. doi:10.1109/TITS.2017.2726546. [39] C. Zhang, G. Li, S. Du, W. Tan, F. Gao, Three-dimensional densely connected convolutional network for hyperspectral remote sensing image classification, J. Appl. Remote Sens. 13 (1) (2019) 016519. doi:10.1117/1.JRS.13.016519. [40] M. C. Mukkamala, M. Hein, Variants of rmsprop and adagrad with logarithmic regret bounds, in: Proceedings of the 34th International Conference on Machine Learning, ACM, 2017, pp. 2545–2553. [41] H. Binol, G. Bilgin, S. Dinc, A. Bal, Kernel fukunaga–koontz transform subspaces for classification of hyperspectral images with small sample sizes, IEEE Geosci. Remote Sens. Lett. 12 (6) (2015) 1287–1291. doi:10.1109/LGRS.2015.2393438. [42] M. Long, H. Zhu, J. Wang, M. I. Jordan, Deep transfer learning with joint adaptation networks, in: Proceedings of the 34th International Conference on Machine Learning, ACM, 2017, pp. 2208–2217.

23

Feilong Cao received the Ph.D. degree in Applied Mathematics from Xi’an Jiaotong University, China, in 2003. He was a Research Fellow with the Center of Basic Sciences, Xi’an Jiaotong University, China, from 2003 to 2004. From 2004 to 2006, he was a PostDoctoral Research Fellow with the School of Aerospace, Xi’an Jiaotong University, China. From June 2011 to December 2011 and October 2013 to January 2014, he was a Visiting Professor with the Department of Computer Science, Chonbuk National University, South Korea, and the Department of Computer Sciences and Computer Engineering, La Trobe University, Melbourne, Australia, respectively. He is currently a Professor with China Jiliang University. He has authored or co-authored over 200 scientific papers in refereed journals. His current research interests include neural networks, and approximation theory.

Wenhui Guo received the B.Sc. degree in applied mathematics from Xinzhou normal University, Shanxi, China, in 2017. She is currently pursuing the M.Sc. degree in applied mathematics with China Jiliang University, Hangzhou, China. Her current research interests include deep learning and image processing.

24