A Dual–Branch Attention fusion deep network for multiresolution remote–Sensing image classification

A Dual–Branch Attention fusion deep network for multiresolution remote–Sensing image classification

A Dual-Branch Attention Fusion Deep Network for Multiresolution Remote-Sensing Image Classification Journal Pre-proof A Dual-Branch Attention Fusion...

10MB Sizes 1 Downloads 108 Views

A Dual-Branch Attention Fusion Deep Network for Multiresolution Remote-Sensing Image Classification

Journal Pre-proof

A Dual-Branch Attention Fusion Deep Network for Multiresolution Remote-Sensing Image Classification Hao Zhu, Wenping Ma, Lingling Li, Licheng Jiao, Shuyuan Yang, Biao Hou PII: DOI: Reference:

S1566-2535(18)30818-2 https://doi.org/10.1016/j.inffus.2019.12.013 INFFUS 1187

To appear in:

Information Fusion

Received date: Revised date: Accepted date:

19 November 2018 29 August 2019 25 December 2019

Please cite this article as: Hao Zhu, Wenping Ma, Lingling Li, Licheng Jiao, Shuyuan Yang, Biao Hou, A Dual-Branch Attention Fusion Deep Network for Multiresolution Remote-Sensing Image Classification, Information Fusion (2019), doi: https://doi.org/10.1016/j.inffus.2019.12.013

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Highlights • A dual branch attention fusion deep network for classification is proposed. • An adaptive center offset sampling strategy for each pa tch is proposed. • Two attention based modules to highlight the advantag es of two different data • The fe ature level fusion is achieve d by the proposed network.

1

A Dual-Branch Attention Fusion Deep Network for Multiresolution Remote-Sensing Image Classification Hao Zhu, Wenping Ma, Lingling Li, Licheng Jiao∗ , Shuyuan Yang, Biao Hou Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, International Research Center for Intelligent Perception and Computation, and Joint International Research Laboratory of Intelligent Perception and Computation, School of Artificial Intelligence, Xidian University, Xi’an, Shaanxi Province 710071, China.

Abstract In recent years, with the diversification of acquisition methods of very high resolution panchromatic (PAN) and multispectral (MS) remote sensing images, multiresolution remote sensing classification has become a research hotspot. In this paper, from the perspective of data-driven deep learning, we design a dual-branch attention fusion deep network (DBAF-Net) for the multiresolution classification. It aims to integrate the feature-level fusion and classification into an end-to-end network model. In the process of establishing a training sample library, unlike the traditional pixel-centric sampling strategy with fixed patch size, we propose an adaptive center-offset sampling strategy (ACO-SS), which allows each patch to adaptively determine the neighborhood range by finding the texture structure of the pixel to be classified. And the neighborhood range is not symmetrical with this pixel, we expect to capture the neighborhood information that is more conducive to its classification. In network structure, based on the captured patches by ACO-SS, we design a spatial attention module (SA-module) for PAN data and a channel attention module (CA-module) for MS data, thus highlighting the spatial resolution advantages of PAN data and the multi-channel advantages of MS data, respectively. Then these two features are interfused to improve and strengthen the fusion features in both spatial and channel. The quantitative and qualitative experimental results verify the robustness and effectiveness of the proposed method. Keywords: Deep learning; feature fusion; attention mechanism; Preprint submitted to Information Fusion

December 26, 2019

panchromatic (PAN) and multispectral (MS) images; multiresolution image classification. 1. Introduction Nowadays, with the development of earth space information technology and the support of advanced equipment technology, different resolution images in the same scene can be acquired simultaneously. Since this goal cannot be achieved simultaneously by a single sensor due to the technical constraints, but instead of the current Spaceborne Passive Earth-observation Systems can jointly acquire two images of the same scene, i.e., a panchromatic (PAN) component with high spatial resolution but less spectral information, and a multispectral (MS) component with low spatial resolution but more spectral information. Compared with the original single resolution images, the combination of these different resolution images (for brevity, we call it ”multiresolution images” ) enable users to obtain higher spatial and spectral information simultaneously. MS data is helpful for the identification of land covers, while PAN data is very useful for accurately describing the shape and structure of objects in images. Therefore, the intrinsic complementarity between PAN and MS data conveys a vital potential for multiresolution image classification tasks[1, 2, 3]. In recent years, the commonly used methods in this field can be divided into two categories: one is first to pan-sharpen the MS data as one fused image, and then classify it [4, 5, 6, 7, 8, 9, 10, 11]; while the other is first to extract their respective features from PAN and MS data, and then fuse these features for classification [12, 13, 14, 15]. The former category is regarded as a way of data fusion-then-classify, which requires a good pan-sharpening algorithm [16, 17, 18] to improve the quality of the fused image. Recently, various excellent pan-sharpening algorithms have been proposed, including component substitution (e.g., Intensityhue-saturation transform [19, 20, 21, 22, 23], Gram-Schmidt (GS) [24, 23, 25], the Brovey transform [26]), and multiresolution analysis (e.g., Wavelet transform [27, 28, 29, 30, 31], Laplacian pyramid representations[32, 33, 34], Support value transform [35]), and sparse representations [36, 37, 38, 39, 40]. Masi et al. [6] and Zhong et al.[7] used the image super-resolution neural network[41] to pan-sharpen MS images. Shingare et al. [8] evaluated several kinds of pan-sharpening methods first, and embed several indexes into the decision tree algorithm to classify the fused images. Shackelford et 3

al. [10] used a color normalization method to the fused images, and then proposed a fuzzy logic methodology to improve classification accuracy. Although these techniques accurately retain spatial and spectral information, the pansharpened classification maps usually be affected by high frequency spectral distortion, because of the obtained PAN and MS components only in partially overlapping spectral ranges [42]. The latter category is regarded as a way of feature fusion-then-classify, where the features are usually extracted separately from PAN and MS images, and then classified by learning classifier or clustering. t combines graph cutting method with linear mixed model, and iterates the relationship between PAN and MS data to generate context classification graph. Moser et al. [13] combines a graph cut method with the linear mixed model, and iterates the relationship between PAN and MS data to generate context classification map. Mao et al. [14] proposed a unified Bayesian framework to discover semantic segments from PAN images first, and then assign corresponding cluster labels from MS images for an effective classification results. Zhang et al. [15] combined the mid-level bag-of-visual words model with the optimal segmentation scale to bridge the high-level semantics information and low-level detail information, these features are then sent into Support vector machine (SVM) for images classification. The main challenge of these strategies is how to effectively extract remarkable features and then fuse features of different attributes from the PAN and MS images. In recent three years, the data-driven deep learning (DL) algorithms [43] have been applied to various remote-sensing image field [44, 45, 46, 47, 48, 49, 50, 51]. By establishing the suitable training database and designing the input and output terminals of the whole network carefully, DL algorithms have been proven to handle complex remote-sensing data characteristics well. In multiresolution remote-sensing image classification, Liu et al.[46] extracted the features from PAN and MS images through a stacked autoencoder (Stacked-SAE) and a convolutional neural network (CNN), respectively, and these two features are concatenated together for final classification results. Zhao et al.[47] used a superpixel-based multiple local network model to classify MS image, while the corresponding PAN image is used to fine tuning this classification results. These above explorations give us a good inspiration to design our network modules. However, there are two main problems need to be overcome when DL algorithms are applied in multiresolution remote-sensing image classifi4

cation as follows: 1) Remote-sensing images usually include various types of objects of different sizes in the same big scene, and only one fixed range of neighborhood information may not be able to learn robust and discriminative feature representations for all samples. Moreover, the training samples are generally pixel-centric image patches, it will cause the pixels with very close Euclidean distances but belonging to different categories will obtain very similar patch information, which will confuse the training of the classification network. 2) The principle of remote-sensing imaging makes remote-sensing images with the same scene but different resolutions usually contain differences and local distortions. Therefore, the network needs a more powerful module to extract more robust feature representations. Moreover, PAN and MS have different resolutions and characteristics, so the network should tend to extract their features that can highlight their respective data characteristics, rather than using the same network structure. Finally, how to effectively eliminate the differences in the features obtained by the two branches and then fuse them is also a problem that needs to be solved. In this paper, we propose a dual-branch attention fusion deep network (DBAF-Net) for multiresolution remote-sensing image classification. In view of the above two problems, the proposed method mainly includes two corresponding aspects as follows: 1) An adaptive center-offset sampling strategy (ACO-SS) based on texture information is proposed. For a pixel to be classified, we do not crop its patch centered on it with a fixed neighborhood range, but first, determine the texture structure in which it is located, and then adaptively offset the pixel to the center of the texture structure to determine the final center of the patch. The size of the patch is not fixed and is adaptively determined by the size of the detected texture structure so that it can better cope with objects of different sizes. And this patch tends to contain more texture information that is homogeneous with the pixel to be classified, thus effectively avoiding the above-mentioned edge categories sampling problem and providing better positive feedback for its classification. 2) In the design of the network structure, for the first time, we introduce the attention mechanism module into the field of remote-sensing data to expect for more robust features. Based on original attention modules used in natural scene data, we design a spatial attention (SA-module) and a channel attention (CA-module) especially for PAN and MS respectively, thus highlighting the spatial resolution advantages of PAN and the multi-channel 5

P, R

P, R

In, Rs

In S

2S

4S

C

2S

S

4S C

C 4S

C

4S

2S

2S

C

4S

4S

C

(a) Fr, Fs

P(G) Squeeze

4C

1 1

Excitation

4C

1 1

S

S Scale S

4C

(b)

4C

S

Figure 1: Two different types of attention mechanism modules. Here, P denotes maxpooling operation and P (G) denotes global average pooling operation, R is the combination of several nonlinear mappings including convolution, batch normalization (BN), and rectified linear units (ReLU). In denotes linear interpolation operation, Rs is the same to R except the final activation function is sigmoid. Fr means a dimensionality-reduction fully-connected (FC) with ReLU, while Fs means a dimensionality-increasing FC with sigmoid. ⊗ is element-wised product. (a) A bottom-up top-down attention module. (b) An squeeze-and-excitation attention module.

advantages of MS. Finally, two kinds of features are interfused into a whole, then we also insert CA-module into SA-module to further extract deeper features from the fused features for classification. The rest of this paper is organized as follows: Section II briefly introduces the preliminaries of related attention modules. Section III elaborates the proposed method in detail. Section IV first introduces the details of data used and the experimental setup, and then shows the experimental results and the corresponding analysis on the four data sets. Finally, Section V draws the conclusion of this paper. 2. Related Work In this section, we mainly recall background information of the attention mechanism models commonly used in data interpretation of natural scenes. 2.1. Bottom-up Top-down Attention Model The bottom-up top-down attention (BuTdA) model was first proposed to handle localization oriented task in semantic segmentation [52, 53, 54]. 6

This structure tends to combine a bottom-up feedforward operation and a top-down feedback operation into one feedforward operation. The bottom-up feedforward operation produces strong semantic information with low spatial resolution features, while the subsequent top-down operation produces dense features to infer each pixel. In an attention module, from the input, convolution operation and maxpooling are performed several times to extract the high-level features and increase the receptive field of features, so that the pixel activated in the high-level features can reflect the area where the attention is located. After the lowest resolution is reached, through a symmetrical top-down structure, the feature maps then expand the global information to determine the input of each location. Finally, the linear interpolation algorithm is used to ensure that the output and input feature maps have the same size. Then, after two consecutive 1 × 1 convolutional layers, a sigmoid layer normalizes the output range to [0 1] as the attention mask. Each pixel of the mask is the weight of each pixel on the input feature map, it enhances meaningful features and suppresses meaningless information. Finally the elementwised product of the mask and the input feature to get weighted attention map. The process is shown in Fig. 1(a), this encoder-decoder framework can be regarded as a solution to a weakly-supervised spatial positioning problem. 2.2. Squeeze-and-Excitation Attention Model Different from the spatial-based attention model described above, Hu et al. [55] focused on channels and proposed a novel channel-based attention architectural unit, called squeeze-and-excitation attention (SEA) module. The aim of the author is to improve the expressive ability of the network model by accurately simulating the interaction of convolution features between different channels. For this purpose, the author proposes a mechanism to calibrate the features of the network structure. This mechanism enables the network not only to selectively enhance useful information features, but also to suppress less useful features in global information. The basic model structure of the SEA module is illustrated in Fig. 1(b). The features U first do a global average pooling operation (called squeeze operation) to aggregate the feature maps across spatial dimensions W × H, thus obtaining a channel feature descriptor. The c-th element of the output Z ∈ Rc is represented by as follows:

7

W X H X 1 Zc = Fsq (Uc ) = Uc (i, j) W × H i=1 j=1

(1)

The global spatial distribution of channel-wise feature responses are included in this feature descriptor, so that the information from the global receiving field can be utilized by its lower layers. Then, an excitation operation is performed to fully capture the channel-wise dependencies. This operation includes two fully-connected (FC) layers (i.e., a dimensionality-reduction layer and a dimensionality-increasing layer) around the non-linearity activations (ReLU and sigmoid). These two layers are learned as a bottleneck of each channel to control the excitation of each channel. The output S is calculated by: S = Fex (Z, W ) = δ(g(Z, W )) = δ(W2 δ(W1 Z))

(2)

where δ refers to the ReLU function, two weight parameters W1 ∈ Rg×h (g < h) and W2 ∈ Rh×g . Finally, the original feature maps U are re-scaled with the activations S to generate the SE module’s output X as Xc = Fscale (Uc , Sc ) = Uc · Sc

(3)

This output X can be fed directly into subsequent layers. 2.3. Stacked Attention Model A common way to enhance the attentional reasoning ability for multimodal understanding is the stacked attention architecture. Because each feature has its own learning attention mask designed specifically for its own features, the corresponding attention modules of different layers can capture different types of attention. Usually, the attention mask in shallow layers mainly suppresses the background details while the attention mask in deep layers mainly enhances the main content. 3. Methodology In this section, the adaptive center-offset sampling strategy (ACO-SS) and the dual-branch attention fusion deep network (DBAF-Net) are explained and analyzed in detail.

8

3.1. Adaptive Center-Offset Sampling Strategy (ACO-SS) As we all know, the quality of the training sample set is very important, which directly affects the performance of the network model. Therefore, how to obtain effective samples is the first problem to be solved. In remote-sensing pixel-level classification tasks, the training samples are generally pixel-centric image patches. A patch provides neighborhood information for its central pixel to determine the category of this central pixel. Unlike the natural images were taken in a close range, remote-sensing images with large scene usually have many object structure with different scales, so it may not reasonable to set all patches with a fixed size to extract features. Moreover, the samples were taken from a large remote-sensing image, if the traditional pixel-centric sampling strategy is used, pixels with very close Euclidean distances but belonging to different categories will obtain very similar patches. Based on this, we put forward an adaptive center-offset sampling strategy (ACO-SS) that allows each patch to adaptively determine the neighborhood range according to the texture structure of the pixel to be classified. This strategy shifts the original patch center (i.e., the pixel to be classified) to the texture structure to obtain more neighborhood information with homogeneity to this pixel, it is expected to provide more positive feedback neighborhood information for the classifier and makes patches obtained on the boundary of the two categories not repeat too much. The main steps are as follows: 1) The texture structure should be first detected and its effective area should be determined. Here we choose the texture structure mainly because it is easy to obtain and its structure is relatively stable. It is well known that the most stable texture structure can be detected in scale-normalized Laplacian of Gaussian (s-LoG: σ 2 ∇2 G) scale space [56]. But in this paper, we use Difference of Gaussian (DoG) scale space [57] to capture texture structure, because DoG is the approximation of s-LoG (DoG ≈ (k − 1)σ 2 ∇2 G) and is easier to calculate. The DoG function can be expressed as DoG =G(x, y, σ1 ) − G(x, y, σ2 ) 2

=

2

2

1 − x 2σ+y2 1 − x 2σ+y2 1 2 e − e 2πσ12 2πσ22

2

(4)

where σ1 is the current scale in scale space, σ2 = kσ1 , k is a constant that represents the ratio of the adjacent scale space. The cross sections of DoG and s-LoG are show in Fig. 2(a). It can be seen that the size of this funnel-shaped texture structure can be captured by 9

0.5DE Ly

O

Ky Ry

Ey Fy d

Y

d

DE DE

E2

Kx Fx Ex

F

K

K2

F2

E

Texture structure

E X

Figure 2: (a) Cross section of DoG and s-LoG. (b) The overall process of determining the neighborhood range and center position of the patches on pixel K and K2 . The two blue patches are selected neighborhood range for K and K2 .

calculating the Euclidean distance between two maxima points (denoted as DE in Fig. 2(a))1 . Therefore, we make the differential DoG with respect to DE equals to zero to get the expression of DE as ( p D E = 2 x2 + y 2 ∂DoG ,0 ∂DE s (5) 32k 2 σ12 ln k → DE = k2 − 1 1

In this paper, we follow [57] and set k to 2 3 , thus DE ≈ 9.50σ1 . Therefore, we can capture the texture structure in DoG scale space by detecting its central extreme-point (E in Fig. 2(a)), because E can provide information about location and the current scale σ1 to determine the corresponding DE . 2) Then based on texture structures with different scales, we adaptively determining the neighborhood range for each pixel. For a certain pixel, there 1 In fact, the circular region with DE as its diameter is not exactly equivalent to the effective region of texture structure (some peripheral edge regions are not covered), we only need to distinguish different scale structures by DE , and DE is also easy to calculate.

10

may be several candidate texture structures around it, the selected texture structure should be closest to this pixel in spatial distance, and ensure that it is in the same homogeneous region as this pixel. Therefore, we use Voronoi Diagram [58] to partition all extreme-points detected in DoG scale space. According to the principle of Voronoi Diagram, each Voronoi region has only one extreme-point, i.e., the nearest texture structure of any pixel in one Voronoi region is the texture structure of its extreme point. The overall process of determining the neighborhood range and center position of the patches can be referred to Fig. 2(b). Assume that K and K2 are two pixels with very close Euclidean distances but belonging to different categories, E and E2 are the central extreme-points of the corresponding texture structure, respectively. With the conversion of spatial relationships, we can calculate the new center positions F and F2 , and the corresponding neighborhood range (two blue patches with adaptive neighborhood range). Compared with the original pixel-centric sampling strategy, the two blue patches by the proposed strategy do not repeat the neighborhood information too much. Taking K as an example, the specific calculation of the spatial relationship is as follows:  |Ky − Ey | ≥ |Kx − Ex |; K ∈ E; F ∈ EK    |Ey − Ly | , |Ky − Ey | , d s.t. |Ry − Ey | = 0.5DE    y| |Fy − Ly | = |Ry − Fy | = |Ry −L 2 → |Ry − Ly | = |Ry − Ey | + |Ey − Ly | = 0.5DE + d 0.5DE − d |Fy − Ey | = |Fy − Ly | − |Ey − Ly | = 2 Ky −Ey 3Ey − Ky + |Ky −Ey | 0.5DE → Fy = 2 Kx − Ex → Fx = (Fy − Ey ) + Ex Ky − Ey

(6)

where Ly and Ry are the boundaries of the selected patch, so the patch size Sp equals |Ry −Ly |. Here, if |Ky −Ey | < |Kx −Ex |, then swap the symbols x and y. If K ∈ / E, it is considered that there is no prominent texture structure around K, so we follow the traditional sampling strategy of pixel-centric, and set a fixed neighborhood size Sf ix for K. 11

As can be seen from Eq. (6), for a certain pixel K, the size of its patch Sp is determined by the scale of its texture structure (DE ) and the Euclidean distance of the pixel from the central extreme-point of the texture structure (d), so that the range of neighborhood information can be adaptively captured according to objects of different sizes. Moreover, while retaining some of its unique neighborhood information, we offset the original patch center to the center of the texture structure, aiming to capture more neighborhood information homogenous to the original center pixel for feature extraction. 3) Before entering the network, in order to allow the network to be trained efficiently, we finally cut all patches into three fixed sizes as  S1 , Sp ≤ S1    Sf ix , S1 < Sp ≤ Sf ix S= (7) S3 , S2 < Sp ≤ S3    ˜ S3 , Sp > S3

where S1 , Sf ix , and S3 are three threshold constants, S˜3 means that the patch with Sp size should be resized to S3 , not directly be set to S3 , to ensure the integrity of the original neighborhood information. Then these patches are used as the input for the subsequent network. 3.2. Dual-Branch Attention Fusion Deep Network (DBAF-Net) 3.2.1. The proposed spatial attention module (SA-module) and channel attention module (CA-module) PAN data has a higher spatial resolution than MS data, so we expect to design a spatial-based attention module to add soft weights on its feature maps for PAN data. While MS data has more channel information, we expect to design a channel-based attention module for MS data. In the field of natural image processing, the bottom-up top-down attention (BuTdA) module[52, 53, 54] is a very common spatial attention model, while Squeezeand-Excitation Attention (SEA) module [55] is used for channel attention model. However, these native stacking attention modules usually leads to several common shortcomings when they are applied to remote-sensing data. First, within an attention module, the element-wised product of feature maps and the corresponding attention masks ranging from zero to one will gradually become smaller as the network deepens, which is easy to fall into suboptimal performance due to the vanishing gradient problem; Meanwhile 12

PAN_feature maps

P, R P, R

P, R

In

In R

P, R

P, R

In, Rs

In, Rs R

R R

A A

(a) B

P(G)

Fr, Fs

B

P(G)

Fr, Fs

R R

MS_feature maps

(b)

Figure 3: The proposed attention modules. The meaning of the symbols is the same as that in Figure 1. (a) The proposed spatial attention module (SA-module) for PAN data.(b) The proposed channel attention module (CA-module) for MS data.

relying directly on the mask as output may also break the characteristics of the original feature, making the network unstable. Second, between the two attention modules, these modules use only the output features of the attention mask into the deeper layers, but often abandon the potential features obtained by the intermediate reasoning step. This reduces the inherent reasoning ability of the module, and may also result in latter layers of attention being particularly vulnerable to errors made in the former layer. Moreover, all activation functions and sigmoid layers are located on the same branch, these nonlinearity operations reduce the width of their input features and make information bottlenecks easily occur between two attention modules, resulting in the loss of useful information. To mitigate the above problems, we propose two improved attention modules (CA-module and SA-module) for PAN and MS data. In Fig. 3(a)(b), each stacked attention model includes two modules, and the dark-green patches in each module represent the original attention module in Fig. 1(a)(b). We have two main improvements on the modules as follows: 1) In an attention module, for each feature map after up-sampling, we element-wised add the feature map that belongs to its preceding downsampling (i.e., the purple arrows in Fig. 3). This can be seen as opening up a new branch, which reduces the weight of purely relying on the original at-

13

tention mask to the network output. It maintains the good properties of the original features, and advances to the top layers by bypassing the attention mask, thus weakening the feature selection ability of the mask branch. In the attention mask process, some useful information that may be lost during up-sampling and down-sampling. By this new branch, this lost information can also be compensated, so that the features can be better transmitted to the deeper layers to improve the stability of the network, which is very important for remote-sensing data with complex features. Moreover, due to the addition of the original features, the output will not become smaller as the network deepens, allowing the network to perform good gradient backpropagation at deeper levels. In fact, we believe that this branch is equivalent to giving reasonable preconditioning for the corresponding attention branch, making the nonlinear complex functions implied by the network easier to fit. Assume that xt is the input of the tth layer, yT is the output of the T th layer (T > t). Fi (·) is a mapping function (e.g., pooling, convolution, BN, ReLU, nonlinear interpolation operation, and element-wised product operation) in the ith layer. wi refers to the parameters of the ith layer. Therefore, the process of purple line connection can be expressed as yT = FT (xt +

T −1 Y

Fi (xi , wi )), wT )

(8)

i=t

In the back-propagation process, the gradient error E from the T th layer to the tth layer is calculated as follows T −1 ∂E ∂yT ∂E ∂F ∂ Y ∂E = · = · (1 + F (xi , wi )) ∂xt ∂yT ∂xt ∂yT ∂xt ∂xt i=t

(9)

Due to the presence of the number 1 in parentheses, the gradient of this layer does not become too small as E decreases. The gradient error of the T th layer can be stably transmitted to the tth layer, which can effectively alleviate the problem of gradient vanish. Our experiments also show that adding a clean and concise information branch is helpful for easing optimization. Although similar to the form of classic residual learning [59, 60, 61], there are some differences in purpose. Fi (·) in residual networks mainly learns the residual between output and input, while Fi (·) here aims to fit the significant features of the deep network module. Combining the output of attention mask, the important features in the output feature map can be enhanced, 14

while the unimportant features can be suppressed. Attention mechanism can pay more attention to targets that are helpful to classification. 2) Between the two attention modules, we concatenate the potential information of the previous attention module into the next attention module (i.e., the blue arrows in Fig. 3). It can effectively capture intermediate reasoning information in the previous attention module and transmit it to the whole network independently and completely, thus enhancing the representation ability of the features. Doing this helps provide positional reasoning to the latter attention module, allowing for the correction of possible attention errors in the previous attention module. Even if the attention of the previous layer is focused on the wrong position, the attention of the latter layer can correct itself, which increases the robustness of the attention mechanism. Moreover, this operation reuses the previous features, so no extra hyper-parameters are introduced. Similar to the above improvement idea, we are equivalent to providing a new branch to reduce the impact of information bottlenecks on the original branch, making the attention model easy to optimize. To maintain feedforward characteristics, each layer receives additional input from all the previous layers and passes its own feature maps to all later layers. In fact, this branch completely preserves the new features learned by each module, and each module can completely separate the features of its previous modules. In the case of back-propagation, the error gradient passed to a layer can also flow directly to all its front layers, thus enhancing the flow of information between layers in the network. In fact, concatenation operation can be written in a form similar to Eq. (8), the process of blue arrows in Fig. 3 can be expressed as yT = fs (xt ) +

T −1 Y

Fi (xi , wi )

(10)

i=t

where fs (·) is a shifting function, fsQ (x) makes x shift right and the offset −1 should be equal to the dimension of Ti=t Fi (xi , wi ). The gradient errors E from the T th layer to the tth layer is calculated as follows T −1 ∂E ∂yT ∂E ∂ Y ∂E (11) = · = (1 + Fi (xi , wi )) ∂xt ∂yT ∂xt ∂yT ∂xt i=t

So similar to Eq. (9), it can alleviate the gradient vanishing between the two modules caused by the information bottleneck. Moreover, as mentioned 15

PAN_patch

R

P, R

P, R 4S

4S

2S

4S

S

2S

2C 4S

4S C

1

A

4W

4C

2S

4C

S N

C

A

MS_patch

2C

R

S

S

2S

B

B

P(G)

Fr, Fs

R

R

S

S 4

S

S

4C

S

S

S 4C

S

S

8C

8C

S

S

S

16C

16C

P, R

R S

8C

S

S 4C

S

B 0.5S 4C

P, R

B

0.25S 0.5S

In

R

B

R

S

S

S

S

4C

S 4C

In, Rs

4C 0.25S

R

B

SPP

S

S

Cross Entropy

S 4C

Figure 4: The flowchart of the proposed dual-branch attention fusion deep network (DBAF-Net). The meaning of the symbols is the same as that in Figure 1. A PAN patch with (4S × 4S, 1) and its corresponding MS patch with (S × S, 4) as the input. Each branch has three stacked proposed modules to extract respective features first, then these feature are interfused, finally it passed through a fused module, a SPP layer, and several fully connected layers for classification.

above, because Eq. (10) can retain all the information of its two inputs, it is actually a reversible process that can be disassembled. This connection mode improves the flow of information and gradients throughout the network, which makes it easy to train [62]. This is the reason we use concatenating instead of summation. 3.2.2. Dual-branch attention fusion deep network In this part, based on the proposed SA-module and CA-module described above, we design a dual-branch attention fusion deep network (DBAF-net) for multiresolution classification. The network first extracts features from two asymmetric branches, where a PAN patch and its corresponding MS patch as their input. Then the two features are interfused to achieve classification. The proposed network framework is shown in Fig. 4, and the details are as follows: (a) Preprocess: Both PAN and MS patches (In this paper, the length (width) ratio of PAN and MS data is 4:1) go through a R operation function 16

(including convolution, batch normalization (BN), and rectified linear units (ReLU)), and prepare for the network training. (b) Feature Extraction based on Attention Modules: As shown in Fig. 3, we use three SA-modules to form a Stacked-SA model on the PAN branch and use three CA-modules to form a Stacked-CA model on the MS branch, which is used as feature extractors. In this process, the two branch weights are not shared and are independent of each other. As seen, these two branches are asymmetrical. The extracted features further enhance the original information advantage of each image data type. On one branch, the attention masks of the different modules capture different types of attention, and they are added to their respective features in the form of soft weights. The shallow mask mainly suppresses the unimportant information such as the background of the image, and as the network deepens, the mask gradually enhances the important information of interest. (c) Feature Fusion and Classification: In order to effectively fuse the features of these two branches, we do the following operations for the output of the third module. Given the input of the third module in PAN branch Aai,c and the same size attention mask αi,c , the corresponding input in MS b and the same size attention mask βi,c , where c is the index of the branch Bi,c f image channel and i covers all spatial locations. The fused feature Yi,c (The process represented by the red arrows in Fig. 4) can be expressed as f b b Yi,c = fs (F (Bi,c )) + F (N (αi,c · Aai,c + F (βi,c · Bi,c )) + Aai,c ) (12)

where N (·) is a normalization function, F (·) and fs (·) represent the same meaning as the equations above. This step can effectively combine the advantages of the two branches to enhance the performance of the fused features in both spatial and channels. Before classification, features need to go through a module to better integrate the fused information. In this module, we insert the CA-module into the SA-module, since there is only one module, it doesn’t contain the process represented by the blue arrows in the CA-module and SA-module. Since the input patch has three different sizes through ACO-SS, we insert a spatial pyramid pooling (SPP)[63] layer before fully connected layers to obtain fixed-dimensional vectors. In this paper, we set up a 3-level pyramid pooling model with pool1×1 , pool2×2 , and pool4×4 . Therefore, the features with S × S × 4C in Fig. 4 should be turned into a (1 × 1 × 4C + 2 × 2 × 4C + 4 × 4 × 4C)-dimensional vector. After the concatenated vector pass through 17

several fully connected layers, the class probability of the pair of patches is finally estimated. In this paper, the cross-entropy error used as the ultimate loss function, and defined as follows: nb 1 X [yi log(yˆi ) + (1 − yi ) log(1 − yˆi )] E=− nb i=1

(13)

where nb denotes the batch size, yi is the label for the ith input pair, while yˆi is the class probability for the ith input pair. We train this end-to-end network using the stochastic gradient descent (SGD) strategy. 4. Experimental Study 4.1. Data Sets Description and Experimental Environment In this section, we use four data sets to verify the robustness and effectiveness of the the proposed method. The first two data sets are provided by the official of the 2016 IEEE GRSS Data Fusion Contest [64]. Each data set was acquired from DEIMOS-2 satellite in Vancouver, Canada, on 31 March 2015 and 30 May 2015. It includes a PAN image with 1-m resolution and an MS image with 4-m resolution (R, G, B and NIR), as follows: 1) Fig. 5(a) shows the level 1B images, where the data is a not resampled product that has been calibrated and radiometrically corrected. The MS image includes 3249 × 2928 × 4 pixels, while the PAN image includes 12996×11712 pixels. The data was divided into 11 categories, which includes vegetation, four kinds of buildings, boat, road, port, bridge, tree, and water. 2) Fig. 5(b) shows the level 1C images, where the data is a manually corrected and then re-sampled product that has been calibrated and radiometrically corrected. The MS image includes 1311 × 873 × 4 pixels, while the PAN image includes 5244 × 3492 pixels. The data was divided into 8 categories including vegetation, three kinds of buildings, boat, road, tree, and water. The next two data sets were obtained from the Quickbird satellite in Xi’an, China, on May 30, 2008. Each data set includes a PAN image with 0.61-m resolution and an MS image with 2.44-m resolution (R, G, B and NIR), as follows: 1) Fig. 5(c) shows the Xi’an Suburban area, covering the southwest corner of Xi’an City. The PAN image consists of 6600 × 6200 pixels, while the 18

(a)

Vegetation 0.55% Building1 3.19% Building2 0.80% Building3 5.39% Building4 1.80% Boat 0.09% Road 0.63% Port 0.85% Bridge 0.07% Tree 4.87% Water 13.75% Unlabeled 67.99%

(b)

Vegetation 0.40% Building1 3.25% Tree 0.86% Building2 3.15% Water 8.55% Road 1.02% Building3 4.12% Boat 0.34% Unlabeled 78.26%

(c)

Building1 Building2 Vegetation1 Vegetation2 Land Building3 Road Building4 Unlabeled

(d)

Building Flat_land Road Shadow Soil Tree Water Unlabeled

0.46% 1.17% 2.46% 0.70% 0.66% 1.75% 4.39% 0.38% 88.02%

11.28% 2.78% 2.58% 3.25% 8.77% 16.36% 2.17% 52.82%

Figure 5: First column: MS images. Second column: PAN images (The PAN images reduce to the same size as their MS images for a more convenient display.). Third column: Ground truth image. Last column: The class labels corresponding to the ground truth image. (a) Vancouver Level 1B image set. (b) Vancouver Level 1C image set. (c) Xi’an suburban area image set. (d) Xi’an urban area image set.

19

MS image consists of 1650 × 1550 × 4 pixels. The scene was divided into 8 categories including two kinds of vegetation, four kinds of construction areas, road, and land. 2) Fig. 5(d) shows the Xi’an Urban area, covering the east of Xi’an City. The PAN image consists of 3200×3320 pixels, while the MS image consists of 800×830×4 pixels. The scene was divided into 6 categories, which consist of building, road, tree, soil, flat land, water, and shadow. Flat land represents all kinds of land except soil. For Vancouver and Xi’an datasets, note that there are actually four relatively independent image datasets, just two places to capture images. So the training samples and testing samples in the experiments belong to the same image dataset. Different datasets do not affect each other. The experiments in this paper are running on HP-Z840 Workstation with TITAN X 12GB GPU and 64GB RAM under Ubuntu 14.04 LTS. The proposed network is trained on Caffe[65], and other operations in our experiments are implemented in MATLAB R2014a. 4.2. Experimental Setup For evaluating the classification performance, the metrics including overall accuracy (OA), average accuracy (AA), and kappa statistic (Kappa) are calculated to perform quantitative analysis. Since the PAN image is often difficult to mark, the ground truth image is corresponding to its MS image pixel by pixel. Therefore, we first intercept the MS sample patch from the MS image by ACO-SS, and map the center point of this patch to the corresponding PAN image, then we intercept the PAN sample patch with it as the center. Corresponding to the flow chart of Fig. 4, the detailed hyper-parameters of the proposed DBAF-Net are shown in Table 1. The input size of the PAN and MS patches are (4S × 4S, 1) and (S × S, 4), respectively. Note that the alphabetic symbol S for PAN should be the same as for MS. In this paper, the three patch sizes S1 , Sf ix , and S3 in Eq.(7) are set to 12, 16 and 24 pixels, respectively. In this way, the number of the three sizes of samples entering the network can be basically balanced, which helps to stabilize the network training. Moreover, we chose them as even numbers, mainly considering that the pooling operation (stride=2, pad=0, kernel size=2) of the network does not require additional zero-filling operations, which can be fast and accurate. In Table 1, the symbol max pool./inter. means that in these four iterations of the loop, the first two operations are max pooling, while the last two corresponding operations are interpolation. 20

Table 1: Hyper-parameters of each network layer. Input

4S × 4S, 1 (S × S, 4)

4S × 4S, 16 (S × S, 64)

PAN branch

    

2S × 2S, 32 (S × S, 128)

   

S × S, 64 (S × S, 256)

S × S, 64

S × S, 64

S × S, 64 1344 − d

  

MS branch

 conv., 3 × 3, 16   conv., 3 × 3, 64  conv., 1 × 1, 16 conv., 1 × 1, 64  conv., 3 × 3, 8   conv., 3 × 3, 32  conv., 1 × 1, 16  conv., 1 × 1, 64   max pool./inter. global pool.  conv., 1 × 1, 16   f c., [16, 64]  ×4   conv., 3 × 3, 8   conv., 1 × 1, 64    conv., 1 × 1, 16  conv., 3 × 3, 32  max pool. conv., 1 × 1, 64 conv., 3 × 3, 16    max pool./inter. global pool. conv., 1 × 1, 32  ×4  f c., [32, 128]    conv., 3 × 3, 16   conv., 1 × 1, 128    conv., 1 × 1, 32  conv., 3 × 3, 64  max pool. conv., 1 × 1, 128 conv., 3 × 3, 32  max pool./inter.   global pool. conv., 1 × 1, 64  ×4 f c., [64, 256] conv., 3 × 3, 32  conv., 1 × 1, 64 conv., 1 × 1, 64 conv., 1 × 1, 64 max pool. conv., 3 × 3, 64 batchnorm conv., 3 × 3, 64 conv., 1 × 1, 64     max pool./inter.   conv., 1 × 1, 64         conv., 3 × 3, 32   ×4    conv., 1 × 1, 64     global pool. f c., [16, 64] (pool1×1 × 64 + pool2×2 × 64 + pool4×4 × 64) f c., [800, 200, sof tmax]

21

The output size of all convolution operations in Table 1 is the same as the input size. Note that we have also tried to reduce the corresponding hyperparameters of the proposed network (e.g., reducing the number of features per layer, and reducing the number of layers in the network), but the accuracy of the obtained classification results is lower than that in the Table 1. The code for network structure is available at: https://github.com/zhuhao1357/DualBranch-Attention-Fusion-Deep-Network. In the training of the networks, we randomly select 10% labeled data from each class as training samples, and the others are used as test samples. In fact, there is a certain degree of sample imbalance in our data, but because the overall number of extracted samples is not very large, and the network is not very deep, so although the experimental results of categories with small number of samples are affected to a certain extent, the overall accuracy reduction is not large. This is a very valuable problem, we will focus on the problem of sample imbalance in the near future. The initial learn rate is 0.005, the weight decay is 0.0005, the iteration number is 60000, and the batch size is 50. In order to ensure that the proposed framework is sufficiently variable, we code 10 times for different random training samples, and take the average result as the final result for each metric. 4.3. Performance of the proposed attention model and ACO-SS strategy In this section, taking Vancouver level 1B images as an example, we do a detailed comparison and analysis of the proposed attention models and ACO-SS strategy, respectively. 4.3.1. Validation of the proposed attention modules performance In this part, we use stacked bottom-up top-down attention (StackedBuTdA) model to compare the proposed stacked spatial attention (StackedSA) model, and use stacked squeeze-and-excitation attention (Stacked-SEA) model to compare the proposed stacked channel attention (Stacked-CA) model. We want to verify whether the proposed modules are more suitable for our remote-sensing image classification task. Here, the sampling strategy of these network models is ACO-SS, and each pair of comparison models has the same hyper-parameters and iteration times. The PAN image of Vancouver level 1B data set is used as the input of the Stacked-BuTdA model and the Stacked-SA model; while the MS one of of Vancouver level 22

1B data set is used as the input of Stacked-SEA module and the Stacked-CA module. A quantitative comparison on the Vancouver level 1B image dataset is shown in Table 2. It can be seen that our Stacked-SA and Stacked-CA obtain higher classification results than the corresponding attention models. When the network tends to be deeper, the general attention network will easily fall into local optimum, resulting in the trained network not stable enough. Due to the complex imaging principle, the training samples of remote sensing images usually have some differences in the same category, and also have some similarities between the categories, it requires a more powerful and stable network. Within an attention module, the proposed models open up another branch and reduce the weight of the original attention mask branch so that it does not rely solely on the mask as the output. It complements the original feature information lost during mask generation and makes the network more stable in dealing with complex data. Between the two attention modules, the information provided to the latter module is not only the output of the former module, but our model also makes full use of the inference information of the former module, which enhances the error correction ability of the network and makes the training of the network more stable. Meanwhile, the nonlinear mapping of the network is not all in one branch, which makes up for the loss of useful information caused by the information bottleneck between modules. The accuracy of each class of our model is higher than that of the original attention model. Experiments show that our model can extract more stable features from complex remote-sensing data, and our improved models are effective for MS and PAN data. 4.3.2. Validation of the proposed adaptive center-offset sampling strategy (ACOSS) performance In this part, traditional pixel-centric sampling strategy is used to verify the effectiveness of the ACO-SS strategy. ACO-SS adaptively determines the size of a patch by its scale of the texture structure around it, and final cut all the patches into three relatively appropriate sizes: 12, 16, and 24. Also we select Pixel-centric Sample strategy and choose the largest fixed size 24 (Pixel-Centric (S = 24)) for all patches as the comparison methods. ACO-SS (S=24) means that the sampling strategy is ACO-SS, but it uniformly resizes the size of all grabbed patches to S = 24 instead of using Eq. 7. And the 23

Table 2: Quantitative comparison over different network models on the Vancouver level 1B image data.

Network models c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 AA(%) OA(%) Kappa(%)

Stacked-BuTdA 97.60 91.12 95.43 92.64 88.99 99.06 82.71 92.77 98.39 90.51 98.47 93.43 92.78 92.00

Stacked-SA 98.26 93.70 97.95 95.70 96.44 99.41 88.22 97.00 99.13 94.69 98.85 96.31 95.94 95.50

Stacked-SEA 97.92 88.33 97.25 92.49 97.88 99.28 94.41 97.25 98.50 93.13 98.30 95.89 95.52 95.04

Stacked-CA 98.53 95.64 99.28 97.04 98.67 99.27 89.68 98.25 98.84 95.63 98.90 97.24 97.03 96.71

ACO-SS without center-offset version (ACO∗ −SS) are also to considered for comparison. The proportion of training samples selected by these methods is the same. The network used by these sampling algorithms is our DBAF-Net, and network-related hyper-parameters are also the same for fair. Table 3: Quantitative comparison over different sampling strategies on the Vancouver level 1B image data. Pixel-Centric ACO-SS ACO∗ − SS ACO-SS Sampling Strategies (S = 24) (S = 24) (S = 12, 16, 24) (S = 12, 16, 24) c1 99.11 99.44 99.50 99.52 c2 96.86 96.54 97.62 98.93 c3 98.21 99.57 99.47 99.59 c4 95.31 97.98 97.83 98.91 c5 96.16 96.98 97.64 98.29 c6 98.49 99.98 99.98 99.98 c7 94.26 98.59 98.91 98.96 c8 97.86 99.58 99.29 99.28 c9 98.55 99.94 99.98 99.98 c10 96.54 97.43 99.22 99.39 c11 99.39 99.76 99.81 99.94 AA(%) 97.34 98.71 99.02 99.34 OA(%) 97.19 98.56 98.95 99.18 Kappa(%) 96.89 98.40 98.83 99.06

24

The experimental result is shown in Table 3. It can been seen that our ACO-SS (S=12,16,24) obtain the highest classification results. Under the same network structure, the result of ACO-SS (S = 24) is better than the result of Pixel-Centric (S = 24), which means that the neighborhood information required for different samples should be different. Through experiments, we find that when the pixel to be determined is near the edges of several categories or the area of the category where the pixel is located is very small, if its patch size is too large, its neighborhood information may be mixed with many other categories of information, which will have a negative impact on the determination of the central pixel category. on the other hand, when the area of the actual object or structure where the pixels are located is very large, too small neighborhood information can not reflect the actual characteristics of the object or structure. Therefore, for a remote sensing image in a large scene, the traditional sampling strategy of fixing a neighborhood information for all samples is usually not very reasonable. While our ACO-SS that adaptively sets the neighborhood range according to its texture structure obtain higher result. Moreover, through experiments we find that compared with ACO-SS (S = 24), the method of fixing the data size into three sizes by Eq. 7 can better maintain the authenticity of the original information and avoid the loss of too much information due to the up-sampling interpolation of the original small-size patches. Finally, compared with ACOSS (S=12,16,24), the ACO∗ − SS (S=12,16,24) reduces the accuracy of all categories in varying degrees. It indicates that the strategy of center-offset is effective for improving accuracy. Instead of the traditional pixel-centric strategy, we break the rule of symmetrical neighborhood information extracted from the center of the pixel to be classified, We break the traditional rule of symmetrical neighborhood information centered on the pixel to be classified, and shift the patch toward the direction of the homogeneous region with that pixel, so that more neighborhood information can play a positive feedback effect on classification. Especially when the pixels to be classified are located at the edges of several categories, this strategy allows them to obtain different neighborhood information to distinguish themselves well, thereby increasing the confidence of the network. Since all network structures in Table 3 use the proposed DBAF-Net, the overall accuracy is relatively high and better than that of Table 2, which also reflects from the side that our proposed feature-fusion classification network is better than that of a single branch classification network.

25

4.4. Experimental Results With Vancouver Level 1B Data Table 4: Quantitative classification results of Vancouver level 1B images. Methods c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 AA(%) OA(%) Kappa(%) T ime(s)

Pixel-Centric (S = 24) +Attention Net 99.62 94.35 94.59 93.55 97.11 95.55 90.11 89.80 97.56 97.19 99.65 95.50 95.33 94.83 730.12

ACO-SS (S = 24) +Attention Net 99.74 94.84 95.87 93.52 97.42 95.68 91.98 96.02 97.05 98.26 99.72 96.37 96.37 95.97 778.67

ACO-SS (S = 12, 16, 24) +Attention Net 99.78 95.48 95.96 94.04 98.05 97.68 93.95 96.14 97.56 98.32 99.71 96.97 96.89 96.56 795.15

ACO-SS (S = 12, 16, 24) +DBFA-Net 99.82 99.03 99.59 98.91 98.29 99.88 98.66 99.28 99.98 99.39 99.94 99.34 99.18 99.06 901.23

DMIL [46]

SML-CNN [47]

99.81 93.24 95.50 98.43 91.82 94.08 95.63 96.89 96.03 96.21 99.73 96.06 96.09 86.81 768.34

98.21 98.90 99.00 99.19 98.94 91.34 87.67 97.53 80.80 99.38 99.96 95.54 99.15 98.87 349.78

N-IHS[21] (S = 24) +Stacked-SEA 98.27 87.19 96.31 88.40 96.30 97.96 92.94 95.01 99.45 98.17 99.19 94.78 94.47 93.20 383.62

In this part, taking Vancouver Level 1B Image as an example, we compare the various methods to verify the effectiveness of the proposed method in detail. Two state-of-the-art methods, namely DMIL [46] and SML-CNN [47] in this paper, are used as the compared methods. Besides, a improved Intensity-hue-saturation (IHS) version called Nonlinear IHS [21] (N-IHS for short) is also used as a compared method to verify the effectiveness of the proposed method more comprehensively. After obtaining a fusion image of MS data with the same size as PAN data by N-IHS pan-sharpening, we refer to the results in Table 2 in the manuscript and select Stacked-SEA model as the classification network of the comparison method for a fair comparison. As mentioned in Section I), both methods are multi-resolution classification methods based on neural network, which is reasonable and suitable to use them as the comparison algorithms. For both networks, we follow the experimental setup in their respective papers to achieve their best results. Moreover, we give the test time to compare the efficiency of each method. The corresponding quantitative and qualitative results of Vancouver level 1B images are shown in Table 4 and Fig. 6 (Due to space constraints, the ACO-SS(S = 24)+Attention Net method of the second column is not shown in Fig. 6, and so do the experimental results of the following sets of data). Here, Attention Net in Table 4 denotes that the features of the PAN branch and the MS branch are extracted by Stacked-BuTdA models and StackedSEA models, respectively, and then the obtained features of the two branches are concatenated to the final classification. The hyper-parameters of Atten26

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Figure 6: Classification maps of different methods on the Vancouver Level 1B Image, each method has two classification maps: one is overall image, while the other is the image with ground truth. (a) and (b) Pixel-centric(S = 24)+Attention Net. (c) and (d) ACOSS(S = 12, 16, 24)+Attention Net. (e) and (f) ACO-SS(S = 12, 16, 24)+DBFA-Net. (g) and (h) DMIL[46]. (i) and (j) SML-CNN[47]. (k) and (l) N-IHS[21](S = 24)+StackedSEA.

27

tion Net is the same as that of DBFA-Net in Table 1 to ensure fair comparison. The comparison and analysis of the experimental results of different methods are as follows. Based on the same Attention Net, the proposed ACO-SS (S = 24) obtain higher results than the Pixel-centric(S = 24). ACO-SS (S = 24) improves the accuracy of most categories, especially c8 significantly improved the accuracy from 89.80% to 96.02%. As can be seen from the ground truth map, the pixels of c8 is widely distributed and contains objects of multiple shapes, so the fixed neighborhood range does not reflect the true nature of this category well. The results indicates that our ACO-SS is not only feasible on the specific DBAF-Net, but also can be promoted in other networks. Moreover, compared with ACO-SS(S = 24), ACO-SS(S = 12, 16, 24) obtains higher metrics. It indicates that the expansion of neighborhood information will not necessarily improve the final accuracy. Remote sensing images taken from distant perspectives actually contain structures or objects of different sizes, it makes the neighborhood information of the sample should be adapted to local conditions. We should consider the size of the actual structure or object for each sample, and increase the proportion of neighborhood information that is actually beneficial to its classification, rather than expanding some meaningless or even negative feedback neighbors for all samples. It is noted that due to the limited performance of the Attention Net, sampling strategy does not raise too much accuracy, some categories (c2 ∼ c4 , c7 ∼ c9 ) obtain lower accuracy. So compared with the results of Table 3, the overall OA, AA, Kappa of the above three methods is not very high. Based on the same ACO-SS(S = 12, 16, 24), the proposed DBFA-Net obtain higher results than the Attention Net. Our DBFA-Net makes up for the shortcomings of the general attention network that is easy to fall into local optimum, and enhances the feature extraction ability of dealing with complex remote-sensing patches. The accuracy of most categories has improved significantly. The accuracies of some categories (e.g., c1 (vegetation), c10 (tree) and c11 (water)) have not been improved much. This is because the difference among the samples of these categories is not significant, and their information distribution is also relatively stable, so these categories are not very demanding on sample quality and network performance. Moreover, our ACO-SS(S = 12, 16, 24)+DBFA-Net is also better than the results of three state-of-the-art methods. The network used in DMIL [46] (CNN for MS branch, SAE for PAN branch) is relatively shallow, and it is not able to adequately extract robust and significant feature representations when 28

dealing with remote sensing data with complex characteristics. Therefore, most of the categories of DMIL are less accurate than DBFA-Net. And the strategy of feature fusion by a concatenated way is relatively rude and straightforward, it does not take into account the differences between the two features. SML-CNN [47] consider six local regions (four corner regions, an original region, and a central region) for a sample, and design six CNN for feature extraction. It obtain achieved better results than DMIL, but it has very low precision at categories with less training samples (e.g, c6 : 0.09%, c7 : 0.63%, and c9 : 0.07%). Considering that the networks in SML-CNN use 25% of labeled data as training samples, this indicates that the network is likely to fall into some local overfitting, thus the AA of SML-CNN is very low. N-IHS(S = 24)+Stacked-SEA uses the way of data fusion, and its fusion process will not only remove redundant information, but also inevitably lose and conceal some unique features of PAN and MS (the spectral range of PAN and MS does not overlap completely). When the feature information is not fully extracted from each data, the information of PAN and MS can not be completely retained. Due to c2 (Building1) and c4 (Building3) have similar spectral information and dense contacts, and data fusion operations flatten the uniqueness of these two categories, they are misclassified each other (c2 : 87.19%, c4 : 88.40%). Finally, we compare the efficiency of all algorithms. Compared with DMIL, SML-CNN and N-IHS(S = 24)+Stacked-SEA, Since N-IHS(S = 24)+Stacked-SEA has only one branch network, even though its network is more complex than that of DMIL, its overall running time is similar to that of DMIL. While our DBFA-Net needs longer running time because of the deeper and more complex network structure. In fact, when we started designing networks based on attention mechanisms especially for remote sensing data, we mainly focused on how to design a deeper and structure-based attention network to extract more stable and robust features. In the case of keeping the main structure of the DBFA-Net unchanged, we have also tried to reduce the number of feature maps per layer or reduce the number of layers of the network, but the results are not satisfactory. So we will study how to further increase the efficiency of the algorithm while maintaining high precision in the follow-up work.

29

4.5. Experimental Results With Vancouver Level 1C Data, Xi’an Suburban Area Data, and Xi’an Urban Area Data The comparison of results of the rest three data set are similar to the Vancouver Level 1B data set. For the three datasets, the ACO-SS(S = 12, 16, 24)+DBFA-Net obtains the highest AA, OA and Kappa, verifying the effectiveness of ACO-SS and DBFA-Net. For the Vancouver Level 1C data set, the quantitative and qualitative results are shown in Table 5 and Fig. 7. Overall, the result of ACO-SS is better than Pixel-Centric. In Pixel-Centric, the accuracies of c5 (92.36%) and c8 (92.15%) are low. We analyze the its confusion matrix and find that the two categories are easily mis-classification each other, which indicates that the fixed-size neighborhood information does not distinguish the two similar categories well. From the color of the overall images, we can see that the results of the Attention Net-based methods have some over-fitting. Our ACO-SS(S = 12, 16, 24)+DBFA-Net yields very high results, probably because the characteristics of each category in the Vancouver Level 1C are relatively are simpler than those in the Vancouver Level 1B. Our method not only has higher precision than DMIL and N-IHS(S = 24)+Stacked-SEA in c2 (99.85% vs. 87.12% vs. 91.92%) and c4 (99.89% vs. 80.35% vs. 90.78%), but also has higher precision than SML-CNN in c6 (99.94% vs. 86.69%) and c8 (100.00% vs. 87.87%). Table 5: Quantitative classification results of Vancouver level 1C images. Methods c1 c2 c3 c4 c5 c6 c7 c8 AA(%) OA(%) Kappa(%) T ime(s)

Pixel-Centric (S = 24) +Attention Net 97.71 97.97 99.49 98.61 92.36 96.16 94.15 92.15 96.08 96.09 95.28 360.33

ACO-SS (S = 24) +Attention Net 99.43 97.67 99.22 96.68 98.81 97.44 94.78 97.58 97.70 97.35 96.79 390.76

ACO-SS (S = 12, 16, 24) +Attention Net 99.25 99.22 99.22 98.35 97.37 96.40 95.67 98.98 98.06 97.91 97.48 400.56

ACO-SS (S = 12, 16, 24) +DBFA-Net 100.00 99.85 100.00 99.89 99.89 99.94 99.80 100.00 99.92 99.89 99.86 446.45

DMIL [46]

SML-CNN [47]

99.63 87.12 98.68 80.35 99.63 93.32 92.98 99.45 93.90 94.82 93.32 390.76

96.24 98.67 98.95 97.19 99.23 86.69 97.39 87.87 95.28 97.67 96.93 180.37

N-IHS[21] (S = 24) +Stacked-SEA 98.72 91.92 94.52 90.78 99.41 93.55 93.13 99.52 95.19 94.18 93.09 198.56

For the Xi’an Suburban Area data set, the quantitative and qualitative results are shown in Table 6 and Fig. 8. It can be seen that c3 , c4 and c7 belong to more difficult categories in all categories. We observe the confusion matrix and find that c3 (Vegetation1) and c4 (Vegetation2) are easily misclassified because of the similar information distribution and spatial proximity. And 30

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Figure 7: Classification maps of different methods on the Vancouver Level 1C Image, each method has two classification maps: one is overall image, while the other is the image with ground truth. (a) and (b) Pixel-centric(S = 24)+Attention Net. (c) and (d) ACOSS(S = 12, 16, 24)+Attention Net. (e) and (f) ACO-SS(S = 12, 16, 24)+DBFA-Net. (g) and (h) DMIL[46]. (i) and (j) SML-CNN[47]. (k) and (l) N-IHS[21](S = 24)+StackedSEA.

31

c7 (Road) is difficult to be correctly classified because of its wide geographical distribution and large sample diversity. The ACO-SS can adaptively select its neighborhood range based on the information distribution around the sample itself, alleviating the impact of c7 sample diversity on accuracy, and the center-offset strategy can also distinguish spatially adjacent c3 and c4 . Finally, relying on our feature-based fusion network DBFA-Net based on attention mechanisms, the accuracy of each category is further improved. For ACO-SS(S = 12, 16, 24)+DBFA-Net, its AA is 99.86%, the OA is 99.79%, and the Kappa is 99.75%. Table 6: Quantitative classification results of Xi’an suburban area images. Methods c1 c2 c3 c4 c5 c6 c7 c8 AA(%) OA(%) Kappa(%) T ime(s)

Pixel-Centric (S = 24) +Attention Net 99.18 99.17 86.48 96.64 99.48 98.14 91.65 98.88 96.20 94.98 94.03 450.89

ACO-SS (S = 24) +Attention Net 99.45 99.40 91.99 97.21 99.71 99.05 95.20 99.49 97.69 97.02 96.45 486.56

ACO-SS (S = 12, 16, 24) +Attention Net 99.56 99.43 93.20 97.42 99.71 99.15 95.75 99.63 97.98 97.40 96.91 512.78

ACO-SS (S = 12, 16, 24) +DBFA-Net 100.00 100.00 99.14 100.00 100.00 99.97 99.76 100.00 99.86 99.79 99.75 567.89

DMIL [46]

SML-CNN [47]

100.00 95.11 94.58 96.21 99.32 98.79 91.26 99.88 96.89 95.76 94.86 493.41

100.00 98.86 98.25 98.25 99.57 99.58 97.90 99.32 98.96 98.55 98.15 279.36

N-IHS[21] (S = 24) +Stacked-SEA 100.00 94.99 93.66 85.29 99.48 94.18 97.84 99.97 95.68 95.10 94.34 298.43

For the Xi’an Urban Area data set, the quantitative and qualitative results are shown in Table 7 and Fig. 9. It can be seen that c1 and c6 belong to more difficult categories in all categories. In contrast to the fact that all buildings in the suburbs of Xi’an are considered to be of the same kind, the shapes and colours of these buildings are different. Unlike the Xi’an Suburban Area, all the buildings the Xi’an Urban Area are classified as the same category, but their shapes and colors are quite different. So c1 (building) is widely distributed and the difference within this category is relatively large. And c6 (Tree) is interspersed among other categories and is easily confused with them. As mentioned above, our ACO-SS can choose to set the appropriate neighborhood range for each sample, and the attention mechanism in DBFA-Net can strengthen the important part of the delineated neighborhood information. Therefore, the proposed ACO-SS(S = 12, 16, 24)+DBFA-Net obtains the highest AA, OA and Kappa. 4.6. Experimental Results With Vancouver Data To further verify the robustness and generalization performance of the proposed method, we conducted a transfer learning experiment. In the ex32

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Figure 8: Classification maps of different methods on the Xi’an Suburban Area Image, each method has two classification maps: one is overall image, while the other is the image with ground truth. (a) and (b) Pixel-centric(S = 24)+Attention Net. (c) and (d) ACOSS(S = 12, 16, 24)+Attention Net. (e) and (f) ACO-SS(S = 12, 16, 24)+DBFA-Net. (g) and (h) DMIL[46]. (i) and (j) SML-CNN[47]. (k) and (l) N-IHS[21](S = 24)+StackedSEA. Table 7: Quantitative classification results of Xi’an urban area images. Methods c1 c2 c3 c4 c5 c6 c7 AA(%) OA(%) Kappa(%) T ime(s)

Pixel-Centric (S = 24) +Attention Net 84.69 99.78 92.76 96.43 98.84 94.64 99.41 95.22 95.22 93.75 241.41

ACO-SS (S = 24) +Attention Net 85.75 99.41 97.54 97.57 99.19 95.59 99.37 96.35 95.76 95.00 279.52

ACO-SS (S = 12, 16, 24) +Attention Net 87.47 99.35 98.45 97.80 99.45 96.04 99.42 96.85 96.43 95.74 306.16

33

ACO-SS (S = 12, 16, 24) +DBFA-Net 98.11 99.91 99.90 99.29 99.81 98.43 99.61 99.29 99.08 98.88 364.79

DMIL [46]

SML-CNN [47]

93.68 95.15 96.43 93.47 98.59 96.65 98.67 96.09 94.89 93.21 263.19

92.77 95.29 92.25 92.27 98.81 97.66 97.35 95.20 95.58 94.69 165.57

N-IHS [21] +Stacked-SEA 91.80 99.66 93.40 97.48 98.50 96.52 89.36 95.25 95.36 93.88 130.63

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Figure 9: Classification maps of different methods on the Xi’an Urban Area Image, each method has two classification maps: one is overall image, while the other is the image with ground truth. (a) and (b) Pixel-centric(S = 24)+Attention Net. (c) and (d) ACOSS(S = 12, 16, 24)+Attention Net. (e) and (f) ACO-SS(S = 12, 16, 24)+DBFA-Net. (g) and (h) DMIL[46]. (i) and (j) SML-CNN[47]. (k) and (l) N-IHS[21](S = 24)+StackedSEA.

34

periment, we extracted the training sample set from the Vancouver Level 1B Image Data, and then use the labeled the Vancouver Level 1C Image Data as the test set to calculate the metrics. We preserve the categories contained in both the two image datasets, including vegetation, boats, roads, trees, and water. The other parameter settings are the same as those in the previous experiment. The experimental result is shown in Table 8. The Kappa of this experiment is 96.23%. Although the accuracy of c3 (tree), c5 (water) and c8 (boat) is all above 98%, the accuracy of c1 (vegetation) and c6 (road) is slightly low. For c1 (vegetation), we can find that the time taken by the two image datasets differs by two months, while the low vegetation grows with time. In addition, the atmospheric media and climate captured at different times are different, which makes the distribution of low vegetation image information of the two image datasets have many differences, resulting in a decrease in accuracy. For c6 (road), the morphology and environment of the two image datasets are quite different, and the training samples of this category are relatively small, reducing the overall accuracy. Although the classification result is not as good as that of experiments in which the training samples and testing samples belong to the same dataset, the overall result is not very bad, reflecting that the proposed method has certain robustness. Our future research will do more in-depth research on such issue. Table 8: Experimental Result With Vancouver Data. Methods ACO-SS (S = 12, 16, 24) +DBFA-Net

c1 (vegetation)

c3 (tree)

c5 (water)

c6 (road)

c8 (boat)

AA(%)

OA(%)

Kappa(%)

90.10

98.65

99.56

80.43

99.04

93.56

98.70

96.23

5. Conclusion In this paper, we propose a dual-branch attention fusion deep network (DBAF-Net) for multiresolution remote-sensing image classification. The above experiments show that ACO-SS is more suitable for extracting samples from remote-sensing images in large scenes. Unlike the general pixel-centric sampling strategy with a fixed patch size for all samples, ACO-SS adaptively determines the neighborhood range and adjusts the original patch center, so that the patch provides more neighborhood information that is beneficial

35

to its classification. Base on ACO-SS, the proposed spital attention module (SA-module) and channel attention module (CA-module) can enhance important features and suppress unimportant information, thus extracting more powerful and stable feature representation for PAN and MS data, respectively. Finally, the DBAF-Net effectively interfuses these two features and obtains the state-of-the-art multiclassification accuracy on four remotesensing data sets. In the near future, we will further explore better network modules, so that the network can further improve efficiency while ensuring high precision. Acknowledgment The authors would like to thank Deimos Imaging for acquiring and providing the data used in this paper, and the IEEE GRSS Image Analysis and Data Fusion Technical Committee. They would also like to thank Dr. W. Zhao and Dr. X. Liu at the Xidian University of China for their kind assistances. References [1] C. K. Munechika, J. S. Warnick, C. Salvaggio, J. R. Schott, Resolution enhancement of multispectral image data to improve classification accuracy, Photogramm. Eng. Remote Sens. 59 (1) (1993) 67–72. [2] X. Jia, J. A. Richards, Cluster-space representation for hyperspectral data classification, IEEE Trans. Geosci. Remote Sens. 40 (3) (2002) 593–598. [3] S. Li, X. Kang, L. Fang, J. Hu, H. Yin, Pixel-level image fusion: A survey of the state of the art, Inform. Fusion 33 (2017) 100–112. [4] I. Amro, J. Mateos, M. Vega, R. Molina, A. K. Katsaggelos, A survey of classical methods and new trends in pansharpening of multispectral images, EURASIP J. Adv. Signal Process. 2011 (1) (2011) 79. [5] I. L. Castillejogonzalez, F. Lopezgranados, A. Garciaferrer, J. M. Penabarragan, M. Juradoexposito, M. S. D. La Orden, M. Gonzalezaudicana, Object- and pixel-based analysis for mapping crops and their agroenvironmental associated measures using QuickBird imagery, Comput. Electron. Agric. 68 (2) (2009) 207–215. 36

[6] G. Masi, D. Cozzolino, L. Verdoliva, G. Scarpa, Pansharpening by convolutional neural networks, Remote Sens. 8 (7) (2016) 594. [7] J. Zhong, B. Yang, G. Huang, F. Zhong, Z. Chen, Remote sensing image fusion with convolutional neural network, Sens. Imag. 17 (1) (2016) 10. [8] P. P. Shingare, P. M. Hemane, D. S. Dandekar, Fusion classification of multispectral and panchromatic image using improved decision tree algorithm, in: Proc. Int. Conf. Signal Propag. Comput. Technol., 2014, pp. 598–603. [9] Y. Liu, X. Chen, Z. Wang, Z. J. Wang, R. K. Ward, X. Wang, Deep learning for pixel-level image fusion: recent advances and future prospects, Inform. Fusion 42 (2018) 158–173. [10] A. K. Shackelford, C. H. Davis, A hierarchical fuzzy classification approach for high-resolution multispectral data over urban areas, IEEE Trans. Geosci. Remote Sens. 41 (9) (2003) 1920–1932. [11] Z.-H. Huang, W.-J. Li, J. Wang, T. Zhang, Face recognition based on pixel-level and feature-level fusion of the top-levels wavelet sub-bands, Inform. Fusion 22 (2015) 95–104. [12] G. Moser, S. B. Serpico, Joint classification of panchromatic and multispectral images by multiresolution fusion through markov random fields and graph cuts, in: 17th Proc. Int. Conf. Digit. Signal Process., IEEE, 2011, pp. 1–8. [13] G. Moser, A. De Giorgi, S. B. Serpico, Multiresolution supervised classification of panchromatic and multispectral images by Markov Random Fields and graph cuts, IEEE Trans. Geosci. Remote Sens. 54 (9) (2016) 5054–5070. [14] T. Mao, H. Tang, J. Wu, W. Jiang, S. He, Y. Shu, A generalized metaphor of chinese restaurant franchise to fusing both panchromatic and multispectral images for unsupervised classification, IEEE Trans. Geosci. Remote Sens. 54 (8) (2016) 4594–4604. [15] J. Zhang, T. Li, X. Lu, Z. Cheng, Semantic classification of highresolution remote-sensing images based on mid-level features, IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 9 (6) (2016) 2343–2353. 37

[16] X. Meng, H. Shen, H. Li, L. Zhang, R. Fu, Review of the pansharpening methods for remote sensing images based on the idea of meta-analysis: practical discussion and challenges, Inform. Fusion 46 (2019) 102–113. [17] Q. Yuan, Y. Wei, X. Meng, H. Shen, L. Zhang, A multiscale and multidepth convolutional neural network for remote sensing imagery pansharpening, IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 11 (3) (2018) 978–989. [18] X. Meng, H. Shen, Q. Yuan, H. Li, L. Zhang, W. Sun, Pansharpening for cloud-contaminated very high-resolution remote sensing images, IEEE Trans. Geosci. Remote Sens. 57 (5) (2019) 2840–2854. [19] T. Tu, S. Su, H. Shyu, P. S. Huang, A new look at IHS-like image fusion methods, Inform. Fusion 2 (3) (2001) 177–186. [20] T. Tu, P. S. Huang, C. Hung, C. Chang, A fast intensity-hue-saturation fusion technique with spectral adjustment for IKONOS imagery, IEEE Trans. Geosci. Remote Sens. 1 (4) (2004) 309–312. [21] M. Ghahremani, H. Ghassemian, Nonlinear ihs: A promising method for pan-sharpening, IEEE Geosci. Remote Sens. Lett. 13 (11) (2016) 1606–1610. [22] M. Choi, A new intensity-hue-saturation fusion approach to image fusion with a tradeoff parameter, IEEE Trans. Geosci. Remote Sens. 44 (6) (2006) 1672–1682. [23] B. Aiazzi, S. Baronti, M. Selva, Improving component substitution pansharpening through multivariate regression of MS+PAN data, IEEE Trans. Geosci. Remote Sens. 45 (10) (2007) 3230–3239. [24] C. A. Laben, B. V. Brower, Process for enhancing the spatial resolution of multispectral imagery using pan-sharpening, U.S. Patent 6011875 A. [25] H. Fenghua, Y. Luming, Study on the hyperspectral image fusion based on the gram schmidt improved algorithm, J. Inform. Technol. 12 (22) (2013) 6694–6701. [26] A. R. Gillespie, A. B. Kahle, R. E. Walker, Color enhancement of highly correlated images. II - channel ratio and ’chromaticity’ transformation techniques, Remote Sens. of Environ. 22 (3) (1987) 343–365. 38

[27] P. S. Pradhan, R. L. King, N. H. Younan, D. W. Holcomb, Estimation of the number of decomposition levels for a wavelet-based multiresolution multisensor image fusion, IEEE Trans. Geosci. Remote Sens. 44 (12) (2006) 3674–3686. [28] S. Li, Multisensor remote sensing image fusion using stationary wavelet transform: Effects of basis and decomposition level, Int. J. Wavelets Multiresolut. Inf. Process. 6 (01) (2008) 37–50. [29] X. Otazu, M. Gonzalezaudicana, O. Fors, J. Nunez, Introduction of sensor spectral response into image fusion methods. application to waveletbased methods, IEEE Trans. Geosci. Remote Sens. 43 (10) (2005) 2376– 2385. [30] G. Hong, Y. Zhang, Comparison and improvement of wavelet-based image fusion, Int. J. Remote Sens. 29 (3) (2008) 673–691. [31] S. Zheng, W. Shi, J. Liu, J. Tian, Remote sensing image fusion using multiscale mapped LS-SVM, IEEE Trans. Geosci. Remote Sens. 46 (5) (2008) 1313–1322. [32] B. Aiazzi, L. Alparone, S. Baronti, A. Garzelli, M. Selva, Mtf-tailored multiscale fusion of high-resolution ms and pan imagery, Photogramm. Eng. Remote Sen. 72 (5) (2006) 591–596. [33] J. Lee, C. Lee, Fast and efficient panchromatic sharpening, IEEE Trans. Geosci. Remote Sens. 48 (1) (2010) 155–163. [34] L. Alparone, L. Wald, J. Chanussot, C. Thomas, P. Gamba, L. M. Bruce, Comparison of pansharpening algorithms: Outcome of the 2006 grs-s data-fusion contest, IEEE Trans. Geosci. Remote Sens. 45 (10) (2007) 3012–3021. [35] S. Yang, M. Wang, L. Jiao, Fusion of multispectral and panchromatic images based on support value transform and adaptive principal component analysis, Inform. Fusion 13 (3) (2012) 177–184. [36] W. Wang, L. Jiao, S. Yang, Fusion of multispectral and panchromatic images via sparse representation and local autoregressive model, Inf. Fusion 20 (2014) 73–87. 39

[37] S. Li, B. Yang, A new pan-sharpening method using a compressed sensing technique, IEEE Trans. Geosci. Remote Sens. 49 (2) (2011) 738–746. [38] X. X. Zhu, R. Bamler, A sparse image fusion algorithm with application to pan-sharpening, IEEE Trans. Geosci. Remote Sens. 51 (5) (2013) 2827–2836. [39] S. Li, H. Yin, L. Fang, Remote sensing image fusion via sparse representations over learned dictionaries, IEEE Trans. Geosci. Remote Sens. 51 (9) (2013) 4779–4789. [40] C. Jiang, H. Zhang, H. Shen, L. Zhang, Two-step sparse coding for the pan-sharpening of remote sensing images, IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 7 (5) (2013) 1792–1805. [41] C. Dong, C. C. Loy, K. He, X. Tang, Image super-resolution using deep convolutional networks, IEEE Trans. Pattern Anal. Mach. Intell. 38 (2) (2015) 295–307. [42] G. Vivone, L. Alparone, J. Chanussot, M. D. Mura, A. Garzelli, G. Licciardi, R. Restaino, L. Wald, A critical comparison among pansharpening algorithms, IEEE Trans. Geosci. Remote Sens. 53 (5) (2015) 2565– 2586. [43] Y. Lecun, Y. Bengio, G. E. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444. [44] L. Zhang, L. Zhang, B. Du, Deep learning for remote sensing data: A technical tutorial on the state of the art, IEEE Geosci. Remote Sens. Mag. 4 (2) (2016) 22–40. [45] L. Jiao, J. Zhao, S. Yang, F. Liu, Deep Learning, Optimization and Recognition, Tsinghua University Press, 2017. [46] X. Liu, L. Jiao, J. Zhao, J. Zhao, D. Zhang, F. Liu, S. Yang, X. Tang, Deep multiple instance learning-based spatial-spectral classification for PAN and MS imagery, IEEE Trans. Geosci. Remote Sens. PP (99) (2017) 1–13.

40

[47] W. Zhao, L. Jiao, W. Ma, J. Zhao, J. Zhao, H. Liu, X. Cao, S. Yang, Superpixel-based multiple local CNN for panchromatic and multispectral image classification, IEEE Trans. Geosci. Remote Sens. 55 (7) (2017) 4141–4156. [48] G. Cheng, P. Zhou, J. Han, Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images, IEEE Geosci. Remote Sens. 54 (12) (2016) 7405–7415. [49] M. Gong, J. Zhao, J. Liu, Q. Miao, L. Jiao, Change detection in synthetic aperture radar images based on deep neural networks, IEEE Trans. Neural Netw. Learn. Syst. 27 (1) (2016) 125–138. [50] Y. Duan, F. Liu, L. Jiao, P. Zhao, L. Zhang, SAR image segmentation based on convolutional-wavelet neural network and markov random field, Pattern Recogn. 64 (2017) 255–267. [51] Z. Zhao, L. Jiao, J. Zhao, J. Gu, J. Zhao, Discriminant deep belief network for high-resolution SAR image classification, Pattern Recogn. 61 (61) (2017) 686–701. [52] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440. [53] H. Noh, S. Hong, B. Han, Learning deconvolution network for semantic segmentation, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1520–1528. [54] V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: A deep convolutional encoder-decoder architecture for image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 39 (12) (2017) 2481–2495. [55] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141. [56] M. Kocan, G. A. Umana-Membreno, F. Recht, A. Baharin, N. A. Fichtenbaum, L. Mccarthy, S. Keller, R. Menozzi, U. K. Mishra, G. Parish, Detection of local features invariant to affine transformations : application to matching and recognition, Bibliogr 43 (7-8) (2002) 64–65. 41

[57] D. G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis. 60 (2) (2004) 91–110. [58] F. Aurenhammer, Voronoi diagramsa survey of a fundamental geometric data structure, ACM Comput. Surv. 23 (3) (1991) 345–405. [59] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. [60] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, K. He, Aggregated residual transformations for deep neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1492–1500. [61] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, X. Tang, Residual attention network for image classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3156–3164. [62] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4700– 4708. [63] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9) (2015) 1904–1916. [64] 2016 IEEE GRSS Data Fusion Contest. online: http://www.grssieee.org/community/technical-committees/data-fusion. [65] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the 22nd ACM International Conference on Multimedia, ACM, 2014, pp. 675–678.

42