Object-based multi-modal convolution neural networks for building extraction using panchromatic and multispectral imagery

Object-based multi-modal convolution neural networks for building extraction using panchromatic and multispectral imagery

Journal Pre-proof Object-based multi-modal convolution neural networks for building extraction using panchromatic and multispectral imagery Yang Chen...

1MB Sizes 1 Downloads 68 Views

Journal Pre-proof

Object-based multi-modal convolution neural networks for building extraction using panchromatic and multispectral imagery Yang Chen , Luliang Tang , Xue Yang , Muhammad Bilal , Qingquan Li PII: DOI: Reference:

S0925-2312(19)31813-2 https://doi.org/10.1016/j.neucom.2019.12.098 NEUCOM 21727

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

28 April 2019 3 December 2019 21 December 2019

Please cite this article as: Yang Chen , Luliang Tang , Xue Yang , Muhammad Bilal , Qingquan Li , Object-based multi-modal convolution neural networks for building extraction using panchromatic and multispectral imagery, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.12.098

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Object-based multi-modal convolution neural networks for building extraction using panchromatic and multispectral imagery Yang Chena, Luliang Tanga*, Xue Yangb, Muhammad Bilalc, Qingquan Lia, d a.

State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China.

b.

School of Geography and Information Engineering, China University of Geosciences, Wuhan 430074, China.

c.

School of Marine Sciences, Nanjing University of Information Science and Technology, Nanjing, 210044, China.

d.

College of Civil Engineering, Shenzhen University, Shenzhen 518060, China.

Corresponding author: Luliang Tang; E-mail: [email protected]

Abstract: Building extraction is one of the important tasks for urbanization monitoring, city planning, and urban change detection. It is not an easy task due to spectral heterogeneity and structural diversity of the complex backgrounds. In this paper, an object-based multi-modal convolution neural networks (OMM-CNN) is proposed for building extraction using panchromatic and multispectral imagery. Specifically, a multi-modal deep CNN (the multispectral CNN and the panchromatic CNN) architecture is designed which can mine multiscale spectral-spatial contextual information. In order to fully explore the spatial-spectral information embedded in panchromatic and multispectral images, the complex convolution and complex self-adaption pooling layer are developed. Furthermore, to improve the building extraction accuracy and efficiency, a simple linear iterative clustering (SLIC) algorithm is used to segment the panchromatic and multispectral remote sensing imagery simultaneous. Results demonstrated that the proposed method can extract different types of buildings, and the result is more accurate and effective than that of the recent building extraction methods. Keywords: Multi-modal convolution neural networks; building extraction; remote sensing imagery; superpixel

1

1. Introduction Availability of high-resolution remote sensing images capturing the outstanding details enables the extraction of objects such as buildings, urban water, and roads (Sharma et al., 2017). Among these object extraction tasks, building extraction has been a hot research topic in many urban applications, such as urban planning, urban change detection and urban management (Vakalopoulou et al., 2015). Therefore, the precise extraction of urban buildings is a significant task for urban planning and development. Although many building extraction approaches have been developed (Weidner & Förstner, 1995), but it is hard to distinguish buildings from backgrounds using spectral-based classification such as support vector machine (SVM) and random forest (Eskandarpour & Khodaei, 2018) due to different materials of building roofs. However, traditional building extraction methods based on pixel level predictions, tend to produce inconsistent extraction results with “salt-and-pepper” noise in high-resolution imagery (Al-Nahas & Shafri, 2018). Object-based image analysis (OBIA) technique can overcome the shortcoming of “salt-and-pepper” noise and has been widely used in remote sensing image processing (Chima, Trodd, & Blackett, 2018). However, building extraction is a challenging task due to the spectral heterogeneity and structural diversity of the complex backgrounds. To distinguish buildings from non-building objects in very-high-resolution images, a large number of clues, including regular shape, line, and color, have been used in previous building extraction approaches (Li et al., 2018). Cote and Saeedi (2013) developed a method combining distinctive corners and variation level set formulation to extract the building outlines, but it is cannot extract irregular building. Cui, Yan, and Reinartz (2012) exploited the Hough transform (HT) to extract buildings, but the HT method is sensitive to fitting parameters. Huang and Zhang (2011) proposed the morphological building index (MBI) for automatic building extraction in high-resolution images, but the MBI method is limited by the shadow detection performance. However, some of these methods based multispectral images are limited by the hand-crafted feature design performance (Sun et al., 2019). Overall, the performances of the traditional methods are limited by hand-crafted features, which did not take advantage of high-level information (Gao et al., 2019; Ramachandram et al.,2018). In recent 2

years, deep convolutional neural networks (CNN) algorithms have been achieved state-of-the-art performance for semantic image segmentation, data fusion, and image classification (Nogueira et al., 2016). In data fusion, a CNN architecture composed of a fusion of paths with and without a

pooling layer is designed to capture both spatial-temporal information (Jin et al., 2018). A modified fusion CNN is proposed to consider multi-level and multi-scale features of the input images (Xu et al., 2019). A novel fusion CNN is exploited to fuse RGB and depth information (Hoffman et al., 2016). Deep CNN can effectively extract high-level information based on the input remote sensing data (Wu & Gu, 2015). Deep CNN uses local connections to effectively extract the spatial information and shared weights to significantly reduce the number of parameters (Lunga et al., 2018). Nowadays,Deep CNN has been applied to the information extraction of remote sensing including building extraction, water extraction, and road extraction. For example, a single CNN is used to extract roads and buildings simultaneously (Alshehhi et al, 2017), but the single CNN method cannot extract different types of buildings. Generally, it is not straightforward to obtain good results for building extraction when using only single data. This study aims to extract different types of buildings and solve the problem of separating adjacent buildings on the complex urban surface. In this paper, an object-based multi-modal convolution neural networks is proposed for building extraction using panchromatic and multispectral imagery. Specifically, a novel deep CNN (the multispectral CNN and the panchromatic CNN) architecture is designed, which can mine multiscale spectral-spatial contextual information. Furthermore, to improve the building extraction accuracy and efficiency, the spectral-spatial deep CNN are integrated with spectral and spatial information of simple linear iterative clustering (SLIC) (Achanta et al., 2012) segmentation in the preprocessing stage. The experimental results demonstrated that the multi-modal CNN architecture can extract irregular building and buildings of different sizes. The major contributions of this paper are outlined below. (1) The multi-modal CNN (the multispectral CNN and the panchromatic CNN) architecture is designed, which can mine deeply building spectral-spatial contextual information. (2) In order to fully explore the spatial-spectral information embedded in panchromatic and multispectral images, the complex-convolution and complex self-adaption pooling layer are developed. (3) To better consider spectral-spatial information and overcome the shortcoming of “salt-and-pepper” noise, a simple linear iterative clustering (SLIC) algorithm is used to segment the 3

panchromatic and multispectral remote sensing imagery simultaneous. The remainder of this paper is organized as follow: the proposed deep CNN architecture for buildings extraction is introduced in Section 2, the experimental results are discussed in Section 3 and finally, the conclusion is summarized in Section 4.

2. Methods In this section, the proposed framework of building extraction based on spectral-spatial deep convolutional neural networks is introduced, and their main contributions are highlighted. First, a novel SLIC algorithm simultaneous segmentation of panchromatic and multispectral remote sensing imagery is discussed, which is used to enhance CNN outputs. Second, the parallel spectral-spatial deep CNN (the multispectral CNN and the panchromatic CNN) architecture is introduced. The overall structure of the proposed framework for building extraction from panchromatic and multispectral imagery is shown in Fig.1. Feature Extration

Image Pre-processing

Feature Fusion

CNN Multispectral Images

Multispectral SCLI

CNN Panchromatic Images

Panchromatic SCLI

Building extraction result

Multi-modal CNN

Fig.1. Proposed building extraction framework

2.1. Pre-processing

Generally, CNN can typically addresses only a small patch of the input image due to graphics processing units (GPU) memory limitations. For building extraction, some methods based on pixels, tend to produce inconsistent extraction results with “salt-and-pepper” noise in high-resolution imagery.

To further solve this problem, in this paper, superpixel is used as an

object to extract building. Therefore, a pre-processing step is used to improve the accuracy and 4

efficiency of building extraction.

2.1.1. Creation of Objects in Panchromatic Remote Sensing Images

As it is well-known that SLIC algorithm is used for three channel natural image over-segmentation (Fulkerson, Vedaldi, & Soatto, 2009). The image object may contain multiple superpixels. However, unlike three channel natural images, panchromatic remote sensing images only comprise one band (Pan Band, e.g., IKONOS 0.45-0.90 µm band) information. Therefore, the regular-SLIC algorithm is no longer practical. To avoid the disadvantages of applying the SLIC method to panchromatic remote sensing images, this paper proposed a novel segmentation of the high-resolution panchromatic image based on a regular-SLIC algorithm. The spectral-spatial dissimilarity of two pixels can be obtained by measuring the dissimilarity of two local patches centering them (Fang et al., 2015). According to Yu, Wang, Liu, and He (2016), the dissimilarity of two patches can be expressed as:

D( Cm , Cn )  l n L [G C (m C ,n ) ]

(1)

where L( G Cm , Cn ) represents the generalized likelihood ratio (GLR) test for patches Cm and Cn . Assuming that the pixel value of single-channel panchromatic remote sensing image follows the Gamma distribution, the dissimilarity measure between the Cm th patch and Cn th patch can be expressed by Equation (2).

d (Cm , Cn )  2M ln where C

m

C  C m

n

2 Cm  Cn

(2)

represents the average spectral value in the patch C m , Cn represents the average spectral

value in the patch C n . M is the number of pixels in Cm . The patch dissimilarity d (Cm , Cn ) in equation

(2) is used to measure the spectral dissimilarity of pixels m and n , with Cm and Cn being the two patches centering them. Beside from the spectral dissimilarity, spatial dissimilarity of patches should also be considered. In this study, a spatial distance of two local patches can through measuring the spatial distance of two local patches cluster center pixels. Referring to Xiang, Tang, Zhao, and Su (2013), two local patches cluster center pixels spatial distance can be defined by Equation (3).

5

d (m, n)  ( xm  xn )2  ( ym  yn )2

(3)

where ( xm , ym ) and ( xn , yn ) are the spatial coordinates of cluster center pixels m and n of two local patches( Cm and Cn ), respectively. For panchromatic remote sensing images, the spectral-spatial dissimilarity of pixels m and n can be expressed by Equation (4).

D(m, n)  d (Cm , Cn )  d (m, n)

(4)

where  is used to control the relative weight between spectral dissimilarity and spatial dissimilarity. If the value of  is large, which can generate more compact superpixels. In this paper, the spectral-spatial dissimilarity of two pixels can be obtained by measuring the dissimilarity of two local patches centering them. The other parts of modified SLIC are similar to the regular-SLIC method.

2.1.2 Creation of Objects in Multispectral Remote Sensing Images

Superpixels are produced by the SLIC algorithm, which can output compact of superpixels (Qin, Guo, & Lang, 2015). Therefore, the SLIC method is applied to produce superpixel blocks in preprocessing state. In general, the SLIC algorithm is used for nature images (Achanta et al., 2012). However, high-resolution multispectral (MS) images often have more than three bands (blue, green, red, and infrared). To fully consider the information of four bands, SLIC algorithm is first improved to segment the MS image with four bands. The CIELAB space is used in SLIC algorithm (Achanta et al., 2012). Therefore, the three bands of red, green, and blue is converted to the CIELAB color space [l,a,b] and can be used directly to calculate the CIELAB color space distance (Wang et al., 2018). To better consider infrared band information, the distance of the infrared band should be calculated independently. The spectral-spatial dissimilarity of cluster center and pixel are calculated, respectively, and the final pixel dissimilarity between two points is determined by weighted calculation.

dnir  (niri  nirj )2 d s  ( xi  x j )2  ( yi  y j ) 2

(5) (6) 6

dc  (li  l j )2  (ai  a j )2  (bi  b j )2

D

    dc m

2

ds S

2



  dnir m

(7)

2

(8)

where, d nir is the infrared space distance of pixels, d s is the spatial distance of pixels, d c is the spectral distance of pixels. The parameter D is the spectral-spatial dissimilarity of cluster center and pixel. The parameter m is applied to control the relative importance of spectral similarity and spatial proximity. S is the scale parameter, which controls the number and size of the superpixel. In this study, the modified SLIC of cluster centers is carried out in the 6-D space

[li , ai , bi , niri , xi , yi ]T . The other parts of the modified SLIC are as similar to the regular-SLIC algorithm. In Fig.2(a), the original image has a size of 440×450 pixels is segmented to 500 superpixels. The same parameters are used to the modified SLIC and the regular-SLIC algorithm. As shown in Fig.2 (b-c), the modified SLIC can segment irregularly shaped buildings and ensures good building boundary adherence than the regular-SLIC simultaneously due to the modified SLIC use more band information to segment images.

(a) Original image

(b)The modified SLIC

(c) The regular-SLIC

Fig.2. The instance of superpixel segmentation

2.2 Object-Based Multi-modal Convolutional Neural Network Structure

In the deep learning field, convolutional neural network (CNN) is one of the most representative deep learning algorithms. The CNN has been successfully applied to remote sensing image processing including building extraction (Chen et al., 2018). In general, a regular CNN structural consists of five types of layers, including the input layer, the convolution layer, the pooling layer, the fully connected layer, and the output layer (Chen et al., 2016), as illustrated in Fig. 3.

7

convolution layer

input layer

pooling layer

fully connected layer

output layer

Fig. 3. The regular structure of the CNN

The convolution layer produces new feature maps from previous feature maps and acts as multiple learnable filters in the input image. In order to fully explore the rich information embedded in panchromatic and multispectral images, a novel convolutional layer is developed in the complex-value space. The complex-value convolutional (CV-Conv) process can be expressed by Equation (9).

G  W* X  b  (W1 * X1  W2 * X2  b1 )  (W2 * X1  W1 * X2  b2 )i where G 

mnI

are the complex output feature maps, X=X1 +X2i 

mnI

(9)

are the previous layer’s

input feature maps, the W=W1 +W2i is a complex value tensor. i  1 is the imaginary unit, a symbol of * is the convolution operation, b=b1  b2i is the offset vector in the complex value space. In order not to change the spatial extent of the activations after convolutions, in this paper, a zero-padding method is used as it does not change activations and compensate for the number of lost pixels at the borders of the feature maps. Generally, the outputs of a convolutional layer with activation function is followed by the pooling layers. The Rectified Linear Unit (ReLU) function the most commonly used in the literature due to neurons with rectified functions performing well to overcome saturation in the learning process (Chen et al., 2018). Therefore, it is used in this paper. The modified ReLU (M-ReLU) function from real to complex can be expressed by Equation (10).

f ( x  yi)  max(0, x)  max(0, y)i

(10)

where x  yi is the input to a neuron, i  1 is the imaginary unit. A pooling operator is used to reduce the spatial dimension (Lau, Lim, & Gopalai, 2015). The common pooling methods include average pooling and max pooling (Zhang & Wu, 2018). For max pooling, a common way is to simply take the max value. For average pooling, a common way is to simply take the average value. However,the intensity value of building often was higher than background at remote sensing images. The red color indicates dominant features in the pooling regions as shown in Fig 4. For the case of Fig. 4(a), if pooling is done with the average pooling model, the features of the new feature map will be weakened. For the case of Fig. 4(b), if pooling is done with the max-pooling model, some features of the new feature map will be ignored. For example, if most of the elements in the pooling region are of red magnitudes, the distinguishing feature (white region) vanishes 8

after max pooling as shown in Fig. 4(b).

after max-pooling

input feature map

after average-pooling

(a) Illustration of average-pooling shortcoming

after max-pooling

input feature map

after average-pooling

(b) Illustration of max-pooling shortcoming

Fig. 4. Illustration the shortcomings of max pooling and average pooling

In order to reduce the loss of building feature at the process of pooling, self-adaption pooling layers is proposed based on the maximum pooling and the average pooling in the complex-value space. The self-adaption pooling process (SAP) from real to complex can be expressed by Equation (11).

Fij 

N N (1   ) N N ( G )   [max( G )  max( G ij )i]  b ij ∑∑ ij i 1, j 1 i 1, j 1 N 2 i 1 j 1

N

N

i 1, j 1

i 1, j 1

(11)

where max(G ij )  max(G ij )i is the max value from the feature map G ij in complex-value space, 

is the pooling factor, and indicating the choice of using the max pooling or average pooling, 1 N N (∑∑G ij ) is average-pooling function in complex -value space, the size of the pooling area is N 2 i 1 j 1

N  N pixels. Fij are the after layer’s G ij feature maps. The role of

 is to dynamically optimize both the max pooling and the average pooling

based on different pooling blocks. The pooling factor can be expressed by Equation (12).

 where

m

nmax  m nmax +m

(12)

is the average of all elements except for the max element in the pooling region , nmax

is the max element in the pooling region, The range of  is [0, 1]. Figure 5(b-d) shows three feature maps from the different pooling in the same multi-modal CNN. It can be seen that the building features obtained from the adaptive pooling model has obvious features, while the max pooling model and the average pooling model weaken the building features.

9

(a)

(b)

(c)

(d)

Figure 5. The feature map from the different pooling model. (a) Original image. (b) The feature map obtained from the SAP model. (c)The feature map obtained from the average-pooling model. (d) The feature map obtained from the max-pooling model.

Over-fitting is a common problem in the training very deep CNN structural (Kim et al., 2017). In the traditional CNN, regularization strategy is often used to address the over-fitting problem (Günnemann, & Pfeffer, 2017). A fully connected layer lead to over-fitting because these producing the plenty of parameters (Wu et al., 2018). In the very deep CNN structural, especially when the number of fully connected layers in the CNN increases, the capability of these layers to address the overfitting problem is weakened (Sainath et al., 2015). However, global average pooling (GAP) can reduce the plenty of parameters (Lin et al., 2014), sequentially GAP is used to reduce the over-fitting problem in the very deep CNN structural. Therefore, simple GAP layer is used instead of fully connected layers in our CNN structural. Generally speaking, high resolution remote sensing imagery (such as Spot-7, Ziyuan-3, and IKONOS e.g.) contain both multispectral (MS) and panchromatic (PAN) images (Huang, et al., 2019). PAN image has a high spatial resolution with only one spectral band, but the resolution of MS image is often lower than that of panchromatic image. However, the single CNN structural cannot extract multiscale spatial-spectral information from multisource remote sensing data (Xu et al., 2017). To make full use of spatial-spectral information in the input data (the PAN images and the MS images), the multi-modal CNN structural is created using the full connection way that integrates the advantages of the multispectral CNN and the panchromatic CNN structural.

10

Multispectral data

Panchromatic data

CV-conv1-1

CV-conv1-2

Filter size:2×2 Stride:1 Filter size:4×4 Stride:1

CV-conv2-1

Filter size:2×2 Stride:2

CV-conv2-2

Filter size:3×3 Stride:1 Filter size:5×5 Stride:1

CV-conv1-3

Filter size:6×6 Stride:2

CV-conv2-3

SAP

Size:2×2 Stride:2

SAP

Size:2×2 Stride:2

CV-conv1-4

Filter size:8×8 Stride:1

CV-conv2-4

Filter size:7×7 Stride:1

CV-conv1-5

Filter size:10×10 Stride:1

CV-conv2-5

Filter size:9×9 Stride:1

CV-conv1-6

Filter size:12×12 Stride:1

CV-conv2-6

Filter size:11×11 Stride:1

GAP

Size:3×3 Stride:1

GAP

Size:5×5 Stride:1

Multispectral-CNN

Feature vector extracted from Multispectral image

Softmax

Panchromatic-CNN

Feature vector extracted from Panchromatic image

Full feature vector extracted from Multispectral and Panchromatic images

Multi-modal CNN Fig. 6. The proposed CNN structural for building extraction

Figure 6 shows the structural the proposed multi-modal CNN. For the multispectral CNN (Figure 6 red box), first, to mine the building contextual features with different sizes, we use the six complex-value convolutional layers, with a size of 2×2, 4×4, 6×6, 8×8, 10×10, and12×12, respectively. In multispectral CNN, as the stride of each complex-value convolutional layers are one, except the thirdly complex-value convolutional layer, which has a stride of two. Then, to reduce the dimension of the feature map, the self-adaption pooling layer is used over a 2 × 2 spatial windows with a stride of two in the multispectral CNN. Finally, to reduce the plenty of parameters in the multispectral CNN, the global average pooling (Lin et al., 2014) is used over a 3 × 3 spatial windows with a stride of one. For the panchromatic CNN(Figure 6 yellow box), to extract the spatial contextual information,we use the six complex-value convolutional layers, with a size of 1×1, 3×3, 5×5, 7×7, 9×9, and11×11, respectively. Therefore, to reduce the dimension of the final feature map, we use the self-adaption pooling layer over a 2 × 2 spatial windows with a stride of two (non-overlapping window). To reduce 11

the overfitting problem, the final complex-value convolutional layer is followed by global average pooling. The global average pooling performs average pooling over a 5× 5 spatial windows with a stride of one. The modified ReLU (M-ReLU) function is used at every stage of the network. The concatenated spectral-spatial contextual feature vector is fully connected. The two individual paths (multispectral and panchromatic) are combined into the final classification network using fully connected layer. After global average pooling and fully connected layer, the Softmax classifier is applied to extract building at spectral-spatial contextual feature vectors.

2.3 Accuracy Evaluation

In order to evaluate objectively the building extraction accuracy, we evaluate the algorithm performance for the building extraction at pixel-level and object-level. The truth buildings were manually drawn at the ENVI software platform. The most ordinary of completeness (recall) and correctness (precision) are used to evaluate building extraction accuracy in the pixel-level (Shufelt, 1999). The completeness represents the ratio of correctly classified building pixels among all true target pixels. The correctness represents the ratio of the correctly classified building pixels and all predicted buildings pixels.

TP TP  FN TP correctness  TP  FP

completeness 

(13) (14)

where TP represents the number of building pixels that have been correctly classified, FN represents the number of building pixels classified as non-building pixels, FP represents the number of non-building pixels classified as building pixels. The pixel-level evaluation may be distorted owing to the disadvantages of the building shape. In the object-level evaluation, the critical problems are how to define differences between the extracted buildings and the ground truth. The traditional object-based evaluation calculates the overlap areas between the extracted buildings and ground truth (Song & Haithcoat, 2005). However, the shape differences between the detected building and the truth building is not considered. The truth building shapes were manually drawn at the ENVI software platform. Therefore, in this paper, a shape station difference index (SSDI) is applied to assess building extraction accuracy at object-level. For a single 12

building, the shape station difference index can be expressed by Equation (15).

SSDI 

( SI1  SI 2 )  ( x1  x2 )2  ( y1  y2 ) 2 SI1  SI 2

where SI1 is the truth building shape index, truth building centroid, and

(15)

SI 2 is the extracted building shape index, ( x1 , y1 ) is the

( x2 , y2 ) is the extracted building centroid. In our experiments, if the

overlap areas overlap areas between the extracted buildings and ground truth are larger than 50%, total shape station difference index (TSSDI) ( TSSDI 

 SSDI n 1

n

, n is the number of all extracted

building) calculated to assess all detected building accuracy at object-level.

3. Experimental results and analysis

3.1 Datasets and Network Training

The experimental region is located at Wuhan City, China. The experiment images are Gaofen-2 (GF-2) with the multispectral image (spatial resolution of 3.4 m) and panchromatic image (spatial resolution of 0.8m). The FLAASH (Fast Line-of-Sight Atmospheric Analysis of Spectral Hypercubes) method is used for atmospheric correction of GF-2 images (Nazeer, Nichol, & Yung, 2014).

The

radiometric calibration coefficient of GF-2 FLAASH atmospheric correction can be downloaded from http://www.cresda.com/CN/Downloads/dbcs/index.shtml. We collected training multispectral satellite images from WHU building dataset (http://study.rsgis.whu.edu.cn/pages/download/), covering over 550 km2 with 2.7 m ground resolution (Ji, Wei, & Lu, 2018). We collected training panchromatic satellite images from five GF-2 panchromatic images, some of the panchromatic datasets as shown in Figure 7. The five GF-2 panchromatic images were taken in 9 December 2016, 21 August 2014, 21 July 2016, 25 May 2017, and 19 September 2015, respectively. We compile a training panchromatic satellite dataset containing 2000 image tiles of size 512 ×512 pixels, which contains more than 29085 buildings. Our test data set consists of 1000 image tiles of size 512 ×512 pixels. Training images are processed into superpixels. For each superpixel, image patches size of 128×128 centered at its geometric center pixel is extracted. The ground truth of these images was manually drawn in the ENVI software platform. 13

Fig. 7. Some of the training panchromatic images

First, the multispectral CNN and panchromatic CNN are pretrained separately in the entire network training state. Then, the weights of multi-modal deep CNN are initialized from the pretrained multispectral CNN and the pretrained panchromatic CNN. We trained the multi-modal deep CNN for 100 epochs using stochastic gradient descent (SGD) with a minibatch size of 128 patches. We used the weight update method (Krizhevsky, Sutskever, & Hinton, 2012) weight decay was set to 0.000001, momentum was set to 0.9, and the learning rate was set to 0.01. We initialized all neuron biases with zeros. The experiments were implemented using the software library Tensorflow. The designed multi-modal deep CNN is implemented by using Python 3.5 a personal computer with an E3-1505M v6 @3GHz, 32 GB DDR4 memory, and a Nvidia Quadro M2200.

3.2 Performance Comparison of Different CNN

To verify the designed multi-modal deep CNN performance, we compare the proposed multi-modal deep CNN with three single branch CNN including single branch CNN (Yuan, 2018), panchromatic CNN (Figure 6 yellow box), and multispectral CNN (Figure 6 red box). The CNN architecture proposed by Yuan (2018) consists of seven regular ConvNet stages and a final stage for pixel level prediction. In this study, the multispectral and panchromatic images are segmented into superpixels using modified SLIC method, then the classes of these superpixels predicted using the multi-modal deep CNN to obtain the building extraction results. To visually compare the building extraction results, Fig. 8 shows some building extraction results of example images generated by different CNN structure. As shown in Fig. 8, we can see that our method 14

can well extract both small buildings and irregular buildings. However, small buildings and irregular buildings have been not good detected at single branch CNN structure. In the third row of Fig. 8, some regular buildings can be extracted using the three single branch networks, however, some adjacent buildings are also be detected as one building, particularly smaller buildings. Overall, the proposed method is capable for successfully extracting different types of buildings based on the experimental comparisons.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Fig. 8. Building extraction results using different CNN structural. (a) Original multispectral image, (b) Original panchromatic image, (c) Ground-truth building map, (d) The output of multispectral CNN (Figure 4 red box), (e) The output of panchromatic CNN (Figure 4 yellow box), (f) The output of CNN structural (Yuan, 2018), and (g) The output of our CNN.

In order to assess the building extraction performance of different CNN, we calculated the average completeness, average correctness, and average total shape station difference index (TSSDI) at thirty test images. Table 1 summarizes the average scores of building mapping with the multispectral CNN, the panchromatic CNN, the CNN (Yuan, 2018), and the proposed CNN, respectively. A good CNN structural has higher values of average-completeness and average-correctness and low value of Average-TSSDI. The high average-correctness indicates the feasibility of the deeply CNN at building extraction field. From Table 1, it can be seen that the proposed CNN structural has good values of average-completeness (0.937) and average-correctness (0.943) on the pixel-level evaluation due to the proposed CNN (the multi-modal CNN) which can mine deeply building spectral-spatial contextual information from panchromatic and multispectral imagery. By comparing three single branch CNN including the CNN (Yuan, 2018), panchromatic CNN, multispectral CNN, and the proposed CNN has low average-TSSDI based on the object-level evaluation. The proposed CNN outperforms another 15

single branch CNN at the object and pixel levels evaluation. Table 1 Comparison of different CNN methods at thirty test images Method

Average-completeness

Average-correctness

Average-TSSDI

Multispectral CNN

0.791

0.813

10.3

Panchromatic CNN

0.785

0.807

13.7

(Yuan, 2018) method

0.839

0.824

7.3

The proposed CNN

0.937

0.943

2.3

3.3 Comparison of Other Building Extraction Methods

To verify the performance, we compared the proposed method to the SVM method (Sun et al., 2015), the OBIA method ((Chima, Trodd, & Blackett, 2018), the HT method (Cui et al., 2012), the FCN method (Wu et al., 2018), and the CNN method (Alshehhi et al., 2017). The SVM training data contains 100 label images and 200 training images. The CNN method (Alshehhi et al., 2017) was also used for three-class (building, road, and background) prediction. Therefore, in this experiment, the CNN method (Alshehhi et al., 2017) is modified that the output from each layer is changed to two-dimensional (e.g. building and background) prediction.

16

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

Fig. 9. Building extraction results using different methods (a) Original multispectral image. (b) Original panchromatic image. (c) Ground-truth building map. (d) The output of SVM (Sun et al., 2015). (e)The OBIA method (Chima, Trodd, & Blackett, 2018). (f) The HT method (Cui et al., 2012). (g)The output of FCN method (Wu et al., 2018). (h) The output of CNN method (Alshehhi et al., 2017). (i) The output of the proposed method. 17

Fig. 9 shows some of building extraction results using different building extraction methods. As shown in Fig. 9, we can clear see that the SVM method and the OBIA method (Fig. 9(d-e)) misclassifies roads as buildings and tend to produce inconsistent extraction results with “salt-and-pepper” noise at pixel-level. The HT method cannot extract irregular buildings. Generally, all of the CNN methods can good extract regular buildings. However, FCN method and CNN method cannot extract small buildings and fail to solve the problem of separating adjacent buildings (see Fig. 9 (g-h)). In comparison with FCN method and CNN method, the proposed method can good extract different types of buildings and solve the problem of separating adjacent buildings in the object-level. In this paper, we calculated the average completeness, average correctness, average total shape station difference index (TSSDI), and the total extraction time of building extraction results by using twenty test images to evaluate the performance of the corresponding building extraction methods (see Table 2). Table 2 summarizes the average scores of building mapping with the proposed method, the SVM method, the FCN method, the OBIA method, the HT method and the CNN method at the pixel and object levels evaluation. By comparing the CNN-based methods, the proposed method has high values of average-completeness and average-correctness and low average-TSSDI. The proposed CNN outperforms the existing building extraction methods at the object and pixel levels evaluation. It can be observed that our method delivers an accuracy better than others with acceptable efficiency. In addition, it can be observed that through superpixel preprocessing of feature maps of the CNN, the computation complexity is reduced greatly. Table 2 Comparison of different methods at twenty test images Method

Average-completeness

Average-correctness

Average-TSSDI

Time(s)

SVM (Sun et al., 2015)

0.601

0.629

50.1

220

OBIA (Chima, et al., 2018)

0.684

0.703

32.9

180

HT (Cui et al., 2012)

0.752

0.769

30.4

195

FCN (Wu et al., 2018)

0.773

0.798

20.6

310

CNN (Alshehhi et al., 2017)

0.829

0.814

10.3

305

Our method

0.928

0.939

4.1

200

4. Conclusion

In this paper, an object-based multi-modal CNN is proposed for building extraction using panchromatic and multispectral imagery. Specifically, a multi-modal deep CNN (the multispectral CNN 18

and the panchromatic CNN) architecture is designed, which can mine multiscale spectral-spatial contextual information. In order to fully explore the rich information embedded in panchromatic and multispectral images, the complex convolution and complex self-adaption pooling layer are proposed. in the complex-value space. In order to reduce the loss of building feature at the process of pooling, a new self-adaption pooling layer is proposed based on the maximum pooling and the average pooling in the complex-value space. By comparing with the recent building extraction methods, the proposed method has high values of average-completeness and average-correctness and low average-TSSDI. Thus, the proposed CNN outperforms recent building extraction methods based on the evaluations. In addition, the proposed method delivers an accuracy better than others with acceptable efficiency. In the future study, we will combine high-resolution remote sensing image and SAR data (such as Seasat SAR, Almaz SAR, and JERS-1 SAR) for building extraction using CNN.

Acknowledgment

This work was supported in part by the National Key Research and Development Plan of China (grant numbers 2017YFB0503604, 2016YFE0200400), the National Natural Science Foundation of China (grant numbers 41971405, 41901394, 41671442, 41571430, 41271442)

Author Contributions Yang Chen responsible for the research design, experiment and analysis, and drafting of the manuscript. Luliang Tang made valuable suggestions to improve the quality of the paper. Xue Yang and Qingquan Li built the CNN model. Muhammad Bilal reviewed the paper. All authors reviewed the manuscript.

Declaration of Interest Statement The author(s) declared no conflicts of interest with respect to the research, authorship, and publication of this paper.

19

Reference

Alshehhi, R., Marpu, P. R., Woon, W. L., & Dalla Mura, M. (2017). Simultaneous extraction of roads and buildings in remote sensing imagery with convolutional neural networks. ISPRS Journal of Photogrammetry and Remote Sensing, 130, 139-149. Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., & Süsstrunk, S. (2012). SLIC superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence, 34(11), 2274-2282. Al-Nahas, F., & Shafri, H. Z. M. (2018, June). Pixel-based and object-oriented classifications of airborne LiDAR and high resolution satellite data for building extraction. In IOP Conference Series: Earth and Environmental Science (Vol. 169, No. 1, p. 012032). IOP Publishing. Chima, C. I., Trodd, N., & Blackett, M. (2018). Assessment of Nigeriasat-1 satellite data for urban land use/land cover analysis using object-based image analysis in Abuja, Nigeria. Geocarto international, 33(9), 893-911. Cote, M., & Saeedi, P. (2013). Automatic rooftop extraction in nadir aerial imagery of suburban regions using corners and variational level set evolution. IEEE transactions on geoscience and remote sensing, 51(1), 313-32 Cui, S., Yan, Q., & Reinartz, P. (2012). Complex building description and extraction based on Hough transformation and cycle detection. Remote sensing letters, 3(2), 151-159. Chen, Y., Fan, R., Yang, X., Wang, J., & Latif, A. (2018). Extraction of Urban Water Bodies from High-Resolution Remote-Sensing Imagery Using Deep Learning. Water, 10(5), 585. Chen, Y., Jiang, H., Li, C., Jia, X., & Ghamisi, P. (2016). Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing, 54(10), 6232-6251. Chen, Y., Fan, R., Bilal, M., Yang, X., Wang, J., & Li, W. (2018). Multilevel Cloud Detection for High-Resolution Remote Sensing Imagery Using Multiple Convolutional Neural Networks. ISPRS International Journal of Geo-Information, 7(5), 181. Eskandarpour, R., & Khodaei, A. (2018). Leveraging Accuracy-Uncertainty Tradeoff in SVM to Achieve Highly Accurate Outage Predictions. IEEE Transactions on Power Systems, 33(1), 1139-1141. Fulkerson, B., Vedaldi, A., & Soatto, S. (2009, September). Class segmentation and object localization with superpixel neighborhoods. In Computer Vision, 2009 IEEE 12th International Conference on (pp. 670-677). IEEE. 20

Fang, L., Li, S., Kang, X., & Benediktsson, J. A. (2015). Spectral-spatial classification of hyperspectral images with a superpixel-based discriminative sparse model. IEEE Transactions on Geoscience and Remote Sensing, 53(8), 4186-4201. Günnemann, N., & Pfeffer, J. (2017, October). Predicting Defective Engines using Convolutional Neural Networks on Temporal Vibration Signals. In First International Workshop on Learning with Imbalanced Domains: Theory and Applications (pp. 92-102). Gao, L., Song, W., Dai, J., & Chen, Y. (2019). Road Extraction from High-Resolution Remote Sensing Imagery Using Refined Deep Residual Convolutional Neural Network. Remote Sensing, 11(5), 552. Huang, X., & Zhang, L. (2011). A multidirectional and multiscale morphological index for automatic building extraction from multispectral GeoEye-1 imagery. Photogrammetric Engineering & Remote Sensing, 77(7), 721-732. Hoffman, J., Gupta, S., Leong, J., Guadarrama, S., & Darrell, T. (2016, May). Cross-modal adaptation for RGB-D detection. In 2016 IEEE International Conference on Robotics and Automation (ICRA) (pp. 5032-5039). IEEE. Huang, X., et al. (2019). Monitoring ecosystem service change in the City of Shenzhen by the use of high‐ resolution remotely sensed imagery and deep learning. Land Degradation & Development, 30(12), 1490-1501. Jin, X., Cheng, P., Chen, W. L., & Li, H. (2018). Prediction model of velocity field around circular cylinder over various Reynolds numbers by fusion convolutional neural networks based on pressure on the cylinder. Physics of Fluids, 30(4), 047105. Ji, S., Wei, S., & Lu, M. (2018). Fully Convolutional Networks for Multisource Building Extraction from an Open Aerial and Satellite Imagery Data Set. IEEE Transactions on Geoscience and Remote Sensing, (99), 1-13. Kim, J., Kim, J., Jang, G. J., & Lee, M. (2017). Fast learning method for convolutional neural networks using extreme learning machine and its application to lane detection. Neural Networks, 87, 109-121. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). Li, W., He, C., Fang, J., & Fu, H. (2018, June). Semantic Segmentation based Building Extraction Method using Multi-source GIS Map Datasets and Satellite Imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA (pp. 18-22). Li, J., Huang, X., & Gong, J. (2019). Deep neural network for remote sensing image interpretation: status and perspectives. National Science Review, online: https://doi.org/10.1093/nsr/nwz058. Lunga, D., Yang, H. L., Reith, A., Weaver, J., Yuan, J., & Bhaduri, B. (2018). Domain-Adapted Convolutional 21

Networks for Satellite Image Classification: A Large-Scale Interactive Learning Workflow. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 11(3), 962-977. Lin, M., Chen, Q., Yan, S. (2013) Network in network, arXiv preprint arXiv: 1312.4400. Lau, M. M., Lim, K. H., & Gopalai, A. A. (2015, July). Malaysia traffic sign recognition with convolutional neural network. In 2015 IEEE International Conference on Digital Signal Processing (DSP) (pp. 1006-1010). IEEE. Nogueira, Keiller, Penatti, Otávio A. B., & dos Santos, Jefersson A. (2016). Towards Better Exploiting Convolutional Neural Networks for Remote Sensing Scene Classification, ArXiv Preprint arXiv: 1602.01517. Nazeer, M., Nichol, J. E., & Yung, Y. K. (2014). Evaluation of atmospheric correction models and Landsat surface reflectance product in an urban coastal environment. International journal of remote sensing, 35(16), 6271-6291. Qin, F., Guo, J., & Lang, F. (2015). Superpixel segmentation for polarimetric SAR imagery using local iterative clustering. IEEE Geoscience and Remote Sensing Letters, 12(1), 13-17. Ramachandram, D., Lisicki, M., Shields, T. J., Amer, M. R., & Taylor, G. W. (2018). Bayesian optimization on graph-structured search spaces: Optimizing deep multimodal fusion architectures. Neurocomputing, 298, 80-89. Sharma, A., Liu, X., Yang, X., & Shi, D. (2017). A patch-based convolutional neural network for remote sensing image classification. Neural Networks, 95, 19-28. Sainath, T. N., Kingsbury, B., Saon, G., Soltau, H., Mohamed, A. R., Dahl, G., & Ramabhadran, B. (2015). Deep convolutional neural networks for large-scale speech tasks. Neural Networks, 64, 39-48. Shufelt, J. A. (1999). Performance evaluation and analysis of monocular building extraction from aerial imagery. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(4), 311-326. Song, W., & Haithcoat, T. L. (2005). Development of comprehensive accuracy assessment indexes for building footprint extraction. IEEE Transactions on Geoscience and Remote Sensing, 43(2), 402-404. Sun, X., Li, L., Zhang, B., Chen, D., & Gao, L. (2015). Soft urban water cover extraction using mixed training samples and support vector machines. International Journal of Remote Sensing, 36(13), 3331-3344. Sun, G., Huang, H., Zhang, A., Li, F., Zhao, H., & Fu, H. (2019). Fusion of multiscale convolutional neural networks for building extraction in very high-resolution images. Remote Sensing, 11(3), 227. Vakalopoulou, M., Karantzalos, K., Komodakis, N., & Paragios, N. (2015, July). Building detection in very high-resolution multispectral data with deep learning features. In Geoscience and Remote Sensing Symposium (IGARSS), 2015 IEEE International (pp. 1873-1876). IEEE. Weidner, U., & Förstner, W. (1995). Towards automatic building extraction from high-resolution digital elevation models. ISPRS journal of Photogrammetry and Remote Sensing, 50(4), 38-49. 22

Wu, H., & Gu, X. (2015). Towards dropout training for convolutional neural networks. Neural Networks, 71, 1-10. Wang, L., Chen, Y., Tang, L., Fan, R., & Yao, Y. (2018). Object-Based Convolutional Neural Networks for Cloud and Snow Detection in High-Resolution Multispectral Imagers. Water, 10(11), 1666. Wu, Y., Hassner, T., Kim, K., Medioni, G., & Natarajan, P. (2018). Facial landmark detection with tweaked convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence, 40(12), 3067-3074. Wu, G., Shao, X., Guo, Z., Chen, Q., Yuan, W., Shi, X., & Shibasaki, R. (2018). Automatic Building Segmentation of Aerial Imagery Using Multi-Constraint Fully Convolutional Networks. Remote Sensing, 10(3), 407. Xiang, D., Tang, T., Zhao, L., & Su, Y. (2013). Superpixel generating algorithm based on pixel intensity and location similarity for SAR image classification. IEEE Geoscience and Remote Sensing Letters, 10(6), 1414-1418. Xu, Y., Bao, Y., Chen, J., Zuo, W., & Li, H. (2019). Surface fatigue crack identification in steel box girder of bridges by a deep fusion convolutional neural network based on consumer-grade camera images. Structural Health Monitoring, 18(3), 653-674. Xu, X., Li, W., Ran, Q., Du, Q., Gao, L., & Zhang, B. (2017). Multisource remote sensing data classification based on convolutional neural network. IEEE Transactions on Geoscience and Remote Sensing, 56(2), 937-949. Yu, W., Wang, Y., Liu, H., & He, J. (2016). Superpixel-based CFAR target detection for high-resolution SAR images. IEEE Geoscience and Remote Sensing Letters, 13(5), 730-734. Yuan, J. (2018). Learning building extraction in aerial scenes with convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 40(11), 2793-2798. Zhang, J., & Wu, Y. (2018). Complex-valued unsupervised convolutional neural networks for sleep stage classification. Computer methods and programs in biomedicine, 164, 181-191.

23

Yang Chen received the M.Eng. degree from Liaoning Technical University, Fuxin, China, in 2019. He was a jointly educates student with the China Academy of Surveying and Mapping, Beijing, China, from 2017 to 2019. He is currently pursuing the Ph.D. degree with the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University. His research interests include deep learning and intelligent remote sensing information processing.

Luliang Tang received the Ph.D. degree from Wuhan University, Wuhan, China, in 2007. He is currently a Professor with Wuhan University. His research interests include space–time GIS, deep learning, GIS for transportation, and change detection.

Xue Yang received the Ph.D. degree from Wuhan University, Wuhan, China, in 2018. He is currently an Assistant Professor with China University of Geosciences.Her research interests include intelligent transportation system, spatiotemporal data analysis, and information mining.

24

Muhammad Bilal received the Ph.D. degree in Photogrammetry and Remote Sensing in 2014 from the Hong Kong Polytechnic University (PolyU), Hung Hom, Kowloon, Hong Kong, the M.S. degree in Meteorology (Specialization in Remote Sensing and GIS) in 2010 from COMSATS University Islamabad, Pakistan, and the B.Sc. (Hons.) degree in space science (Remote Sensing/GIS, Atmospheric Science) in 2008 from the University of the Punjab, Lahore, Pakistan. He worked at the PolyU as a Postdoctoral Fellow from 2014-2017 and joined the Nanjing University of Information Science and Technology (NUIST), Nanjing, China, as a Professor in October 2017. In June 2018, he was awarded the special title “Distinguished Professor” by the Jiangsu Province, China. His research interests include applications of satellite remote sensing for air quality monitoring, aerosol retrieval algorithms, and atmospheric correction. He has authored widely on these topics in top Aerosol Remote Sensing journals.

Qingquan Li received the Ph.D. degree in geographic information system (GIS) and photogrammetry from Wuhan Technical University of Surveying and Mapping, Wuhan, China, in 1998. He is currently a Professor with Shenzhen University, Guangdong, China, and Wuhan University, Wuhan. His research areas include dynamic data modeling in GIS, surveying engineering, and intelligent transportation system.

25