Deep SAR-Net: Learning objects from signals

Deep SAR-Net: Learning objects from signals

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193 Contents lists available at ScienceDirect ISPRS Journal of Photogrammetry and ...

12MB Sizes 1 Downloads 175 Views

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193

Contents lists available at ScienceDirect

ISPRS Journal of Photogrammetry and Remote Sensing journal homepage: www.elsevier.com/locate/isprsjprs

Deep SAR-Net: Learning objects from signals Zhongling Huang

a,b,c,d,⁎

d,⁎

, Mihai Datcu

a,b,c

, Zongxu Pan

, Bin Lei

a,b,c

T

a

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Huairou District, Beijing 101408, China Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences, Beijing 100190, China d Remote Sensing Technology Institute (IMF), German Aerospace Center (DLR), 82234 Wessling, Germany b c

ARTICLE INFO

ABSTRACT

Keywords: Deep convolutional neural network Complex-valued SAR images Transfer learning Time-frequency analysis Physical properties

This paper introduces a novel Synthetic Aperture Radar (SAR) specific deep learning framework for complexvalued SAR images. The conventional deep convolutional neural networks based methods usually take the amplitude information of single-polarization SAR images as the input to learn hierarchical spatial features automatically, which may have difficulties in discriminating objects with similar texture but discriminative scattering patterns. Our novel deep learning framework, Deep SAR-Net, takes complex-valued SAR images into consideration to learn both spatial texture information and backscattering patterns of objects on the ground. On the one hand, we transfer the detected SAR images pre-trained layers to extract spatial features from intensity images. On the other hand, we dig into the Fourier domain to learn physical properties of the objects by joint time-frequency analysis on complex-valued SAR images. We evaluate the effectiveness of Deep SAR-Net on three complex-valued SAR datasets from Sentinel-1 and TerraSAR-X satellite and demonstrate how it works better than conventional deep CNNs, especially on man-made objects classes. The proposed datasets and the trained Deep SAR-Net model with all codes are provided.

1. Introduction Synthetic Aperture Radar (SAR) image understanding is an important issue with many difficulties due to the special imaging mechanism of active microwave systems. Measuring the distance to features in slantrange rather than the true horizontal distance along the ground, large slant-range distortions will occur in SAR images such as foreshortening, layover, and shadowing effect. Speckle, due to the interference of microwaves reflected from many elementary scatterers, makes SAR image interpretation more complicated. SAR images represent an estimate of the radar backscatter on the ground. In practical, the observed radar backscatter is very complex, combining a series of scattering mechanisms, such as surface, volume, and double bounce scattering. The backscatter is affected by various parameters which are related to objects and surfaces (materials, shapes, roughness), and also SAR sensors (polarization, incident angle). Two SAR images of the same area, shown in Fig. 1(a), demonstrate that an object can have various behaviors under different imaging conditions. Also, two entirely different objects may look quite similar in SAR images, shown as Fig. 1(b). These examples indicate that understanding a SAR image could be very complicated. Deep learning based algorithms are widely applied in SAR image understanding during the recent years due to the advantages in ⁎

automatically learning hierarchical features based on amounts of data. For PolSAR images, the covariance matrix which provides the complete polarization information is processed to train deep networks, either for unsupervised deep belief networks (DBNs) (Lv et al., 2015) and stacked auto-encoders (SAEs) (Zhang et al., 2016) or supervised convolutional neural networks (CNNs) (Wu et al., 2018; Zhou et al., 2016). For single polarimetric SAR images however, most deep learning applications employ the processed grounded detected data or the amplitude information of single-look complex (SLC) data to learn high-level features from the spatial content of SAR images (Zhao et al., 2017; Geng et al., 2015, 2017; Chen et al., 2016), as depicted in Fig. 2, the red rectangle part. For detected data, speckle noise is restrained by multi-looking and slant-range distortion is corrected by mapping to ground range, making SAR images more visually understandable, but the phase information is lost at the same time. Only few works use complex-valued network to explore both amplitude and phase information of complex SAR imagery (Zhang et al., 2017), but applied in polarimetric SAR (PolSAR) only. Although the phase information in single-polarized SAR images is generally not made available to a human observer, it allows certain additional features of targets to be recognized. Earlier in 2003, complex-valued SAR image spectral analysis with a series of sub-looks was used for target detection (Souyris et al., 2003). Spigai et al. (2011)

Corresponding author. E-mail addresses: [email protected] (Z. Huang), [email protected] (M. Datcu).

https://doi.org/10.1016/j.isprsjprs.2020.01.016 Received 14 June 2019; Received in revised form 14 November 2019; Accepted 10 January 2020 0924-2716/ © 2020 Published by Elsevier B.V. on behalf of International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS).

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193

Z. Huang, et al.

Fig. 1. (a) The city center, Houston, USA. Left: Sentinel-1 Stripmap mode, acquisition time 20190310, BeamID S1, descending, incident angle 23.34°, polarization VV; Middle: Sentinel-1 Stripmap mode, acquisition time 20190305, BeamID S3, ascending, incident angle 32.17°, polarization HH. Right: Google Earth. The baseball field and high building marked with red boxes behave much differently in two SAR images with different observing conditions. (b) Some examples of different objects which look similar in SAR images. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 2. Problem description.

proposed a continuous sub-band analysis for SLC SAR images to reveal the physical properties of targets on the ground in 2011 and then Singh and Datcu (2012) used multiple-sublook decomposition based on joint time–frequency analysis (JTFA) for complex-valued SAR images to emphasize the target characterization visually. As depicted in Fig. 2, we can infer the backscattering patterns from SLC SAR images which are decided by a physical model with parameters of both SAR sensor and objects on the ground. Moreover, the

objects on the ground determine the specific land cover and land use class or target label. However, the theoretical physical modeling based on radiative transfer theory is complex and difficult to decide so that simplification of the targets and assumptions on the basic scattering process must be done practically. Generally, the theory-driven physical modeling is considered as a different field with deep learning which is data-driven. Reichstein et al. (2019) indicated that it is necessary to integrate the physical model and data-driven learning approaches in 180

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193

Z. Huang, et al.

Earth sciences in multiple ways to provide theoretical constraints when learning models from data, or to replace a physical sub-model with a machine learning model when the physical formulations have little theoretic basis and hard to decide. Practically, there are some related researches in chemistry (Willis and von Stosch, 2017) and atmospheric field (Gentine et al., 2018) but still a blank field in SAR image understanding. As a result, we are going to design a deep network to replace the inverse physical model, mapping SLC SAR images to backscattering patterns, objects properties and finally the specific land cover land use classes. In this paper, we propose a novel deep learning framework specific for SLC SAR images, named Deep SAR-Net (DSN). It is the first time to make the full use of single polarization complex-valued SAR data with deep learning method. Performed on complex-valued SAR datasets, we transfer mid-level layers from a TerraSAR-X detected data pre-trained deep residual model to extract spatial features. On the other hand, we apply joint time-frequency analysis to obtain the radar spectrograms and design a deep neural network to learn hierarchical frequency features related to target physical properties. Later, these frequency features are spatially aligned corresponding to the spatial information of radar spectrogram in order to make the final prediction jointly with spatial texture features. We evaluate the proposed method with different complex-valued SAR data, including the self-annotated datasets from Sentinel-1 and TerraSAR-X satellites and also OpenSARShip dataset (Huang et al., 2018). The contributions of this paper are listed as follow:

(Ferro-Famil et al., 2005; Bovenga et al., 2014, 2011). Also, joint time–frequency analysis (JTFA) in two-dimension has been proposed to process the extended 2-D SAR image spectrum with chirp bandwidth in range and Doppler bandwidth in azimuth, aiming at extracting backscattering variations from the 2-D frequency spectrum and characterizing the target properties (Ferro-Famil et al., 2005; Spigai et al., 2011). In Singh and Datcu (2013), the fractional Fourier transform based on rotated JTFA was proposed to generate a feature descriptor which allowed discovering the underlying backscattering phenomenon of the objects on the ground. Consequently, JTFA method makes it possible to reveal the backscattering diversity versus range and azimuth frequencies unseen in SAR images, which inspires us to integrate with deep learning algorithms for SAR interpretation. There are three main JTFA methods in literatures: Fourier transforms, wavelet transforms, and Wigner-Ville decomposition. Considering the computation complexity and the easy link between frequency domain and sub-antenna, the JTFA method for SAR images we apply in this paper is based on short-time Fourier transform (STFT). Spigai et al. (2011) proposed a continuous sub-band analysis method based on a sliding bandpass filtering in the Fourier domain, splitting the full aperture into continuous sub-looks in both range and azimuth directions to derive a radar spectrogram featuring the range and azimuth scattering variations. The proposed processing chain transforms the 2-D complex-valued SAR image to a “hyperimage” in a 4-D space, called radar spectrogram S that

S (x 0 , y 0 , f r0 , f a0 ) = FFT 1 [wB (f r0

• We generate two land cover and land use complex-valued SAR da• • •

fr , f a0

fa )·FFT(C (x , y ))](x 0 , y 0 ) (1)

tasets using Sentinel-1 Stripmap data and TerraSAR-X Spotlight data for evaluation, including some natural land cover classes (water, forest, etc.) and man-made objects classes (container, storage tank, etc.). We propose a novel deep learning framework (Deep SAR-Net) specific to complex-valued SAR images, aiming at learning more valuable information both in spatial and frequency domain. We demonstrate the superior performance of Deep SAR-Net against some baseline models on the proposed datasets and undertake an indepth analysis on how Deep SAR-Net performs better than conventional CNNs especially on man-made objects classes. We open-source the project with proposed datasets, trained Deep SAR-Net model, and code.

where the bandpass filter wB is centered on (f r0 , f a0 ) and C is an extract segment of the SLC image centered (x 0 , y 0 ) . The continuous sub-band analysis provides a 4-D array S with an increasing data volume which inspires us to use the data-driven method to reveal the characterized backscattering mechanisms from radar spectrogram. In Spigai et al. (2011), based on a visualized 2-D image of S (x, y, fr , fa ) corresponding to the central pixel (x, y) , four typical behaviors are defined: frequency-invariant, range-variant, azimuth-variant, and 2-D-variant, related to four characteristic backscattering patterns which indicate the physical properties of targets on the ground. In order to further understand the 4-D representation, Singh and Datcu (2012) proposed six projections from S to generate 2-D images for visualization and indicated the hidden physical significance of them. They also demonstrated that although two entirely different objects looked quite similar in SAR images, they could be distinguished in different projections by visual interpretation. Fig. 3 shows some examples in two projections of S (x , y , fr, fa) and S (x, y, fr , fa ) , which indicate the low-resolution images seen from different sub-looks (range and azimuth) of a synthetic antenna and how different particular points on the ground are seen from all sub-looks, respectively. As a result, JTFA for SAR images provides another form of SAR signals with valuable information about backscattering mechanisms and physical properties of objects on the ground. It will be applied in our method to assist image content to make a better prediction.

In the following, we will introduce the time-frequency analysis for SAR images and the proposed Deep SAR-Net in Sections 2 and 3, respectively. Section 4 presents the proposed datasets for evaluation. The experiments and discussions will be given in Section 5. Finally, we conclude this paper in Section 6 with a short summary. 2. Joint time-frequency analysis for SAR images In previous studies, time-frequency analysis (TFA) in the azimuth direction has been applied for sub-aperture decomposition to process a set of sub-band images containing different parts of the SAR Doppler spectrum, with reduced resolution though. For some man-made objects with particular backscattering phenomena, the azimuth sub-band decomposition provides useful information for SAR image understanding due to the differences among different looks of a synthetic antenna. The sub-aperture analysis has been used for moving target detection (Renga et al., 2019; Tupin and Tison, 2004), urban area analysis (Tupin and Tison, 2004; Wu et al., 2013), and characterizing the scattering behavior of targets (Ferro-Famil et al., 2003). For high-resolution SAR, the antenna transmits linearly modulated chirp signals with a large bandwidth spectrum in range direction. The sub-bands decomposition in range frequency domain allows to obtain a series of scene reflectivity behaviors at different observation frequencies which have been used to detect and characterize objects with frequency sensitive responses

3. Deep SAR-Net framework 3.1. Problem description and framework overview We firstly introduce some notations to define the problem and clarify the solution. The complex-valued SAR image is denoted as C (x , y ) where x and y refer to spatial positions in range and azimuth directions, respectively. Generally, CNN-based SAR interpretation algorithms use the detected SAR data with only amplitude information, denoted as P (C (x , y )) , where P represents the processing on complexvalued SAR images, including thermal noise removal, radiometric and 181

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193

Z. Huang, et al.

Fig. 3. JTFA provides a 4-D representation of the SAR image and can be visualized in a 2-D image by projection of any two dimensions while keeping the other two dimensions fixed. Here shows two projections, representing the sub-look image seen from a particular sub-look of a synthetic antenna and the radar spectrogram of a particular point seen from all sub-looks, respectively. Our proposed framework is inspired by these two kinds of projections.

geometric correction, multi-looking and so on. Designing a deep network to learn the hierarchical features (P (C (x , y ))) and the mapping to label space L is the conventional way. However, the processing P has abandoned the phase information which we believe is helpful for interpreting SAR images even for single polarization SAR. In this paper, we take the full advantage of complex-valued SAR data to learn features from both spatial and frequency domain with deep convolutional neural networks. The overall architecture is shown in Fig. 4 which demonstrates the main idea of our proposed method. We can easily obtain the intensity part from C (x , y ) and learn the spatial features 1 (x , y ) using a convolutional neural network GS in general that 1 (x ,

y ) = GS (I (x , y ),

aligns spatially in S (x , y , fr , fa ) , denoted as

(x , y ,

L = G (F ( 1,

= GF (s (fr , fa ),

F)

2 (fr ,

2)

that (4)

), )

(5)

The above-mentioned GS , GF , F , and G build up the proposed novel deep learning framework, Deep SAR-Net, as presented in Fig. 4. In the following section, we will describe how to implement GS , GF , F and G with deep convolutional neural networks, respectively.

where S denotes the parameters in GS . Note that the size of I (x , y ) for each patch is Nc × Nc and 1 (x , y ) with dimension [C1, N1, N1]. The problem is how to make the frequency information spatially aligned so as to be combined with spatial features. S (x , y , fr , fa ) with dimension [Nx , Ny , N fr , N fa], referring to the 4-D representation of C (x , y ) , can be obtained by JTFA, where fr and fa refer to the range and Doppler frequencies. While abandoning the spatial information, we can get a series of s (fr , fa ) (x, y) which refers to the radar spectrogram of a particular point seen from all sub-looks. The embedding feature vector in frequency n with size of C can be extracted from a frequency domain 2 (fr , fa ) 2 feature generator GF that 2 (fr , fa )

={

(x , y ,

where (x , y , 2 ) is with dimension [Nx , Ny , C2]. Given the spatial feature 1 (x , y ) and the spatially aligned frequency feature (x , y , 2 ) , the combined information F ( 1, ) is mapped to the label space L by a generator G that

(2)

S)

2)

N ,N fa )(x , y) } x ,xy = 1y

3.2. Implementation As shown in Fig. 5, the Deep SAR-Net has four components, representing GS , GF , F , and G , respectively. 3.2.1. Transfer learning from TerraSAR-X MGD data With a complex-valued SAR images C (x , y ) = A (x , y ) + jB (x , y ) , the intensity part is given by I (x , y ) 2 = A (x , y ) 2 + B (x , y ) 2 . I (x , y ) is visualized similarly with the grounded range detected data P (C (x , y )) which has been calibrated, geometric and radiometric corrected, if the resolution and the pixel spacing along range and azimuth are approximated in the complex data so that the distortion resulting from the multi-looking and slant-to-ground range re-projection can be ignored by visualization. Although in detected data, the multi-looking procedure will reduce speckle noise and degrade spacial resolution, the

(3)

where F denotes the parameters in GF . After that, we restore the spatial correspondence between 1 and 2 by stacking 2 (fr , fa ) as s (fr , fa ) (x, y)

Fig. 4. The overall architecture of Deep SAR-Net. 182

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193

Z. Huang, et al.

Fig. 5. The detailed implementations of Deep SAR-Net framework.

Sentinel-1, SM

Sentinel-1, IW

digital number of each pixel still reflects the backscattering intensity of the objects on the ground. Fig. 6 shows an example of intensity image and multi-looked grounded detected image in different imaging modes of Sentinel-1. As a result, we propose to transfer the mid-level features from the TerraSAR-X (TSX) Multi-looked Ground Detected (MGD) data pretrained deep residual convolutional neural network (Huang et al., 2020https://github.com/Alien9427/SAR_specific_models) to I (x , y ) . The pre-trained deep model is based on the ResNet-18 architecture (He et al., 2016), containing four types of residual blocks repeated twice and with 64, 128, 256, and 512 output feature maps, respectively. Each single residual block (ResBlock) is made up of two convolutional layers with convolution kernels of 3×3 followed by a batch normalization layer and a non-linear activation layer, shown in Fig. 7. It has been proved a good transferability on other SAR image interpretation tasks. We believe that the MGD data and I (x , y ) can share some low- and midlevel features according to our previous study (Huang et al., 2019). See module 1 in Fig. 5, the first four ResBlocks GS are transferred to intensity images, extracting mid-level features 1 (x , y ) , with a feature

Fig. 6. The comparison between intensity image of SLC product and the multi-looked, speckle filtered, grounded range image. Left: Sentinel-1 Stripmap Mode; Beam ID: S3; slant range/azimuth resolution: 2.5 m/3.6 m; slant range/azimuth pixel spacing: 2.2 m/3.5 m; Right: Sentinel-1 Interferometric Wide Swath Mode; Beam ID: IW1; slant range/azimuth resolution: 2.7 m/22.5 m; slant range/azimuth pixel spacing: 2.3 m/14.1 m.

Fig. 7. The architecture of residual block (ResBlock) and residual bottleneck block (ResBottlenectBlock).

183

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193

Z. Huang, et al.

map of 128 channels.

Finally, the post learning module made up with two residual bottleneck blocks is applied by mapping F ( 1, ) to label space, denoted as G shown in Fig. 5, module 4. The bottleneck architecture (He et al., 2016) is used for reducing the number of features at each layer, leading to large saving in computational cost. In our case, the number of features was reduce by 4 times in each block, as shown in Fig. 7. Table 1 demonstrates the specific kernel design in each convolutional component and Algorithm 1 gives the whole training procedure.

3.2.2. Spectrogram auto-encoder Given a complex image C (x , y ) of size Nc × Nc , the 4-D radar spectrogram can be generated via JTFA by (6)

S (x , y , fr , fa ) = FFT 1 [w (fr , fa ) × FFT(C )](x , y )

with the Fourier transform FFT and inverse Fourier transform FFT 1. w (fr , fa ) is a Bandpass filter (we apply Hamming window here) centered at (fr , fa ) with the bandwidths of bwr and bwa in range and azimuth, 1 1 respectively. We set bwr = 2 BWr and bwa = 2 BWa , where BWr and BWa are the chirp bandwidth in range and Doppler bandwidth in azimuth, respectively. For simplicity, we set up the parameters of bwr and bwa so 1 that Nx = Ny = N fr = N fa = 2 Nc . For particular fr = fr and fa = fa , S (x , y , fr, fa) represents a sub-look image which relates the backscattering behavior for a sub-look (fr , fa) . Similarly, for a particular position (x, y), S (x, y, fr , fa ) represents the spectrogram seen from all sub-looks with information in the full frequency domain which is helpful for us to understand the physical properties of the objects on the ground (Spigai et al., 2011). Consequently, it is necessary to learn some valuable features related to frequency domain from the spectrogram so as to reveal the backscattering mechanisms. S (x , y , fr , fa ) is complex-valued and we only apply amplitude here because the goal is to find frequency characteristics from the “energy”. For simplicity, all further notations of S (x , y , fr , fa ) and s (fr , fa ) denote the amplitude. Essentially, we take each s (fr , fa ) as an image, providing visual information for people to understand the scattering mechanisms. As a result, abandoning the spatial information of S (x , y , fr , fa ) , a series of s (fr , fa ) are sent into a stacked convolutional auto-encoder (SCAE) to learn the latent hiern , depicted as module 2 in Fig. 5. G is the archical features 2 (fr , fa ) F encoder part of SCAE and GF denotes the decoder part. The optimization is conducted by minimizing the mean square error loss between the output of the decoder s (fr , fa ) and the input of the encoder s (fr , fa ) that

Lossmse =

||GF (GF (s (fr , fa )))

s (fr , fa )||2

4. Datasets description The proposed DSN framework is evaluated with several datasets of complex-valued SAR images. In this section, we will introduce the applied datasets that are Sentinel-1 dataset (S1), TerraSAR-X dataset (TSX), and OpenSARShip dataset (OPS). 4.1. Sentinel-1 dataset S1 dataset is collected from Sentinel-1 satellite (Torres et al., 2012) single-looked complex (SLC) SAR images of imaging mode Stripmap. The Sentinel-1 SAR mission was launched by the European Space Agency (ESA), operating in C-Band with horizontal and vertical polarizations, providing SAR data in four different modes. According to the analysis of sharing similar spatial texture features with detected data in Section 3.2.1, we applied three SLC images located in three cities in StripMap mode acquired by beams S3 and S4, with approximative resolution and pixel spacing in slant range and azimuth directions. Since our main purpose is to evaluate DSN with complex-valued SAR data, only HH channel of dual-polarized product is applied. Table. 2 shows detailed parameters of selected Sentinel-1 SAR images. Algorithm 1 (Deep SAR-Net training processing). Input: Training set D = {Ck (x , y ), lk }m k = 1; pre-trained network GS ; hamming window

w (fr , fa )

(7)

Output: Spectrogram stacked CAEsGF ( F ) ; post-learning sub-net G ( ) 1: function TRAIN rain _GF (D, w (fr , fa )) 2: Training the stacked spectrogram CAEs 3: for all Ck (x , y ) D do 4: Using JFTA to calculate the 4-D representation of SLC SAR images

The encoder part GF is made up of four convolutional layers, each followed by a batch normalization layer and ReLU activation layer. The convolution kernels design of GF is presented in Table 1 and the detailed architecture is shown in Fig. 5.

5:

c 2 DkF = {sk(x , y) (fr , fa )} N x, y=1

3.2.3. Feature fusion and post-learning We select Nc = 64 for example, which provides S (x , y , fr , fa ) of dimension [32, 32, 32, 32]. The spatial feature map 1 (x , y ) extracted from the TSX pre-trained layers GS is with dimension [128, 16, 16]. For each non-spatial spectrogram s (fr , fa ) , 2 (fr , fa ) with size of 128, is obtained from the spectrogram auto-encoders GF . A series of 2 (fr , fa ) (x, y ) are spatially aligned as (x , y , 2 ) with size of [32, 32, 128] and then transformed to the dimension of [128, 32, 32]. We implement F by down-sampling (x , y , 2 ) with a max-pooling layer and concatenating with 1 after normalization, forming a fused feature F ( 1, ) with dimension [256, 16, 16], depicted as module 3 in Fig. 5.

6:

GF (

conv 64 × 7 × 7 ResBlk-64(× 2 ) conv 64 × 3 × 3 conv 64 × 3 × 3 ResBlk-128(× 2 ) conv 128 × 3 × 3 conv 128 × 3 × 3

conv 32 × 5 × 5 conv 64 × 5 × 5 conv 64 × 3 × 3 conv 128 × 4 × 4

F)

for all ski (fr , fa )

DkF do

7:

Calculating the output of CAEs ski (fr , fa ) = GF (GF (ski (fr , fa )))

8:

Calculating the loss functionLoss mse =

^s ik (fr,fa)-si (fr,fa ) k

9: Back-propagating to update the parameters 10: end for 11: end for 12: end function

2

F

13: function TRAIN _G (D, D ) F 14: Training the post-learning sub-net 15: for all (Ck (x , y ), lk ) D do

Table 1 Kernel design for convolution layers. GS ( S )

Sk (x , y, fr , fa ) = FFT 1[w (fr , fa ) × FFT(Ck )](x , y ) Abandoning the spatial information to obtain a set of radar spectrogram

16: 17:

Calculating the intensity image Ik2 = Real (Ck )2 + Img (Ck )2 Calculating the spatial feature 1 (x , y ) = GS (Ik )

18:

for all ski (fr , fa )

19:

G( ) ResBtnBlk-128(× 2 ) conv 128 × 1 × 1 conv 128 × 3 × 3 conv 512 × 1 × 1

DkF do

Calculating the feature in Fourier domain

25:

) ^ Calculating the output of the post-learning sub-net lk = G (F ( 1, )) Calculating the objective function Loss = CrossEntropyLoss (l , ^l )

(x , y,

Feature fusion F ( 1,

26: cls 27: Back-propagating to update the parameters 28: end for 29: end function

184

= GF (ski )

end for Calculating the spatially aligned feature in Fourier domain Down-sampling with max-pooling Feature normalization 1,

24:

The bold represents the residual block and bottleneck block, while the unbold below represents the composition of each block.

i 2 (fr , fa )

20: 21: 22: 23:

k

k

2)

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193

Z. Huang, et al.

Table 2 Sentinel-1 StripMap SLC images selected for S1 dataset.

Table 3 TerraSAR-X spotlight SSC images selected for TSX dataset.

ImgName

BBE7

4312

EBAD

ImgName

1550

3142

3527

Location Polarization BeamID Slant Range Resolution (m ) Azimuth Resolution (m ) Slant Range Pixel Spacing (m ) Azimuth Pixel Spacing (m ) Incidence Angle (°)

Houston HH S3 2.5 3.6 2.2 3.5 31.2

Chicago HH S4 3 4.8 2.6 4.1 36.4

So Paulo HH S4 3 4.8 2.6 4.1 36.4

Location Polarization Slant Range Resolution (m ) Azimuth Resolution (m ) Square Pixel Spacing (m ) Incidence Angle (°)

Houston HH 0.58 0.23 0.8 30

San Antonio VV 0.58 0.23 1 40

San Antonio VV 0.58 0.23 1 26.5

the other two datasets, we only keep the VV polarization data for experiments. The original OpenSARShip dataset contains 17 types of ships in total, but highly imbalanced numbers in each type (8470 in Cargo and 4 in Towing for example). To generate our OPS dataset for evaluating the proposed approach, we select three main ship classes that are Container Ship, Bulk Carrier, and Tanker, as same as (Huang et al., 2018) applied for evaluation. The selected patches are cropped in center to keep the entire object with a size of 64×64. As a result, the patches with too large or too small objects are abandoned. Fig. 10 shows some examples for each class and the corresponding optical images, together with some product parameters.

Manually, we annotated eight classes including five man-made land use classes (industrialbuilding, storagetank, container, residential and skyscraper) and three natural surfaces classes (forest, agriculture and water) by cropping the patches with size of 64 × 64 pixels on SLC images, covering around 200 × 200m2 on the ground which is similar to the TerraSAR-X detected data we used for pre-training. The number of patches in each class is listed in Fig. 8, together with the visualization of intensity images and the corresponding Google Earth references. The S1 dataset has 2550 patches with a balanced distribution, with around 300 samples in each class, and the data is stored in 16 bits I/Q channel. 4.2. TerraSAR-X dataset

5. Experiments and discussions

The TSX dataset is collected from TerraSAR-X satellite (Pitz and Miller, 2010) Single-look Slant-range Complex (SSC) data from imaging mode Spotlight. Three TerraSAR-X images located in Houston and San Antonio are selected for experiments, with a single-polarization of HH or VV. The specific parameters of the experimented data are given in Table 3. Before annotation, the SSC images are processed with multilooking in channel I and Q, respectively, in order to obtain a square pixel spacing of approximate 1 m. With a higher resolution than Sentinel-1, TerraSAR-X images provide more detailed textural information of objects on the earth's surface. As a result, we annotate five man-made objects classes for cropped patches with size of 64×64 to generate the TSX dataset, including 352 patches of storage tanks, 341 patches of railways, 225 patches of ships, 305 patches of industrial facilities, and 390 patches of residential areas, as shown in Fig. 9. As we see, the TSX dataset mainly focuses on man-made targets. For each patch, we keep one specific object as far as possible. The high resolution of TerraSAR-X makes it possible to show very detailed structures of targets. We generate this dataset to see if the proposed framework could capture the characteristic features of man-made objects and distinguish them even the relevant context is missing.

5.1. Experimental setup Due to the transferred TerraSAR-X detected data pre-trained layers, GS ( S ) can be off-the-shelf. Then we implement JTFA by sliding a Hamming spectral window with size 32 × 32 centered on all range/ azimuth frequency couples (fr , fa) on frequency domain to obtain a set of continuous sub-bands S (x , y , fr, fa) with a half resolution in spatial domain. When training the spectrogram SCAE, the number of s (fr , fa ) is Nx × Ny = 1024 times more than C (x , y ) which makes the spectrum data possible to train the stacked deep architecture with reconstruction loss at the first layer rather than layer-wise pattern. To optimize the 8-layer spectrogram SCAE, we apply SGD optimizer with batch-size 200 and initialize the learning rate as 0.1, with weight decay 5e-4 by default. When transferring GS and GF and training G , S and F are fixed as offthe-shelf layers while is randomly initialized with He initializer (He et al., 2015) and learned with Adam optimizer. We set the initial learning rate as 0.01 in post-learning processing. In the following experiments, we use 90%, 70%, 50%, and 30% data of each dataset for training, respectively. To make the evaluation more convincing, we randomly split the dataset into training and testing sets for five times in each experiment and record all the results. The code which is written in Python 3.6 and integrated with deep learning toolbox Pytorch 0.4.0 is available online (https://github.com/ Alien9427/SAR_specific_models). All experiments are conducted on a workstation of 64 bit Win7 operating system, with 64G RAM and NVIDIA Quadro M2000M graphics card of 4 GB GDDR5 VRAM clocked at 1250 MHz.

4.3. OpenSARShip dataset OpenSARShip (Huang et al., 2018) is a SAR ship dataset collected from Sentinel-1, containing 11,346 ship patches from 41 Sentinel-1 SAR images. We select the SLC product of interferometric wide swath (IW) mode with dual-polarization VV/VH, where the real and imaginary part values are stored in the original data. In order to keep consistent with

Fig. 8. S1 dataset overview. 185

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193

Z. Huang, et al.

Fig. 9. TSX dataset overview. Cropping patches from TerraSAR-X SSC SAR images after multi-looking, keeping the complex values of data and visualizing the intensity information in this figure. Five classes are annotated and some examples of cropped patches in the SAR image are shown.

Fig. 10. OPS dataset used in our experiments.

Fig. 11. The overall accuracy and F1-score performances of CNN, TL-CNN, and DSN on S1 dataset, with reducing training samples.

5.2. Baseline models

and randomly initialized and learned. 2. TL-CNN: Using intensity image I (x , y ) of complex-valued SAR data for training. Only GS and G are kept. GS is transferred from TerraSAR-X detected data pre-trained model with S fixed as the offthe-shelf layers while only is learned with I (x , y ) . 3. F-CNN: Using radar spectrograms s (fr , fa ) for training. Only GF and G are kept and trained from scratch, with all F and randomly

To verify the effectiveness of our proposed DSN, some baseline models are built with the similar architecture as GF , GS and G . 1. CNN: Using intensity image I (x , y ) of complex-valued SAR data for training. Only GS and G are kept and trained from scratch, with all S 186

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193

Z. Huang, et al.

Table 4 Experimental results - overall accuracy (%). 90% for training

70% for training

Classes

DSN

TL-CNN

CNN

DSN

Forest Water Agriculture Industrialbuilding Residential Skyscraper Storagetank Container Average

98 ± 2.74 100 ± 0 93.50 ± 2.24 90 ± 5.59 100 ± 0 92.00 ± 3.26 85.50 ± 6.71 84.50 ± 4.11 92.94 ± 1.05

99.50 ± 1.12 100 ± 0 94.50 ± 2.09 81 ± 8.77 97.50 ± 3.06 91.50 ± 7.42 82.00 ± 4.11 73.50 ± 10.98 89.94 ± 1.75

96.50 ± 1.37 100 ± 0 95.00 ± 3.06 79.50 ± 4.47 94.00 ± 5.18 89.00 ± 5.76 72.50 ± 6.85 69.50 ± 9.25 87.00 ± 2.01

95.16 99.58 94.32 85.68 99.79 91.79 83.79 85.05 92.21

50% for training Classes

DSN

Forest Water Agriculture Industrialbuilding Residential Skyscraper Storagetank Container Average

94.57 99.66 92.32 86.67 99.77 89.03 81.64 82.78 90.80

± ± ± ± ± ± ± ± ±

1.92 0.31 1.77 3.27 0.31 2.57 4 2.82 0.66

± ± ± ± ± ± ± ± ±

2.84 0.58 2.42 2.94 0.47 2.92 1.76 7.64 0.94

TL-CNN

CNN

98.11 99.79 96.42 75.16 96.84 87.79 74.95 66.11 86.89

96.42 99.79 99.79 80.00 94.95 85.26 64.21 64.00 84.87

± ± ± ± ± ± ± ± ±

1.37 0.47 1.76 5.29 1.05 4.97 4.90 5.78 0.64

± ± ± ± ± ± ± ± ±

0.94 0.47 0.47 6.66 1.73 6.00 2.35 17.25 2.05

30% for training TL-CNN

CNN

96.67 99.77 94.92 78.74 96.93 88.39 70.41 64.63 86.31

96.42 ± 99.77 ± 91.35 ± 74.59 ± 96.25 ± 81.94 ± 61.52 ± 55.93 ± 82.22 ±

± ± ± ± ± ± ± ± ±

2.33 0.31 2.31 3.75 1.03 5.80 6.75 7.21 0.61

DSN

2.74 0.31 3.20 4.52 1.78 4.48 2.56 7.48 0.85

80.70 99.67 90.27 86.03 99.67 87.85 78.66 80.26 87.89

± ± ± ± ± ± ± ± ±

4.07 0.18 2.55 3.05 0.18 2.63 2.90 4.90 0.59

TL-CNN

CNN

95.68 ± 99.43 ± 91.12 ± 73.85 ± 94.72 ± 82.92 ± 69.29 ± 55.63 ± 82.83 ±

91.98 ± 99.67 ± 90.04 ± 74.87 ± 95.85 ± 72.85 ± 64.69 ± 53.25 ± 80.40 ±

2.21 0.22 3.31 1.15 2.97 2.20 2.49 6.98 1.13

1.77 0.34 2.96 2.73 2.12 4.41 5.28 9.21 1.46

The bold represents the highest performance for each experiment. Table 5 Experimental results - F1-score. 90% for training Classes

DSN

Forest Water Agriculture Industrialbuilding Residential Skyscraper Storagetank Container Average

0.958 ± 1±0 0.959 ± 0.907 ± 0.990 ± 0.887 ± 0.858 ± 0.873 ± 0.929 ±

0.011 0.014 0.046 0.010 0.029 0.035 0.023 0.011

70% for training TL-CNN

CNN

DSN

0.959 ± .011 1±0 0.96 ± 0.018 0.832 ± 0.065 0.958 ± 0.014 0.844 ± 0.025 0.817 ± 0.036 0.811 ± 0.075 0.898 ± 0.019

0.949 ± 1±0 0.955 ± 0.804 ± 0.926 ± 0.813 ± 0.741 ± 0.762 ± 0.869 ±

0.033 0.023 0.052 0.037 0.045 0.053 0.0457 0.022

50% for training Classes

DSN

Forest Water Agriculture Industrialbuilding Residential Skyscraper Storagetank Container Average

0.930 0.998 0.939 0.864 0.982 0.871 0.828 0.860 0.909

0.017 0.001 0.017 0.014 0.007 0.012 0.017 0.012 0.006

± ± ± ± ± ± ± ± ±

0.014 0.003 0.012 0.013 0.011 0.030 0.015 0.023 0.009

CNN

0.96 ± 0.014 0.997 ± 0.003 0.957 ± 0.010 0.770 ± 0.009 0.953 ± 0.005 0.813 ± 0.030 0.737 ± 0.016 0.749 ± 0.038 0.867 ± 0.006

0.945 0.998 0.946 0.786 0.930 0.785 0.667 0.705 0.845

TL-CNN

CNN

± ± ± ± ± ± ± ± ±

0.014 0.003 0.022 0.033 0.017 0.014 0.033 0.096 0.022

± ± ± ± ± ± ± ± ±

0.020 0.002 0.019 0.017 0.012 0.037 0.035 0.059 0.014

30% for training TL-CNN

± ± ± ± ± ± ± ± ±

0.957 0.998 0.963 0.876 0.974 0.877 0.841 0.873 0.921

TL-CNN

0.942 0.995 0.952 0.785 0.947 0.827 0.741 0.726 0.865

± ± ± ± ± ± ± ± ±

CNN

0.007 0.002 0.009 0.017 0.006 0.006 0.020 0.054 0.007

0.934 0.996 0.935 0.744 0.929 0.767 0.642 0.632 0.823

DSN

± ± ± ± ± ± ± ± ±

0.021 0.001 0.023 0.013 0.013 0.013 0.016 0.058 0.009

0.844 0.997 0.873 0.851 0.971 0.860 0.806 0.838 0.880

± ± ± ± ± ± ± ± ±

0.033 0.001 0.025 0.011 0.010 0.021 0.016 0.025 0.005

0.917 0.995 0.926 0.735 0.928 0.803 0.690 0.640 0.829

± ± ± ± ± ± ± ± ±

0.025 0.001 0.019 0.009 0.017 0.009 0.015 0.045 0.011

0.900 0.995 0.910 0.719 0.915 0.739 0.632 0.618 0.804

The bold represents the highest performance for each experiment.

initialized and learned. 4. CV-CNN: Using complex-valued SAR data C (x , y ) for training. A complex-valued CNN (CV-CNN) model proposed in (Zhang et al., 2017) is applied. Only GS and G are kept. Due to the complex-valued convolutional kernels in CV-CNN with real and imaginary parts, the parameters in neural network will be doubled. As a result, we reduce the number of feature maps in GS and G by half in order to ensure the same parameter volume in CV-CNN and CNN.

ROC curve to evaluate the proposed method. Firstly, we want to illustrate the effectiveness of utilizing more information extracted from complex-valued SAR data compared with only taking the intensity information into consideration. As shown in Fig. 11, the overall accuracy and F1-score in average of S1 dataset are given in four situations with reducing training samples. The dots denote the results in different training/testing splits and the lines connect those average values, with error bars denoting the standard deviation. It is obvious that DSN performs much better than two baseline models in average, with increasing overall accuracies of 3%, 5.94% in the case of training with 90% data, and 5.06%, 7.49% in the case of training with 30% data, compared with TL-CNN and CNN,

5.3. Comparison with intensity information based CNN We apply several metrics, such as overall accuracy, F1-score and 187

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193

Z. Huang, et al.

Fig. 12. The accuracy and F1-score of DSN and CNN in each class, with 90% data of S1 dataset for training.

Fig. 13. The ROC curve of three natural surfaces and five man-made land use classes of S1 dataset, respectively.

respectively. Tables 4 and 5 record all experimental results of overall accuracy and F1-score, respectively. Comparing CNN with TL-CNN, the results show the effectiveness of transferring the TerraSAR-X detected data pre-trained layers to Sentinel-1 SLC intensity images despite different sensors and imaging modes. Fig. 12 shows accuracy and F1-score in each specific class of S1 dataset with 90% data for training, both in DSN and CNN cases. We can observe that for some classes DSN performs much better than CNN, such as industrial building and container. However for some classes the performance is equally matched, such as water, forest, and agriculture which are all natural surfaces and CNN even achieves a higher accuracy and F1-score in agriculture than DSN. We infer that our DSN model can behave superiorly in man-made objects than the original CNNs but has

little advantage for natural surfaces. To verify this, we plot the ROC curves in three kinds of natural surfaces and five man-made land use classes predicted by DSN and CNN model, as shown in Fig. 13. Apparently, the main advantage of DSN lies on recognizing the man-made land use classes and the superiority is more obvious when training with limited data. However, for natural surface classes, DSN shows little advantage and even not as good as CNN, which can be also observed in Tables 4 and 5. The overall accuracy and F1-score results of CNN and DSN trained on OPS and TSX datasets are shown in Fig. 14. Figs. 15 and 16 show the ROC curves for each class of TSX dataset and OPS dataset, respectively. The results indicate the superior performance of DSN model in classifying man-made targets and fine-grained types of ships compared with 188

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193

Z. Huang, et al.

Fig. 14. The overall accuracy and F1-score of CNN and DSN (DSN) on TSX and OPS datasets, with reducing training samples.

Fig. 15. The ROC curve of each class in TSX dataset.

Fig. 16. The ROC curve of each class in OPS dataset.

189

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193

Z. Huang, et al.

Fig. 17. (a) The overall accuracy for test data of different datasets, performed in F-CNN, CNN, and DSN, respectively. (b) The overall accuracy for test data of different datasets, performed in CV-CNN and DSN. Fig. 18. Feature visualization of S1 dataset with DSN and CNN model using t-sne (Maaten and Hinton, 2008), respectively. The zoomed in area A presents two samples of industrial building and container which are confused with the residential cluster, predicted by TL-CNN. B and C show two different samples are recognized as the same class by TL-CNN.

Fig. 19. Feature visualization of TSX dataset with DSN and CNN using t-sne (Maaten and Hinton, 2008), respectively.

intensity information based CNN models.

network will decide the latent semantic labels from complex-valued SAR images depends on the training data and learning process, by setting proper learning hyper-parameters and a good initialization of weights. It can be very useful when a large amount of labeled training data is available and well-chosen optimization approach is applied. While in DSN, some theory knowledge is given to the framework priorly so that the DSN framework is not a complete “black box” compared with CV-CNN. We design this framework by the idea that knowing the physical properties of objects in SAR images helps understand the semantic labels. Some parts in this framework can be obtained by known

5.4. Comparison with CV-CNN and F-CNN As shown in Fig. 17(a), the performances of DSN on S1, TSX, and OPS datasets are much better than CV-CNN. DSN and CV-CNN both utilize the complex-valued SAR data for training with similar network architecture and approximate parameter volume. However, CV-CNN learns complex-valued weights and bias in network directly from complex-valued data which is totally a data-driven approach. How the 190

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193

Z. Huang, et al.

objects with similar physical properties. As a result, the feature fusion makes information in frequency domain to assist the textual information understanding the semantic meaning of SAR images. 5.5. Feature visualization and targets analysis Besides the outperformance on evaluation metrics of DSN, we want to take a further discussion and explanation of the results in this section. We visualize the output features of last residual bottleneck block in G on S1 test data and TSX test data, shown in Fig. 18 and Fig. 19, respectively. In Fig. 18, the yellow and red circles, representing agriculture and forest, are more clearly distinguished in CNN, except for a few mistaken samples mixed in the other cluster. While the feature boundary between agriculture and forest is less clear in DSN model. By visualizing the spectrogram amplitudes of forest and agriculture in Fig. 20, we can observe that the forest and agriculture samples present similar backscattering patterns without a specific mechanism for distinguishing. As a result, the features in frequency domain cannot provide enough extra information on natural surfaces to support the interpretation of SAR images. For five man-made targets based land use classes in S1 and TSX dataset, the radar spectrogram helps a lot. Fig. 18 shows that residential can be perfectly separated from other classes in DSN. However, the cluster in CNN is less distinguishable from the other four man-made objects classes with some confused samples from “industrialbuilding” and “container”, shown as A in Fig. 18. Due to the complex contents of “container”, “skyscraper”, “industrialbuilding” and “storagetank” classes in SAR images, CNN has certain difficulty in recognizing them only with intensity information. Features from CNN of these four classes are severely mixed with each other. This also happens in TSX dataset that Fig. 19 shows mixed feature distributions of “ships”, “industrialfacilities”, “storagetank”, and “railways”. Some samples from different classes show similar features in spatial domain, as B and C present in Fig. 18, and two example images in Fig. 19. In contract, DSN can do much better in clustering features of these four classes and make them more separable. Next, we will focus on the misleading samples of situation A, B, and C in Fig. 18 to analyze why DSN works well in manmade objects classes. Firstly, we take a look at situation A shown in Fig. 21, where an “industrialbuilding” sample and a “container” sample are predicted as “residential” by TL-CNN with accuracies of 78.16% and 71.77%, respectively, while DSN makes the right predictions with accuracies of 75.47% and 76.06%, respectively. Also, a closing “residential” sample in this cluster is shown which is typical and correctly predicted both by CNN and DSN. The red circles point the similar texture of a rectanglelikely shape that the intensity information based CNN may have confused. By comparing the SAR intensity patch with the corresponding

Fig. 20. The spectrogram amplitudes of two “forest” samples and two “agriculture” samples.

physical models, for example, STFT based JTFA. While some parts which are not easy for modeling with our knowledge are replaced by deep neural networks, for example, GF , GS , and G . In this case, we combine the theory-driven approach and the data-driven approach to understand SAR images in order to make the framework more explainable and understandable, reducing the demands of large amount data for training the deep network at the same time. Then, we compare the performances of using feature fusion (DSN) with not using feature fusion (CNN and F-CNN). CNN learns the hierarchical textual features from intensity information of SAR images automatically in spatial domain while F-CNN only learns the latent features in frequency domain for classification from complex-valued SAR images after JTFA processing. In Fig. 17(b), the overall accuracies on test data of S1, TSX, and OPS datasets demonstrate that the DSN outperforms methods without feature fusion, and textual information based CNN is better than frequency information based F-CNN. As we mentioned in Section 2, the JTFA for complex-valued SAR image reveals the characteristic backscattering behaviors of objects in 2-D frequency domain. Some elementary patterns of the backscattering behavior were proposed in (Spigai et al., 2011) which are limited to only four classes. Although we believe that there are more variations of the backscattering behaviors, it is still impossible to correspond them to semantic labels perfectly since different semantic labels may contain

Fig. 21. The elaborate explanation of case A, B, and C shown in Fig. 18. In each case, the Google Earth image, Sentinel-1 SLC intensity image, and the predicted patch are given, with predictions of DSN and CNN. We circle the areas with similar texture which may confuse the image content based CNNs and display the corresponding radar spectrogram visualization to demonstrate the discriminative scattering patterns. 191

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193

Z. Huang, et al.

Foundation of China under Grant 61701478, and the University of Chinese Academy of Sciences (UCAS) Joint PhD Training Program scholarship.

Google Earth images, we find that in the “residential” sample, the rectangle-likely shape represents the street block and the bright backscattering in the rectangle reflects the houses on the ground. In the “industrialbuilding” sample, this represents the flat building roof and the bright line on the left shows the radar signal has a stronger scattering on the roof edge with a specific incident angle. In the “container” sample, the brighter boundary of the rectangle may be due to the multiple bounces from other containers next to it. As shown in Fig. 21, the red circles mark some similar shape and texture in intensity images which could be confused by CNN, but the spectrogram amplitudes shown in 3-D axis present prominently different characteristics. With dihedral of building walls or cylindrical street lamps above rough soil in residential areas, the spectrogram performs a typical frequency invariant behavior which means the target has an isotropic backscattering pattern. Differently, the industrial roof with very regular structures shows range variant behavior in spectrogram amplitude and the “container” sample also shows a much different backscattering pattern compared with “residential” sample. In situation B, there are both strong scattering point with a cross referring to sidelobes in the intensity images which may confuse the CNN due to the similar texture. However, in frequency domain, the “skyscraper” sample performs such a complex backscattering pattern due to the multiple bounces among high buildings rather than the simple frequency invariant behavior in “storagetank” sample. In situation C, DSN successfully predicts the first “container” sample with a high confidence because of the typical range variant scattering behavior shown in spectrogram while CNN has less confidence to make the prediction with only spatial information.

References Bovenga, F., Giacovazzo, V., Refice, A., Nitti, D., Veneziani, N., 2011. Interferometric multi-chromatic analysis of high resolution x-band data. In: Proceedings of the Fringe 2011 Workshop, Frascati, Italy, pp. 19–23. Bovenga, F., Derauw, D., Rana, F.M., Barbier, C., Refice, A., Veneziani, N., Vitulli, R., 2014. Multi-chromatic analysis of SAR images for coherent target detection. Remote Sens. 6 (9), 8822–8843. https://doi.org/10.3390/rs6098822. URL http://www. mdpi.com/2072-4292/6/9/8822. Chen, S., Wang, H., Xu, F., Jin, Y.-Q., 2016. Target classification using the deep convolutional networks for SAR images. IEEE Trans. Geosci. Remote Sens. 54 (8), 4806–4817. Ferro-Famil, L., Reigber, A., Pottier, E., Boerner, W., 2003. Scene characterization using subaperture polarimetric SAR data. IEEE Trans. Geosci. Remote Sens. 41 (10), 2264–2276. https://doi.org/10.1109/TGRS.2003.817188. Ferro-Famil, L., Reigber, A., Pottier, E., 2005. Nonstationary natural media analysis from polarimetric SAR data using a two-dimensional time-frequency decomposition approach. arXiv:https://doi.org/10.5589/m04-062. Can. J. Remote Sens. 31 (1), 21–29. https://doi.org/10.5589/m04-062. Geng, J., Fan, J., Wang, H., Ma, X., Li, B., Chen, F., 2015. High-resolution SAR image classification via deep convolutional autoencoders. IEEE Geosci. Remote Sens. Lett. 12 (11), 2351–2355. Geng, J., Wang, H., Fan, J., Ma, X., 2017. Deep supervised and contractive neural network for SAR image classification. IEEE Trans. Geosci. Remote Sens. 55 (4), 2442–2459. Gentine, P., Pritchard, M., Rasp, S., Reinaudi, G., Yacalis, G., 2018. Could machine learning break the convection parameterization deadlock? Geophys. Res. Lett. 45 (11), 5742–5751 arXiv:https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/ 2018GL078202, doi:10.1029/2018GL078202. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2018GL078202. He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. The IEEE International Conference on Computer Vision (ICCV). He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. Proc. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://github.com/Alien9427/SAR_specific_models. Huang, Z., Dumitru, C.O., Pan, Z., Lei, B., Datcu, M., 2020. Classification of Large-Scale High-Resolution SAR Images with Deep Transfer Learning. IEEE Geosci. and Remote Sens. Lett. 1–5. https://doi.org/10.1109/LGRS.2020.2965558. In press. https:// arxiv.org/abs/2001.01425. Huang, L., Liu, B., Li, B., Guo, W., Yu, W., Zhang, Z., Yu, W., 2018. Opensarship: A dataset dedicated to sentinel-1 ship interpretation. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 11 (1), 195–208. Huang, Z., Pan, Z., Lei, B., 2019. What, Where, and How to Transfer in SAR Target Recognition Based on Deep CNNs. IEEE Trans. on Geosci. and Remote Sens. 1–13. https://doi.org/10.1109/tgrs.2019.2947634. In press. http://dx.doi.org/10.1109/ TGRS.2019.2947634. Lv, Q., Dou, Y., Niu, X., Xu, J., Xu, J., Xia, F., 2015. Urban land use and land cover classification using remotely sensed SAR data through deep belief networks. J. Sens. Maaten, L.v.d., Hinton, G., 2008. Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579–2605. Pitz, W., Miller, D., 2010. The terrasar-x satellite. IEEE Trans. Geosci. Remote Sens. 48 (2), 615–622. https://doi.org/10.1109/TGRS.2009.2037432. Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., 2019. Prabhat, Deep learning and process understanding for data-driven earth system science. Nature 566 (7743), 195–204. https://doi.org/10.1038/s41586-019-0912-1. URL https://doi.org/10.1038/s41586-019-0912-1. Renga, A., Graziano, M.D., Moccia, A., 2019. Segmentation of marine SAR images by sublook analysis and application to sea traffic monitoring. IEEE Trans. Geosci. Remote Sens. 57 (3), 1463–1477. https://doi.org/10.1109/TGRS.2018.2866934. Singh, J., Datcu, M., 2012. SAR target analysis based on multiple-sublook decomposition: a visual exploration approach. IEEE Geosci. Remote Sens. Lett. 9 (2), 247–251. Singh, J., Datcu, M., 2013. SAR image categorization with log cumulants of the fractional fourier transform coefficients. IEEE Trans. Geosci. Remote Sens. 51 (12), 5273–5282. https://doi.org/10.1109/TGRS.2012.2230892. Souyris, J., Henry, C., Adragna, F., 2003. On the use of complex SAR image spectral analysis for target detection: assessment of polarimetry. IEEE Trans. Geosci. Remote Sens. 41 (12), 2725–2734. https://doi.org/10.1109/TGRS.2003.817809. Spigai, M., Tison, C., Souyris, J.-C., 2011. Time-frequency analysis in high-resolution SAR imagery. IEEE Trans. Geosci. Remote Sens. 49 (7), 2699–2711. Torres, R., Snoeij, P., Geudtner, D., Bibby, D., Davidson, M., Attema, E., Potin, P., Rommen, B., Floury, N., Brown, M., et al., 2012. Gmes sentinel-1 mission. Remote Sens. Environ. 120, 9–24. Tupin, F., Tison, C., 2004. Sub-aperture decomposition for SAR urban area analysis. EUSAR 2004, 431–434. Willis, M.J., von Stosch, M., 2017. Simultaneous parameter identification and discrimination of the nonparametric structure of hybrid semi-parametric models. Comput. Chem. Eng. 104, 366–376. https://doi.org/10.1016/j.compchemeng.2017. 05.005.. URL http://www.sciencedirect.com/science/article/pii/ S009813541730204. Wu, W., Guo, H., Li, X., 2013. Man-made target detection in urban areas based on a new

6. Conclusion In this paper, we propose a novel SAR-specific deep learning framework named Deep SAR-Net, aiming at taking the full use of singlelook complex SAR images. Two different forms of signals, intensity image and radar spectrogram, are obtained from SLC SAR images to jointly learn the surfaces and objects on the ground. By transferring the TerraSAR-X detected data pre-trained layers, representative spatial features are extracted from intensity images and are proved to be effective despite different SAR sensors and imaging modes. JTFA provides a 4-D representation of SLC SAR image with information in all subbands, revealing the backscattering diversity versus range and azimuth frequencies of objects on the ground. For each 2-D radar spectrogram, the frequency features are extracted from a stacked CAE model and then aligned spatially corresponding to spatial information preserved by JTFA. The final decisions are made by fusing those features and a post-learning processing. We generate a land cover and land use dataset with Sentinel-1 SLC SAR images, containing five man-made land use classes and three natural surfaces classes for evaluation. The experiments and results demonstrate the superior performance of DSN in interpreting SAR images especially for man-made objects, compared with the proposed CNN baseline models only based on intensity information. We believe that the novel SAR-specific deep learning framework is also applicable for other SAR interpretation tasks, such as ship velocity estimation and ship detection. Our future work will focus on the transferability of Deep SAR-Net to different polarizations and resolutions, as well as the practicality in other applications. Finally, the proposed dataset and trained DSN model are open-sourced (https://github.com/ Alien9427/SAR_specific_models). Declaration of Competing Interest We declare that we have no conflict of interest. Acknowledgement This work was supported by the National Natural Science 192

ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020) 179–193

Z. Huang, et al. azimuth stationarity extraction method. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 6 (3), 1138–1146. https://doi.org/10.1109/JSTARS.2013.2243700. Wu, W., Li, H., Zhang, L., Li, X., Guo, H., 2018. High-resolution PolSAR scene classification with pretrained deep convnets and manifold polarimetric parameters. IEEE Trans. Geosci. Remote Sens. 56 (10), 6159–6168. https://doi.org/10.1109/TGRS. 2018.2833156. Zhang, L., Ma, W., Zhang, D., 2016. Stacked sparse autoencoder in PolSAR data classification using local spatial information. IEEE Geosci. Remote Sens. Lett. 13 (9), 1359–1363.

Zhang, Z., Wang, H., Xu, F., Jin, Y.-Q., 2017. Complex-valued convolutional neural network and its application in polarimetric SAR image classification. IEEE Trans. Geosci. Remote Sens. 55 (12), 7177–7188. Zhao, Z., Jiao, L., Zhao, J., Gu, J., Zhao, J., 2017. Discriminant deep belief network for high-resolution SAR image classification. Pattern Recognit. 61, 686–701. Zhou, Y., Wang, H., Xu, F., Jin, Y.-Q., 2016. Polarimetric SAR image classification using deep convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 13 (12), 1935–1939.

193