Deep learning-based fusion of Landsat-8 and Sentinel-2 images for a harmonized surface reflectance product

Deep learning-based fusion of Landsat-8 and Sentinel-2 images for a harmonized surface reflectance product

Remote Sensing of Environment 235 (2019) 111425 Contents lists available at ScienceDirect Remote Sensing of Environment journal homepage: www.elsevi...

9MB Sizes 0 Downloads 23 Views

Remote Sensing of Environment 235 (2019) 111425

Contents lists available at ScienceDirect

Remote Sensing of Environment journal homepage: www.elsevier.com/locate/rse

Deep learning-based fusion of Landsat-8 and Sentinel-2 images for a harmonized surface reflectance product

T

Zhenfeng Shaoa,b, Jiajun Caia,b,c, Peng Fud,e,∗, Leiqiu Huf, Tao Liug a

State Key Laboratory for Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, 430072, China Collaborative Innovation Center for Geospatial Technology, Wuhan University, Wuhan, 430072, China c Department of Geography and Resource Management, The Chinese University of Hong Kong, 999077, Hong Kong, China d Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA e Carl R. Woese Institute of Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA f Atmospheric Science Department, University of Alabama at Huntsville, Huntsville, AL, 35805, USA g Geographic Data Science, Oak Ridge National Laboratory, Oak Ridge, TN, 37830, USA b

A R T I C LE I N FO

A B S T R A C T

Keywords: Deep learning Data fusion Landsat-8 Sentinel-2 Continuous monitoring

Landsat and Sentinel-2 sensors together provide the most widely accessible medium-to-high spatial resolution multispectral data for a wide range of applications, such as vegetation phenology identification, crop yield estimation, and forest disturbance detection. Improved timely and accurate observations of the Earth's surface and dynamics are expected from the synergistic use of Landsat and Sentinel-2 data, which entails coordinating the spatial resolution gap between Landsat (30 m) and Sentinel-2 (10 m or 20 m) images. However, widely used data fusion techniques may not fulfil community's needs for generating a temporally dense reflectance product at 10 m spatial resolution from combined Landsat and Sentinel-2 images because of their inherent algorithmic weaknesses. Inspired by the recent advances in deep learning, this study developed an extended super-resolution convolutional neural network (ESRCNN) to a data fusion framework, specifically for blending Landsat-8 Operational Land Imager (OLI) and Sentinel-2 Multispectral Imager (MSI) data. Results demonstrated the effectiveness of the deep learning-based fusion algorithm in yielding a consistent and comparable dataset at 10 m from Landsat-8 and Sentinel-2. Further accuracy assessments revealed that the performance of the fusion network was influenced by both the number of input auxiliary Sentinel-2 images and temporal interval (i.e., difference in image acquisition dates) between auxiliary Sentinel-2 images and the target Landsat-8 image. Compared to the benchmark algorithm, area-to-point regression kriging (ATKPK), the deep learning-based fusion framework proved better in the quantitative assessment in terms of RMSE (root mean square error), correlation coefficient (CC), universal image quality index (UIQI), relative global-dimensional synthesis error (ERGAS), and spectral angle mapper (SAM). ESRCNN better preserved the reflectance distribution as the original image compared to ATPRK, resulting in an improved image quality. Overall, the developed data fusion network that blends Landsat-8 and Sentinel-2 images has the potential to help generate continuous reflectance observations of higher temporal frequency than that can be obtained from a single Landsat-like sensor.

1. Introduction The Landsat consistent records of the Earth's surface and dynamics with 30 m spatial resolution represent hitherto the longest space-based observations dating back to the 1970s (Roy et al., 2014). The opening of the Landsat archive in 2008 (Woodcock et al., 2008) has fostered many previously unimaginable environmental applications based on time series satellite images, such as near-real time disturbance detection (Verbesselt et al., 2012) and continuous land cover change detection



and classification (Zhu and Woodcock, 2014). Currently, it is being the norm to use all available time series images (dating back to 1980s for reflectance products) at a given location for a wide range of applications from characterizing forest disturbance (Kim et al., 2014) and vegetation phenology (Senf et al., 2017) to revealing urbanization-induced land use and land cover changes (Fu and Weng, 2016a). Despite the popularity of time series analysis of Landsat images, the usability of Landsat images is often limited by the presence of clouds, shadow, and other poor atmospheric effects (Fu and Weng, 2016b); thus, the actual

Corresponding author. Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA. E-mail address: [email protected] (P. Fu).

https://doi.org/10.1016/j.rse.2019.111425 Received 23 June 2019; Received in revised form 8 September 2019; Accepted 14 September 2019 0034-4257/ © 2019 Elsevier Inc. All rights reserved.

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Table 1 Band specifications for Landsat-8 and Sentinel-2 images. Landsat-8

Sentinel-2

Band

Wavelength (nm)

Resolution (m)

Band

Wavelength (nm)

Resolution (m)

1 2 3 4

430–450 450–515 525–600 630–680

30 30 30 30

5 (NIR)

845–885

30

6 (SWIR-1) 7 (SWIR-2) 8 (PAN)

1560–1660 2100–2300 503–676

30 30 15

1 (Coastal) 2 (Blue) 3 (Green) 4 (Red) 5 (Red Edge) 6 (Red Edge) 7 (Red Edge) 8 (NIR) 9 (Water vapor) 10 (SWIR-Cirrus) 11 (SWIR-1) 12 (SWIR-2) –

433–453 458–523 543–578 650–680 698–713 733–748 773–793 785–900 935–955 1360–1390 1565–1655 2100–2280 –

60 10 10 10 20 20 20 10 60 60 20 20 –

(Coastal) (Blue) (Green) (Red)

2013). Nevertheless, STARFM and its variants require at least one pair of fine and coarse spatial resolution images (e.g., Landsat and MODIS images) acquired on the same day as inputs to implement the downscaling process. This requisite makes the STARFM framework not suitable for fusing Landsat-8 and Sentinel-2 images that often revisit the same target at different days. Recently, Wang et al. (2017) used area-topoint regression kriging (ATPRK) to yield downscaled Landsat-8 images at 10 m and suggested that the ATPRK approach outperformed STARFM on downscaling reflectance. However, ATPRK, as a geostatistical fusion approach, involves complex semi-variogram modeling from the cokriging matrix that is computationally unrealistic for a large domain due to its sheer size. Plus, ATPRK may not be suitable for areas experiencing rapid land cover changes since its performance relies on input of covariates (i.e., spectral bands) that may come from a subjectively selected image in a specific date. Within a short time period (e.g., 2 weeks), the collected Sentinel-2 and Landsat data for a given location may all have valuable information for image fusion. In contrast, the ATPRK algorithm does not have the flexibility to accommodate a different number of input images within the specific study period for reflectance prediction. With the varied number of input images, a new fusion algorithm is highly needed to automatically select the best features from one or more images to perform reflectance prediction at the high spatial resolution (Das and Ghosh, 2016). The recent advances in deep learning make it promising in addressing the spatial gap between Landsat-8 and Sentinel-2 data, potentially leading to improved performance of image fusion over existing algorithms (Masi et al., 2016; Yuan et al., 2018). Deep learning is a fully data-driven approach and can automatically transform the representation at one level into a representation at a higher, slightly abstract level (LeCun et al., 2015; Schmidhuber, 2015), thus facilitating data predictions at different spatial scales. Especially, convolutional neural networks (CNNs) consist of a series of convolution filters that can extract hierarchical contextual image features (Krizhevsky et al., 2012). As a popular form of deep learning networks, they have been widely used in image classification, object recognition, and natural language processing due to their powerful feature learning ability (Audebert et al., 2017; Hirschberg and Manning, 2015; Simonyan and Zisserman, 2014). Inspired by these successful applications of CNNs, this study extended a super-resolution CNN (SRCNN) to address the gap in spatial resolution between Landsat-8 and Sentinel-2 images. More specifically, the deep learning-based framework was used to downscale the Landsat8 image of 30 m spatial resolution to 10 m by using Sentinel-2 spectral bands at 10 m and 20 m. Given the better performance of ATPRK in image fusion over previous pan-sharpening and spatiotemporal fusion algorithms (Wang et al., 2017), this study used ATPRK as the benchmark algorithm to assess the effectiveness of deep learning-based fusion approach.

temporal revisit cycle of usable Landsat images is not periodical and sometimes longer than 16 days. The missed temporal information can hinder applications that require near-daily or multi-day imagery at medium spatial resolution (∼30 m), e.g., crop yield estimation (Claverie et al., 2017), flood response (Skakun et al., 2014), vegetation phenology identification (Melaas et al., 2013), and forest disturbance detection (White et al., 2017). Synergies between Landsat Operational Land Imager (OLI) and Sentinel-2 Multispectral Imager (MSI) data are promising to fulfil the community's needs for high-temporal resolution images at medium spatial scale. With the launch of Sentinel-2A and -2B satellites in 2015 and 2017, respectively, the combined Landsat-8 OLI and Sentinel-2 MSI dual system can provide dense global observations at a nominal revisit interval of 2–3 days. Each MSI onboard the twin Sentinel-2 satellites (Sentinel-2A and -2B) acquires images covering thirteen spectral bands (Table 1) at the spatial resolutions of 10 m (four visible and near-infrared bands), 20 m (six red edge and shortwave infrared bands), and 60 m (three atmospheric correction bands) in every 10 days (Drusch et al., 2012; Malenovský et al., 2012). The MSI images are complementary to images captured by the Landsat OLI sensor since the two instruments share similarities in band specifications (Table 1). The combination of Landsat and Sentinel-2 reflectance products enriches the temporal information. For example, we can generate 5-day composite products, much more frequent than those from Landsat and Sentinel-2 individually. However, the disparity in spatial resolution between Landsat-8 and Sentinel-2 data remains unsolved. One common but rather simple approach is to sacrifice the spatial resolution in the reflectance products by resampling the finer-resolution data to match the coarser one (Wang et al., 2017). This resampling approach has also been adopted by the NASA Harmonized Landsat and Sentinel-2 (HLS) project to produce temporal dense reflectance products at 30 m. Thus, valuable information provided by the 10 m Sentinel-2 data is wasted. In this study, we aim to develop a data fusion approach that takes full advantages of available temporal and spatial information in both Landsat-8 and Sentinel-2 images to generate a reflectance product at a finer spatial resolution of 10 m. Various algorithms have been developed in the past for image fusion to obtain more frequent Landsat-like data, such as the spatial and temporal adaptive reflectance fusion model (STARFM) (Gao et al., 2006) and its variants (Hilker et al., 2009; Zhu et al., 2010, 2016), unmixing-based data fusion (Gevaert and García-Haro, 2015), wavelet transformation (Acerbi-Junior et al., 2006), sparse representation (Wei et al., 2015), and smoothing filter-based intensity modulation (SFIM) (Liu, 2000). Among these data fusion techniques, the STARFM and its variants are probably the most popular fusion algorithms to generate synthetic surface reflectance at both high spatial and temporal resolutions due to their robust prediction performances (Emelyanova et al.,

2

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

to x (i) , and b(j) denotes the bias. The convolutional operator is indicated by the symbol *, and f means the nonlinear activation function. The third phase, max pooling (or sub-sampling), is expressed in equation (2).

yr , c = max (x r + m, c + n )

(2)

0 ≤ m, n ≤ s

where yr , c is the neuron value at (r , c ) in the output layer, and m and n are used to indicate the pixel location around the center neuron at (r , c ) within a spatial extent of s × s (also known as an image patch). More specifically, equation (2) dictates that the value at location (r , c ) is assigned with the maximal value over the spatial extent of s × s in the input layer.

Fig. 1. The schematic illustration of local receptive fields (red rectangles) and parameter sharing (weight values w1, w2, and w3 are equal) between layers in convolution neural networks. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

2.2. Extending super-resolution convolutional neural network (SRCNN) for image fusion

2. Methodology

Although CNN is initially designated to predict categorical variables (i.e., classification), it has recently been modified to output continuous variables (i.e., regression) for data fusion (e.g., Dong et al., 2016; Shao & Cai, 2018). Dong et al. (2016) proposed SRCNN to reconstruct a highresolution image from a low-resolution image directly. As fusion of Sentinel-2 and Landsat-8 images is to improve the spatial resolution of Landsat-8 imagery from 30 m to 10 m, it shares some similarities with super-resolution reconstruction (SR). However, SR cannot reconstruct spatial details within image pixels from coarse resolution images that are used as only inputs for predicting images at fine spatial resolution. In this study, an extended SRCNN (ESRCNN) was developed to fuse data with different resolutions, where the high-resolution Sentinel-2 images are treated as auxiliary data for downscaling the low-resolution Landsat-8 image. The ESRCNN framework consists of three layers. The first layer takes inputs with N1 channels and calculates N2 feature maps using a k1 × k1 receptive field and a nonlinear activation ReLU. The second layer computes N3 feature maps with the aid of a k2 × k2 receptive field and ReLU. Finally, the third layer outputs fusion results with N4 channels based on a k3 × k3 receptive field. In summary, these layers can be expressed in equation (3).

2.1. Convolutional neural network As a popular neural network form in deep learning, CNNs leverage three important ideas including sparse interactions, parameter sharing, and sub-sampling to help improve a machine learning system (He et al., 2015; Krizhevsky et al., 2012; Ouyang et al., 2015; Zhang et al., 2014; Sun et al., 2014). In CNNs, it is impractical to connect all neurons from different network layers for images of high dimensionality. Thus, neurons in CNNs are connected only to a local region of the original image (i.e., sparse interactions) by using a hyperparameter called a receptive field of a neuron (equivalently this is the filter size as shown in Fig. 1). In other words, sparse interactions indicate that the value of each neuron within a convolution layer is calculated according to the neuron values of its spatially neighboring region from a previous layer. Parameter sharing ensures the weights of a convolutional kernel remain the same when they are applied to the input data at different locations. In this way, the number of parameters in the network decreases significantly. Fig. 1 shows the concepts of local receptive fields (red rectangles) and parameter sharing (i.e., weight values w1, w2, and w3 have the same value: w1 = w2 = w3 ). A typical layer of a convolutional network includes three phases as shown in Fig. 2. In the first phase, a series of convolutions are used in parallel to produce linear activations. In the second phase, each linear activation from the first phase is delivered to a nonlinear activation function, such as Sigmoid, Tanh, and ReLU (Rectified Linear Unit) (Nair and Hinton, 2010). In the third phase, a pooling function configured in a so-called sub-sampling layer is adopted to perform local averaging and sub-sampling, reducing the sensitivity of the output to shifts and distortions and the computational complexity of the model. The output from operation in each phase is called feature map. Mathematically, the first two phases are expressed in equation (1).

⎛ y (j) = f ⎜b(j) + ⎝ x (i)

∑ k (i)(j) ∗ x (i) ⎞⎟. ⎠

i

⎧ f1 (x ) = max(0, b1 + w1 ∗ x ), ⎪ ⎪ b1: N2 × 1 ⎪ ⎪ f2 (x ) = max(0, b2 + w2 ∗ f1 (x )), ⎨ b2 : N3 × 1 ⎪ ⎪ f3 (x ) = b3 + w3 ∗ f2 (x ), ⎪ ⎪ b3: N4 × 1 ⎩

w1: N2 × (k1 × k1 × N1), w1: N3 × (k2 × k2 × N2), w3: N4 × (k3 × k3 × N3),

(3) where max function returns the maximum value of the two in brackets. The workflow of ESRCNN for fusing Landsat-8 and Sentinel-2 images is shown in Fig. 3. The downscaling procedures are divided into two parts: self-adaptive fusion of Sentinel-2 images and multi-temporal fusion of Landsat-8 and Sentinel-2 images. First, Sentinel-2 spectral bands 11 and 12 (SWIR-1/2) at 20 m are downscaled to 10 m by feeding ESRCNN with bands 2–4, (B, G, and R), 8 (NIR) at 10 m and bands 11 and 12 at 10 m resampled using the nearest neighbor interpolation.

(1)

y (j)

and are the i -th input feature map and j -th output feature where map of a convolutional layer, k (i)(j) is the convolutional kernel applied

Fig. 2. The three phases, i.e., convolution, nonlinear activation, and pooling, in a typical convolutional neural network layer. 3

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Fig. 3. The workflow of the extended SRCNN (ESRCNN) for fusing Landsat-8 and Sentinel-2 images. Conv indicates the convolutional layer and the rectified linear unit (ReLU) represents the nonlinear activation layer.

respectively. The number of input and output feature maps in the selfadaptive fusion was 6 (bands 2–4, 8, and 11–12) and 2 (bands 11–12), respectively. For multi-temporal fusion of Sentinel-2 and Landsat-8 images, the number of output feature maps was 7 (Landsat-8 bands 1–7), while the number of input feature maps depended on the number of auxiliary Sentinel-2 images used in the fusion network. For instance, if only one Sentinel-2 image was used for downscaling, N1 would be 14 (Sentinel-2 bands 2–4, 8, 11–12 and Landsat-8 bands 1–8). After the preparation of algorithm inputs and network configurations, it comes to the training stage for image fusion. Assume x is the input of the network and y is the desired output (or known reference M data). Then, the training set can be described as [x (i) , y (i) ]i = 1, where M is the number of training samples. The network training is for learning a mapping function f: yˆ = f (x ; w ) , where yˆ is the predicted output and w is the set of all parameters including filter weights and bias. With the help of an optimization function (equation (4)), the fusion network can learn to reduce the prediction error. Generally, for the deep learning network, the mean square error (equation (4)) is used as the optimization function (also known as loss function).

Second, Landsat-8 bands 1–7 are downscaled through ESRCNN by using the Landsat-8 panchromatic band (15 m) and Sentinel-2 data sets (bands 2–4, 8, 11–12 at 10 m). The fusion network in this step can accommodate a flexible number of Sentinel-2 images as auxiliary data sets. Inputs for ESRCNN include multi-temporal Sentinel-2 images (10 m) captured in relatively close days to the target Landsat-8 image as well as resampled Landsat-8 bands 1–7 at 10 m with the nearest neighbor interpolation. The incorporation of resampled bands by the fusion network in both steps ensures that information from coarse resolution images is learned by the deep learning network and used for downscaling. Overall, the ESRCNN fusion workflow illustrated in Fig. 3 can be summarized in four steps. (1) The 20 m Sentinel-2 bands 11–12 are resampled to 10 m using the nearest neighbor interpolation; (2) the resampled Sentinel-2 bands 11–12 and band 2–4, 8 at 10 m are fed into the ESRCNN to generate downscaled Sentinel-2 bands 11–12 at 10 m (the self-adaptive fusion process); (3) the Landsat-8 bands 1–7 at 30 m and band 8 (PAN band) at 15 m are resampled to 10 m using the nearest neighbor interpolation; (4) the resampled Landsat-8 images and Sentinel-2 images at 10 m are used as inputs to the ESRCNN to generate Landsat-8 bands 1–7 at 10 m (i.e., multi-temporal fusion of Landsat-8 and Sentinel-2 images).

L=

1 n

n

∑ y (i) − f (x (i) )2 i=1

(4)

where n is the number of training samples used in each iteration, L refers to the prediction error, y is the reference (desired output), f (xi;w) is the predicted output, and ||.|| denotes the ℓ2 -norm. The stochastic gradient descent with the standard backpropagation (Lecun et al., 1998) algorithm is used for optimization. The weights are updated by equation (5).

In this study, parameters of the fusion network were configured following SRCNN (Dong et al., 2016). More specifically, the kernel size (k1, k2 and k3) of each layer was 9, 1, and 5, respectively. The number of output feature maps (N2 and N3 ) in the first two layers was 64 and 32, 4

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Δt + 1 = β⋅Δt − α⋅

∂L , ∂wil

M

wtl+ 1 = wtl + Δt + 1

where α and β are the learning rate and momentum,

CC (R, F ) =

(5) ∂L ∂wtl

Q (R, F ) =

The study area covers parts of Shijiazhuang, which is the capital and largest city of North China's Hebei Province. It is a city of agriculture and industry and has experienced dramatic growth in population and urban extent since the founding of the People's Republic of China in 1949. Two areas of Shijiazhuang, as shown in Fig. 4 and Fig. 5, were used for training and testing the fusion network, respectively. The two areas contain abundant landforms (e.g., urban areas, cropland, and forest) and have exhibited significant changes in spectral reflectance within the study period (June 15 to July 7, 2017). Fig. 4 shows the area with a spatial extent of 19.32 km × 25.20 km, i.e., the 10 m and 20 m Sentinel-2 bands have 1932 × 2520 and 966 × 1260 pixels, respectively, while 30 m and 15 m Landsat-8 bands have 644 × 830 and 1288 × 1660 pixels, respectively. Fig. 5 shows the study area with a spatial extent of 12 km × 12 km, i.e., the 10 m and 20 m Sentinel-2 bands have 1200 × 1200 and 600 × 600 pixels, respectively, while 30 m and 15 m Landsat-8 bands have 400 × 400 and 800 × 800 pixels, respectively. Three subareas (1, 2, and 3) as shown in Fig. 4 and two subareas in Fig. 5 were selected for visual/quantitative assessment of the fusion network to downscale Landsat-8 images at low spatial resolution to high spatial resolution. Sentinel-2 is an Earth observation mission of the European Space Agency (ESA) developed from the European Union Copernicus Programme to acquire optical imagery at a high spatial resolution. In this study, Sentinel-2 images acquired on June 20, June 27, and July 7, 2017 were used as auxiliary inputs for the fusion framework. These Sentinel-2 images were from the collection of Level-1C products, which provided the top of atmosphere (TOA) reflectance with a sub-pixel multispectral registration accuracy in the UTM/WGS84 projection system (Drusch et al., 2012; Claverie et al., 2017; 2018). Two scenes of NASA/USGS Landsat-8 TOA reflectance data (L1TP) on June 15 and July 1 in 2017 were collected in this study. The Landsat-8 images were processed for standard terrain correction in the UTM/WGS84 projection system (USGS, 2015). Both Landsat-8 and Sentinel-2 images were captured under clear-sky conditions, and the selection of these dates was to illustrate the effectiveness of the fusion network to yield downscaled Landsat-8 images at 10 m spatial resolution that could be combined with Sentinel-2 images to generate more frequent and consistent observations. Naturally, for the self-adaptive fusion, the fusion network can downscale the Sentinel-2 bands 11 and 12 at 20 m spatial resolution to 10 m with the Sentinel-2 bands 2–4 and 8 as inputs; however, the reference data at 10 m are not available to evaluate the performance of the fusion network. Thus, in this study, synthetic data sets were constructed per the Wald's protocol (Wald et al., 1997). Specifically, the Sentinel-2 spectral bands 2–4 and 8 were resampled by a factor of 2 by using a Gaussian model-based degradation function (also known as the point spread function, PSF). The resampled Sentinel-2 bands 2–4, 8 at

where v is the pixel vector formed by the reference image, and vˆ is the vector formed by the fused image. SAM is first calculated on a per-pixel basis and then all the SAM values are averaged to a single value for the whole image. The RMSE (equation (7)) is used to evaluate overall spectral differences between the reference image R and the fused image F. N

∑ ∑ [R (i, j) − F (i, j)]2 (7)

The ERGAS provides a global quality evaluation of the fusion result and is calculated in equation (8).

ERGAS = 100

h l

1 k

k

∑ [RMSE (i)/Mean (i)]2 i=1

(10)

3. Study area, satellite data, and synthetic data sets

(6)

i=1 j=1

4σRF ⋅μ (R)⋅μ (F ) (δR2 + δF2 )(μR2 + μF2 )

where μ refers to the mean value and σ represents standard deviation. To calculate UIQI for the whole image, a sliding window was used to increase differentiation capability and measure the local distortion of a fused image. The final UIQI was acquired by averaging all Q values in sliding windows. The ideal value for SAM, ERGAS, RMSE, CC, and UIQI is 0, 0, 0, 1, and 1, respectively, if and only if Ri = Fi, for all i = 1, 2, …, N (where N is the number of pixels).

The performance of the deep learning-based fusion framework and ATPRK was assessed based on five indicators, including the spectral angle mapper (SAM) (Alparone et al., 2007), root-mean-square error (RMSE), relative global-dimensional synthesis error (ERGAS) (Ranchin and Wald, 2000), correlation coefficient (CC), and universal image quality index (UIQI) (Wang and Bovik, 2002). These indicators have been widely used in assessing the performances of data fusion algorithms such as ATPRK (Wang et al., 2017). The SAM (equation (6)) measures spectral angle between two vectors.

M

N

The UIQI (equation (10)) is a global fusion performance indicator.

2.4. Evaluation metrics for data fusion

1 MN

M

(9)

ATPRK is among the first methods used to downscale Landsat-8 images from 30 m to 10 m spatial resolution. The ATPRK approach performs the downscaling process primarily by introducing an area-topoint kriging (ATPK)-based residual downscaling scheme which is synergistically used with the regression-based overall trend estimation. The ATPRK approach can be regarded as an extension of either regression kriging or ATPK. Further technical details of this downscaling approach can be referred to Wang et al. (2016). In this study, the effectiveness of the fusion network ESRCNN was benchmarked by the algorithm ATPRK that outperformed other data fusion algorithms (Wang et al., 2017).

RMSE (R, F ) =

N

is the deri-

2.3. ATPRK for fusion of Landsat-8 and Sentinel-2 images

v, vˆ v2 ∗ vˆ2

M

∑i = 1 ∑ j = 1 [R (i, j ) − μ (R)]2 ∑i = 1 ∑ j = 1 [F (i, j ) − μ (F )]2

vative, l is the layer, and Δi refers to the intermediate value in iteration t. Following Dong et al. (2016), α is set to 10−4 (except for the last layer where α is set to 10−5 for accelerating convergence), β is set to 0.9, the patch size (the size of a local receptive field, or filter size) is 32 by 32 pixels (Supporting Information Table S1), and the batch size (a term used in machine learning to refer to the number of training samples in one iteration) is 128.

SAM (v, vˆ) = sin−1

N

∑i = 1 ∑ j = 1 [R (i, j ) − μ (R)][F (i, j ) − μ (F )]

(8)

where h/l is the ratio between the spatial resolution of the fused Landsat-8 image and that of the original Landsat-8 image, k is the number of bands of the fused image, Mean(i) is the mean value of differences between the ith band of the reference image and that of the fused image, and RMSE(i) indicates the root mean squared error of the ith band between the reference and fused images. The CC (equation (9)) indicates the spectral correlation between the reference image R and the fused image F. 5

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Fig. 4. Landsat-8 and Sentinel-2 data sets used in the training phase (bands 4, 3, and 2 as RGB). Landsat-8 images at 30 m were acquired on (a) June 15 and (b) July 1, 2017. Sentinel-2 images were acquired on (c) June 20, (d) June 27, and (e) July 7, 2017. The marked red rectangle areas (1, 2, and 3) were used for visual assessment. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.) Fig. 5. Landsat-8 and Sentinel-2 data sets used in the testing phase (bands 4, 3, and 2 as RGB). Landsat-8 images at 30 m were acquired on (a) June 15 and (b) July 1, 2017. Sentinel-2 images were acquired on (c) June 20, (d) June 27, and (e) July 7, 2017. The marked red rectangle areas (1, 2) were used for evaluating the fusion network performance. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

6

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Fig. 6. The results of self-adaptive fusion on June 20, June 27, and July 7, 2017 for the subarea 1 in Fig. 4. (a) The degraded Sentinel-2 bands at 40 m, (b) the reference Sentinel-2 bands at 20 m, (c) the fused Sentinel-2 bands at 20 m, and (d) the fused Sentinel-2 bands at 10 m.

spectral bands 2–4, 8, 11, and 12 at 10 m were used as auxiliary data for downscaling Landsat-8 spectral bands at 30 m to 10 m. All input images were degraded by a factor of 3 since the reference Landsat-8 bands at 10 m were not available. Specifically, the Sentinel-2 bands at 10 m and Landsat-8 spectral and panchromatic bands at 30 m and 15 m were resampled to 30 m, 90 m, and 45 m. Then these resampled data sets were used as an input to the fusion framework to yield Landsat-8 bands 1–7 at 30 m. The original Landsat-8 bands 1–7 at 30 m were used for training and validating the fusion network. Similar to the self-adaptive fusion, all the input images should be of the same size (spatial extent and pixel size) as each other. In the multi-temporal fusion step, 12160 patches were randomly selected for training and 3040 patches were

20 m and bands 11–12 at 40 m were then fused in the proposed fusion network to yield bands 11–12 at 20 m. The original Sentinel-2 bands 11–12 at 20 m were used for network training and validation. It is noted that the input data sets for the fusion framework should be of the same size (spatial extent and pixel size) as each other. In other words, the Sentinel-2 bands at 40 m should be interpolated (nearest neighbor interpolation) to 20 m and used as inputs for the proposed fusion algorithm. Per the data augmentation technique (rotation and scaling), 36864 patches were randomly selected for training and 9216 patches were randomly selected for validating the fusion network (further summarized in Supporting Information Table S2). Following the self-adaptive fusion of Sentinel-2, the Sentinel-2 7

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Table 2 Comparisons between ATPRK and ESRCNN for downscaling Sentinel-2 bands 11–12 at 40 m to 20 m in three dates (June 20, June 27, and July 7, 2017). The bold number indicates the best value for accuracy assessment. Metrics

RMSE CC UIQI

Band

Band Band Band Band Band Band

ERGAS SAM (degree)

11 12 11 12 11 12

June 20, 2017

June 27, 2017

July 7, 2017

ATPRK

ESRCNN

ATPRK

ESRCNN

ATPRK

ESRCNN

0.0178 0.0231 0.9786 0.9772 0.9556 0.9511 5.1905 1.3528

0.0065 0.0093 0.9962 0.9954 0.9920 0.9918 2.0211 0.6515

0.0316 0.0324 0.9804 0.9813 0.9575 0.9590 5.7884 1.4396

0.0095 0.0110 0.9975 0.9970 0.9948 0.9942 1.8727 0.7435

0.0304 0.0315 0.9769 0.9793 0.9540 0.9567 5.6605 1.5220

0.0104 0.0119 0.9963 0.9960 0.9929 0.9928 2.0516 0.7420

Table 3 Comparisons between the ESRCNN fused and reference Landsat-8 images at 30 m. Date

Band

June 15, 2017

Band Band Band Band Band Band Band Band Band Band Band Band Band Band

July 01, 2017

randomly selected for validating the fusion network (further summarized in Supporting Information Table S2).

1 2 3 4 5 6 7 1 2 3 4 5 6 7

RMSE

CC

UIQI

ERGAS

SAM (degree)

0.0243 0.0242 0.0260 0.0258 0.0405 0.0263 0.0263 0.0193 0.0192 0.0210 0.0217 0.0399 0.0254 0.0254

0.9860 0.9869 0.9861 0.9875 0.9569 0.9848 0.9864 0.9899 0.9904 0.9894 0.9897 0.9599 0.9830 0.9854

0.9710 0.9737 0.9726 0.9755 0.9415 0.9738 0.9757 0.9683 0.9732 0.9731 0.9749 0.9420 0.9698 0.9727

1.1817

0.7042

1.0741

0.7904

were almost the same as the original 20 m Sentinel-2 bands 11–12 (Fig. 6 (b)), indicating the good performance of the proposed fusion framework. In addition, the fusion framework revealed much finer spatial details as shown in the fused 10 m Sentinel-2 bands 11–12 (Fig. 6 (d)). Table 2 presents the accuracy assessment of the selfadaptive fusion results for the training area (Fig. 1). Clearly, ESRCNN outperformed ATPRK in downscaling Sentinel-2 bands 11–12 from

4. Results 4.1. Self-adaptive fusion of Sentinel-2 images Fig. 6 shows the self-adaptive fusion results of Sentinel-2 data that were acquired on June 20, June 27, and July 7, 2017 for the subarea 1 in Fig. 4. Visually, the fused 20 m Sentinel-2 bands 11–12 (Fig. 6 (c))

Fig. 7. The multi-temporal fusion results for the subarea 1 in Fig. 4. (a)–(c) are the resampled Sentinel-2 images at 30 m on June 20, June 27, and July 7, 2017. (d)–(f) are the resampled Landsat image at 90 m, the original Landsat-8 reference at 30 m, and the fused Landsat-8 image at 30 m on June 15, 2017. (g)–(i) are the resampled Landsat image at 90 m, the original Landsat-8 reference at 30 m, and the fused Landsat-8 image at 30 m on July 1, 2017. 8

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Fig. 8. The multi-temporal fusion results for the subarea 2 in Fig. 4 (bands 5, 4, 3 as RGB). (a)–(c) are three Sentinel-2 images at 10 m acquired on June 20, June 27, and July 7, 2017. (d)–(f) are the original Landsat-8 reference at 30 m and the fused Landsat-8 image at 30 m and 10 m on June 15, 2017. (g)–(i) are the original Landsat-8 reference at 30 m and the fused Landsat-8 images at 30 m and 10 m on July 1, 2017. The yellow circle highlights the land use and land cover (LULC) changes due to planting of crops (a–c). (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

were spectrally and spatially consistent with original images. The proposed fusion framework also yielded good performance for downscaling Landsat-8 images at low spatial resolution to high spatial resolution in areas experiencing land use and land cover (LULC) changes. For example, the yellow circles in Fig. 8 (a) and (c) highlighted LULC changes due to the planting of crops within the subarea 2 from June 20 to July 7, 2017. Although the Sentinel-2 image acquired on July 7, 2017 was input to the fusion framework, the final fused image on June 15, 2017 (Fig. 8 (f)) did not incorporate such LULC changes. In addition, the Landsat-8 image on July 1, 2017 was more similar to the Sentinel-2 image on July 7, 2017 rather than the other two Sentinel-2 images. Despite the obvious LULC changes among the three Sentinel-2 images used as inputs in the fusion network, the final fused image at 30 m on July 1, 2017 still did not incorporate these LULC changes (Fig. 8 (i)). These findings suggest that the fusion network can identify areas experiencing LULC changes and then remove features associated with spectral changes that are not consistent with the target image.

40 m 20 m spatial resolution. Evaluation indicators including RMSE, CC, and UIQI provided by ESRCNN were much closer to their corresponding ideal values (0, 1, 1) than those provided by ATPRK. SAM and ERGAS between ESRCNN downscaled and reference images exhibited a value of ∼0.8 and ∼2.0 for the three dates, less than a value of ∼2.0 and ∼5.6 between ATPRK downscaled and reference images. The two fusion approaches were used to generate the fused 10 m Sentinel-2 bands 11–12, which were subsequently used as inputs for downscaling Landsat-8 images from 30 m spatial resolution to 10 m. 4.2. Multi-temporal fusion of Landsat-8 and Sentinel-2 images Fig. 7 shows the multi-temporal fusion results for the subarea 1 in Fig. 4. The Landsat-8 images on June 15 and July 1, 2017 (Fig. 7 (e) and (h)) were individually combined with three Sentinel-2 images on June 20, June 27, and July 7, 2017 (Fig. 7 (a), (b), and (c)) to yield fusion results. Visually, the resampled Landsat-8 images at 90 m (Fig. 7 (d) and (g)) were sharply improved to the fused Landsat-8 images at 30 m (Fig. 7 (f) and (i)). The fused Landsat-8 images at 30 m (Fig. 7 (f) and (i)) were visually similar to the original Landsat-8 images at 30 m (Fig. 7 (e) and (h)). Table 3 lists the values of the five indices for accuracy assessment for the entire training area. For the fused images of these two dates, the fusion network produced an RMSE less than 0.05, CC larger than 0.95, UIQI greater than 0.94, ERGAS less than 1.2, and SAM less than 0.8. These metrics clearly showed that the fused images

4.3. Image fusion with a flexible number of auxiliary Sentinel-2 images In the previous section, the downscaling of Landsat-8 images was accomplished with three auxiliary Sentinel-2 images. However, it may not always be possible to have three or more Sentinel-2 images as inputs into the fusion framework due to clouds, shadows, and snow contamination. Thus, the impact of the number of auxiliary Sentinel-2 9

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Table 4 The impact of the number of auxiliary Sentinel-2 images input to the fusion network on the downscaling performance. The bold number in each row indicates the best value for accuracy assessment.

RMSE

CC

UIQI

ERGAS SAM (degree)

Band1 Band2 Band3 Band4 Band5 Band6 Band7 Band1 Band2 Band3 Band4 Band5 Band6 Band7 Band1 Band2 Band3 Band4 Band5 Band6 Band7

D0

D1

D2

D3

D1+D2

D2+D3

D1+D3

D1+D2+D3

0.0223 0.0223 0.0221 0.0225 0.0610 0.0325 0.0310 0.9865 0.9870 0.9883 0.9888 0.9034 0.9720 0.9781 0.9619 0.9670 0.9712 0.9733 0.8503 0.9510 0.9614 3.0491 4.6004

0.0147 0.0143 0.0159 0.0158 0.0265 0.0185 0.0180 0.9942 0.9947 0.9940 0.9945 0.9826 0.9910 0.9926 0.9813 0.9857 0.9851 0.9876 0.9751 0.9841 0.9868 1.7754 2.4788

0.0154 0.0151 0.0171 0.0170 0.0286 0.0202 0.0201 0.9936 0.9941 0.9930 0.9936 0.9796 0.9893 0.9909 0.9802 0.9843 0.9832 0.9855 0.9706 0.9823 0.9842 1.9167 2.5986

0.0170 0.0166 0.0174 0.0174 0.0347 0.0222 0.0222 0.9923 0.9829 0.9928 0.9934 0.9697 0.9871 0.9888 0.9755 0.9807 0.9825 0.9847 0.9565 0.9791 0.9810 2.1033 2.9878

0.0144 0.0138 0.0154 0.0153 0.0245 0.0175 0.0173 0.9945 0.9951 0.9944 0.9948 0.9851 0.9920 0.9933 0.9822 0.9866 0.9865 0.9884 0.9791 0.9864 0.9882 1.6972 2.3462

0.0147 0.0142 0.0159 0.0158 0.0264 0.0186 0.0184 0.9942 0.9948 0.9940 0.9946 0.9827 0.9909 0.9923 0.9813 0.9858 0.9853 0.9876 0.9755 0.9851 0.9867 1.7785 2.4563

0.0145 0.0139 0.0156 0.0156 0.0264 0.0181 0.0178 0.9944 0.9950 0.9942 0.9946 0.9827 0.9914 0.9929 0.9820 0.9863 0.9858 0.9876 0.9754 0.9853 0.9874 1.7474 2.4394

0.0141 0.0136 0.0153 0.0151 0.0240 0.0173 0.0172 0.9947 0.9952 0.9945 0.9950 0.9856 0.9922 0.9934 0.9829 0.9871 0.9869 0.9886 0.9798 0.9869 0.9885 1.6768 2.3162

Note: A total of eight different data sets were prepared as inputs to the fusion network: (1) D0 (Landsat-8 image on July 1, 2017 and no auxiliary Sentinel-2 image); (2) D1 (Landsat-8 image on July 1, 2017 and only one Sentinel-2 image on June 27, 2017); (3) D2 (Landsat-8 image on July 1, 2017 and only one Sentinel-2 image on July 7, 2017); (4) D3 (Landsat-8 image on July 1, 2017 and only one Sentinel-2 image on June 20, 2017); (5) D1+D2; (6) D2+D3; (7) D1+D3; (8) D1+D2+D3.

on July 1, 2017 and no auxiliary Sentinel-2 image); (2) D1 (Landsat-8 image on July 1, 2017 and only one Sentinel-2 image on June 27, 2017); (3) D2 (Landsat-8 image on July 1, 2017 and only one Sentinel-2 image on July 7, 2017); (4) D3 (Landsat-8 image on July 1, 2017 and only one Sentinel-2 image on June 20, 2017); (5) D1+D2; (6) D2+D3; (7) D1+D3; (8) D1+D2+D3. Note that even if no auxiliary Sentinel-2 image was input into the fusion framework (D0), the fusion network could still reconstruct a high-resolution image at 10 m from the Landsat-8 image at 30 m. Table 4 presents the accuracy assessment for the downscaling results from eight different data sets against original images at 30 m. The accuracy assessment was performed for the entire training area (Fig. 4). It was observed that the fused image generated from D0 had the least accurate results for all the bands. Since the downscaling scheme D0 solely used the Landsat-8 PAN band without any Sentinel-2 information, the results had an inherent pan-sharpening problem (Alparone et al., 2007; P. Liu et al., 2016): the PAN band was only captured at 15 m and its wavelength only covered those of Landsat-8 bands 2–4 but not bands 5–7. The performance of the fusion network degraded as the time interval between Sentinel-2 and Landsat-8 images increases. This finding was evidenced by the more accurate fusion results from D1 than those from D2 or D3 (Table 4) when only one auxiliary Sentinel-2 image was used for downscaling the Landsat-8 image. The time intervals between the Landsat-8 image and the Sentinel-2 images on D1, D2, and D3 were 4, 6, and 11 days, respectively. When two auxiliary Sentinel-2 images were used for downscaling the Landsat-8 image, the performance of the fusion network improved as the total time interval between the Landsat-8 image and the two Sentinel-2 images decreased. Table 4 shows that the fusion image generated from D1 + D2 (the total time interval is 10 days) was more accurate those generated from D2+D3 (the total time interval is 17 days) and D1+D3 (the total time interval is 15 days). Additionally, the fusion image generated from D1 + D3 had a higher accuracy level than that generated from D2 + D3. At least, the fusion accuracy improved as the number of auxiliary Sentinel-2 images input into the fusion network increased. Further visual examinations of the downscaled Landsat-8 images at 10 m suggested that the performances of the fusion networks trained by

Fig. 9. The subarea 3 in Fig. 4 experienced changes in temporary housing (bright dots; band 2 gray-scale images were shown here). The red rectangle in the upper-left corner is an enlarged view of the red rectangle in the center of the image. The number of bright dots (temporal housing) in the Landsat-8 image on (c) July 1, 2017, and three Sentinel-2 images on (a) June 20, (b) June 27, and (d) July 7, 2017 was 4, 2, 3, and 5, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

images on the downscaling performance was evaluated in this section. Here a series of data sets with varying number of Sentinel-2 images as input into the fusion framework was prepared: (1) D0 (Landsat-8 image 10

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Fig. 10. The downscaled Landsat-8 image at 10 m (bands 7, 6, 2 as RGB) on July 1, 2017 derived from eight different data sets, including (a) D1, (b) D2, (c) D3, (d) D1+ D2, (e) D2+ D3, (f) D1+ D3, (g) D1+D2+D3. The red rectangle in the upper-left corner is an enlarged view of the red rectangle in the center of the image. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

network and presented the performance of the fusion network to downscale Sentinel-2 and Landsat-8 images for the training area (Fig. 4). In this section, we applied the trained network to a different domain within Shijiazhuang (12 km by 12 km as shown in Fig. 5). The performance of the fusion network to downscale Sentinel-2 bands 11–12 at 40 m to 20 m and Landsat-8 images from 30 m to 10 m for this study area was benchmarked by ATPRK. Table 5 shows the comparisons between ATPRK and ESRCNN for downscaling Sentinel-2 bands 11–12 at 40 m to 20 m acquired on June 20, June 27, and July 7, 2017. Clearly, the evaluation results revealed that ESRCNN outperformed ATPRK in self-adaptive fusion of Sentinel-2 images. The downscaled Sentinel-2 bands 11–12 at 10 m (resampled to 30 m) were further used in downscaling Landsat-8 images from 90 m to 30 m. Four groups of data sets were used for comparing the performances of the two algorithms for downscaling Landsat-8 images: (1) the Landsat-8 image on June 15, 2017 and the Sentinel-2 image on June 20, 2017, (2) the Landsat-8 image on June 15, 2017 and the Sentinel-2 image on June 27, 2017, (3) the Landsat-8 image on July 1, 2017 and the Sentinel-2 image on June 27, 2017, and (4) the Landsat-8 image on July 1, 2017 and the Sentinel-2 image on June 20. In each group, only one Sentinel-2 image was included since ATPRK could only take one Sentinel-2 image as the input for the downscaling process. Thus, the use of only one Sentinel-2 image in each group in this section ensured a fair comparison between ATPRK and ESRCNN. The original Landsat-8 and Sentinel-2 images in the two groups were resampled by a factor of 3 and then input into the two algorithms for yielding fused images at 30 m. The temporal interval between Sentinel-2 and Landsat-8 images for each group was 5, 12, 4, and 11. The proposed ESRCNN framework outperformed ATPRK in downscaling Landsat-8 images at 30 m to 10 m based on all evaluation metrics (Table 6). Fig. 11 and Fig. 12 further show that fused reflectance yielded by ESRCNN matched more closely to reference reflectance than that yielded by ATPRK. In addition, evaluation metrics from Group 1 (or 3) were better than those from Group 2 (or 4), further stressing the importance of temporal interval in image acquisition time between the auxiliary Sentinel-2 and the target Landsat-8 images (as shown in Section 4.3). Examinations of the downscaled images yielded by ATPRK revealed spectral distortions. For example, over-sharpened building

Table 5 Comparisons between ATPRK and ESRCNN for downscaling Sentinel-2 bands 11–12 at 40 m to 20 m for the test area (Fig. 5). Bold number indicates the best value for accuracy assessment. Metrics

RMSE CC UIQI

Band

Band Band Band Band Band Band

ERGAS SAM (degree)

11 12 11 12 11 12

June 20, 2017

June 27, 2017

July 07, 2017

ATPRK

ESRCNN

ATPRK

ESRCNN

ATPRK

ESRCNN

0.0252 0.0325 0.9531 0.9479 0.9143 0.9088 5.8925 1.3316

0.0135 0.1869 0.9822 0.9774 0.9693 0.9614 3.3045 0.8220

0.0424 0.0480 0.9545 0.9484 0.9156 0.9071 6.6753 1.6193

0.0218 0.0278 0.9842 0.9764 0.9713 0.9584 3.6917 1.0906

0.0468 0.0571 0.9491 0.9397 0.9081 0.8964 7.2484 1.9884

0.0248 0.0322 0.9808 0.9721 0.9662 0.9548 3.9988 1.1695

the eight different data sets were not consistent for areas experiencing LULC changes. Fig. 9 shows the LULC changes caused by temporal housing (houses temporally built near construction sites, i.e., bright dots marked by the red rectangle) in the subarea 3. As suggested by Fig. 9, the number of temporal housing (i.e., bright dots) in the Landsat8 image on July 1, 2017 and in the Sentinel-2 images on June 20, June 27 and July 7, 2017 was 4, 2, 3, and 5, respectively. For the fusion image derived from D1 (Fig. 10 (a)), the number of the bright dots was the same as that in Fig. 9 (a) However, the dots in the green rectangle (Fig. 10 (a)) were relatively blurry compared to other dots in the fusion image. The number of bright dots for the fusion image derived D2 was five while the top two dots in the fusion image derived D3 were relatively blurry. In contrast, the fusion results derived from two auxiliary Sentinel-2 images (D1+ D2, D2+ D3 and D1+ D3) were much better though blurry dots were still observed. The best fusion result was observed in the fusion image derived from D1+ D2+ D3 from both the qualitative (visual) and quantitative assessments (Table 4). These findings revealed that the fusion network could learn LULC changes with a sufficient number of auxiliary Sentinel-2 images as inputs. 4.4. Comparison between ATPRK and ESRCNN for image fusion In the previous sections, we showed steps to train the fusion 11

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Table 6 Comparisons between ATPRK and ESRCNN for downscaling the Landsat-8 images on June 15 and July 1, 2017 at 90 m to 30 m for the entire test area (Fig. 5). The bold numbers in each row indicate the best values for accuracy assessment. Metrics

RMSE

CC

UIQI

ERGAS SAM (degree)

Band

Band 1 Band 2 Band 3 Band 4 Band 5 Band 6 Band 7 Mean Band 1 Band 2 Band 3 Band 4 Band 5 Band 6 Band 7 Mean Band 1 Band 2 Band 3 Band 4 Band 5 Band 6 Band 7 Mean

Group 1

Group 2

Group 3

Group 4

June 15, 2017

June 15, 2017

July 1, 2017

July 1, 2017

ATPRK

ESRCNN

ATPRK

ESRCNN

ATPRK

ESRCNN

ATPRK

ESRCNN

– 0.0800 0.0864 0.0975 0.0851 0.0827 0.0901 0.0870 – 0.9131 0.9130 0.9072 0.8251 0.9100 0.9159 0.8974 – 0.8372 0.8335 0.8178 0.7401 0.8353 0.8360 0.8166 12.7934 8.5484

0.0426 0.0426 0.0444 0.0457 0.0415 0.0458 0.0461 0.0455 0.9513 0.9539 0.9547 0.9581 0.9395 0.9520 0.9588 0.9469 0.9195 0.9224 0.9256 0.9286 0.8949 0.9204 0.9294 0.9115 6.8257 6.8257

– 0.0843 0.0908 0.09555 0.0835 0.0824 0.0904 0.0878 – 0.9089 0.9091 0.9118 0.8320 0.9102 0.9152 0.8979 – 0.8259 0.8228 0.8243 0.7490 0.8355 0.8348 0.8154 12.9690 8.6148

0.0434 0.0442 0.0468 0.0491 0.0452 0.0466 0.0490 0.0463 0.9493 0.9504 0.9493 0.9511 0.9236 0.9495 0.9525 0.9465 0.9134 0.9136 0.9139 0.9137 0.8790 0.9150 0.9164 0.9093 6.9992 4.6881

– 0.0694 0.0777 0.0863 0.0903 0.0874 0.0877 0.0831 – 0.9194 0.9171 0.9078 0.8443 0.8904 0.9025 0.8969 – 0.8180 0.8298 0.8184 0.7429 0.8105 0.8239 0.8073 12.6265 8.4717

0.0317 0.0338 0.0392 0.0406 0.0524 0.0462 0.0441 0.0412 0.9712 0.9679 0.9601 0.9583 0.9188 0.9432 0.9508 0.9529 0.9224 0.9260 0.9256 0.9279 0.8486 0.9090 0.9229 0.9118 6.2570 4.3759

– 0.0732 0.0776 0.0806 0.0941 0.0912 0.0915 0.0847 – 0.9139 0.9200 0.9197 0.8358 0.8831 0.8966 0.8948 – 0.8050 0.8322 0.8379 0.7290 0.7982 0.8130 0.8025 12.8198 8.6556

0.0319 0.0349 0.0421 0.0449 0.0448 0.0492 0.0498 0.0425 0.9708 0.9657 0.9539 0.9483 0.9412 0.9347 0.9365 0.9502 0.9173 0.9163 0.9095 0.9070 0.8935 0.8936 0.8969 0.9049 6.5302 4.2710

Note: Four groups of data sets were used: (1) the Landsat-8 image on June 15, 2017 and the Sentinel-2 image on June 20, 2017, (2) the Landsat-8 image on June 15, 2017 and the Sentinel-2 image on June 27, 2017, (3) the Landsat-8 image on July 1, 2017 and the Sentinel-2 image on June 27, 2017, and (4) the Landsat-8 image on July 1, 2017 and the Sentinel-2 image on June 20.

Compared to data fusion algorithms such as STARFM and its variants (Gao et al., 2006; Hilker et al., 2009; Zhu et al., 2010, 2016), the fusion network does not require the input of a pair or more pairs of satellite images acquired in the same date that would be a harsh prerequisite to fulfil to blend Landsat-8 and Sentinel-2 images. In contrast, the proposed fusion network can accommodate a flexible number of auxiliary Sentinel-2 images to enhance the target Landsat-8 image to a higher spatial resolution. This flexibility makes it possible for the fusion network to leverage several Sentinel-2 images in relatively close dates to the target Landsat-8 for reflectance predictions. ATPRK, as a geostatistical approach for data fusion (Wang et al., 2016, 2017), does not have this flexibility to digest several auxiliary Sentinel-2 images simultaneously because of the computational difficulty to calculate a cokriging matrix containing spectral bands from all input images for cross-semivariogram modeling. This difference in flexibility to use multi-temporal Sentinel-2 images between the fusion network and ATPRK also explains the better performance of the fusion network to generate downscaled Landsat-8 images at a high spatial resolution (Table 5). Examination of the downscaled Landsat-8 image highlights the advantage of the fusion network over ATPRK to preserve the distribution of spectral values as the original image. The high-fidelity reflectance values and the preservation of the statistical distribution of reflectance values, provided by the fusion network, hold important merits for time series analysis of satellite images. For example, Zhu et al. (2014) and Zhu et al. (2016) showed that continuous land cover change detection and classification was sensitive to anomaly observations that may exert impacts on the accurate modeling of time series reflectance values. The twin Sentinel-2 satellites together can provide consistent records of Earth's surface at a nominal revisit cycle of 5 days. Thus, one to three Sentinel-2 images captured in relatively close dates to the target

boundaries were observed in Fig. 13. As suggested by the distribution of pixel values from all spectral bands in Fig. 14, ATPRK tended to change the distribution of reflectance values. Compared to the distribution curves in Fig. 14 (a, b, d, e), those in Fig. 14 (c, f) were shorted and flatter (based on the number of pixels at each gray level). In contrast, the fusion network had the ability to preserve the data distribution as the original image. Table 7 shows that ESRCNN exhibited better accuracy than ATPRK in downscaling Landsat-8 images in areas experiencing LULC changes (mostly due to changes in vegetation coverage as shown in Fig. 15 (a) and (b)). Both ESRCNN and ATPRK can identify LULC changes and reproduce the original spectral information for downscaling Landsat-8 images (Fig. 15 (a), (c), and (d)). However, the fused image generated by ATPRK is over-sharpened compared to that generated by ESRCNN (Fig. 15 (c) and (d)).

5. Discussion Presently, there is a need for satellite data of higher temporal resolution than that can be provided from a single imaging sensor (e.g., Landsat-8 or Sentinel-2) to better improve understanding of land surface changes and their causal agents (Claverie et al., 2018). The synergistic use of Landsat-8 and Sentinel-2 data provides a promising avenue to satisfy this scientific demand yet requires coordination of the spatial resolution gap between the two sensors. To this end, this study presents a deep learning-based fusion approach to downscale Landsat-8 images at 30 m spatial resolution to 10 m. The downscaled Landsat-8 images at 10 m spatial resolution, together with Sentinel-2 data at 10 m, can provide temporally dense and routine information at a short time interval suitable for environmental applications such as crop monitoring (Veloso et al., 2017). 12

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Fig. 11. Comparisons between the reference and the fused reference produced by ATPRK and ESRCNN for bands 2–7 on June 15, 2017. (a)–(f) are fusion reflectance yielded by ATPRK with the auxiliary Sentinel-2 image on June 20, 2017. (g)–(l) are fusion reflectance yielded by ESRCNN with the auxiliary Sentinel-2 image on June 20, 2017. The color scheme indicates point density. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

13

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Fig. 12. Comparisons between the reference and the fused reference produced by ATPRK and ESRCNN for bands 2–7 on July 1, 2017. The color scheme indicates point density. (a)–(f) are fusion reflectance yielded by ATPRK with the auxiliary Sentinel-2 image on June 27, 2017. (g)–(l) are fusion reflectance yielded by ESRCNN with the auxiliary Sentinel-2 image on June 27, 2017. The color scheme indicates point density. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

14

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Fig. 13. The multi-temporal fusion results for the subarea 1 indicated by the red rectangle in Fig. 5 (bands 4, 3, 2 as RGB). The original Landsat-8 images at 30 m on (a) June 15 and (e) July 1, 2017. The resampled Landsat-8 images at 90 m on (b) June 15 and (f) July 1. The fused Landsat-8 images (ESRCNN) at 30 m on (c) June 15 and (g) July 1, 2017. The fused Landsat-8 images (ATPRK) at 30 m on (d) June 15 and (h) July 1, 2017. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

three or more auxiliary Sentinel-2 images should be input into the fusion network for the downscaling process. In the future, research efforts will be made to explore whether there is a threshold in temporal interval or an optimal number of auxiliary Sentinel-2 images fed into the fusion network that results in the best downscaling performance. As such, high-performance computing environments that leverage parallel processing, such as NASA Earth Exchange, Google Earth Engine, and Amazon Web Services (Gorelick et al., 2017), are necessary and should be employed to help perform sensitivity analysis of the fusion network. In addition, the fusion network requires sampled pixels that are representative of different types of changes within a study area for training. Thus, there is a need to collect various types of LULC changes (more than temporal housing and planting of crops shown in this study) to further train the fusion network. The validation of the fusion network using images at 10 m collected from airborne- or ground-based sensing platforms, rather than synthetic data sets, may further help understand the algorithm performance. Although the fusion network is developed to blend Landsat-8 and Sentinel-2 images for a harmonized reflectance data set, it can be used to enhance the spatial resolution of images from other satellite sensors such as Landsat-7 ETM+, the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER), and the Moderate Resolution Imaging Spectroradiometer (MODIS). Attention should be paid to use surface reflectance products from these sensors rather than TOA reflectance products evaluated in this study to reduce prediction uncertainties induced by atmospheric conditions. The fusion of Sentinel-2 data with images acquired by these satellite sensors at a coarse spatial resolution ranging from 60 m to 500 m may also entail the reconfiguration of the fusion network. Particularly, the fusion of Sentinel2 and Landsat-7 images requires special efforts made to fill gaps in the Landsat-7 Scan Line Corrector (SLC)-OFF images. Recently, NASA launched an initiative, the HLS project (https://hls. gsfc.nasa.gov/), to produce harmonized Landsat-8 and Sentinel-2 reflectance products to fulfil community's needs for image at both high spatial and temporal resolutions. The project aims to produce reflectance products based on a series of algorithms including atmospheric correction, cloud and cloud-shadow masking, spatial co-registration and common gridding system, Bi-directional Reflectance Distribution Function (BRDF) normalization, and spectral bandpass adjustment, to resolve differences between Landsat-8 and Sentinel-2 images. To address the gap in spatial resolution between Sentinel-2 and Landsat-8 images, the HLS project adopted the strategy to resample Sentinel-2 images at 10 m spatial resolution to 30 m. However, this resampling strategy wastes valuable information provided by the 10 m Sentinel-2 data. The proposed fusion network, thus, provides a promising approach to help generate a better harmonized Landsat-8 and

Landsat-8 image may be available as inputs for the fusion network. The performance of the fusion network to downscale Landsat-8 images at 30 m spatial resolution to 10 m improved as the number of auxiliary Sentinel-2 images input into the fusion network increased. In addition, temporal interval (i.e., difference in acquisition date) between the target Landsat-8 image and the auxiliary Sentinel-2 also exerted impacts on the downscaling performance the fusion network (Table 5). For example, when only one auxiliary Sentinel-2 image was fed into the fusion network, the best downscaling performance was achieved for the data set D1 (Table 5), in which the temporal interval between the target Landsat-8 image and the Sentinel-2 image was the smallest compared to that in D2 and D3. Even if auxiliary Sentinel-2 images are not available (contaminated by clouds, shadows, and snow), the fusion network can still yield a high-resolution image at 10 m for a target date with a high accuracy level (Table 4). As such, the proposed fusion network is more suitable than existing fusion algorithms for practical applications requiring dense time series images. In addition, the fusion network can generate the Landsat-8 band 1 at 10 m through learning complex relationships across all spectral bands within and among images. One critical issue in spatiotemporal fusion of satellite images is that LULC changes may occur at the temporal scale of days or weeks. Currently, only a few data fusion algorithms can deal with the downscaling process while considering LULC changes. For example, learningbased data fusion methods such as the Sparse-representation-based spatiotemporal reflectance fusion model (Huang and Song, 2012) and the extreme learning machine-based fusion method (X. Liu et al., 2016) have proved to be able to capture LULC changes. However, these learning-based data fusion methods are only suitable for relatively homogeneous areas and thus may not be directly applicable to the study area presented in this study containing abundant landforms and particularly in cities where spatial heterogeneity is high. The most recent variant of STARFM, Flexible Spatiotemporal DAta Fusion method (FSDAF) (Zhu et al., 2016), has the potential to capture both gradual and abrupt LULC changes for data fusion by using spectral unmixing analysis and thin plate spline interpolation. It is worth noting that the ability of FSDAF to capture LULC changes is heavily dependent on whether LULC changes are detectable in coarse resolution images. As Landsat-8 images have a revisit cycle of 16-day, LULC changes identified between two temporally neighboring Landsat images may miss changes occurring at a scale of few days (e.g., crop phenology). The results in this study revealed that the fusion network had the ability to identify the areas experiencing LULC changes within a few days and yield fusion results consistent with the original image. When the number of the auxiliary Sentinel-2 images was less than 3 (Fig. 10), the fusion network may not be able to yield good fusion results for areas experiencing LULC changes. Thus, it was suggested in this study that

15

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Fig. 14. The distribution of pixel values from all spectral bands in Landsat-8 and Sentinel-2 images. (a)–(c) are the Landsat-8 bands 1–7 at 30 m, the ESRCNN fused Landsat-8 bands at 30 m, and the ATPRK fused Landsat-8 bands at 30 m on June 15. (d)–(f) are the Landsat-8 bands 1–7 at 30 m, the ESRCNN fused Landsat-8 bands at 30 m, and the ATPRK fused Landsat-8 bands at 30 m on July 1, 2017.

6. Conclusion

Table 7 Comparisons between ATPRK and ESRCNN for downscaling the Landsat-8 image on June 15, 2017 (Sentinel-2 image on July 7, 2017 as auxiliary) at 90 m to 30 m for the area experiencing LULC changes (Fig. 14). The bold number indicates the best value for accuracy assessment. Metrics

RMSE

CC

UIQI

ERGAS

SAM (degree)

ATPRK ESRCNN

0.0871 0.0492

08830 0.9131

0.8054 0.8895

11.4530 6.6506

7.5243 4.1589

Sentinel-2 dataset at 10 m spatial resolution. evaluation of the fusion network to downscale other areas are necessary, particularly since methods are normally data-hungry for robust et al., 2015).

The composition of Landsat-8 and Sentinel-2 satellite images have potential to provide temporally dense observations of land surfaces at short time interval (e.g., 5-day composite) suitable for environmental application such as monitoring of agricultural management and conditions. However, the spatial resolution gap between the two satellite sensors needs to be coordinated. In this study, a deep learning-based data fusion network was developed to downscale Landsat-8 images at 30 m spatial resolution to 10 m with inputs of auxiliary Sentinel-2 images that were acquired in close dates to the target Landsat-8 images. The promising results of Landsat-8 images with rich spatial information descaled from the deep learning-based network showed a high quality, suggesting a great potential in various applications that require a series of satellite observations with both high temporal and spatial data. Overall, the major advantages of the proposed fusion network can be

Further training and Landsat-8 images for deep learning-based performance (LeCun

16

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Fig. 15. Fusion results for the subarea 2 in Fig. 5 experiencing changes in vegetation coverage (bands 5, 4, 3 as RGB). (a) The 30 m Landsat-8 image on June 15, 2017, (b) the 30 m Sentinel-2 image on July 7, 2017, (c) the 90 m Landsat-8 image to be downscaled on June 15, 2017, (d) the ESRCNN downscaled image at 30 m, and (e) the ATPRK downscaled image at 30 m.

summarized as the following points.

sensor image fusion using wavelet transforms for mapping the Brazilian Savanna. Int. J. Appl. Earth Obs. Geoinf. 8, 278–288. Alparone, L., Wald, L., Chanussot, J., Thomas, C., Gamba, P., Bruce, L.M., 2007. Comparison of pansharpening algorithms: outcome of the 2006 GRS-S data fusion contest. IEEE Trans. Geosci. Remote Sens. 45, 3012–3021. Audebert, N., Le Saux, B., Lefèvre, S., 2018. Beyond RGB: very high resolution urban remote sensing with multimodal deep networks. ISPRS J. Photogrammetry Remote Sens. 140, 20–32. Claverie, M., Masek, J.G., Ju, J., Dungan, J.L., 2017. Harmonized Landsat-8 Sentinel-2 (HLS) Product User's Guide. National Aeronautics and Space Administration (NASA), Washington, DC, USA. Claverie, M., Ju, J., Masek, J.G., Dungan, J.L., Vermote, E.F., Roger, J.C., et al., 2018. The Harmonized Landsat and Sentinel-2 surface reflectance data set. Remote Sens. Environ. 219, 145–161. Das, M., Ghosh, S.K., 2016. Deep-step: a deep learning approach for spatiotemporal prediction of remote sensing data. IEEE Geosci. Remote Sens. Lett. 13 (12), 1984–1988. Dong, C., Loy, C.C., He, K., Tang, X., 2016. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38, 295–307. Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., Meygret, A., Spoto, F., Sy, O., Marchese, F., Bargellini, P., 2012. Sentinel-2: ESA's optical high-resolution mission for GMES operational Services. Remote Sens. Environ. 120, 25–36. Emelyanova, I.V., McVicar, T.R., Van Niel, T.G., Li, L.T., van Dijk, A.I., 2013. Assessing the accuracy of blending Landsat–MODIS surface reflectances in two landscapes with contrasting spatial and temporal dynamics: a framework for algorithm selection. Remote Sens. Environ. 133, 193–209. Fu, P., Weng, Q., 2016a. A time series analysis of urbanization induced land use and land cover change and its impact on land surface temperature with Landsat imagery. Remote Sens. Environ. 175, 205–214. Fu, P., Weng, Q., 2016b. Consistent land surface temperature data generation from irregularly spaced Landsat imagery. Remote Sens. Environ. 184, 175–187. Gao, F., Masek, J., Schwaller, M., Hall, F., 2006. On the blending of the Landsat and MODIS surface reflectance: predicting daily Landsat surface reflectance. IEEE Trans. Geosci. Remote Sens. 44, 2207–2218. Gevaert, C.M., García-Haro, F.J., 2015. A comparison of STARFM and an unmixing-based algorithm for Landsat and MODIS data fusion. Remote Sens. Environ. 156, 34–44. Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., Moore, R., 2017. Google Earth engine: planetary-scale geospatial analysis for everyone. Remote Sens. Environ. 202, 18–27. He, K., Zhang, X., Ren, S., Sun, J., 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37, 1904–1916. Hilker, T., Wulder, M.A., Coops, N.C., Linke, J., McDermid, G., Masek, J.G., Gao, F., White, J.C., 2009. A new data fusion model for high spatial- and temporal-resolution mapping of forest disturbance based on Landsat and MODIS. Remote Sens. Environ. 113, 1613–1627. Hirschberg, J., Manning, C.D., 2015. Advances in natural language processing. Science 349 (6245), 261. Huang, B., Song, H., 2012. Spatiotemporal reflectance fusion via sparse representation.

(1) Compared to ATPRK that outperforms other data fusion algorithms in downscaling, ESRCNN can accommodate a flexible number of auxiliary Sentinel-2 images to enhance Landsat-8 images at 30 m spatial resolution to 10 m with a higher accuracy. (2) By leveraging information from auxiliary Seninel-2 images, ESRCNN has potential to identify LULC changes for data fusion, which generally cannot be handed by existing data fusion algorithms. (3) ESRCNN is superior over ATPRK in preserving the distribution of reflectance values for the downscaled Landsat-8 images. Code availability The code developed for fusing Landsat-8 and Sentinel-2 images is available at https://github.com/MrCPlusPlus/ESRCNN-for-Landsat8Sentinel2-Fusion, or avaiable upon request to the corresponding author. Acknowledgements This work was supported in part by the National Key Research and Development Plan of China on strategic international scientific and technological innovation cooperation special project under Grant 2016YFE0202300, the National Natural Science Foundation of China under Grants 61671332, 41771452, and 41771454, the Natural Science Fund of Hubei Province in China under Grant 2018CFA007. The authors are also grateful to the editors and three anonymous reviewers who helped improve the earlier version of this manuscript through their comments and suggestions. Appendix A. Supplementary data Supplementary data to this article can be found online at https:// doi.org/10.1016/j.rse.2019.111425. References Acerbi-Junior, F.W., Clevers, J.G.P.W., Schaepman, M.E., 2006. The assessment of multi-

17

Remote Sensing of Environment 235 (2019) 111425

Z. Shao, et al.

Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. 1409.1556. Shao, Z., Cai, J., 2018. Remote sensing image fusion with deep convolutional neural network (2018). IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 11 (5), 1656–1669. Sun, Y., Chen, Y., Wang, X., Tang, X., 2014. Deep learning face representation by joint identification-verification. In: International Conference on Neural Information Processing Systems, pp. 1988–1996. USGS, 2015. Using the USGS Landsat 8 product. [Online]. Available. http://landsat.usgs. gov/Landsat8_Using_Product.php. Veloso, A., Mermoz, S., Bouvet, A., Le Toan, T., Planells, M., Dejoux, J.-F., Ceschia, E., 2017. Understanding the temporal behavior of crops using Sentinel-1 and Sentinel-2like data for agricultural applications. Remote Sens. Environ. 199, 415–426. Verbesselt, J., Zeileis, A., Herold, M., 2012. Near real-time disturbance detection using satellite image time series. Remote Sens. Environ. 123, 98–108. Wald, L., Ranchin, T., Mangolini, M., 1997. Fusion of satellite images of different spatial resolution: assessing the quality of resulting images. Photogramm. Eng. Remote Sens. 63, 691–699. Wang, Q., Shi, W., Li, Z., Atkinson, P.M., 2016. Fusion of sentinel-2 images. Remote Sens. Environ. 187, 241–252. Wang, Q., Blackburn, G.A., Onojeghuo, A.O., Dash, J., Zhou, L., Zhang, Y., Atkinson, P.M., 2017. Fusion of landsat-8 OLI and sentinel-2 MSI data. IEEE Trans. Geosci. Remote Sens. 55, 3885–3899. Wang, Z., Bovik, A.C., 2002. A universal image quality index. IEEE Signal Process. Lett. 9, 81–84. Wei, Q., Bioucas-Dias, J., Dobigeon, N., Tourneret, J.Y., 2015. Hyperspectral and multispectral image fusion based on a sparse representation. IEEE Trans. Geosci. Remote Sens. 53 (7), 3658–3668. White, J.C., Wulder, M.A., Hermosilla, T., Coops, N.C., Hobart, G.W., 2017. A nationwide annual characterization of 25 years of forest disturbance and recovery for Canada using Landsat time series. Remote Sens. Environ. 194, 303–321. Woodcock, C.E., Allen, R., Anderson, M., Belward, A., Bindschadler, R., Cohen, W., Wynne, R., 2008. Free access to Landsat imagery. Science 320 (5879), 1011. Yuan, Q., Wei, Y., Meng, X., Shen, H., Zhang, L., 2018. A multiscale and multidepth convolutional neural network for remote sensing imagery pan-sharpening. IEEE Journal of Selected Topics in Applied Earth Observation and Remote Sensing 11 (3), 978–989. Zhang, N., Donahue, J., Girshick, R., Darrel, T., 2014. Part-based RCNNs for fine-grained category detection. In: European Conference on Computer Vision, pp. 834–849. Zhu, X., Chen, J., Gao, F., Chen, X., Masek, J.G., 2010. An enhanced spatial and temporal adaptive reflectance fusion model for complex heterogeneous regions. Remote Sens. Environ. 114, 2610–2623. Zhu, X., Helmer, E.H., Gao, F., Liu, D., Chen, J., Lefsky, M.A., 2016. A flexible spatiotemporal method for fusing satellite images with different resolutions. Remote Sens. Environ. 172, 165–177. Zhu, Z., Woodcock, C.E., 2014. Continuous change detection and classification of land cover using all available Landsat data. Remote Sens. Environ. 144, 152–171.

IEEE Trans. Geosci. Remote Sens. 50 (10), 3707–3716. Kim, D.-H., Sexton, J.O., Noojipady, P., Huang, C., Anand, A., Channan, S., Feng, M., Townshend, J.R., 2014. Global, Landsat-based forest-cover change from 1990 to 2000. Remote Sens. Environ. 155, 178–193. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, pp. 1097–1105. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based leaning applied to document recognition. Proc. IEEE 86, 2278–2324. LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436. Liu, J.G., 2000. Smoothing Filter-based Intensity Modulation: a spectral preserve image fusion technique for improving spatial details. Int. J. Remote Sens. 21 (18), 3461–3472. Liu, P., Xiao, L., Tang, S., 2016. A new geometry enforcing variational model for pansharpening. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 9 (12), 5726–5739. Liu, X., Deng, C., Wang, S., Huang, G.B., Zhao, B., Lauren, P., 2016. Fast and accurate spatiotemporal fusion based upon extreme learning machine. IEEE Geosci. Remote Sens. Lett. 13 (12), 2039–2043. Malenovský, Z., Rott, H., Cihlar, J., Schaepman, M.E., García-Santos, G., Fernandes, R., Berger, M., 2012. Sentinels for science: potential of Sentinel-1, -2, and -3 missions for scientific observations of ocean, cryosphere, and land. Remote Sens. Environ. 120, 91–101. Masi, G., Cozzolino, D., Verdoliva, L., Scarpa, G., 2016. Pansharpening by convolutional neural networks. Remote Sens. 8 (7), 594. Melaas, E.K., Friedl, M.A., Zhu, Z., 2013. Detecting interannual variation in deciduous broadleaf forest phenology using Landsat TM/ETM+ data. Remote Sens. Environ. 132, 176–185. Nair, V., Hinton, G.E., 2010. Rectified linear units improve restricted Boltzmann machines. In: International Conference on Machine Learning, pp. 807–814. Ouyang, W., Wang, X., Zeng, X., Qiu, S., Luo, P., Tian, Y., Li, H., Yang, S., Wang, Z., Loy, C.-C., Tang, X., 2015. Deepid-net: deformable deep convolutional neural networks for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2403–2412. Ranchin, T., Wald, L., 2000. Fusion of high spatial and spectral resolution images: the ARSIS concept and its implementation. Photogramm. Eng. Remote Sens. 66, 49–61. Roy, D.P., Wulder, M.A., Loveland, T.R., C.E, W., Allen, R.G., Anderson, M.C., ... Zhu, Z., 2014. Landsat-8: science and product vision for terrestrial global change research. Remote Sens. Environ. 145, 154–172. Skakun, S., Kussul, N., Shelestov, A., Kussul, O., 2014. Flood hazard and flood risk assessment using a time series of satellite images: a case study in Namibia. Risk Anal. 34 (8), 1521–1537. Schmidhuber, J., 2015. Deep learning in neural networks: an overview. Neural Netw. 61, 85–117. Senf, C., Pflugmacher, D., Heurich, M., Krueger, T., 2017. A Bayesian hierarchical model for estimating spatial and temporal variation in vegetation phenology from Landsat time series. Remote Sens. Environ. 194, 155–160.

18