Integrating Google Earth imagery with Landsat data to improve 30-m resolution land cover mapping

Integrating Google Earth imagery with Landsat data to improve 30-m resolution land cover mapping

Remote Sensing of Environment 237 (2020) 111563 Contents lists available at ScienceDirect Remote Sensing of Environment journal homepage: www.elsevi...

19MB Sizes 0 Downloads 93 Views

Remote Sensing of Environment 237 (2020) 111563

Contents lists available at ScienceDirect

Remote Sensing of Environment journal homepage: www.elsevier.com/locate/rse

Integrating Google Earth imagery with Landsat data to improve 30-m resolution land cover mapping

T

Weijia Lia,b,c, Runmin Donga,b, Haohuan Fua,b, , Jie Wangd, Le Yua,b, Peng Gonga,b ⁎

a

Ministry of Education Key Laboratory for Earth System Modeling, Department of Earth System Science, Tsinghua University, Beijing 100084, China Joint Center for Global Change Studies (JCGCS), Beijing 100084, China c CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong, Hong Kong, China d State Key Laboratory of Remote Sensing Science, Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, Beijing 100101, China b

ARTICLE INFO

ABSTRACT

Edited by Marie Weiss

Land use and land cover maps provide fundamental information that has been used in different kinds of studies, ranging from climate change to city planning. However, despite substantial efforts in recent decades, large-scale 30-m land cover maps still suffer from relatively low accuracy in terms of land cover type discrimination (especially for the vegetation and impervious types), due to limits in relation to the data, method, and design of the workflow. In this work, we improved the land cover classification accuracy by integrating free and public high-resolution Google Earth images (HR-GEI) with Landsat Operational Land Imager (OLI) and Enhanced Thematic Mapper Plus (ETM+) imagery. Our major innovation is a hybrid approach that includes three major components: (1) a deep convolutional neural network (CNN)-based classifier that extracts high-resolution features from Google Earth imagery; (2) traditional machine learning classifiers (i.e., Random Forest (RF) and Support Vector Machine (SVM)) that are based on spectral features extracted from 30-m Landsat data; and (3) an ensemble decision maker that takes all different features into account. Experimental results show that our proposed method achieves a classification accuracy of 84.40% on the entire validation dataset in China, improving the previous state-of-the-art accuracies obtained by RF and SVM by 4.50% and 4.20%, respectively. Moreover, our proposed method reduces misclassifications between certain vegetation types, and improves identification of the impervious type. Evaluation applied over an area of around 14,000 km2 confirms little improvement for land cover types (e.g., forest) of which the classification accuracies are already over 80% when using traditional machine learning approaches, yet improvements in accuracy of 7% for cropland and shrubland, 9% for grassland, 23% for impervious and 25% for wetlands were achieved when compared with traditional machine learning approaches. The results demonstrate the great potential of integrating features of datasets at different resolutions and the possibility to produce more reliable land cover maps.

Keywords: 30-m land cover mapping High-resolution Google Earth imagery Landsat Data fusion Deep learning

1. Introduction Land use and land cover maps provide fundamental information for various earth system studies, e.g., climate change, carbon cycling, biodiversity, and public health (Gong et al., 2013; Yang et al., 2013). As an important input for numerical models and an important dataset for different analysis scenarios, land use and land cover maps provide essential information for the planning and management of urban, natural resources, and ecosystems (Yu et al., 2014; Zhao et al., 2016; Li et al., 2016; Chen et al., 2015). Large-scale land cover mapping based on remote sensing data has been widely studied for decades. Early global land cover (GLC) mapping studies used coarse-resolution satellite data

to produce land cover products at resolutions of 300 m to 1 km, e.g., the University of Maryland (UMD) land-cover map (Hansen et al., 2000), GLC 2000 (Bartholome and Belward, 2005), and GlobCover 2009 (Bontemps et al., 2011). In 2013, the first 30-m resolution global land cover products based on Landsat Thematic Mapper (TM) and Enhanced Thematic Mapper+ (ETM+) data were reported in the Finer Resolution Observation and Monitoring of Global Land Cover (FROM-GLC) project, obtaining the highest classification accuracy of 64.89% on the global validation dataset and 66.23% on the validation dataset in China (Gong et al., 2013). The GLC classification accuracy was further improved to 67.08% through a segmentation-based approach (FROMGLC-seg), using Landsat data combined with Moderate Resolution

⁎ Corresponding author at: Ministry of Education Key Laboratory for Earth System Modeling, Department of Earth System Science, Tsinghua University, Beijing 100084, China. E-mail address: [email protected] (H. Fu).

https://doi.org/10.1016/j.rse.2019.111563 Received 4 August 2019; Received in revised form 19 November 2019; Accepted 25 November 2019 0034-4257/ © 2019 Published by Elsevier Inc.

Remote Sensing of Environment 237 (2020) 111563

W. Li, et al.

Imaging Spectroradiometer (MODIS) data (with high-temporal frequency information) and several auxiliary data (e.g., bioclimatic, digital elevation model (DEM), and soil-water conditions) (Yu et al., 2013). In 2014, FROM-GLC-agg was proposed to aggregate FROM-GLC, FROM-GLC-seg, and additional coarse-resolution land cover map products, improving the accuracy of 30-m resolution GLC maps to 69.50% (Yu et al., 2014). In 2017, the accuracy of 30-m resolution global land cover classification was improved to 70.17% through expanding the FROM-GLC sample dataset to an all-season sample dataset and using a sample spatial-temporal partition strategy (Li et al., 2017). The global land cover mapping accuracy was further improved to 77.3% through extracting more useful features from multi-source remote sensing data in FROM-GLC-2015 (Gong et al., 2017). There are also studies that have used coarse- or medium-resolution remote sensing data for countryscale land cover mapping in China (Liu et al., 2003), the United States (Homer et al., 2004), France (Inglada et al., 2017), etc. Although substantial efforts have been made towards improving land cover mapping results, the classification accuracy and quality of existing land cover maps still cannot satisfy the demands of many applications (Yu et al., 2014; Zhao et al., 2016). In general, existing largescale 30-m resolution land cover mapping studies are based on traditional machine learning methods (e.g., Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), and Maximum Likelihood (MLC) classifiers), using 30-m or coarser resolution remote sensing imagery (e.g., Landsat and MODIS imagery). Most of their efforts focus on expanding the training and validation sample datasets, extracting useful features from more remote sensing data sources, and improving the pre-processing or post-processing methods (Gong et al., 2016). Landsat imagery used in these studies has a relatively high spectral resolution, which contributes to the superior classification accuracies of Water, Bare Land, Snow/Ice and Cloud types in existing large-scale land cover mapping studies. Its high temporal resolution and free accessibility also enable Landsat imagery to be the main data source of largescale land cover mapping/change studies. However, in previous largescale land cover mapping studies based on Landsat images, the vegetation types (i.e., cropland, forest, grassland and shrubland) and the impervious type often experience serious confusion and inferior classification accuracies. The spectral features extracted from 30-m spatial resolution data are not enough for accurately classifying these land cover types due to the spectral similarity of vegetation types, the spatial heterogeneity of the impervious type and other challenges of large-scale land cover mapping (Gong et al., 2016). As a number of factors that limit the current accuracy relate to the inherent constraints of Landsat data, a natural idea for improvement is to explore the benefits of integrating other data sources with more information. In region-scale studies, high-resolution commercial satellite and aerial images have been widely used for generating higher-accuracy and finer-resolution land cover and land use maps (Zhao et al., 2017; Marcos et al., 2018; Sidike et al., 2019; Li et al., 2019; Maggiori et al., 2017; Mahdianpari et al., 2018). For instance, Zhang et al. (2018a, 2018b, 2019) used high-resolution aerial images collected from the UK (with a total area of about 177.5 km2) for land use and land cover mapping based on various deep learning methods. Similarly, WorldView satellite images were used for urban land use mapping in a 143 km2 region of Hong Kong and a 25 km2 region of Shenzhen (Huang et al., 2018). These studies applied CNN (or incorporated CNN and Multilayer Perceptron (MLP)) to high-resolution single-source images. There are also studies that have integrated multi-source remote sensing data to improve land cover mapping accuracy (Gessner et al., 2015; Chen et al., 2017; Toure et al., 2018), e.g., using a combination of highresolution WorldView-2 images and medium-resolution Landsat timeseries data for mapping tree species across the entire state of Bavaria in Germany (Immitzer et al., 2018), and using a combination of Canadian Digital Surface Model (CDSM), high-resolution RapidEye images and Landsat 8 images for wetland mapping across an area of about 700 km2 (Amani et al., 2017). Commercial satellite images (e.g., Quickbird,

RapidEye, and WorldView) and aerial images used in these studies have both high spatial resolution and multiple spectral bands, which contribute to obtaining high classification accuracies based on both spatial and spectral features. However, the related cost for data acquisition constrains existing studies to the city level, usually with an area of a few hundred square kilometers or even less. To minimize the cost of acquiring data at a global scale, in our study, we chose to combine high-resolution imagery data from Google Earth, which provides free and public high-resolution remote sensing images of most of the land at a global scale (Yu and Gong, 2012; Liang et al., 2018). While there are a number of existing efforts (Cheng and Han, 2016; Cheng et al., 2016; Yu et al., 2016; Hu et al., 2015; Cheng et al., 2017; Xia et al., 2017) that apply deep learning methods to Google Earth imagery-based object detection (e.g., building, vehicle, ship, and airplane detection) and scene classification (e.g., residential, industrial, school, farmland, and forest type classification), the locations of samples are generally not provided and the scenario is constrained to a limited region. Moreover, using Google Earth images alone also has its own limitations, such as a limited number of visible bands when compared with multispectral Landsat and MODIS data, a limited number of historical time points, and a varying available resolution for different locations on the earth. Based on the above considerations, in this research, we designed a novel land cover mapping approach that integrates high-resolution Google Earth images with medium-resolution data (e.g., Landsat and Digital Elevation Model (DEM)) to improve 30-m resolution land cover mapping results based on deep learning. Our proposed approach includes land cover classification based on sample datasets and land cover mapping based on large-scale remote sensing images. First, we trained and evaluated two traditional machine learning classifiers (i.e., RF and SVM) based on the features extracted from 30-m resolution data. Second, we designed a CNN classifier for both land cover classification and feature extraction, using high-resolution Google Earth images (HRGEI). Third, we built an SVM-fusion classifier that combines the spectral features extracted from 30-m resolution data with the spatial features extracted from the HR-GEI-based CNN classifier. Finally, we designed a rule-based decision fusion method for the integration of the land cover classification results from four classifiers. In large-scale land cover mapping process, we classified large-scale remote sensing images based on the optimized RF, SVM, CNN, and SVM-fusion models, and employed the proposed decision fusion method to integrate the result of each classifier and obtain the final land cover mapping results. Our proposed method improves the previous state-of-the-art accuracies (obtained by Gong et al., 2017) by 4.20% and 4.50%, and achieves more accurate large-scale land cover maps. We also analyzed the land cover classification results of the CNN-based method without using high-resolution images, the effect of our proposed method on different land cover types, the effect of different patch sizes of HR-GEI on the land cover mapping results, and other important aspects of large-scale land cover mapping in this paper. 2. Datasets 2.1. Sample collection and classification system The training and validation samples used in this study were selected from the all-season sample dataset used for global land cover mapping (Li et al., 2017), covering the entire area of China. The sample coordinates of the all-season sample dataset (Li et al., 2017) were taken from the FROM-GLC sample dataset (Gong et al., 2013). All of the samples were collected through human interpretation based on Landsat imagery, high-resolution Google Earth imagery, and MODIS time series imagery, etc. As samples from different seasons are included in the training and validation datasets for capturing the seasonal land cover changes, one sample coordinate corresponds to samples from multiple seasons (four seasons and one growing season, at most). Each sample 2

Remote Sensing of Environment 237 (2020) 111563

W. Li, et al.

Table 1 The number of samples and coordinates for each land cover type. Land cover type

Training samples

Validation samples

Training coordinates

Validation coordinates

Cropland Forest Grassland Shrubland Wetland Water Tundra Impervious Bare land Snow/Ice Cloud Total number

3231 2778 2373 581 362 2813 19 1905 3988 2251 2027 22,328

1446 1593 1288 350 53 106 2 213 2237 677 796 8761

949 829 781 168 158 904 9 542 1166 1443 1590 8539

421 476 411 107 26 38 1 58 647 458 653 3296

Table 2 The number of Landsat scenes for sample collection in different acquisition years. Acquisition year

Before 2012

2013

2014

2015

Total

# Training scenes # Validation scenes

8 8

50 40

1223 1126

790 730

2071 1904

collection in different acquisition years. We used Landsat 5 for all scenes acquired before 2012 and Landsat 8 for all scenes acquired in 2013, 2014 and 2015. Moreover, for each pixel in a Landsat image, we calculated the normalized difference vegetation index (NDVI) from multispectral Landsat imagery, and obtained the elevation and the slope from DEM data. We used DEM data from Shuttle Radar Topographic Mission for pixels with latitude below ± 60°, and DEM data from Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) for pixels with latitude above ± 60°. We also recorded the longitude, latitude, day of the year (DOY), maximum NDVI among multiple NDVIs in the same location but from different dates, the Landsat image of the maximum NDVI, and the DOY of the maximum NDVI. For all above datasets in 30-m resolution, we took a single pixel at each given location for feature extraction, which is introduced in Section 3.2.1 in detail. Fig. 2 shows the Landsat spectral reflectance curves of all training samples of each land cover type. For each type, we calculated the ‘average’ and ‘standard deviation’ of the ‘spectral reflectance × 10000’ of all training samples (belonging to this type) for each given band in order to obtain the curve. We found that the spectral characteristics had great similarity among Cropland, Forest, Grassland, Shrubland, Wetland, Tundra and Impervious. The spectral characteristics of Water, Bare land, Snow/Ice, and Cloud were relatively distinguishable compared with the above land cover types, contributing to the superior classification accuracies of these four types in previous large-scale land cover mapping studies.

was interpreted as one of eleven categories, including ten land cover types (Cropland, Forest, Grassland, Shrubland, Wetland, Water, Tundra, Impervious, Bare land, and Snow/Ice) and one Cloud type. The ‘Snow/ Ice’ type refers to temporary Snow and Ice in this study. The sampling scheme used in this study is the same as the one used by Li et al. (2017). All samples were collected systematically in China. For training sample collection, about 10–20 samples were selected from each Landsat scene, of which the land cover types should be as representative of the image scene as possible. For validation sample collection, the entire study area was partitioned using a hexagonal scheme (Gong et al., 2013), and five sample coordinates were randomly selected from each hexagon. In this way, the proportions of samples in different categories are similar to the actual proportions of different land covers in China, leading to the great difference in sample size between categories. The numbers of training samples, validation samples, training coordinates, and validation coordinates for each land cover type are listed in Table 1. The distributions of training and validation sample coordinates for each land cover type are shown in Fig. 1.

2.3. High-resolution datasets

2.2. Thirty-meter resolution datasets

Besides 30-m resolution datasets, high-resolution Google Earth images (HR-GEI) were utilized in our proposed method to improve the land cover classification accuracy and large-scale land cover mapping results. Fig. 3 shows some examples of HR-GEI for different land cover types. The HR-GEI of sample datasets were downloaded through Google Map Static API (https://developers.google.com/maps/documentation/ maps-static/dev-guide). The HR-GEI of each sample can be freely downloaded according to its center location (latitude, longitude) and a given zoom level. For large-scale land cover mapping, the HR-GEI were first downloaded through Google Earth Pro software and then reconstructed into large-scale images (Cao et al., 2018). Although the

In this research, we used 30-m resolution Landsat images (Landsat Surface Reflectance) as the primary datasets for land cover classification based on sample datasets and land cover mapping based on largescale remote sensing images. We used 2071 Landsat scenes for collecting training samples and 1904 Landsat scenes for collecting validation samples. Most of the Landsat images (about 99%) used in this study were collected from Landsat 8. Only a small percentage of sample locations used Landsat 5 images as a substitute for Landsat 8 images, where all Landsat 8 images had a high cloud coverage (more than 50%). Table 2 demonstrates the number of Landsat scenes for sample

Fig. 1. The distributions of training and validation sample coordinates for each land cover type. 3

Remote Sensing of Environment 237 (2020) 111563

W. Li, et al.

Fig. 2. Landsat spectral reflectance curves for the training samples of each type. The curve of each type is obtained through calculating the ‘average’ and ‘standard deviation’ of the ‘spectral reflectance × 10000’ of all training samples (belonging to this type) for each given band.

zoom level was set to 18 for all HR-GEI used in this study, the spatial resolution varied slightly with the latitude of each image. The spatial resolutions of the HR-GEI used in this study were between 0.35 m and 0.6 m. The sizes in meters of our sample images (in 128 × 128 pixels) were between 44.8 × 44.8 m and 76.8 × 76.8 m. As the acquisition time of available HR-GEI is often different from the actual acquisition time of its corresponding sample in the manually interpreted datasets, images of the Snow/Ice type and the Cloud type are not shown in Fig. 3. We found that the spatial characteristics of HR-GEI were more distinguishable compared with the spectral characteristics of Landsat imagery for Cropland, Forest, Grassland, Shrubland and Impervious types. On the other hand, the spatial characteristics of HR-GEI had no obvious advantages over the spectral characteristics of Landsat imagery for Wetland, Water, Tundra, and Bare land, as only visible bands are available in HR-GEI. Details of the pre-processing of HR-GEI are described in Section 3.3.1.

the classification results of four classifiers. In large-scale land cover mapping, we first prepared and pre-processed 30-m resolution datasets for traditional classifiers and high-resolution Google Earth images for CNN and SVM-fusion classifiers. Second, we applied the optimized RF, SVM, CNN, and SVM-fusion models to the pre-processed multi-resolution datasets and obtained large-scale land cover maps of each classifier. Finally, we applied the proposed decision fusion method to integrate the land cover maps predicted from each classifier, and we obtained the final land cover mapping results. The source code of our proposed method can be found at https://github.com/liweijia/landcover-mapping-China. Details of our proposed method are described in the following sections. 3.2. Land cover classification based on 30-m resolution sample datasets 3.2.1. Feature extraction and pre-processing For each sample (corresponding to a single pixel) in the training and validation datasets, a total of 24 original features were extracted from the 30-m resolution datasets (described in Section 2.2). In order to enable two samples from adjacent dates (e.g., DOY = 1 and DOY = 365) to have similar feature values, we transformed one original DOY into two DOY features with the aid of trigonometric functions, i.e., cos(2π × DOY / 366) and sin(2π × DOY / 366). The 24 features consist of the following parts: (1) Landsat surface reflectance of seven spectral bands, the corresponding NDVI, and the corresponding two DOY features (constituting 10 features); (2) seven surface reflectance features of the Landsat image with the maximum NDVI (described in Section 2.2), the corresponding maximum NDVI, and the corresponding two DOY features (constituting 10 features); (3) the longitude and the latitude (constituting 2 features); (4) the elevation and the slope (constituting 2 features). In addition, as the value range varied greatly among different types of features, we normalized the 24 features of each sample to the value range of 0 to 1 and used the normalized features as the input features of traditional classifiers.

3. Methods 3.1. Overview of the proposed method In this research, we designed a novel CNN-based land cover mapping approach that integrates high-resolution Google Earth images with medium-resolution Landsat data to improve 30-m resolution land cover mapping in China. The process of our proposed method includes land cover classification based on sample datasets and land cover mapping based on large-scale remote sensing images. In land cover classification based on sample datasets, we first used the 30-m resolution sample dataset to train and evaluate two traditional classifiers (i.e., RF and SVM). Second, we designed a CNN classifier for the land cover classification based on high-resolution Google Earth images. Moreover, we built an SVM-fusion classifier based on the fusion of original 30-m resolution features with high-resolution features extracted from the CNN classifier. Finally, we proposed a decision fusion method to integrate 4

Remote Sensing of Environment 237 (2020) 111563

W. Li, et al.

Fig. 3. Examples of high-resolution Google Earth images of different land cover types. The sizes in meters of these images (in 128 × 128 pixels) are between 44.8 × 44.8 m and 76.8 × 76.8 m.

3.2.2. Classifiers and parameter setting In this research, we selected Random Forest (RF) and Support Vector Machine (SVM) as the classifiers for land cover classification based on 30-m resolution sample datasets. RF and SVM classifiers have been widely utilized in the remote sensing community for decades owing to their excellent classification accuracies in various large-scale land cover mapping studies (Gong et al., 2013; Yu et al., 2013; Yu et al., 2014; Li et al., 2017). For both RF and SVM, we used 22,328 training samples to train two classifiers and 8761 validation samples to calculate the classification accuracies. Each sample possesses 24 pre-processed input features. We optimized the number of trees for RF (from 10 to 500) and the value of C (Cost) for SVM (from 1 to 100) to obtain the optimized parameter setting with the highest classification accuracy on the validation samples (i.e., #trees = 200 for RF, and C = 10 for SVM). Other parameters were assigned as the default values provided by Scikit-learn, an efficient tool for machine learning and data analysis (Pedregosa et al., 2011). The optimized RF and SVM models (with the highest classification accuracy on the 8761 validation samples) were saved and utilized in large-scale land cover mapping, which is described in Section 3.5.

3.3. Land cover classification based on the high-resolution sample dataset 3.3.1. Pre-processing of the sample dataset As described in Section 2.3, we downloaded a high-resolution Google Earth image for each training and validation sample coordinate. As the acquisition time of available Google Earth imagery is often different from the actual acquisition time of its corresponding sample in the manually interpreted datasets, we removed the high-resolution samples of certain types (i.e., Temporary Snow/Ice and Cloud) with high diversity among different time points in the same location. Moreover, we removed the high-resolution sample images of Water, Wetland and Bare land types because these types are easier to interpret based on multispectral Landsat data than HR-GEI with only visible bands. We also removed the high-resolution sample images of the Tundra type because of the limited number of sample coordinates (9 training coordinates and 1 validation coordinate). For some of the sample coordinates, high-resolution images were not available from Google Earth and thus they were removed as well. Consequently, 4536 high-resolution sample images (including 3134 training samples and 1402 validation samples) of five land cover types (Cropland, Forests, Grassland, Shrubland, and Impervious) were used for the deep learningbased land cover classification. Different types of samples had the same 5

Remote Sensing of Environment 237 (2020) 111563

W. Li, et al.

Fig. 4. The architecture of the CNN proposed in this study. Each sample is 128 × 128 pixels and has 3 bands. The numbers on the max-pooling and convolution layers denote the size of feature maps and the number of convolution kernels (e.g., 128 × 128 pixels and 64 kernels in the first convolution layer). The numbers on the fully connected layers denote the number of hidden units.

image size. In addition, all sample images were cropped into three groups of sizes (i.e., 64 × 64, 128 × 128, and 256 × 256 pixels, with three sample images having the same center) for training our proposed CNN independently, and the optimized image size for classifying the land cover type of the center coordinate is analyzed in Section 5.1.

the high-resolution dataset-based CNN, we built another dataset based on the fusion of 24 original features (extracted from 30-m resolution datasets) and 24 spatial features (extracted from high-resolution imagebased CNN) to train another SVM classifier. For each sample in the original 30-m resolution validation dataset, we searched for its corresponding high-resolution spatial features in the entire high-resolution validation dataset according to its coordinate (Longitude and Latitude). If the coordinate can be searched successfully, the land cover type of the validation sample is predicted by the feature fusion-based SVM classifier (SVM-fusion) using the 48 combined features; otherwise, it is determined by the original RF and SVM classifier using the 24 original features, according to the rule described in Section 3.4.2. The optimized SVM-fusion model was saved and utilized for large-scale land cover mapping, as explained in Section 3.5.

3.3.2. The architecture of the CNN proposed in this study In our proposed method, the CNN is only applied to Google Earth imagery. The architecture of the CNN proposed in this study is based on the VGG-16 architecture (Simonyan and Zisserman, 2014), including 13 convolutional layers, 5 max-pooling layers and 3 fully connected layers. The size of each convolution kernel is 3 × 3. ReLU is used as the activation function. Fig. 4 shows the architecture of the CNN proposed in this study, taking the input image size of 128 × 128 pixels as an example. As the features extracted from the CNN (i.e., the hidden units in the second fully connected layer (FC7)) are fused with the 24 original features and utilized in the land cover classification based on feature fusion and decision fusion, we modified the number of hidden units in the FC7 layer from 4096 to 24 for balancing the contributions of highresolution features and 30-m resolution features. We also modified the number of hidden units in the last fully connected layer (FC8) from 1000 to 5 because there are five types of high-resolution sample images used in our proposed CNN. The process of the CNN training is described in Section 3.3.3.

3.4.2. Decision fusion of four classifiers For each above-mentioned classifier (i.e., RF, SVM, CNN and SVMfusion), we obtained the predicted probability of each land cover type for each validation sample. In order to improve the overall land cover classification results, we proposed a rule-based decision fusion method to integrate the classification result of each classifier. For each sample in the original validation dataset (with 8761 samples), if its land cover type is predicted as Cloud or Snow/Ice by RF and SVM, or its coordinate cannot be searched in the high-resolution sample dataset, then its predicted land cover type is determined by RF and SVM classifiers. Otherwise, the predicted land cover type of this validation sample is determined by all four classifiers. For both cases, we calculated the average probability of two or four classifiers for each land cover type. The final prediction result is the land cover type with the highest average probability. This decision fusion rule was applied to large-scale land cover mapping, as explained in Section 3.5.

3.3.3. Training and evaluation of the CNN In this study, a total number of 3134 training samples and 1402 validation samples were used for CNN training and accuracy evaluation. In order to avoid potential problems resulting from the limited quantity of samples, we used the pre-trained VGG-16 model based on the large-scale ImageNet dataset for initializing the weights of our proposed CNN, and we used the land cover sample dataset for the finetuning of our proposed CNN. As the input image size of the land cover dataset is different from the input image size of the ImageNet datasetbased VGG-16 model, the size of the feature map obtained in the last convolutional layers is different from the one of the ImageNet-based VGG-16. Thus, the pre-trained weights of the 3 fully connected layers were ignored and only the weights in 13 convolutional layers of the pretrained VGG-16 model were utilized for initializing the corresponding weights of our proposed CNN. Similar to RF and SVM, the optimized CNN model (with the highest classification accuracy on 1402 high-resolution validation samples) was used for large-scale land cover mapping, which is described in Section 3.5.

3.5. Large-scale land cover mapping using multi-resolution datasets Our proposed large-scale land cover mapping method consists of land cover prediction using four classifiers and result integration based on the proposed decision fusion method. For traditional classifier-based land cover prediction, we used the optimized RF and SVM models obtained in Section 3.2.2 to predict the land cover type for each pixel coordinate in the Landsat image, based on 24 original features. For the CNN-based land cover prediction, we used the optimized CNN model obtained in Section 3.3.3 to predict the land cover type for each pixel coordinate in the Landsat image, based on its corresponding image patch extracted from a high-resolution Google Earth image. For SVMfusion-based land cover prediction, we integrated the 24 original features (described in Section 3.2.1) with the 24 features extracted from the CNN for each pixel coordinate in the Landsat image. Then, we made a prediction for the pixel coordinate using the optimized SVM-fusion

3.4. Land cover classification based on feature fusion and decision fusion 3.4.1. Fusion of 30-m resolution features and high-resolution features Besides the 30-m resolution dataset-based traditional classifiers and 6

Remote Sensing of Environment 237 (2020) 111563

W. Li, et al.

Fig. 5. The workflow of our proposed method for land cover mapping. The size of HR-GEI is 128 × 128 pixels. The size of 30-m images and the land cover map is 2019 × 1995 pixels.

model obtained in Section 3.4.1. For the decision fusion-based result integration, we integrated the land cover type predicted by each method, according to the rule of decision fusion described in Section 3.4.2. The workflow of our proposed method for land cover mapping is summarized in Fig. 5.

poor for the 30-m resolution dataset, which might be due to the limited number of sample images. Thus, we selected AlexNet-based CNN (including 5 convolutional layers and 3 fully connected layers, removing the max-pooling layers due to the limited size of input images) for the 30-m resolution dataset and optimized its main hyper-parameters (the number of convolution kernels and hidden neurons in the first fully connected layer). Similar to the high-resolution image-based CNN, the number of hidden neurons in the second fully connected layer was set to 24 for balancing the contributions of features extracted from the CNN and original 30-m resolution features in the SVM-fusion classifier. CNN achieves the highest accuracy of 69.94% when the number of convolution kernels in five convolutional layers and the number of neurons in three fully connected layers are 20-45-35-35-35-50-24-11, which is slightly higher than those obtained by RF (67.82%) and SVM (69.64%) when only using 7 spectral bands of each single pixel as the input features. The processes of the SVM-fusion classifier and the rule-based decision fusion of the CNN-SRI method are the same as those described in Section 3.4. We compared the classification results obtained by the CNN-SRI method with those obtained by the RF and SVM methods, which have been widely used and have achieved higher accuracies than other traditional classifiers (e.g., the maximum likelihood classifier and the decision tree classifier) in many large-scale land cover mapping studies (Gong et al., 2013; Yu et al., 2013; Yu et al., 2014; Li et al., 2017). The input features and parameter setting of the RF-based method were the same as those used in FROM-GLC-2015 (Gong et al., 2017). We also calculated the classification result obtained by the ensemble of RF and

4. Experimental results 4.1. The land cover classification results of the CNN-based method without using high-resolution images In this section, we analyze the experiment results of the land cover classification based on the training and validation sample datasets. Before analyzing the experimental results of the CNN-based method using multi-resolution images (the high-resolution Google Earth images and 30-m resolution data), we first analyze the results of the CNN-based method when only using the single-resolution (30-m resolution) images (denoted by CNN-SRI). The RF and SVM classifiers of CNN-SRI are the same as those described in Section 3.2. For the CNN classifier, we extracted the image patches in a size of 25 × 25 pixels, with 7 Landsat spectral bands (surface reflectance) for each sample. If we had used larger sizes of image patches, most of the image patches would have consisted of multiple land cover types, resulting in interferences for identifying the land cover type of the center coordinate. As each image patch has 7 spectral bands, we could not use the pre-trained VGG-16 model based on the ImageNet dataset to initialize the weights of our proposed CNN. The results obtained by VGG-16-based CNN were very

Table 3 The confusion matrix of the classification results obtained by the RF method. UA denotes the User's Accuracy and PA denotes the Producer's Accuracy (for Tables 3–9). Name

Cropland

Forest

Grassland

Shrubland

Wetland

Water

Tundra

Bare land

Impervious

Snow/Ice

Cloud

SUM

UA (%)

Cropland Forest Grassland Shrubland Wetland Water Tundra Impervious Bare land Snow/Ice Cloud SUM PA (%)

1094 125 166 68 21 0 0 82 0 2 11 1569 69.73

153 1381 57 145 3 2 0 3 5 9 4 1762 78.38

144 42 826 88 1 0 2 12 49 3 5 1172 70.48

11 17 14 18 0 0 0 0 1 0 0 61 29.51

1 0 0 1 2 1 0 1 0 0 0 6 33.33

2 0 0 1 15 100 0 0 12 4 2 136 73.53

0 0 0 0 0 0 0 0 0 0 0 0 0.00

27 3 24 11 4 2 0 112 24 1 2 210 53.33

4 0 184 18 4 0 0 2 2133 19 25 2389 89.28

0 18 5 0 2 1 0 0 2 601 14 643 93.47

10 7 12 0 1 0 0 1 11 38 733 813 90.16

1446 1593 1288 350 53 106 2 213 2237 677 796 8761

75.66 86.69 64.13 5.14 3.77 94.34 0.00 52.58 95.35 88.77 92.09

7

79.90

Remote Sensing of Environment 237 (2020) 111563

W. Li, et al.

Table 4 The confusion matrix of the classification results obtained by the SVM method. Name

Cropland

Forest

Grassland

Shrubland

Wetland

Water

Tundra

Bare land

Impervious

Snow/Ice

Cloud

SUM

UA (%)

Cropland Forest Grassland Shrubland Wetland Water Tundra Impervious Bare land Snow/Ice Cloud SUM PA (%)

1141 141 164 74 19 2 0 64 2 2 9 1618 70.52

153 1388 67 148 0 1 0 3 10 11 2 1783 77.85

101 25 789 94 2 1 2 5 36 1 9 1065 74.08

11 8 17 12 0 0 0 1 1 0 0 50 24.00

2 0 1 1 6 2 0 1 1 0 0 14 42.86

2 0 0 0 17 98 0 2 9 4 2 134 73.13

0 0 0 0 0 0 0 0 0 0 0 0 0.00

20 1 17 10 2 0 0 133 26 1 4 214 62.15

5 0 214 10 4 1 0 3 2122 19 30 2408 88.12

0 19 6 0 2 1 0 0 9 613 16 666 92.04

11 11 13 1 1 0 0 1 21 26 724 809 89.49

1446 1593 1288 350 53 106 2 213 2237 677 796 8761

78.91 87.13 61.26 3.43 11.32 92.45 0.00 62.44 94.86 90.55 90.95 80.20

Table 5 The confusion matrix of the classification results obtained by the CNN-SRI method. Name

Cropland

Forest

Grassland

Shrubland

Wetland

Water

Tundra

Bare land

Impervious

Snow/Ice

Cloud

SUM

UA (%)

Cropland Forest Grassland Shrubland Wetland Water Tundra Impervious Bare land Snow/Ice Cloud SUM PA (%)

1144 130 150 62 20 0 0 67 2 2 5 1582 72.31

148 1385 62 143 0 0 0 3 5 10 2 1758 78.78

104 32 816 103 1 1 2 5 28 1 8 1101 74.11

12 15 10 13 0 0 0 0 0 0 0 50 26.00

0 0 0 1 5 1 0 0 0 0 0 7 71.43

3 0 0 0 20 103 0 0 8 3 2 139 74.10

0 0 0 0 0 0 0 0 0 0 0 0 0.00

16 0 17 10 3 0 0 134 13 1 3 197 68.02

8 0 215 16 1 0 0 3 2162 15 28 2448 88.32

0 22 6 0 2 1 0 0 9 624 10 674 92.58

11 9 12 2 1 0 0 1 10 21 738 805 91.68

1446 1593 1288 350 53 106 2 213 2237 677 796 8761

79.11 86.94 63.35 3.71 9.43 97.17 0.00 62.91 96.65 92.17 92.71 81.31

Table 6 Summary of the results obtained by RF, SVM, RF + SVM and CNN-SRI methods. AA denotes the Average Accuracy and OA denotes the Overall Accuracy (for Tables 6, 8, and 9). Method

RF

SVM

RF + SVM

Land cover type

UA (%)

PA (%)

AA (%)

UA (%)

PA (%)

AA (%)

UA (%)

PA (%)

AA (%)

UA (%)

PA (%)

AA (%)

Cropland Forest Grassland Shrubland Wetland Water Tundra Impervious Bare land Snow/Ice Cloud OA (%)

75.66 86.69 64.13 5.14 3.77 94.34 0.00 52.58 95.35 88.77 92.09

69.73 78.38 70.48 29.51 33.33 73.53 0.00 53.33 89.28 93.47 90.16 79.90

72.69 82.53 67.30 17.33 18.55 83.93 0.00 52.96 92.32 91.12 91.12

78.91 87.13 61.26 3.43 11.32 92.45 0.00 62.44 94.86 90.55 90.95

70.52 77.85 74.08 24.00 42.86 73.13 0.00 62.15 88.12 92.04 89.49 80.20

74.71 82.49 67.67 13.71 27.09 82.79 0.00 62.30 91.49 91.29 90.22

79.25 87.19 63.12 3.71 9.43 95.28 0.00 61.03 96.11 91.88 92.21

72.67 78.43 73.91 27.08 71.43 74.26 0.00 64.68 88.15 91.74 91.29 81.08

75.96 82.81 68.52 15.40 40.43 84.77 0.00 62.85 92.13 91.81 91.75

79.11 86.94 63.35 3.71 9.43 97.17 0.00 62.91 96.65 92.17 92.71

72.31 78.78 74.11 26.00 71.43 74.10 0.00 68.02 88.32 92.58 91.68 81.31

75.71 82.86 68.73 14.86 40.43 85.64 0.00 65.47 92.48 92.38 92.20

SVM (a part of our proposed method, denoted by RF + SVM) for reference. Tables 3–5 show the confusion matrix of the land cover classification results obtained by RF, SVM and CNN-SRI methods, in which the User's Accuracy (UA) represents the number of correctly classified samples divided by the total number of samples in ground truth, and the Producer's Accuracy (PA) represents the number of correctly classified samples divided by the total number of samples classified as the land cover type. Table 6 summarizes the UA, PA, the average accuracy (AA, the average of UA and PA) of each land cover type, and the overall accuracy (OA, the number of correctly classified validation samples divided by the total number of validation samples) obtained by RF, SVM, RF + SVM, and CNN-SRI methods. The CNN-SRI method improves the OA by only 1.41%, 1.11%, and 0.23% compared with RF, SVM, and RF + SVM. Moreover, the accuracy obtained by the SVM-

CNN-SRI

fusion classifier (based on the fusion of 24 original features and 24 features extracted from CNN) is 78.61%, which is even lower than the accuracies obtained by RF and SVM classifiers (based on 24 original features). We can conclude that the effect of the proposed CNN-based method is very limited if we only use 30-m resolution data. The spatial features extracted from the 30-m resolution Landsat images provide little useful information for improving land cover classification results in China. 4.2. The land cover classification results of the CNN-based method using multi-resolution images In this section, we analyze the land cover classification results of our proposed CNN-based method using multi-resolution images (CNN8

Remote Sensing of Environment 237 (2020) 111563

W. Li, et al.

Table 7 The confusion matrix of the classification results obtained by CNN-MRI method. Name

Cropland

Forest

Grassland

Shrubland

Wetland

Water

Tundra

Bare land

Impervious

Snow/Ice

Cloud

SUM

UA (%)

Cropland Forest Grassland Shrubland Wetland Water Tundra Impervious Bare land Snow/Ice Cloud SUM PA (%)

1158 73 120 46 20 2 0 22 0 2 5 1448 79.97

107 1404 65 138 0 1 0 3 7 10 2 1737 80.83

109 35 1002 123 5 2 2 7 30 1 9 1325 75.62

21 43 27 41 0 0 0 0 1 0 1 134 30.60

0 0 0 0 4 1 0 0 0 0 0 5 80.00

0 0 0 0 17 99 0 0 8 3 2 129 76.74

0 0 0 0 0 0 0 0 0 0 0 0 0.00

39 7 16 0 1 0 0 180 20 1 3 267 67.42

0 0 41 1 3 0 0 0 2150 15 27 2237 96.11

0 23 6 0 2 1 0 0 10 622 13 677 91.88

12 8 11 1 1 0 0 1 11 23 734 802 91.52

1446 1593 1288 350 53 106 2 213 2237 677 796 8761

80.08 88.14 77.80 11.71 7.55 93.40 0.00 84.51 96.11 91.88 92.21 84.40

Table 8 Summary of the results obtained by RF, SVM, RF + SVM and CNN-MRI methods. Method

RF

SVM

RF + SVM

CNN-MRI

Land cover type

UA (%)

PA (%)

AA (%)

UA (%)

PA (%)

AA (%)

UA (%)

PA (%)

AA (%)

UA (%)

PA (%)

AA (%)

Cropland Forest Grassland Shrubland Wetland Water Tundra Impervious Bare land Snow/Ice Cloud OA (%)

75.66 86.69 64.13 5.14 3.77 94.34 0.00 52.58 95.35 88.77 92.09 79.90

69.73 78.38 70.48 29.51 33.33 73.53 0.00 53.33 89.28 93.47 90.16

72.69 82.53 67.30 17.33 18.55 83.93 0.00 52.96 92.32 91.12 91.12

78.91 87.13 61.26 3.43 11.32 92.45 0.00 62.44 94.86 90.55 90.95 80.20

70.52 77.85 74.08 24.00 42.86 73.13 0.00 62.15 88.12 92.04 89.49

74.71 82.49 67.67 13.71 27.09 82.79 0.00 62.30 91.49 91.29 90.22

79.25 87.19 63.12 3.71 9.43 95.28 0.00 61.03 96.11 91.88 92.21 81.08

72.67 78.43 73.91 27.08 71.43 74.26 0.00 64.68 88.15 91.74 91.29

75.96 82.81 68.52 15.40 40.43 84.77 0.00 62.85 92.13 91.81 91.75

80.08 88.14 77.80 11.71 7.55 93.40 0.00 84.51 96.11 91.88 92.21 84.40

79.97 80.83 75.62 30.60 80.00 76.74 0.00 67.42 96.11 91.88 91.52

80.03 84.48 76.71 21.16 43.77 85.07 0.00 75.96 96.11 91.88 91.87

our proposed method on different land cover types. The proposed CNNMRI method achieves a classification accuracy of 84.40%, which outperforms RF, SVM, and RF + SVM methods by 4.50%, 4.20% and 3.32%. We can conclude that the accuracy improvement obtained by our proposed method is very remarkable when using the integration of 30-m resolution data and high-resolution Google Earth images compared with only using 30-m resolution data. The CNN-MRI method achieves the highest average accuracy among the four methods for all land cover types except for Tundra, of which the classification accuracies are always zero due to the limited number of samples. For Forest, Water, Bare land, Snow/Ice and Cloud, the overall accuracies of the CNN-MRI method are slightly better than those achieved from RF and SVM methods. For Cropland, Grassland, Shrubland, Wetland, and Impervious, the CNN-MRI method improves the overall accuracies by 7.34%, 9.40%, 3.83%, 25.22% and 23.00% compared with the RF method, and by 5.31%, 9.04%, 7.44%, 16.68% and 13.67% compared with the SVM method. The accuracy improvements of Cropland, Forest, Grassland, Shrubland and Impervious types benefit from the integration with high-resolution images. The accuracy improvement of the Forest type is not as significant as that of the other four types, as its average accuracy is already over 80% when using RF and SVM classifiers. For Wetland, Water, Bare land, Snow/Ice and Cloud types, the accuracy improvements benefit from the ensemble of different classifiers. Although no high-resolution image is used for these land cover types, the improvements of four vegetation types and the impervious type lead to fewer misclassifications among different land cover types and, as a result, contribute to the accuracy improvements of these types to some extent. In the definition of different land cover types (Gong et al., 2013), shrubland and forests are only distinguished by height. If a vegetation sample is higher than 5 m, then it is defined as forest type; otherwise, it is defined as shrubland type. Moreover, in many places of China, shrubs are often mixed or spatially adjacent to

Fig. 6. The average accuracy of each type obtained by RF, SVM and our proposed methods.

MRI). Section 3 describes the process of CNN-MRI in detail. Besides the overall accuracy based on the entire validation sample dataset, we also calculated the accuracies of four classifiers based on the samples in five selected land cover types (Cropland, Forest, Grassland, Shrubland and Impervious). The accuracies obtained by RF and SVM classifiers are 74.58% and 75.83%. The accuracies obtained by CNN (only based on high-resolution images) and SVM-fusion (based on the fusion of 24 original features and 24 features extracted from CNN) classifiers are 72.54% and 76.03%. We can conclude that using the combination of 30-m resolution data and high-resolution Google Earth images can effectively improve the land cover classification results compared with only using the single-source data. Table 7 shows the confusion matrix of the land cover classification results obtained by the CNN-MRI method. Table 8 summarizes UA, PA, AA, and OA obtained by RF, SVM, RF + SVM and CNN-MRI methods. We summarize the average accuracy of each type (except for Tundra) obtained by RF, SVM and our proposed methods in Fig. 6, in order to clearly demonstrate the effect of 9

Remote Sensing of Environment 237 (2020) 111563

W. Li, et al.

Fig. 7. The locations and Landsat false color images of the five regions for evaluating the land cover mapping results. The black squares in each region denote the locations of five zoomed areas for result evaluation. The land cover maps of these regions are shown in Figs. 8 and 9.

forests or grasslands. Consequently, there exists serious confusion between shrubland and forests or grasslands, resulting in the low classification accuracy of shrubland in land cover mapping results obtained by different methods.

future research. 5. Discussion

4.3. The large-scale land cover mapping results of each method

5.1. The land cover classification results using different sizes of sample images

To evaluate the performance of our proposed method for large-scale land cover mapping, we compared and analyzed the land cover maps of five regions (selected from the whole of China) obtained by RF, SVM and our proposed CNN-MRI method. Fig. 7 shows the locations and Landsat false color images (the combination of bands 3, 4, and 5 of Landsat 8) of the five selected regions for result evaluation (with a total area of about 14,000 km2), of which the land cover maps can be found in Fig. 8. We also selected five zoomed areas for comparing the results of each method in detail (denoted by black squares in Fig. 7), of which the land cover maps and the corresponding Google Earth images can be found in Fig. 9. From both Figs. 8 and 9, we can find that the land cover mapping results of our proposed method are more consistent with the reference satellite images, compared with those obtained by RF and SVM. The results are compared and analyzed in detail as follows. In land cover maps obtained by our method, there are fewer misclassifications between Cropland and Forest (the red polygons in zoomed areas (a) and (c), the top right polygon in zoomed area (e), and the left polygon in zoomed area (d)), between Cropland and Grassland (the left polygon in zoomed area (e)), and between Cropland and Impervious (the red polygons in zoomed area (b) and the right polygon in zoomed area (d)). On the other hand, due to the characteristics of patch-based classification and pixel-based classification, RF and SVM methods outperform our proposed method in some small areas of land cover types different from the surrounding areas. For instance, narrow roads cannot be distinguished correctly from cropland in some regions (the blue polygons in zoomed area (b)), and the impervious type is easier to over-classify in its boundaries with other land cover types (the blue polygons in zoomed area (a)). These issues should be improved in

In this section, we analyze the impact of the sample image size on the land cover classification results of our proposed CNN-MRI method. Table 9 shows the classification results of the CNN-MRI method when high-resolution (0.35–0.6 m) sample images are cropped into 64 × 64 pixels, 128 × 128 pixels, and 256 × 256 pixels, which are about 1 × 1, 2 × 2, and 4 × 4 times the actual ground area of the corresponding pixel in the 30-m resolution sample image, respectively. The classification accuracies of the CNN-MRI method are 84.08%, 84.40% and 85.24% when the sample image sizes are 64 × 64 pixels, 128 × 128 pixels, and 256 × 256 pixels. Fig. 10 shows the land cover mapping results when using different sizes of image patch. Google Earth images and land cover mapping results obtained by RF and SVM are also displayed for reference. From both Table 9 and Fig. 10, we can conclude that if the image patch size is too small, it might be hard to identify the land cover type based on the limited information extracted from an image patch. For instance, some image patches in the cropland type are relatively easy to identify when using large sample images but difficult to distinguish from grassland when using the image patch in 64 × 64 pixels. On the other hand, if the image patch size is too large, one image patch is more likely to consist of multiple land cover types, resulting in interferences for identifying the land cover type of the center coordinate. For instance, an image patch with cropland in its center (i.e., of the cropland type) but with buildings in its corner might be mistakenly classified as impervious type when using the image patch in 256 × 256 pixels, resulting in serious over-classification of the impervious type. Taking both the accuracy of land cover classification and the quality of land cover mapping results into consideration, we selected 128 × 128 pixels as the sample image size for land cover 10

Remote Sensing of Environment 237 (2020) 111563

W. Li, et al.

(caption on next page) 11

Remote Sensing of Environment 237 (2020) 111563

W. Li, et al.

Fig. 8. Landsat false color images and land cover maps of RF, SVM and our method. The locations of these images can be found in Fig. 7. The scales of these images are the same as those shown in Fig. 7.

classification and the patch size for land cover mapping in this study. The sizes in meters of our sample images are between 44.8 × 44.8 m and 76.8 × 76.8 m.

land cover classification results. The first issue concerns the inconsistent acquisition date of the sample image. The training and validation samples used in this study were selected from the all-season sample dataset used for global land cover mapping (Li et al., 2017), covering the whole area of China. For many samples, the acquisition date of the available high-resolution Google Earth image is often different from the actual acquisition date of its corresponding 30-m resolution Landsat image. Although we removed the high-resolution samples in certain types (i.e., Snow/Ice and Cloud) that vary extremely with time, there still exist some samples of which the land cover type of the corresponding Google Earth image is different from the one interpreted from the 30-m resolution Landsat image. This issue might result in some errors in the CNN classifier to some extent. The large-scale land cover classification results could be further improved if high-resolution images acquired on the same date as the Landsat image were used in our proposed method. The second issue concerns the interpretation of the sample dataset. For our sample dataset, the training and validation samples are distributed discretely across the whole area of China. Each sample consists of one pixel in a Landsat image, interpreted as one land cover type. In order to identify the single land cover type of each sample pixel based on the CNN classifier, we need to collect its corresponding image patch in a suitable size with enough features, which might consist of multiple land cover types. This issue might aggravate the misclassification at the boundaries of different land cover types. If pixel-wise interpreted datasets are available for large-scale land cover studies in the future, more state-of-the-art deep learning algorithms (e.g., semantic segmentation models) could be explored to reduce the misclassification at land cover boundaries and improve large-scale land cover mapping results. The third issue concerns the design of our proposed CNN-MRI method. In the CNN classifier of our proposed method, we selected 128 × 128 pixels as the optimized sample image size for land cover classification and the patch size for land cover mapping, taking both the accuracy of land cover classification and the quality of land cover mapping results into consideration. Using a multi-scale image patch might improve the land cover classification accuracy to some extent, but it would increase the processing time of land cover mapping as well. In the feature fusion and classifier aggregation of our proposed method, we integrated 30-m resolution features with the same number of highresolution features obtained by CNN, and we used the unweighted average probabilities obtained by different classifiers. The land cover classification accuracy might be improved if different feature fusion and decision fusion strategies are utilized in our proposed method. These above-mentioned aspects should be explored in our future research.

5.2. The land cover classification results using different combinations of classifiers In this section, we analyze the impact of different combinations of classifiers on the land cover classification accuracy of the 8761 validation samples. Fig. 11 shows the overall classification accuracies obtained by different combinations of classifiers, of which the descriptions of RF, SVM, CNN and SVMF (SVM-fusion) classifiers can be found in Sections 3.2, 3.3, and 3.4. As CNN and SVM-fusion classifiers are based on samples of five land cover types, these two classifiers need to be combined with at least one of the RF and SVM classifiers for calculating the overall classification accuracy of the entire validation dataset. Each combination of classifiers employs a similar rule of decision fusion to that described in Section 3.4.2. Taking SVM + CNN + SVMF as an example, for each sample in the validation dataset, if its land cover type is predicted as Cloud or Snow/Ice by SVM, or its coordinate cannot be searched in the high-resolution sample dataset, then its land cover type is determined by the SVM classifier. Otherwise, the land cover type of this validation sample is determined by SVM, CNN and SVM-fusion classifiers. For both cases, we calculated the average probability of either one or three classifiers for each land cover type. The final prediction result is the land cover type with the highest average probability. Fig. 11 shows that RF + SVM obtains the lowest classification accuracy (81.08%) and our proposed method (RF + SVM + CNN + SVMF) obtains the highest classification accuracy (84.40%) among all combinations of classifiers. The classification accuracies of different combinations of classifiers are all higher than those obtained by the individual RF method (79.90%) and SVM method (80.20%). The accuracies obtained by combinations with CNN or SVMF are obviously higher than the accuracy of RF + SVM, even if we use the same number of classifiers. We can conclude that increasing the diversity of classifiers and features in decision fusion can effectively improve the land cover classification results. 5.3. The computational efficiency of our proposed method In this study, the training and land cover mapping of RF, SVM, and SVM-fusion were based on the Scikit-learn library (Pedregosa et al., 2011) and a 12-core CPU hardware platform. The training and land cover mapping of CNN were based on the Caffe deep learning framework (Jia et al., 2014) and a single NVIDIA Titan V GPU card. For the sample dataset-based land cover classification, it takes less than 2 h to train all classifiers. For land cover mapping based on large-scale remote sensing images, it takes about one day to classify an image of about 3000 km2 using our proposed CNN-MRI method. Most of this time is used for file reading and writing instead of land cover type prediction. The implementation of CNN-MRI will be further optimized to reduce the land cover mapping time in our future work. Despite the long classification time of the current implementation, the land cover mapping time will remain the same when scaling up to the whole of China if hardware devices are sufficient, as the classification of each image is independent and multiple images can be easily processed in parallel.

6. Conclusions In this paper, we proposed a novel deep CNN-based method for improving 30-m resolution land cover mapping in China using multisource remote sensing images (i.e., Landsat images, Digital Elevation Model data and high-resolution Google Earth images). In the land cover classification based on training and validation datasets, we first used the 30-m resolution sample dataset to train and evaluate RF and SVM classifiers. Second, we designed a CNN classifier for land cover classification and feature extraction based on high-resolution images. Moreover, we built an SVM-fusion classifier based on the integration of original 30-m resolution features with high-resolution features extracted from the CNN classifier. Finally, we proposed a rule-based decision fusion method to aggregate the classification results obtained by each classifier. In the large-scale land cover mapping process, we first prepared the 30-m resolution datasets for RF and SVM classifiers and the high-resolution Google Earth images for CNN and SVM-fusion

5.4. Potential strategies for further improving the large-scale land cover classification results In this section, we introduce some potential issues related to our proposed method and several strategies for improving the large-scale 12

Remote Sensing of Environment 237 (2020) 111563

W. Li, et al.

(caption on next page) 13

Remote Sensing of Environment 237 (2020) 111563

W. Li, et al.

Fig. 9. Google Earth images and land cover maps of RF, SVM and our method in five zoomed areas. The locations of these areas can be found in Fig. 7. The red polygons denote some examples where our method outperforms RF and SVM. The blue polygons denote some examples where RF and SVM outperform our method. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Table 9 Classification results of the CNN-MRI method using different sizes of sample images. Size

64 × 64 pixels

128 × 128 pixels

256 × 256 pixels

Land cover type

UA (%)

PA (%)

AA (%)

UA (%)

PA (%)

AA (%)

UA (%)

PA (%)

AA (%)

Cropland Forest Grassland Shrubland Wetland Water Tundra Impervious Bare land Snow/Ice Cloud OA (%)

80.98 87.26 76.48 13.71 7.55 93.40 0.00 83.10 96.11 90.99 91.21 84.08

76.74 83.08 75.83 32.00 80.00 76.74 0.00 58.22 96.50 92.35 93.08

78.86 85.17 76.15 22.86 43.77 85.07 0.00 70.66 96.30 91.67 92.14

80.08 88.14 77.80 11.71 7.55 93.40 0.00 84.51 96.11 91.88 92.21 84.40

79.97 80.83 75.62 30.60 80.00 76.74 0.00 67.42 96.11 91.88 91.52

80.03 84.48 76.71 21.16 43.77 85.07 0.00 75.96 96.11 91.88 91.87

83.96 86.13 80.36 21.43 7.55 94.34 0.00 81.22 96.24 91.14 91.08 85.24

79.76 84.38 78.00 35.38 80.00 77.52 0.00 67.32 96.42 92.23 92.83

81.86 85.25 79.18 28.40 43.77 85.93 0.00 74.27 96.33 91.68 91.96

Fig. 10. The land cover mapping results of RF, SVM and our proposed method when using different sizes of image patch. Google Earth images are displayed for reference.

14

Remote Sensing of Environment 237 (2020) 111563

W. Li, et al.

interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgements This work was supported in part by the National Key Research and Development Program of China (Grant No. 2017YFA0604401), the National Natural Science Foundation of China (Grant Nos. 51761135015 and U1839206), Center for High Performance Computing and System Simulation, Pilot National Laboratory for Marine Science and Technology (Qingdao), and Cyrus Tang Foundation and a donation by Delos LLC to Tsinghua University.

Fig. 11. The overall classification accuracies obtained by different combinations of classifiers.

References Amani, M., Salehi, B., Mahdavi, S., Granger, J.E., Brisco, B., Hanson, A., 2017. Wetland classification using multi-source and multi-temporal optical remote sensing data in Newfoundland and Labrador, Canada. Can. J. Remote. Sens. 43 (4), 360–373. Bartholome, E., Belward, A.S., 2005. GLC2000: a new approach to global land cover mapping from Earth observation data. Int. J. Remote Sens. 26, 1959–1977. Bontemps, S., Defourny, P., Van Bogaert, E., Arino, O., Kalogirou, V., Perez, J.R., 2011. GLOBCOVER 2009 Products description and validation report. URL. http://ionia1. esrin.esa.int/docs/GLOBCOVER2009_Validation_Report_2.2.pdf. Cao, X., Liu, Y., Liu, Q., Cui, X., Chen, X., Chen, J., 2018. Estimating the age and population structure of encroaching shrubs in arid/semiarid grasslands using high spatial resolution remote sensing imagery. Remote Sens. Environ. 216, 572–585. Cheng, G., Han, J., 2016. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 117, 11–28. Chen, J., Chen, J., Liao, A., Cao, X., Chen, L., Chen, X., He, C., Han, G., Peng, S., Lu, M., Zhang, W., 2015. Global land cover mapping at 30 m resolution: a POK-based operational approach. ISPRS J. Photogramm. Remote Sens. 103, 7–27. Chen, B., Huang, B., Xu, B., 2017. Multi-source remotely sensed data fusion for improving land cover classification. ISPRS J. Photogramm. Remote Sens. 124, 27–39. Cheng, G., Zhou, P., Han, J., 2016. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 54 (12), 7405–7415. Cheng, G., Han, J., Lu, X., 2017. Remote sensing image scene classification: benchmark and state of the art. Proc. IEEE 105 (10), 1865–1883. Gessner, U., Machwitz, M., Esch, T., Tillack, A., Naeimi, V., Kuenzer, C., Dech, S., 2015. Multi-sensor mapping of West African land cover using MODIS, ASAR and TanDEMX/TerraSAR-X data. Remote Sens. Environ. 164, 282–297. Gong, P., Wang, J., Yu, L., Zhao, Y., Zhao, Y., Liang, L., Niu, Z., Huang, X., Fu, H., Liu, S., Li, C., 2013. Finer resolution observation and monitoring of global land cover: first mapping results with Landsat TM and ETM+ data. Int. J. Remote Sens. 34 (7), 2607–2654. Gong, P., Yu, L., Li, C., Wang, J., Liang, L., Li, X., Ji, L., Bai, Y., Cheng, Y., Zhu, Z., 2016. A new research paradigm for global land cover mapping. Ann. GIS 22 (2), 87–102. Gong, P., Wang, J., Ji, L., Yu, L., 2017. Landsat Based Land Cover Product for 2015 (FROM-GLC 2015 v0.1). https://doi.org/10.6084/m9.figshare.5362774.v1. Hansen, M.C., DeFries, R.S., Townshend, J.R., Sohlberg, R., 2000. Global land cover classification at 1 km spatial resolution using a classification tree approach. Int. J. Remote Sens. 21 (6–7), 1331–1364. Homer, C., Huang, C., Yang, L., Wylie, B., Coan, M., 2004. Development of a 2001 national land-cover database for the United States. Photogramm. Eng. Remote Sens. 70 (7), 829–840. Hu, F., Xia, G.S., Hu, J., Zhang, L., 2015. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sens. 7 (11), 14680–14707. Huang, B., Zhao, B., Song, Y., 2018. Urban land-use mapping using a deep convolutional neural network with high spatial resolution multispectral remote sensing imagery. Remote Sens. Environ. 214, 73–86. Immitzer, M., Böck, S., Einzmann, K., Vuolo, F., Pinnel, N., Wallner, A., Atzberger, C., 2018. Fractional cover mapping of spruce and pine at 1 ha resolution combining very high and medium spatial resolution satellite imagery. Remote Sens. Environ. 204, 690–703. Inglada, J., Vincent, A., Arias, M., Tardy, B., Morin, D., Rodes, I., 2017. Operational high resolution land cover map production at the country scale using satellite image time series. Remote Sens. 9 (1), 95. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T., 2014, November. Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia. ACM, pp. 675–678. Li, W., Fu, H., Yu, L., Gong, P., Feng, D., Li, C., Clinton, N., 2016. Stacked autoencoderbased deep learning for remote-sensing image classification: a case study of African land-cover mapping. Int. J. Remote Sens. 37 (23), 5632–5646. Li, C., Gong, P., Wang, J., Zhu, Z., Biging, G.S., Yuan, C., Hu, T., Zhang, H., Wang, Q., Li, X., Liu, X., 2017. The first all-season sample set for mapping global land cover with Landsat-8 data. Sci. Bull. 62 (7), 508–515. Li, W., He, C., Fang, J., Zheng, J., Fu, H., Yu, L., 2019. Semantic segmentation-based building footprint extraction using very high-resolution satellite images and multisource GIS data. Remote Sens. 11 (4), 403.

classifiers. Second, we utilized the optimized RF, SVM, CNN, and SVMfusion models to predict the large-scale land cover maps of each classifier. Finally, we applied the proposed decision fusion method to the land cover maps of each classifier and obtained the final land cover mapping results. Compared with RF and SVM methods, our proposed CNN-MRI method achieves a higher classification accuracy on the entire validation dataset in China (4.50% and 4.20% higher than RF and SVM, respectively), and higher or the same AA for all land cover types. In the land cover maps obtained by our method, there are much fewer misclassifications among Cropland, Forest, Grassland, and Impervious types. In addition, we analyzed the land cover classification results of the CNN-based method without using high-resolution images, the effect of our proposed CNN-MRI method on different land cover types, and the influence of different combinations of classifiers and different patch sizes of HR-GEI on the land cover classification results. In our future research, we will improve our proposed method in order to reduce the errors of the current land cover maps in some regions (i.e., narrow roads and boundaries between two land cover types). The potential of other deep learning algorithms (e.g. semantic segmentation models) for large-scale land cover mapping will be explored in our future work as well. Moreover, in order to apply our proposed method to broader or even global areas, we will improve its generalization ability by utilizing the training and validation datasets collected from broader areas, and improve the computational efficiency of large-scale land cover mapping. Author contributions Use this form to specify the contribution of each author of your manuscript. A distinction is made between five types of contributions: Conceived and designed the analysis; Collected the data; Contributed data or analysis tools; Performed the analysis; Wrote the paper. For each author of your manuscript, please indicate the types of contributions the author has made. An author may have made more than one type of contribution. Optionally, for each contribution type, you may specify the contribution of an author in more detail by providing a one-sentence statement in which the contribution is summarized. In the case of an author who contributed to performing the analysis, the author's contribution for instance could be specified in more detail as ‘Performed the computer simulations’, ‘Performed the statistical analysis’, or ‘Performed the text mining analysis’. If an author has made a contribution that is not covered by the five pre-defined contribution types, then please choose ‘Other contribution’ and provide a one-sentence statement summarizing the author's contribution. Declaration of competing interest The authors declare that they have no known competing financial 15

Remote Sensing of Environment 237 (2020) 111563

W. Li, et al. Liang, J., Gong, J., Li, W., 2018. Applications and impacts of Google Earth: a decadal review (2006–2016). ISPRS J. Photogramm. Remote Sens. 146, 91–107. Liu, J.Y., Zhuang, D.F., Luo, D., Xiao, X.M., 2003. Land-cover classification of China: integrated analysis of AVHRR imagery and geophysical data. Int. J. Remote Sens. 24 (12), 2485–2500. Maggiori, E., Tarabalka, Y., Charpiat, G., Alliez, P., 2017. Convolutional neural networks for large-scale remote-sensing image classification. IEEE Trans. Geosci. Remote Sens. 55 (2), 645–657. Mahdianpari, M., Salehi, B., Rezaee, M., Mohammadimanesh, F., Zhang, Y., 2018. Very deep convolutional neural networks for complex land cover mapping using multispectral remote sensing imagery. Remote Sens. 10 (7), 1119. Marcos, D., Volpi, M., Kellenberger, B., Tuia, D., 2018. Land cover mapping at very high resolution with rotation equivariant CNNs: towards small yet accurate models. ISPRS J. Photogramm. Remote Sens. 145, 96–107. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... Vanderplas, J., 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research 12 (Oct), 2825–2830. Sidike, P., Sagan, V., Maimaitijiang, M., Maimaitiyiming, M., Shakoor, N., Burken, J., Mockler, T., Fritschi, F.B., 2019. dPEN: deep Progressively Expanded Network for mapping heterogeneous agricultural landscape using WorldView-3 satellite imagery. Remote Sens. Environ. 221, 756–772. Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv preprint (arXiv:1409.1556). Toure, S.I., Stow, D.A., Shih, H.C., Weeks, J., Lopez-Carr, D., 2018. Land cover and land use change analysis using multi-spatial resolution data and object-based image analysis. Remote Sens. Environ. 210, 259–268. Xia, G.S., Hu, J., Hu, F., Shi, B., Bai, X., Zhong, Y., Zhang, L., Lu, X., 2017. AID: a benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 55 (7), 3965–3981. Yang, J., Gong, P., Fu, R., Zhang, M., Chen, J., Liang, S., Xu, B., Shi, J., Dickinson, R.,

2013. The role of satellite remote sensing in climate change studies. Nat. Clim. Chang. 3 (10), 875. Yu, L., Gong, P., 2012. Google Earth as a virtual globe tool for Earth science applications at the global scale: progress and perspectives. Int. J. Remote Sens. 33 (12), 3966–3986. Yu, L., Wang, J., Gong, P., 2013. Improving 30 m global land-cover map FROM-GLC with time series MODIS and auxiliary data sets: a segmentation-based approach. Int. J. Remote Sens. 34 (16), 5851–5867. Yu, L., Wang, J., Li, X., Li, C., Zhao, Y., Gong, P., 2014. A multi-resolution global land cover dataset through multisource data aggregation. Sci. China Earth Sci. 57 (10), 2317–2329. Yu, Y., Guan, H., Zai, D., Ji, Z., 2016. Rotation-and-scale-invariant airplane detection in high-resolution satellite images based on deep-Hough-forests. ISPRS J. Photogramm. Remote Sens. 112, 50–64. Zhang, C., Pan, X., Li, H., Gardiner, A., Sargent, I., Hare, J., Atkinson, P.M., 2018a. A hybrid MLP-CNN classifier for very fine resolution remotely sensed image classification. ISPRS J. Photogramm. Remote Sens. 140, 133–144. Zhang, C., Sargent, I., Pan, X., Li, H., Gardiner, A., Hare, J., Atkinson, P.M., 2018b. An object-based convolutional neural network (OCNN) for urban land use classification. Remote Sens. Environ. 216, 57–70. Zhang, C., Sargent, I., Pan, X., Li, H., Gardiner, A., Hare, J., Atkinson, P.M., 2019. Joint Deep Learning for land cover and land use classification. Remote Sens. Environ. 221, 173–187. Zhao, Y., Feng, D., Yu, L., Wang, X., Chen, Y., Bai, Y., Hernández, H.J., Galleguillos, M., Estades, C., Biging, G.S., Radke, J.D., 2016. Detailed dynamic land cover mapping of Chile: accuracy improvement by integrating multi-temporal data. Remote Sens. Environ. 183, 170–185. Zhao, W., Du, S., Emery, W.J., 2017. Object-based convolutional neural network for highresolution imagery classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 10 (7), 3386–3396.

16