Estimating grassland LAI using the Random Forests approach and Landsat imagery in the meadow steppe of Hulunber, China

Estimating grassland LAI using the Random Forests approach and Landsat imagery in the meadow steppe of Hulunber, China

Journal of Integrative Agriculture 2017, 16(2): 286–297 Available online at www.sciencedirect.com ScienceDirect RESEARCH ARTICLE Estimating grassla...

847KB Sizes 44 Downloads 77 Views

Journal of Integrative Agriculture 2017, 16(2): 286–297 Available online at www.sciencedirect.com

ScienceDirect

RESEARCH ARTICLE

Estimating grassland LAI using the Random Forests approach and Landsat imagery in the meadow steppe of Hulunber, China LI Zhen-wang, XIN Xiao-ping, TANG Huan, YANG Fan, CHEN Bao-rui, ZHANG Bao-hui National Hulunber Grassland Ecosystem Observation and Research Station, Institute of Agricultural Resources and Regional Planning, Chinese Academy of Agricultural Sciences, Beijing 100081, P.R.China

Abstract Leaf area index (LAI) is a key parameter for describing vegetation structures and is closely associated with vegetative photosynthesis and energy balance. The accurate retrieval of LAI is important when modeling biophysical processes of vegetation and the productivity of earth systems. The Random Forests (RF) method aggregates an ensemble of decision trees to improve the prediction accuracy and demonstrates a more robust capacity than other regression methods. This study evaluated the RF method for predicting grassland LAI using ground measurements and remote sensing data. Parameter optimization and variable reduction were conducted before model prediction. Two variable reduction methods were examined: the Variable Importance Value method and the principal component analysis (PCA) method. Finally, the sensitivity of RF to highly correlated variables was tested. The results showed that the RF parameters have a small effect on the performance of RF, and a satisfactory prediction was acquired with a root mean square error (RMSE) of 0.1956. The two variable reduction methods for RF prediction produced different results; variable reduction based on the Variable Importance Value method achieved nearly the same prediction accuracy with no reduced prediction, whereas variable reduction using the PCA method had an obviously degraded result that may have been caused by the loss of subtle variations and the fusion of noise information. After removing highly correlated variables, the relative variable importance remained steady, and the use of variables selected based on the best-performing vegetation indices performed better than the variables with all vegetation indices or those selected based on the most important one. The results in this study demonstrate the practical and powerful ability of the RF method in predicting grassland LAI, which can also be applied to the estimation of other vegetation traits as an alternative to conventional empirical regression models and the selection of relevant variables used in ecological models. Keywords: leaf area index, Random Forests grassland, remote sensing, Hulunber

1. Introduction Received 15 October, 2015 Accepted 20 January, 2016 LI Zhen-wang, Tel: +86-10-82109618, E-mail: lizhenwang@126. com; Correspondence ZHANG Bao-hui, Tel: +86-10-82109618, E-mail: [email protected] © 2017, CAAS. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http:// creativecommons.org/licenses/by-nc-nd/4.0/) doi: 10.1016/S2095-3119(15)61303-X

Grassland covers approximately one third of the global terrestrial surface (Fan et al. 2003; Lemaire et al. 2005), and plays an important role in the interactions among earth’s atmosphere, hydrosphere and continental surface. The study of grassland is necessary to understand global climatic change and terrestrial carbon cycling (Lemaire et al. 2005).

LI Zhen-wang et al. Journal of Integrative Agriculture 2017, 16(2): 286–297

Leaf area index (LAI) is a key parameter for describing vegetation structures and is closely associated with vegetative photosynthesis and energy balance (Running et al. 1989; Sellers et al. 1997). This index is adopted in a variety of ecosystem models of vegetation biophysical process and earth system productivity. Remote sensing technology is an effective method to monitor terrestrial seasonal and inter-annual variability within regional to global domains, and provides us a useful mean for the timely monitoring and quantitative assessment of vegetation dynamics. The retrieval of LAI using remotely sensed satellite and aerial images has showed outstanding potential and has been widely employed in different fields. A number of LAI estimation methods have been developed from remotely sensed data, each of which presents unique advantages and limitations. Among these methods, an empirical regression method linking LAI and spectral observations or a combination of spectral observations (vegetation indices, VIs) is the most popular and commonly used approach. However, due to their empirical nature, these regression models are site- and sensor-specific, and their performance can be hampered by factors such as differences in surface property and viewing geometry (Verrelst et al. 2008, 2010). Alternatively, machine learning methods, such as the artificial neural network (ANN) and decision tree, are also increasingly used by fully utilizing spectrum information to minimize the estimation error through an adaptive learning process. These algorithms can describe intricate non-linear relationships and incorporate more ancillary information to find the best solutions (Verrelst et al. 2010). Physical approaches simulate the radiative transfer process in vegetation and describe the canopy spectral variation as a function of canopy, leaf and soil background characteristics. This explicit physical basis has led to widespread use of these approaches (Jacquemoud and Baret 1990; Jacquemoud et al. 2000; Fan et al. 2010; Fan et al. 2014). A drawback of using physically based models is the ill-posed nature of model inversion. The complex parameterization and optimization procedures are also hindering the application of the model (Combal et al. 2002; Atzberger 2004; Verrelst et al. 2010). The Random Forests (RF) approach, introduced by Breiman (2001), is a relatively new approach developed from the decision tree. It combines an ensemble of decision trees to improve prediction accuracy and demonstrates a more robust capacity in terms of incurring the overfitting problem and resisting noise data (Heung et al. 2014). By randomly changing the predictors and training data for each decision tree in the RF, the algorithm increased its diversity and computed a prediction which was more accurate than any of its individual trees. Additionally, along with the RF

287

prediction, an important value for each variable will also be provided, so RF can also be used as a variable selection tool to identify informative variables (Genuer et al. 2010; Hapfelmeier and Ulm 2013). The objectives of this study were to first optimize RF parameters for grassland LAI prediction from field measurements and Landsat imagery, variable reduction was then performed to eliminate noise information using Variable Importance Value method and principal component analysis (PCA); finally, the stabilization of the algorithm was tested by removing highly correlated predictors. The approach used in this study can be applied in regional remote sensing production and selection of informative variables for model prediction. The results are meaningful for the usage of RF in other applications.

2. Materials and methods 2.1. Study site The field campaigns were conducted at the Hulunber Grassland Ecosystem Observation and Research Station (Hulunber Station) (49°20´24´´N, 119°59´44´´E), Inner Mongolia, China (Fig. 1). The site has an area of 3 km×3 km, and is distributed around the eddy covariance flux tower. The site has an average elevation of 650 m and features a flat surface that does not vary by more than 20 m in elevation. The regional climate is characterized as a semiarid steppe with an annual average rainfall of 350–400 mm, which mainly occurs from July to September; annual mean temperature at the site ranges from –3 to 1°C. The land cover is meadow steppe, which is dominated by Leymus chinensis and Stipa baicalensis. There are three pasture types (Fig. 1): grazing pasture, which feeds cattle; cutting pasture that is used for silage; and fenced pasture, which is enclosed by a fence and has grown naturally without any external influence.

2.2. Field sampling A well designed ground sampling strategy is important for accurate representation of the ground situation, and several sampling methods were developed for various application (Morisette et al. 2006; Ge et al. 2012, 2015; Wang et al. 2014). In this study, a two-scale sampling strategy designed by the VALERI (validation of land European remote sensing instruments) project was used to collect ground LAI (Baret et al. 2005; De Kauwe et al. 2011). The two scales of VALERI sampling were the site scale (at least 3 km×3 km, representing the entire experiment site) and the elementary sampling unit (ESU) scale (30 m×30 m in this study corresponding to the Landsat pixel size). For

288

LI Zhen-wang et al. Journal of Integrative Agriculture 2017, 16(2): 286–297

A 115°E N

120°E

125°E

B

N

ei

H ng ng ov Pr in

52°N

jia

52°N

lo

County boundary Province boundary Country boundary

ce

Cutting pasture

Hulunber City

Grazing pasture

Hulunber Station

0

Xing’an League

115°E

Fenced pasture

48°N

Mongolia

50°N 48°N

50°N

Russia

120°E

125°E

0

Eddy tower Elementary sampling unit Dividing line 3 km×3 km study site

Fig. 1 Overview of the study site. A, site location. B, sample points in the site (background: Landsat-8 operational land imager (OLI) false-color composite image).

the site, the 3 km×3 km area was divided into nine 1 km× 1 km grids, and 3 to 5 ESUs were randomly chosen in each grid. In total, 29 ESUs were chosen across the entire site (Fig. 1), and our field measurements were collected from these ESUs. At each ESU, the LAIs were measured at five points organized in a “cross” pattern in which each sample point was 15 m from the next point, and five local LAI values were averaged to calculate a mean value for each ESU. Samplings were carried out during the growing season of 2014 and 2015. In 2014, the experiment dates were 28 July and 3 August; in 2015, the experiment dates were 19 June, 30 June and 7 July. The ground effective LAI was measured with an LAI-2200C plant canopy analyzer (Li-Cor, Lincoln, Nebraska, USA), which is an indirect non-contact instrument to measure the gap fraction of the diffuse radiation transmission observed through the canopy (Welles and Norman 1991; Chen et al. 2002). For each LAI collection at a point, one above-canopy and six below-canopy LAI-2200C measurements were taken to obtain one local LAI value. For the low grass, a narrow hole was dug under the grass and the LAI2200C was put in the hole to measure LAI values (Liu et al. 2011). The measurements were collected near sunrise or sunset to obtain nearly uniform sky illumination. In addition, the GPS locations of each ESU, which were accurate to 2 m, were recorded at the center point to ensure that the measurements for each campaign were collected in the same location. From five field campaigns, 89 ESU measurements were collected.

2.3. Landsat imagery and auxiliary predictors Landsat imagery Two Landsat7 ETM+ (enhanced thematic

mapper) images and three Landsat8 operational land imager (OLI) images were collected in this study to correspond with the field experiment dates (Table 1). The images were downloaded from the United States Geological Survey (USGS) (http://glovis.usgs.gov/) and all received a level 2 processing (after radiometric and systematic geometric correction) and were projected in universal transverse mercator (UTM) coordinates (world geodetic system 1984 (WGS84) datum, zone 50N). The images were then atmospherically corrected using the FLAASH (fast line-of-sight atmospheric analysis of hypercubes) program embedded in the ENVI 4.8 software. Two important parameters used in the FLAASH program for atmospheric correction, aerosol optical depth and water vapor column, were obtained using a Microtops II Sunphotometer (Solar Light Company, PA, USA) during each field experiment. Finally, geometric correction was performed on all images using ground points collected in the field around the site. Because the short wave infrared (SWIR) band was susceptive to the atmosphere aerosol and water vapor, the SWIR1 band (1.55–1.75 nm for Landsat7 ETM+ and 1.56–1.66 nm for Landsat8 OLI) was abandoned in this study after checking the reflectance. The remaining 30 m resolution spectral blue, green, red, near-infrared (NIR) and SWIR2 bands (bands 1, 2, 3, 4 and 7 from Landsat7 ETM+ images and corresponding bands 2, 3, 4, 5 and 7 from Landsat8 OLI images) were used as the predictors for RF estimation. Vegetation indices Seven vegetation indices were calculated from the geometrically and atmospherically corrected canopy spectral reflectance images (Table 2), which were found to have good performance in predicting grassland LAI (Atzberger et al. 2015; Liang et al. 2015). The vegetation

LI Zhen-wang et al. Journal of Integrative Agriculture 2017, 16(2): 286–297

indices include: the normalized difference vegetation index (NDVI), green chlorophyll index (CIgreen), atmospherically resistant vegetation index (ARVI), modified triangular vegetation index (MTVI), optimized soil adjusted vegetation index (OSAVI), simple ratio (SR), and wide dynamic range vegetation index (WDRVI). Auxiliary digital elevation model (DEM) derivation The topographic wetness index (TWI), derived from advanced spaceborne thermal emission and reflection radiometer (ASTER) global digital elevation model (GDEM) data, was also used as an auxiliary predictor in RF to predict grassland LAI. The ASTER GDEM data had a spatial resolution of 30 m and were downloaded from the ASTER GDEM distribution site (http://gdem.ersdac.jspacesystems.or.jp/). The TWI is defined as: TWI=ln (α/tanβ) (1) Where, α is the upslope contributing area per unit contour length and β is the local gradient at the point. TWI is the most popular DEM-based index for reflecting soil moisture (Kopecký and Čížková 2010). Previous study showed that it has good performance in quantifying topographic control on hydrological processes and indicating the spatial distribution of soil moisture and surface saturation (Qin et al. 2007; Kilibarda et al. 2014). Recently, TWI was also applied in vegetation ecology to model species distribution (Van Niel et al. 2004; Evans and Cushman 2009), predict vegetation types (Dobrowski et al. 2008) and model tree lines (Bader and Ruijten 2008).

1)

Landsat sensors1) Landsat7 ETM+ Landsat8 OLI Landsat8 OLI Landsat7 ETM+ Landsat8 OLI

Index1) NDVI CIgreen ARVI MTVI OSAVI SR WDRVI 1)

Image date (yr-mon-d) 2014-07-26 2014-08-03 2015-06-19 2015-06-27 2015-07-05

ETM+, enhanced thematic mapper; OLI, operational land imager.

Table 2 Vegetation indices used in the study

2.4. Random Forests RF was developed from the decision tree and is similar to bagging trees. It generates an ensemble number of trees (ntree) that are aggregated to produce accurate predictions that do not overfit the data (Breiman 1996, 2001). Unlike bagging trees, RF grows its trees with a randomly chosen subset of the number of predictors at each splitting node (mtry), and the tree is allowed to grow fully without pruning. Each tree in the RF is independently grown to its maximum size based on a bootstrap sample from the training dataset (approximately two-thirds), and the remaining one-third of the samples are randomly left out; these are called the outof-bag (OOB) samples, which are used to calculate an unbiased OOB error rate and variable importance (measured by calculating the percent increase in the mean square error when the OOB data for each variable are permuted) (Breiman 2001; Prasad et al. 2006). At each binary split, the predictor that produces the best split is chosen from a random subset (mtry) of the entire predictor set (p), and mtry is recognized as the main tuning parameter of RF and should therefore be optimized (Svetnik et al. 2003; Heung et al. 2014). Using the out-of-bag samples, the prediction error (OOB error) for each individual tree is obtained using the following equation: 2 1 n ^ ( yi −yi ) (2) ∑ i =1 n Where, y^i is the predicted output of an OOB sample, yi is the actual output and n is the total number of OOB samples.

ErrorOOB=

2.5. Prediction assessment

Table 1 Collection of remote sensing images Experiment date (yr-mon-d) 2014-07-28 2014-08-03 2015-06-19 2015-06-30 2015-07-07

289

To estimate the accuracy of RF predictions, a validation dataset including 24 ground ESU measurements was randomly selected from the original 89 samples. The root mean squared error (RMSE), mean absolute error (MAE) and coefficient of determination (R2) between measured and predicted values were used to assess the model performance. RMSE and MAE were calculated as follows:

Equation2) (ρNIR–ρred)/(ρNIR+ρred) ρNIR/ρgreen–1 (ρNIR–(2×ρred–ρblue))/(ρNIR+(2×ρred–ρblue)) 1.2×(1.2×(ρNIR–ρgreen)–2.5×(ρred–ρgreen)) (1+0.16)×(ρNIR–ρred)/(ρNIR+ρred+0.16) ρNIR/ρred (0.1×ρNIR−ρRed)/(0.1×ρNIR+ρRed)+0.9/1.1

References Rouse et al. (1974) Gitelson et al. (2003a, b) Kaufman and Tanre (1992) Haboudane et al. (2004) Rondeaux et al. (1996) Jordan (1969) Gitelson (2004)

NDVI, normalized difference vegetation index; CIgreen, green chlorophyll index; ARVI, atmospherically resistant vegetation index; MTVI, modified triangular vegetation index; OSAVI, optimized soil adjusted vegetation index; SR, simple ratio; WDRVI, wide dynamic range vegetation index. 2) ρNIR, the reflectance of near-infrared band; ρred, the reflectance of red band; ρgreen, the reflectance of green band; ρblue, the reflectance of blue band.

290

LI Zhen-wang et al. Journal of Integrative Agriculture 2017, 16(2): 286–297

RMSE=

LAI value ranging from 1.17–1.27. In the peak of the growing season (3 August), the mean and variability of grass LAI values were higher, with a mean LAI value of 1.85 and a standard deviation of 0.63.

∑ in=1 ( yi −y^i )2 n

n

MAE=

∑ i =1 yi −y^i

n Where, y^i is the estimated LAI value, yi is the measured LAI value and n is the number of measured values in the validation data. Moreover, the leave-one-out cross validation (CV) error rate and OOB error were also used to estimate the performance of RF parameter optimization. The leave-one-out cross validation is an iterative process, for each step an observation is excluded, and the remaining is used to fit the model and predict the excluded one. All analyses in this study were accomplished using the randomForest package within the statistical software package R 3.2.0 (Liaw and Wiener 2002) (https://www.r-project.org). The flowchart of this study is showed in Fig. 2.

3.2. Optimization of RF parameters Thirteen predictors were selected in this study for RF prediction: blue band (b1), green band (b2), red band (b3), near-infrared (NIR) band (b4), SWIR2 band (b6), NDVI, CIgreen, ARVI, MTVI, OSAVI, SR, WDRVI, and TWI. Before executing the algorithm using the predictors, two important user-defined parameters of RF, ntree and mtry, should be optimized to minimize the generalization error. Fig. 3-A shows the OOB error in response to the number of trees from 1 to 1 000 using the default mtry (equals to p/3) set by the RF. When the trees grew from 1 to 50, the OOB error decreased and reached a minimal value at the point of 50 trees, but a fluctuation and increase in the OOB error was observed until approximately 500 trees. The OOB error remained fairly consistent and did not indicate obviously better performance after that, so ntree=500 was chosen for RF prediction in our study. Based on 20 replicate trials of executing RF using mtry ranging from 1 to 13, the OOB error variation with mtry is shown in Fig. 3-B. A decrease of the OOB error was observed following the increase of mtry from 1 to 13, and the OOB error reached a minimum when mtry reached 13.

3. Results and discussion 3.1. Ground measurements The detailed summary statistics of the LAI measurements are shown in Table 3. Generally, the mean effective LAI values for the site were 0.4–2.9 during the growing season. The LAI values in the end of June and beginning of July ranging from 0.41–2.19, with a comparatively stable mean

Landsat raw data Ground measured LAI (89 ESUs) Validation dataset (24 ESUs)

Radiometric calibration

Training LAI (65 ESUs)

Geometric correction

Atmospheric correction

Reflectance band

Vegetation indices (VI)

Digital elevation model (DEM) Topographic wetness index

RF model

RF prediction

Parameter optimization

Variable reduction

Sensitivity to highly correlated predictors

Variable importance value method

Principal component analysis (PCA) method

The mostimportant variable-based correlation analysis

The best performed VI-based correlation analysis

RF prediction

RF prediction

RF prediction

RF prediction

Fig. 2 Flowchart of leaf area index (LAI) prediction using Random Forests (RF) model in this study. ESU, the elementary sampling unit.

LI Zhen-wang et al. Journal of Integrative Agriculture 2017, 16(2): 286–297

Table 3 Descriptive statistics of the measured leaf area index (LAI) dataset Date No. of ESUs1) (yr-mon-d) 2014-07-28 11 2014-08-03 14 2015-06-19 24 2015-06-30 19 2015-07-07 21 Total 89 1)

Mean

Minimum

Maximum

SD

1.17 1.85 1.27 1.17 1.24 1.32

0.69 1.03 0.77 0.41 0.51 0.41

2.33 2.90 2.03 2.19 2.01 2.90

0.57 0.63 0.38 0.57 0.55 0.57

ESU, elementary sampling unit.

However, the decrease in the OOB error was minor at approximately 0.01, which indicates that mtry optimization resulted in minimal improvements in RF predictions. Similar results were also found by Hultquist et al. (2014), Genuer et al. (2010) and Heung et al. (2014). To retain more of the ‘randomness’ in RF’s randomized variable selection process, a smaller value of mtry=6 was used in this study.

3.3. Variable reduction Another application of RF that was widely used was to select important variables for ecological or other models (Genuer et al. 2008; Hapfelmeier and Ulm 2013; Liu et al. 2014; Ramoelo et al. 2015), which enabled variable reduction to remove the potentially irrelevant predicted variables for models with numerous inputs and massive data. This study tested the variable reduction for RF prediction to examine its performance and select the comparatively important variables for grassland LAI prediction. Two methods were used to reduce the predicted variables; one was based on the Variable Importance Value, and the other was based on the PCA. Variable importance Before reducing variables of RF for LAI prediction, an optimized total number of predictor variables (p) should be determined. This was accomplished by

using the 5-fold CV and the calculated error rate to assess the performance for each value of p adopted in the model. Moreover, for each p, a set of mtry calculated as a function of p was also adopted; the mtry functions were mtry=p, p/2, p/3 (default setting), p/4, p/5, and SQRT (p). From 20 replicates of 5-fold CV using different numbers of predictors and a set of mtry functions, the results (Fig. 4) showed that the mtry as different functions of p slightly affected the RF performance except for mtry=p with variables less than 8, when a larger CV error rate was generated. For the reduced p, the CV error rate remained fairly consistent until the number of predictors was reduced to 8, and the CV error rate increased by 16% when reducing p to 7. The results indicated that variable reduction for RF to eliminate irrelevant variables is feasible and will not worsen the RF performance with respect to the CV error rate. Variable importance (Fig. 5-A) showed that five VIs, ARVI, OSAVI, WDRVI, SR and NDVI, ranked more important for predicting grassland LAI; b3 was more important than the other spectral band for LAI estimation, and MTVI, TWI and b4 were the three least important variables. However, when removing each variable from the predictors (Fig. 5-B), a slightly different situation was observed using the CV error rate as the indicator. SR and OSAVI removal resulted in a larger increase in the error than the other predictors, and removing b6 and TWI only resulted in a minimal increase in the CV error. Comprehensively considering their performance and using the variable selection approach summarized by Hapfelmeier and Ulm (2013), b4, TWI, MTVI, CIgreen and b6 were considered as more irrelevant variables. Principal component analysis (PCA) The PCA method is actually a matrix transformation that transforms the original multi-dimensional data into orthogonal variables (principal components, PCs); these components are ordered such that the first few PCs can explain most of the variation of the original data (Hotelling 1933; Jolliffe 2002). In this study, B 0.10 0.09 OOB error

OOB error

A 0.15 0.14 0.13 0.12 0.11 0.10 0.09 0.08 0.07 0.06 0.05

291

0.08 0.07 0.06

0

200

400 600 Number of trees

800

1 000

0.05

1

2

3 4 5 6 7 8 9 10 11 12 13 Number of predictors tried at each node

Fig. 3 RF parameter optimization of the out-of-bag (OOB) error variation changing with the number of trees (ntree) (A) and the number of predictors at each node (mtry) (B).

292

LI Zhen-wang et al. Journal of Integrative Agriculture 2017, 16(2): 286–297

0.20

p p/4

p/2 p/5

RF with reduced variables using the two methods, the two user-defined parameters were set to ntree=500 and mtry=p/2. The results showed that, compared with RF using the original 13 variables, RF using reduced variables performed nearly the same, only a very slight reduction in accuracy was observed (Fig. 6-A and B, Fig. 7-A and B). Similar results were also found by Svetnik et al. (2003) and Heung et al. (2014). The results indicated that RF is insensitive to the presence of irrelevant predictors and that proper variable reduction will not result in a reduction in RF performance (Heung et al. 2014). However, for variable reduction using the PCA method, a smaller LAI value (Fig. 6-C) and an obvious degradation (Fig. 7-C) of RF prediction was observed, which resulted from the irrelevant information that fused to the PCs. After eliminating the irrelevant variables, the performance of RF with the first three PCs improved, but a smaller LAI value (Fig. 6-D) and a larger error were generated (Fig. 7-D). The reason may result from the loss of subtle variations that can be captured by RF by abandoning the low-ranking five PCs.

p/3 SQRT(p)

CV error rate

0.15 0.10 0.05 0.00

1

2

3

4

5

6 7 8 9 10 11 12 13 Number of variables (p)

Fig. 4 Cross validation (CV) error rate with predictors being removed at each step using various mtry functions.

the PCA was conducted on the original 13 variables, and the results of the optimized eight variables showed that the first five PCs accounted for 99.6% of the information of the original 13 variables, and the first three PCs accounted for 99.4% of the information of the optimized eight variables, which meant that these first several PCs could reasonably substitute the transformed data, and little information would be lost.

3.5. Sensitivity to highly correlated predictors The effect of highly correlated variables on the performance of RF has been previously studied by Genuer et al. (2008, 2010) using variable importance as the indicator. The results showed that the relative importance between variables remained steady in the presence of highly correlated variables. This study conducted a similar experiment on the highly correlated VIs and bands, and the performance of RF was assessed using both variable importance and validation data. Based on the correlation coefficient among all the variables (data not shown), two groups of variables

3.4. Prediction assessment Using RF model with optimized parameters, LAI maps were generated from 13 original variables, eight reduced variables, five PCs from thirteen original variables and three PCs from eight reduced variables. The representative LAI maps from the four predictor datasets of July 7, 2015 were showed in Fig. 6. Then the predicted grassland LAI maps were assessed using the validation dataset (Fig. 7). For the

A

B

ARVI OSAVI WDRVI SR NDVI b3 b6 b1 b2 CIgreen MTVI TWI b4 0

5

10 15 20 Variable importance value

25

30

SR OSAVI b3 WDRVI ARVI b1 MTVI b4 NDVI b2 CIgreen b6 TWI All included 0

0.02

0.04 0.06 CV error rate

0.08

0.10

Fig. 5 Variable reduction using Variable Importance Value method. A, variable importance value. B, CV error rate when each predicted variable was removed. ARVI, atmospherically resistant vegetation index; OSAVI, optimized soil adjusted vegetation index; WDRVI, wide dynamic range vegetation index; SR, simple ratio; NDVI, normalized difference vegetation index; CIgreen, green chlorophyll index; MTVI, modified triangular vegetation index; TWI, topographic wetness index; b1, blue band; b2, green band; b3, red band; b4, near-infrared (NIR) band; b6, SWIR2 band. The same as below.

293

LI Zhen-wang et al. Journal of Integrative Agriculture 2017, 16(2): 286–297

A

B N

N

0 B

0.5

1 km

2.55 0.53

0.5

1 km

0.5

1 km

2.55 0.52

C N

N

0

0

0.5

1 km

2.04 0.60

0

2.34 0.57

Fig. 6 Predicted LAI maps of July 7, 2015. A, optimized RF with 13 variables. B, RF with reduced 8 variables. C, RF with principal component analysis (PCA) for the original 13 variables. D, RF with PCA for the reduced 8 variables using ground measured LAI.

were selected: one is based on the highest importance value calculated by RF, ARVI (ARVI group), and the other is based on the best performance of a VI with grassland LAI using simple linear regression, WDRVI (WDRVI group). The variables with a correlation coefficient larger than 0.85 with the base VIs were abandoned. The ARVI group contained variables b1, b2, b3, b4, b6, TWI, CIgreen and ARVI. The WDRVI group contained variables b1, b2, b4, b6, TWI and WDRVI. The results (Fig. 8) showed that the importance of each variable steadily increased, and the relative variable importance was steady for the WDRVI group after removing all highly correlated variables except for b4, its increased importance may result from the removal of highly correlated MTVI. For the ARVI group, the importance of CIgreen and b4 also increased and ranked as more important. The relative position of other variables was steady. Using ground measurements to validate the RF performance of the two groups (Fig. 9), the results showed an increase in the prediction error observed for the ARVI group. The decreased accuracy may result from the irrelevant variable

CIgreen, which ranked more important and participated more in RF prediction after removing the highly correlated variables. Using variables selected based on WDRVI for RF prediction, the validation result showed an improvement of RF prediction. For predicting grassland LAI, each VI demonstrates distinct performances. For example, NDVI was more suitable for detecting changes in LAI values below 2 but showed clear saturation when LAI exceeded 3 (Haboudane et al. 2004; Viña et al. 2011), and the sensitivity to vegetation interference factors (such as canopy structure, soil background, diffuse radiation ) also varied between VIs (Haboudane et al. 2004; Liang et al. 2015). Eliminating these VIs and using the best-performing VI will result in better performance.

4. Conclusion This study used the RF method to estimate grassland LAI from Landsat spectral band and auxiliary predictors. After optimizing the RF parameters and reducing variables, the RF

294

LI Zhen-wang et al. Journal of Integrative Agriculture 2017, 16(2): 286–297

Predicted LAI

2

1

0

0

1 2 Measured LAI

C 3

Predicted LAI

B 3

y=0.8716x+0.1452 R2=0.8809 RMSE=0.1956 MAE=0.1333

1

0

0

1

0

D 3

y=0.5125x+0.6396 R2=0.6872 RMSE=0.3372 MAE=0.2267

2

y=0.8937x+0.1025 R2=0.8798 RMSE=0.1985 MAE=0.1348

2

0

3

Predicted LAI

Predicted LAI

A 3

1 2 Measured LAI

3

y=0.7034x+0.3806 R2=0.8309 RMSE=0.2448 MAE=0.1621

2

1

0

3

1 2 Measured LAI

0

1 2 Measured LAI

3

Fig. 7 Validation of grassland LAI estimated from optimized RF with 13 variables (A), RF with reduced 8 variables (B), RF with PCA for the original 13 variables (C), RF with PCA for the reduced 8 variables using ground measured LAI (D). RMSE, root mean square error; MAE, mean absolute error.

50 All variables Variables selected based on ARVI

Variable importance

40

Variables selected based on WDRVI

30

20

10

0

b4

TWI

MTVI

CIgreen b2

b1

b6

b3

NDVI

SR WDRVI OSAVI ARVI

Predicted variables

Fig. 8 Variable importance in the presence of different sets of variables.

predictions were validated using an independent validation dataset. Moreover, the sensitivity of RF to highly correlated variables was also tested to examine its stability. The results showed that mtry, an important RF parameter, has a small effect on the performance of RF; thus, the method can be used in automatically operated models in which minimal user

intervention is needed. In terms of variable reduction, the Variable Importance Value method rather than PCA method was recommended; even when variable reduction barely improves the RF outputs, the variable reduction will show better potential for predictions with numerous inputs and massive data. The results also indicated the insensitivity of

295

LI Zhen-wang et al. Journal of Integrative Agriculture 2017, 16(2): 286–297

B 3

y=0.8045x+0.2335 R2=0.8478 RMSE=0.2180 MAE=0.1464

2

Predicted LAI

Predicted LAI

A 3

1

0

0

1 2 Measured LAI

3

y=0.8323x+0.2051 R2=0.8900 RMSE=0.1868 MAE=0.1204

2

1

0

0

1 2 Measured LAI

3

Fig. 9 Validation of grassland LAI estimated from RF with variables selected based on ARVI (A) and WDRVI (B).

RF to irrelevant information and demonstrated the ability to select valuable variables. After removing highly correlated variables, the relative variable importance was steady, but the spectral bands ranked more important to substitute the highly correlated VIs and participated more in RF prediction. Thus, using the best-performing VI rather than the most important one is recommended in this study. The study area in this work is relatively small; the results represent only the regional performance of the RF method. For a large-scale application, the method should be assessed in a larger area in the future. However, the results in this study still illustrated the basic features of the RF model and will provide guidance for its improved application.

Acknowledgements The study was funded by the Key Technologies Research and Development Program of China (2013BAC03B02, 2012BAC19B04), the International Science and Technology Cooperation Project of China (2012DFA31290), and the Earmarked Fund for Modern Agro-industry Technology Research System, China (CARS-35).

References Atzberger C. 2004. Object-based retrieval of biophysical canopy variables using artificial neural nets and radiative transfer models. Remote Sensing of Environment, 93, 53–67. Atzberger C, Darvishzadeh R, Immitzer M, Schlerf M, Skidmore A, Le Maire G. 2015. Comparative analysis of different retrieval methods for mapping grassland leaf area index using airborne imaging spectroscopy. International Journal of Applied Earth Observation and Geoinformation, 43, 19–31. Bader M Y, Ruijten J J. 2008. A topography-based model of forest cover at the alpine tree line in the tropical Andes.

Journal of Biogeography, 35, 711–723. Baret F, Weiss M, Allard D, Garrigues S, Leroy M, Jeanjean H, Fernandes R, Myneni R, Privette J, Morisette J. 2005. VALERI: A network of sites and a methodology for the validation of medium spatial resolution land satellite products. [2013-5-18]. http://w3.avignon.inra.fr/valeri/ documents/VALERI-RSESubmitted.pdf Breiman L. 1996. Bagging predictors. Machine Learning, 24, 123–140. Breiman L. 2001. Random forests. Machine Learning, 45, 5–32. Chen J M, Pavlic G, Brown L, Cihlar J, Leblanc S G, White H P, Hall R J, Peddle D R, King D J, Trofymow J A, Swift E, Van der sanden J, Pellikka P K E. 2002. Derivation and validation of Canada-wide coarse-resolution leaf area index maps using high-resolution satellite imagery and ground measurements. Remote Sensing of Environment, 80, 165–184. Combal B, Baret F, Weiss M. 2002. Improving canopy variables estimation from remote sensing data by exploiting ancillary information. Case study on sugar beet canopies. Agronomie, 22, 205–215. Dobrowski S Z, Safford H D, Cheng Y B, Ustin S L. 2008. Mapping mountain vegetation using species distribution modeling, image-based texture analysis, and object-based classification. Applied Vegetation Science, 11, 499–508. Evans J S, Cushman S A. 2009. Gradient modeling of conifer species using random forests. Landscape Ecology, 24, 673–683. Fan J, Zhong H, Liang B, Shi P, Yu G. 2003. Carbon stock in grassland ecosystem and its affecting factors. Grassland of China, 25, 51–58. (in Chinese) Fan W, Liu Y, Xu X, Chen G, Zhang B. 2014. A new FAPAR analytical model based on the law of energy conservation: A case study in China. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 7, 3945–3955. Fan W J, Xu X R, Liu X C, Yan B Y, Cui Y K. 2010. Accurate LAI retrieval method based on PROBA/CHRIS data. Hydrology and Earth System Sciences, 14, 1499–1507.

296

LI Zhen-wang et al. Journal of Integrative Agriculture 2017, 16(2): 286–297

Ge Y, Bai H, Wang J, Cao F. 2012. Assessing the quality of training data in the supervised classification of remotely sensed imagery: A correlation analysis. Journal of Spatial Science, 57, 135–152. Ge Y, Wang J H, Heuvelink G B M, Jin R, Li X, Wang J F. 2015. Sampling design optimization of a wireless sensor network for monitoring ecohydrological processes in the Babao River basin, China. International Journal of Geographical Information Science, 29, 92–110. Genuer R, Poggi J M, Tuleau-Malot C. 2010. Variable selection using random forests. Pattern Recognition Letters, 31, 2225–2236. Genuer R, Poggi J M, Tuleau C. 2008. Random Forests: some methodological insights. [2015-9-18]. http://arxiv.org/ pdf/0811.3619v1.pdf Gitelson A A. 2004. Wide dynamic range vegetation index for remote quantification of biophysical characteristics of vegetation. Journal of Plant Physiology, 161, 165–173. Gitelson A A, Verma S B, Vi A A, Rundquist D C, Keydan G, Leavitt B, Arkebauer T J, Burba G G, Suyker A E. 2003a. Novel technique for remote estimation of CO2 flux in maize. Geophysical Research Letters, 30, 1486–1489. Gitelson A A, Vi A A, Arkebauer T J, Rundquist D C, Keydan G, Leavitt B. 2003b. Remote estimation of leaf area index and green leaf biomass in maize canopies. Geophysical Research Letters, 30, 1248–1251. Haboudane D, Miller J R, Pattey E, Zarco-tejada P J, Strachan I B. 2004. Hyperspectral vegetation indices and novel algorithms for predicting green LAI of crop canopies: Modeling and validation in the context of precision agriculture. Remote Sensing of Environment, 90, 337–352. Hapfelmeier A, Ulm K. 2013. A new variable selection approach using random forests. Computational Statistics & Data Analysis, 60, 50–69. Heung B, Bulmer C E, Schmidt M G. 2014. Predictive soil parent material mapping at a regional-scale: A random forest approach. Geoderma, 214–215, 141–154. Hotelling H. 1933. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417. Hultquist C, Chen G, Zhao K. 2014. A comparison of Gaussian process regression, random forests and support vector regression for burn severity assessment in diseased forests. Remote Sensing Letters, 5, 723–732. Jacquemoud S, Bacour C, Poilv H, Frangi J P. 2000. Comparison of four radiative transfer models to simulate plant canopies reflectance: Direct and inverse mode. Remote Sensing of Environment, 74, 471–481. Jacquemoud S, Baret F. 1990. PROSPECT: A model of leaf optical properties spectra. Remote Sensing of Environment, 34, 75–91. Jolliffe I. 2002. Principal Component Analysis. John Wiley & Sons, Ltd. New York, United States of America. Jordan C F. 1969. Derivation of leaf-area index from quality of light on the forest floor. Ecology, 50, 663–666. Kaufman Y J, Tanre D. 1992. Atmospherically resistant

vegetation index (ARVI) for EOS-MODIS. IEEE Transactions on Geoscience and Remote Sensing, 30, 261–270. De Kauwe M G, Disney M I, Quaife T, Lewis P, Williams M. 2011. An assessment of the MODIS collection 5 leaf area index product for a region of mixed coniferous forest. Remote Sensing of Environment, 115, 767–780. Kilibarda M, Hengl T, Heuvelink G, Graler B, Pebesma E, Tadic M P, Bajat B. 2014. Spatio-temporal interpolation of daily temperatures for global land areas at 1 km resolution. Journal of Geophysical Research (Atmospheres), 119, 2294–2313.

Kopecký M, Čížková Š. 2010. Using topographic wetness index in vegetation ecology: Does the algorithm matter? Applied Vegetation Science, 13, 450–459. Lemaire G, Wilkins R, Hodgson J. 2005. Challenges for grassland science: managing research priorities. Agriculture, Ecosystems & Environment, 108, 99–108. Liang L, Di L, Zhang L, Deng M, Qin Z, Zhao S, Lin H. 2015. Estimation of crop LAI using hyperspectral vegetation indices and a hybrid inversion method. Remote Sensing of Environment, 165, 123–134. Liaw A, Wiener M. 2002. Classification and regression by randomForest. [2015-9-18]. https://cran.r-project.org/doc/ Rnews/Rnews_2002-3.pdf Liu M, Liu X, Li J, Ding C, Jiang J. 2014. Evaluating total inorganic nitrogen in coastal waters through fusion of multitemporal RADARSAT-2 and optical imagery using random forest algorithm. International Journal of Applied Earth Observation and Geoinformation, 33, 192–202. Liu Y, Ju W, Zhu G, Chen J, Xing B, Zhu J, Zhou Y. 2011. Retrieval of leaf area index for different grasslands in Inner Mongolia prairie using remote sensing data. Acta Ecologica Sinica, 39, 5159–5170. (in Chinese) Morisette J T, Baret F, Privette J L, Myneni R B, Nickeson J E, Garrigues S, Shabanov N V, Weiss M, Fernandes R A, Leblanc S G, Kalacska M, Sanchez-azofeifa G A, Chubey M, Rivard B, Stenberg P, Rautiainen M, Voipio P, Manninen T, Pilant A N, Lewis T E, et al. 2006. Validation of global moderate-resolution LAI products: A framework proposed within the CEOS land product validation subgroup. IEEE Transactions on Geoscience and Remote Sensing, 44, 1804–1817. Prasad A M, Iverson L R, Liaw A. 2006. Newer classification and regression tree techniques: Bagging and random forests for ecological prediction. Ecosystems, 9, 181–199. Qin C, Zhu A X, Yang L, Li B, Pei T. 2007. Topographic wetness index computed using multiple flow direction algorithm and local maximum downslope gradient. In: The 7th International Workshop of Geographical Information System. September 12–14, 2007. Beijing, China. Ramoelo A, Cho M A, Mathieu R, Madonsela S, Van de kerchove R, Kaszta Z, Wolff E. 2015. Monitoring grass nutrients and biomass as indicators of rangeland quality and quantity using random forest modelling and WorldView-2 data. International Journal of Applied Earth Observation and Geoinformation, 43, 43–54.

LI Zhen-wang et al. Journal of Integrative Agriculture 2017, 16(2): 286–297

Rondeaux G, Steven M, Baret F. 1996. Optimization of soil-adjusted vegetation indices. Remote Sensing of Environment, 55, 95–107. Rouse Jr J W, Haas R, Schell J, Deering D. 1974. Monitoring vegetation systems in the Great Plains with ERTS. NASA Special Publication, 351, 309. Running S W, Nemani R R, Peterson D L, Band L E, Potts D F, Pierce L L, Spanner M A. 1989. Mapping regional forest evapotranspiration and photosynthesis by coupling satellite data with ecosystem simulation. Ecology, 70, 1090–1101. Sellers P J, Dickinson R E, Randall D A, Betts A K, Hall F G, Berry J A, Collatz G J, Denning A S, Mooney H A, Nobre C A, Sato N, Field C B, Henderson-sellers A. 1997. Modeling the exchanges of energy, water, and carbon between continents and the atmosphere. Science, 275, 502–509. Svetnik V, Liaw A, Tong C, Culberson J C, Sheridan R P, Feuston B P. 2003. Random forest: A classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43, 1947–1958. Van Niel K P, Laffan S W, Lees B G. 2004. Effect of error in the

297

DEM on environmental variables for predictive vegetation modelling. Journal of Vegetation Science, 15, 747–756. Verrelst J, Schaepman M E, Koetz B, Kneubühler M. 2008. Angular sensitivity analysis of vegetation indices derived from CHRIS/PROBA data. Remote Sensing of Environment, 112, 2341–2353. Verrelst J, Schaepman M E, Malenovsk Z, Clevers J G. 2010. Effects of woody elements on simulated canopy reflectance: Implications for forest chlorophyll content retrieval. Remote Sensing of Environment, 114, 647–656. Viña A A, Gitelson A A, Nguy-robertson A L, Peng Y. 2011. Comparison of different vegetation indices for the remote assessment of green leaf area index of crops. Remote Sensing of Environment, 115, 3468–3478. Wang J, Ge Y, Heuvelink G, Zhou C. 2014. Spatial sampling design for estimating regional GPP with spatial heterogeneities. IEEE Geoscience and Remote Sensing Letters, 11, 539–543. Welles J M, Norman J M. 1991. Instrument for indirect measurement of canopy architecture. Agronomy Journal, 83, 818–825. (Managing editor SUN Lu-juan)