On-line vis-NIR spectroscopy prediction of soil organic carbon using machine learning

Soil & Tillage Research 190 (2019) 120–127 Contents lists available at ScienceDirect Soil & Tillage Research journal homepage: www.elsevier.com/loca...

Download PDF

1MB Sizes 0 Downloads 34 Views

Report

PDF Reader
Full Text

Soil & Tillage Research 190 (2019) 120–127

Contents lists available at ScienceDirect

Soil & Tillage Research journal homepage: www.elsevier.com/locate/still

On-line vis-NIR spectroscopy prediction of soil organic carbon using machine learning ⁎

S. Nawar , A.M. Mouazen

T

⁎

Department of Environment, Ghent University, Coupure 653, 9000 Gent, Belgium

A R T I C LE I N FO

A B S T R A C T

Keywords: Vis-NIR spectroscopy Organic carbon Spiking Random forest

Accurate on-line visible and near infrared (vis-NIR) spectroscopy prediction of soil organic carbon (OC) is essential for food security and environmental management. This paper aims at using on-line vis-NIR spectra coupled with random forest (RF) modelling approach for the prediction of soil organic carbon (OC), comparing between single ﬁeld (SF), non-spiked UK multiple-ﬁeld (NSUK) and spiked UK multiple-ﬁeld (SUK) calibration models. Fresh soil samples collected from 6 ﬁelds in the UK (including two target ﬁelds) were scanned with a ﬁbre-type vis-NIR spectrophotometer (tec5 Technology for Spectroscopy, Germany), with a spectral range of 305–2200 nm. After dividing spectra into calibration and independent validation sets, RF was run on the calibration set to develop calibration models for OC for the three studied datasets. Results showed that the model prediction performance depends on the dataset used and varies between ﬁelds. Less accurate prediction performance was obtained for the on-line prediction compared to laboratory (samples scanned in the laboratory under non-mobile measurement) prediction, and for non-spiked models compared to spiked models. The best model performance in both laboratory and on-line predictions was obtained when samples from the SF were spiked into the UK samples, with coeﬃcients of determination (R2) values of 0.80 to 0.84 and 0.74 to 0.75, root mean square error of prediction (RMSEP) values of 0.14% and 0.17 to 0.18%, and ratio of prediction deviation (RPD) values of 2.30 to 2.5 and 1.98 to 2.04, respectively. Therefore, these results suggest that RF modelling approach when coupled with spiking provides high prediction performance of OC under both non-mobile laboratory and on-line ﬁeld scanning conditions.

1. Introduction

potential to eliminate the aforementioned constraints (Viscarra Rossel et al., 2011; Kuang et al., 2012). As common PSS technique visible and near-infrared (vis-NIR) diﬀuse reﬂectance spectroscopy has attracted increasing interest among soil scientists in recent times and has been proposed as a possible method of soil analysis. It oﬀers the possibility of collecting high spatial resolution data, compared with conventional laboratory analyses (Shepherd and Walsh, 2002; Wetterlind et al., 2010). This technique allows for both in ﬁeld (in situ) non-mobile (Viscarra Rossel and Chen, 2011; Brodský et al., 2013; Morgan et al., 2009; Ge et al., 2014; Ackerson et al., 2015, 2017) and on-line (tractordriven) measurements (Maleki et al., 2008; Kuang and Mouazen, 2013a). However, compared to laboratory analysis that is done under controlled conditions, ﬁeld spectroscopy analysis is aﬀected by ambient and experimental conditions that need to overcome for accurate prediction to be achieved (Mouazen et al., 2007; Stenberg et al., 2010). One way to reduce these negative inﬂuences is by adopting advanced data mining modelling techniques, particularly those approaches that correct for ambient conditions like moisture and handle the non-

Estimation of carbon status in the soil is crucial from both agricultural and environmental points of view. Soil organic carbon (OC) has for long been identiﬁed as an important factor for soil fertility and crop production (Kucharik et al., 2001; Muñoz and Kravchenko, 2011). It is well known that soil organic matter, and consequently OC is one of the most important constitutes of soil because it aﬀects soil structure, soil quality, water availability and nutrient cycling (Wang et al., 2015) in addition to its eﬀect on increasing soil resistivity to erosion (Bresson et al., 2001). Therefore, spatial predictions of soil OC content is needed for a wide range of agricultural and environmental applications (Muñoz and Kravchenko, 2011; Wang et al., 2013). Traditional laboratory analysis methods for OC is laborious, time consuming, costly and destructive (McDowell et al., 2012; Wang et al., 2015). Therefore, there is a growing demand for quick, cost-eﬀective, nondestructive and suﬃciently accurate approach for predicting OC in situ (Wang et al., 2015). Proximal soil sensing (PSS) techniques have the

⁎

Corresponding authors. E-mail addresses: [email protected] (S. Nawar), [email protected] (A.M. Mouazen).

https://doi.org/10.1016/j.still.2019.03.006 Received 19 September 2018; Received in revised form 13 February 2019; Accepted 11 March 2019 0167-1987/ © 2019 Elsevier B.V. All rights reserved.

Soil & Tillage Research 190 (2019) 120–127

S. Nawar and A.M. Mouazen

2.2. On-line soil measurement and collection of soil samples

linearity in spectral responses to variations in concentrations. As a linear multivariate analysis, partial least squares regression (PLSR) is the most commonly used technique for soil spectral analysis (Viscarra Rossel et al., 2006; Vohland et al., 2011; Conforti et al., 2015). However, the accuracy of PLSR as a linear multivariate regression technique tends to decrease due to the non-linear nature of the relationship between spectral data and the dependent variable (Araújo et al., 2014). Therefore, non-linear methods were introduced in the literature as better alternatives to PLSR for spectroscopic analyses of soils (Morellos et al., 2016; Nawar et al., 2016). Among those models artiﬁcial neural networks (ANN) (Mouazen et al., 2010), support vector regression (SVR) (Vohland et al., 2011), boosted regression trees (Brown et al., 2007), Cubist (Stevens et al., 2013), multivariate adaptive regression splines (MARS) (Nawar et al., 2016) and random forests (RF) (Viscarra Rossel and Behrens, 2010) were proven to provide improved prediction performances. Recently, RF has received growing attention in vis-NIR spectral analyses in diﬀerent domains. It is an ensemble learning technique introduced by Breiman, (2001) as a combination of tree predictors. It has many advantages such as resistance to noise, ability to be used even when the predictor variables are higher than observations, suﬀering of small overﬁtting, and provides an assessment of variable importance (Díaz-Uriarte and Alvarez de Andrés, 2006; Prasad et al., 2006; Ishwaran, 2007). Accordingly, RF can handle nonlinear and hierarchical behaviors when introducing variability to the general spectral library for predicting local samples. Although few studies on the use of RF for the analyses of soil properties under laboratory non-mobile measurement conditions have been reported (Viscarra Rossel and Behrens, 2010), to our best knowledge no study of RF modelling of online collected vis-NIR spectra can be found in the literature. The aim of this paper was to evaluate the predictive performance of RF models for OC using laboratory and on-line scanned spectra, comparing between datasets of single ﬁeld (SF), non-spiked UK multiple ﬁelds (NSUK), and spiked UK multiple ﬁelds (SUK) with selected target ﬁeld samples.

Both ﬁelds were scanned using the on-line system designed and developed by Mouazen (2006). It consists of a subsoiler, which penetrates the soil to any depth (5–50 cm), making a trench, whose bottom is smoothened by the downforce acting on the subsoiler (Mouazen et al., 2005). The subsoiler was retroﬁtted with the optical unit and attached to a frame. This was mounted onto the three point linkage of a tractor (Mouazen et al., 2005). An AgroSpec mobile, ﬁbre type, vis–NIR spectrophotometer (Tec5 Technology for Spectroscopy, Germany) with a measurement range of 305–2200 nm was used to measure soil spectra in diﬀuse reﬂectance mode. The sampling interval of the instrument was 1 nm. A diﬀerential global positioning system (DGPS) (EZ-Guide 250, Trimble, USA) was used to record the position of on-line measured spectra with sub-meter accuracy. The on-line measurements were carried out in 2015 and 2016 for the former and latter ﬁelds, respectively, pulling the sensor at 12 m gap between adjacent transects and at 15 cm average depth. A total of 122 and 139 soil samples were collected during the on-line measurement from Hessleskew and Hagg ﬁelds, respectively (Fig. 1). 2.3. Laboratory chemical and optical measurements Fresh soil samples were used in the laboratory spectral analysis. Each soil sample was placed in a glass container and mixed well. Plant residue and big stones were removed manually. Each sample was used to ﬁll three Petri dishes of 2 cm in diameter, and 2 cm deep, representing three replicated measurements. Each soil in the Petri dishes was pressed gently before levelling with a spatula to ensure a smooth surface; and therefore maximum light reﬂection and a large signal-tonoise ratio (Mouazen et al., 2005). Soil samples were scanned by the same spectrometer used in the on-line measurements. A total of ten scans were collected from each replicate, and these were averaged into one spectrum. The second part of each sample was air dried before it was analyzed for OC using the combustion method (British Standard BS 7755 Section 3.8, 1995). For the determination of the OC content, any carbonates present are in advance removed by treating the soil with hydrochloric acid. In this method, the carbon present in the soil is oxidised to carbon dioxide (CO2) by heating the soil to 900 °C. The amount of carbon dioxide released is then measured by a thermal conductivity detector (TCD).

2. Materials and methods 2.1. Experimental sites Two experimental ﬁelds were used in this study, namely, Hessleskew (longitudes -0.590° and -0.586 °W, and latitudes 53.844° and 53.844 °N) with total area of about 12 ha, and Hagg (longitudes of -1.172° and -1.166 °W, and latitudes of 53.936° and 53.941 °N) with total area of about 21 ha, both located in Yorkshire, UK. Hessleskew ﬁeld is cultivated with cereal crops in rotation, where Hagg ﬁeld is cultivated with vegetable crops (leeks, cabbage, carrots and onions). Soil in both ﬁelds are well developed, with dominant soil texture of a sandy loam in Hagg and clay loam to clay in Hessleskew ﬁelds (Table 1) according to United States Department of Agriculture (USDA) textural soil classiﬁcation system (Soil Survey Staﬀ, 1999). Dense presence of stones were observed in Hessleskew ﬁeld.

2.4. Spectra pretreatment The same pretreatment of soil spectra of on-line and laboratory measurements was carried out for all soil properties investigated using R packages (Stevens and Ramirez, 2013). First, the spectral range outside 370–1979 nm was cut to remove the noise at both edges. Then, moving average ﬁlter with window size of ﬁve successive wavelengths was used to reduce noise. Maximum normalization was followed, which is typically used to get all data to approximately the same scale. Spectra were then subjected to ﬁrst derivation using Gap–segment derivative (gapDer) algorithms (Norris, 2001) with a second-order polynomial

Table 1 Information of experimental ﬁelds where the general UK sample set (240 samples) was collected from the ﬁrst four ﬁelds, whereas the target Fields were Hessleskew and Hagg. Field

Date

Area (ha)

Crop

Sample No.

Texture

MC (%)

Parent material

Silsoe, Bedford Horns End, Bedford Cotton End, Bedford Great BW, Suﬀolk Hessleskew, Yorkshire Hagg, Yorkshire

2013 2014 2016 2016 2015 2016

10 10 11 19 12 21

Wheat Wheat Wheat Onion Wheat leeks, cabbage, carrots, and onions

94 46 40 60 122 139

Clay loam to clay Clay loam to clay Loam to sandy clay loam loam to Sandy loam Clay loam to Clay Sandy loam

9.0-17.85 12.0-25.5 8.0-19.40 8.0-18.0 12.0-25.5 6.0-18.0

Mudstone Mudstone Mudstone Sandstone Mudstone Sandstone

121

Soil & Tillage Research 190 (2019) 120–127

S. Nawar and A.M. Mouazen

Fig. 1. Maps of Hessleskew and Hagg ﬁeld showing the on-line measurement transects, the location of soil samples and the selection of the calibration and validation sample sets.

Hagg ﬁelds, respectively were used. In this case, variability of the target ﬁeld was not introduced to the UK dataset, in order to compare the performance of spiking to that of non-spiking, explained in the following dataset. 3- SUK dataset, where selected samples from each target ﬁeld (Hessleskew: 85 and Hagg: 101) were spiked into the NSUK samples. Spiking was used to introduce the local variability of the two target ﬁelds into a UK dataset (e.g., NSUK), consisting of 4 ﬁelds with a total of 240 samples (Table 1). Spiking has resulted in two SUK calibration sets with a total sample number of 325 and 341 for the Hessleskew and Hagg ﬁelds, respectively. The same validation sets (as for 1 and 2 above) of 37 and 38 samples collected from Hessleskew and Hagg ﬁelds, respectively were used. All soil samples were collected with the same method at the same depth (0–20 cm). Soil moisture content ranged between 8 and 25.5% for the UK samples where it ranged between 6.0 and 18.0% for Hagg and 12 and 25.5% for Hessleskew. Detailed information about the UK samples is shown in Table 2. A schematic ﬂowchart illustrating different steps of the analysis is shown in Fig. 2. The three calibration sets detailed above were subjected to RF analysis. It is a nonparametric and nonlinear classiﬁcation and regression algorithm ﬁrst proposed by Ho (1998) and further developed by Breiman (2001). RF is a method based on a kind of learning strategy (ensemble learning) that generates many classiﬁers and aggregates their results. According to its algorithm, RF does not need any data pretreatment, which is one of its main advantages. RF runs very fast, and that is a very important factor in the ﬁeld of on-line and in situ measurements. Tree diversity guarantees RF model stability that is achieved by two means: (1) a random subset of predictor (p) variables is chosen to grow each tree and (2) each tree is based on a diﬀerent random data subset, created by bootstrapping, i.e., sampling with replacement (Efron, 1979). For each tree in the bootstrapped set, a modiﬁed algorithm is used for splitting at each node instead of testing the performance of all p variables. The default mtry (number of p variables used to split the nodes at each partitioning) value is the square root of the total number of variables (Abdel Rahman et al., 2014). Therefore, ntrees (number of trees to be grown) needs to be set suﬃciently high (500 in this case). Consequently, RFs do not overﬁt when

approximation. Finally, the Savitzky-Golay smoothing was carried out to remove noise from spectra and to decrease the detrimental eﬀect on the signal-to-noise ratio that conventional ﬁnite-diﬀerence derivatives would have.

2.5. Datasets and random forest modelling Three diﬀerent calibration models were used in this work based on three diﬀerent datasets (Table 2): 1- SF dataset (n = 122, and 139), where samples from one ﬁeld either Hessleskew or Hagg ﬁelds were used, respectively. In order to ensure independent validation, the two target ﬁelds were divided into two parts (Fig. 1), with samples outside the rectangular were used as calibration samples (70%) and those inside the rectangular were the validation set (30%), similar to that performed by Brown et al. (2005). The total number of samples in the calibration and validation sets were 85 and 101 and 37 and 38 for Hessleskew and Hagg ﬁeld, respectively. 2- NSUK dataset (n = 240), where samples collected from four ﬁelds in the UK (Table 1) were used as the calibration set, whereas the same validation sets of 37 and 38 samples collected from Hessleskew and Table 2 Descriptive statistics for soil organic carbon (OC) in % for Hessleskew ﬁeld, Hagg ﬁeld, and the UK datasets. Dataset Hessleskew Cal (n = 85) Val (n = 37) Hagg Cal (n = 101) Val (n = 38) UK dataset (n = 240) UK+Hess* (n = 325) UK+Hagg** (n = 341)

Min

1st Qu.

Median

Mean

3rd Qu.

Max

SD*

1.5 1.4

1.84 1.91

2.01 2.18

2.07 2.11

2.20 2.30

3.04 3.00

0.37 0.34

1.37 1.34

1.65 1.59

1.87 1.70

1.90 1.81

2.06 1.90

3.10 3.06

0.33 0.35

0.94

1.21

1.60

1.52

1.90

2.79

0.54

0.94

1.50

1.80

1.69

2.04

3.04

0.52

0.94

1.50

1.70

1.60

1.96

3.10

0.49

* SD = Standard deviation. 122

Soil & Tillage Research 190 (2019) 120–127

S. Nawar and A.M. Mouazen

Fig. 2. A schematic ﬂowchart explaining diﬀerent steps of calibration and validation performed during the study.

3. Results and discussion

more trees are added, but produce a limited generalisation error (Prasad et al., 2006; Peters et al., 2007). The optimal mtry can be determined by the function tuneRF implemented in the R software package randomForest. Here we selected the mtry and ntrees to be 2 and 500, respectively. All models were developed with R program using the software package randomForest Version 4.6–12 (Liaw and Wiener, 2015), based on Breiman and Cutler's Fortran code (Breiman, 2001). The resulted calibration models were independently validated using the independent validation sets detailed above. Independent validation was made with spectra collected in stationary laboratory and on-line ﬁeld conditions.

3.1. Laboratory measured soil properties and spectra The descriptive statistics showed that OC content in Hessleskew ﬁeld ranged between 1.5 and 3.04 for the calibration set and 1.4 and 3.0 for validation set, with mean and median values of 2.07 and 2.01% and 2.11 and 2.18% for the calibration and validation datasets, respectively. Meanwhile OC is of larger range in Hagg ﬁeld and ranged between 1.37 and 3.10%, and 1.34 and 3.06% for calibration and validation datasets, respectively (Table 2). The mean OC of the UK dataset is 1.52% and ranged between 0.94 and 2.79%. The data in Table 2 indicates that the range and SD for the calibration and validation sets for both Hessleskew and Hagg ﬁelds as well as for the UK datasets are comparable. It is well known that sample concentration range or SD is an eﬀective factor in the model accuracy. The range of the calibration set should be equal or larger than the prediction set. However, larger range or SD will introduce higher RMSEP (Stenberg et al., 2010).

2.6. Evaluation of model accuracy In order to evaluate the model performance for the prediction of OC, the same validation samples (Table 2) were used for the laboratory and on-line scanned sample sets. Model prediction performance was evaluated using coeﬃcient of determination (R2), root mean square error of prediction (RMSEP) and ratio of prediction deviation (RPD) values, which is the standard deviation of laboratory wet analysis results divided by RMSEP. Viscarra Rossel et al. (2006) classiﬁed the RPD values into six classes: excellent (RPD > 2.5), very good (RPD = 2.5–2.0), good (RPD = 2.0–1.8), fair (RPD = 1.8–1.4), poor (RPD = 1.4–1.0), and very poor model (RPD < 1.0). This classiﬁcation was adopted in this study to compare between diﬀerent RF models in laboratory and on-line prediction.

3.2. Performance of calibration models for laboratory prediction RF calibration models were validated under laboratory conditions using laboratory scanned spectra of the prediction sets. The best predictions were obtained with the SUK dataset followed, respectively by SF and NSUK datasets (Table 3). This is true for both the Hagg ﬁeld (R2 = 0.84, RMSEP = 0.14%, and RPD = 2.55) and Hessleskew ﬁeld

Table 3 Results of visible and near infrared (vis-NIR) spectroscopy prediction of soil organic carbon (OC) in Hessleskew and Hagg ﬁelds obtained with random forests (RF) models, comparing between laboratory and on-line scanning, and between single ﬁeld (SF), spiked UK (SUK), and non-spiked UK (NSUK) dataset. SF

Hessleskew Field Laboratory On-line Hagg Field Laboratory On-line

SUK

NSUK

R2

RMSEP (%)

RPD

R2

RMSEP (%)

RPD

R2

RMSEP (%)

RPD

0.65 0.65

0.20 0.21

1.72 1.70

0.81 0.74

0.14 0.18

2.30 1.98

0.20 0.12

0.30 0.33

1.13 1.08

0.75 0.71

0.17 0.18

2.05 1.89

0.84 0.75

0.14 0.17

2.55 2.04

0.60 0.41

0.22 0.27

1.61 1.32

R2 = Coeﬃcient of determination; RMSE = Root mean square error of prediction; RPD = Ratio of prediction deviation. 123

Soil & Tillage Research 190 (2019) 120–127

S. Nawar and A.M. Mouazen

(R2 = 0.81, RMSEP = 0.14%, and RPD = 2.30), followed by the SF dataset (R2 = 0.75, RMSEP = 0.17%, and RPD = 2.05 for Hagg ﬁeld, and R2 = 0.65, RMSEP = 0.20%, and RPD = 1.72 for Hessleskew ﬁeld). The poorest result is obtained with NSUK based model in Hessleskew ﬁeld with R2 = 0.20, RMSEP = 0.30%, and RPD = 1.13. The results reveal that RF models based on spiking outperformed the local and non-spiked UK models in both ﬁelds. These results are supported by the ﬁndings of Nawar and Mouazen (2017), who showed that multivariate adaptive regression splines (MARS) provided robust predictions of total carbon (TC) with a general model (European spectral library of 529 samples spiked with local samples) with R2, RMSEP, and ratio of performance to interquartile (RPIQ) of 0.88, 0.19%, and 5.94, respectively. Moreover, these results are in line with those reported by Kuang and Mouazen (2013a) using spiking of laboratory scanned data into a general European dataset of 425 samples, collected from four diﬀerent European countries, reporting a very good laboratory prediction (RMSEP = 0.07-0.1% and RPD = 2.28–2.38). Comparing the prediction performance between the two target ﬁelds, one may conclude that the prediction is better in the Hagg ﬁeld, as compared to the Hessleskew ﬁeld (Table 3). This is true for all SF, NSUK and SUK models. In this case, even the NSUK model provided fair prediction (RPD = 1.61), according to the classiﬁcation criterion suggested by Viscarra Rossel et al. (2006) for model performance based on RPD values. Examining the OC range (Table 2), only a minor diﬀerence can be observed between the two ﬁelds with slightly bigger range for Hagg. This suggests a negligible eﬀect of the range on prediction accuracy. The only reason that authors believe to result in a lower prediction accuracy for Hesslskew ﬁeld is the heavy clay to clay loam texture, compared to the light sandy loam texture of Hagg ﬁeld. Previous studies (Kuang and Mouazen, 2013b) observed that the negative inﬂuence of moisture content on the prediction performance of the vis-NIR spectroscopy augments when combined by heavy soils.

texture, plant residues and gravels) aﬀecting the on-line measurement (Mouazen et al., 2007; Stenberg et al., 2010). It was reported that moisture and soil texture to have the largest eﬀect on OC measurement with vis-NIR spectroscopy (Tekin et al., 2012), particularly for clay soils under wet conditions (Kuang and Mouazen, 2013b). This may also explain the lower results of prediction in Hessleskew ﬁeld that is characterized by heavy clay texture (clay content > 60%) combined with high moisture content (> 25%). The good results obtained with spiking in this study indicate the high performance of RF models in the laboratory and on-line predictions of OC. Due to several ambient factors aﬀecting the on-line measurement (Mouazen et al., 2005), the prediction performance of the online measurement is of lower quality than that of laboratory measurement. Very good RF prediction results of OC and other soil properties have been reported for non-mobile measurement by Wang et al. (2015) with RPD = 3.23 and RMSE = 0.019%, which can be attributed to the ability of the RF to overcome the nonlinear relationship that exists between the response variable and predictor variables (Lark, 1999; Guo et al., 2015). Moreover, the RF model required no assumptions about relationships between the response variable and predictor variables so that it could handle nonlinear relationships (Guo et al., 2015; Zhang et al., 2017). 3.4. Inﬂuence of dataset variability and size on models’ prediction performance It has been reported that soil variability aﬀects the prediction accuracy of soil properties including OC (Kuang and Mouazen, 2011). Examining the range of the OC contents in both SF datasets reveals that considerable variability exists in each ﬁeld (Table 2), explaining the successful predictions (Table 3). Spiking local samples into the UK samples has improved the model prediction performance, compared with those obtained using the SF and NSUK datasets (Table 3). The improvement in prediction accuracy was signiﬁcant, compared to the NSUK model, which can be explained by the inclusion of the target ﬁeld variability in the SUK dataset. However, the diﬀerence in prediction accuracy between SUK and SF models was much smaller (Table 3), which is true for both laboratory and on-line predictions. Here there are two potential explanations of the latter improvement, namely increased variability and/or dataset size. The SUK dataset in both ﬁelds is of larger concentration ranges, compared to that of the single ﬁeld datasets (Table 2), particularly on the low end of the range. However, the diﬀerence in the range and SD between the SF and SUK dataset are smaller than those reported in other studies. For example, Kuang and Mouazen (2011) reported increase not only in R2 and RPD but also in RMSEP with the increase in variability in the dataset e.g., SF data compared to spiked an European dataset. A diﬀerent trend is obtained in the present work, where the increase in the range and SD due to spiking of the EU dataset has resulted in increased R2 and RPD, while RMSEP decreases (Figs. 3 and 4). These diﬀerent trends can be attributed to the large diﬀerence in variability between a SF and an European dataset of Kuang and Mouazen (2011) work, compared to the small diﬀerence in variability between the SF and SUK in the present work (Table 2). Therefore, a large increase in variability would not only lead to increased R2 and RPD but also RMSEP, whereas a slight increase in variability of the spiked dataset over a SF dataset can lead to a decrease in RMSEP. Brown (2007) reported a great reduction in RMSE for predictions of SOC in upland soil samples from a catchment in Uganda, by adding local samples into a global library. This is true particularly if the main dataset (to be spiked) is selected from a limited number of ﬁelds with similar textures to the target ﬁelds, as those selected in the current work (Table 1). Therefore, a vis-NIR model performance depends to a large extent on the degree of variability encountered in the dataset including soil texture and moisture (Stenberg et al., 2010; Wang et al., 2010). In addition to the eﬀect of variability on prediction quality, the

3.3. Performance of calibration models for on-line prediction The on-line collected spectra were used to predict soil OC using the calibration models developed in advance, as explained above. The online prediction based on SUK dataset has resulted in the best predictions with R2 = 0.75, RMSEP = 0.17%, and RPD = 2.04 for Hagg ﬁeld, and R2 = 0.74, RMSE = 0.18%, and RPD = 1.98 for Hessleskew ﬁeld (Table 3). The prediction results based on SF dataset were less accurate compared to those of SUK dataset. Again, poor results are produced by NSUK dataset with the lowest result was for Hessleskew ﬁeld (Table 3). Again, the on-line prediction accuracy for Hagg ﬁeld was better than that for Hessleskew ﬁeld, which can be attributed to the lighter soil texture of the former, and to the high stone content in the latter ﬁeld. The on-line prediction of OC in this research shows a good to very good performance (RPD = 1.98–2.04 obtained with SUK model). Larger RMSEP of 0.31% was reported by Nawar and Mouazen (2017) for the on-line measurement of OC, based on a general European dataset (529 samples), spiked with local UK samples using MARS model. Importantly, Kuang and Mouazen (2013b) reported very good prediction results (RMSEP = 0.13 - 0.19%) of OC based on PLSR followed spiking of local samples into European dataset of 425 samples (range 0.45–3.48%), which is a similar error range of the SUK model reported in the present work (Table 3). In spite of the fact that a smaller number of samples used in the present research (240) compared to 425, prediction accuracies are comparable thanks to RF ability in accounting for non-linearity between spectra and OC concentration. Although there are signiﬁcant absorption peaks associated with C–O, C–H + C–H and C–H + C–C overtones and combinations in the NIR spectral ranges, only few successful cases for on-line measurements of TC and OC have been reported so far (Christy, 2008; Kodaira and Shibusawa, 2013; Kuang and Mouazen, 2013a; Aliah Baharom et al., 2015; Nawar and Mouazen, 2017). This might be attributed to several environmental factors (e.g., moisture content, surface roughness, 124

Soil & Tillage Research 190 (2019) 120–127

S. Nawar and A.M. Mouazen

Fig. 3. Variation of ratio of prediction deviation (RPD) for diﬀerent visible and near infrared (vis-NIR) datasets used to establish random forests (RF) models for the prediction of organic carbon (OC) (a) laboratory validation set (b) on-line validation set. SD = standard deviation.

concept of spiking a multi-ﬁeld dataset with samples from measured target ﬁeld is a successful calibration procedure for both laboratory and on-line vis–NIR measurement of soil OC.

dataset size (e.g., sample number) has shown a considerable inﬂuence on the prediction performance of OC (Sankey et al., 2008; Guerrero et al., 2016; Nawar and Mouazen, 2017). The general trend is that the prediction capability increases with sample number. The larger sample size of the SUK dataset might also have contributed to the improvement of prediction performance of SUK models compared to SF models. It was reported that a small dataset size leads to a negative eﬀect that is diﬃcult to measure and may result in very poor performance (Klement et al., 2008). Guerrero et al. (2016) suggested to avoid using a large sample set, and that this is not necessary to achieve better results recognising the increased cost associated with the increased size of the sample set. Compared to prediction performance of a MARS model (Nawar and Mouazen, 2017) obtained with a spiked European dataset (RMSEP = 0.31%), and an ANN model (Kuang and Mouazen, 2013a) obtained by a spiked European dataset (RMSEP = 0.13 - 0.19%), results (RMSEP = 0.17%) for the on-line prediction of OC obtained in this study is comparable, although models were established with the smallest dataset (SUK). The current work conﬁrms previous ﬁndings and provides an additional evidence that suggests advanced data mining methods (e.g., RF in the current work) have the ability to improve the on-line prediction performance of vis-NIR spectroscopy. Moreover, the

4. Conclusions In this study, random forest (RF) was utilized for modelling the visible and near infrared (vis-NIR) spectral data collected under stationary laboratory and on-line (mobile) scanning conditions for the prediction of soil organic carbon (OC) in two ﬁelds (e.g., Hesselskew and Hagg) using three dataset of single ﬁeld (SF), spiked UK (SUK) and non-spiked UK (NSUK). Generally, the performance of RF models varied in accordance with the dataset used, with diﬀerent OC concentration and dataset size. Spiking of selected samples collected from a target ﬁeld into a spectral library (SUK samples), resulted in the best laboratory and on-line predictions. The on-line prediction accuracies were classiﬁed as good to very good and were found to be comparable to those achieved so far with other researches based on diﬀerent nonlinear calibrations but using larger spectral library. Results also suggested that a moderate increase in variability in the spiked dataset is suﬃcient to increase R2 and RPD and reduce RMSEP, simultaneously.

Fig. 4. Variation of coeﬃcient of determination (R2) and root mean square error of prediction (RMSEP) for diﬀerent visible and near infrared (vis-NIR) datasets used to establish random forests (RF) models for the prediction of organic carbon (OC) (a) laboratory validation set (b) on-line validation set. SD = standard deviation. 125

Soil & Tillage Research 190 (2019) 120–127

S. Nawar and A.M. Mouazen

Therefore, RF modelling approach when coupled with spiking is recommended for high prediction performance of OC under both nonmobile laboratory and on-line ﬁeld scanning conditions. Further work is being undertaken to improve the prediction accuracy of the system by accounting for the eﬀect of other aﬀecting parameters e.g., soil moisture content (MC) and texture on the result obtained. Moreover, there is important work needed on optimizing the selection of the calibration dataset for spiking of the target ﬁeld samples into existing spectral library. The concept of spiking of general calibration models needs to be tested for other soil properties with other dataset than those reported in this study.

sensor for high resolution soil property mapping. Geoderma 199, 64–79. Kuang, B., Mouazen, A.M., 2011. Calibration of visible and near infrared spectroscopy for soil analysis at the ﬁeld scale on three European farms. Eur. J. Soil Sci. 62, 629–636. Kuang, B., Mouazen, A.M., 2013a. Eﬀect of spiking strategy and ratio on calibration of online visible and near infrared soil sensor for measurement in European farms. Soil Tillage Res. 128, 125–136. Kuang, B., Mouazen, A.M., 2013b. Non-biased prediction of soil organic carbon and total nitrogen with vis-NIR spectroscopy, as aﬀected by soil moisture content and texture. Biosyst. Eng. 114, 249–258. Kuang, B., Mahmood, H.S., Quraishi, M.Z., Hoogmoed, W.B., Mouazen, A.M., van Henten, E.J., 2012. Sensing soil properties in the laboratory, in situ, and on-line. A review. Adv. Agron. 114, 155–223. Kucharik, C.J., Brye, K.R., Norman, J.M., Foley, J.A., Gower, S.T., Bundy, L.G., 2001. Measurements and modeling of carbon and nitrogen cycling in agroecosystems of southern Wisconsin: potential for SOC sequestration during the Next 50 years. Ecosystems 4, 237–258. Lark, R.M., 1999. Soil-landform relationships at within-ﬁeld scales: an investigation using continuous classiﬁcation. Geoderma 92, 141–165. Liaw, A., Wiener, M., 2015. Breiman and Cutler’s Random Forests for Classiﬁcation and Regression. R Package Version N 4. pp. 6–12 (At:"https://cran.r- project.org/web/ packages/randomForest/randomForest.pdf (Accessed 28 April 2016). Maleki, M.R., Mouazen, A.M., De Ketelaere, B., Ramon, H., De Baerdemaeker, J., 2008. On-the-go variable-rate phosphorus fertilisation based on a visible and near-infrared soil sensor. Biosyst. Eng. 99, 35–46. McDowell, M.L., Bruland, G.L., Deenik, J.L., Grunwald, S., Knox, N.M., 2012. Soil total carbon analysis in Hawaiian soils with visible, near-infrared and mid-infrared diﬀuse reﬂectance spectroscopy. Geoderma 189–190, 312–320. Morellos, A., Pantazi, X.E., Moshou, D., Alexandridis, T., Whetton, R., Tziotzios, G., Wiebensohn, J., Bill, R., Mouazen, A.M., 2016. Machine learning based prediction of soil total nitrogen, organic carbon and moisture content by using VIS-NIR spectroscopy. Biosyst. Eng. 152, 1–13. Morgan, C.L.S., Waiser, T.H., Brown, D.J., Hallmark, C.T., 2009. Simulated in situ characterization of soil organic and inorganic carbon with visible near-infrared diﬀuse reﬂectance spectroscopy. Geoderma 151, 249–256. Mouazen, A.M., 2006. Soil Sensing Device. International Publication, Published Under the Patent Cooperation Treaty (PCT). World Intellectual Property Organization, International Bureau, International Publication Number; W02006/015463; PCT/ BE 2005/000129; G01N21/00GO1N21/00. Mouazen, A.M., De Baerdemaeker, J., Ramon, H., 2005. Towards development of on-line soil moisture content sensor using a ﬁbre-type NIR spectrophotometer. Soil Tillage Res. 80, 171–183. Mouazen, A.M., Karoui, R., Deckers, J., De Baerdemaeker, J., Ramon, H., 2007. Potential of visible and near-infrared spectroscopy to derive colour groups utilising the Munsell soil colour charts. Biosyst. Eng. 97, 131–143. Mouazen, A., Kuang, M., De Baerdemaeker, B., J Ramon, H., 2010. Comparison among principal component, partial least squares and back propagation neural network analyses for accuracy of measurement of selected soil properties with visible and near infrared spectroscopy. Geoderma 158, 23–31. Muñoz, J.D., Kravchenko, A., 2011. Soil carbon mapping using on-the-go near infrared spectroscopy, topography and aerial photographs. Geoderma 166, 102–110. Nawar, S., Mouazen, A.M., 2017. Predictive performance of mobile vis-near infrared spectroscopy for key soil properties at diﬀerent geographical scales by using spiking and data mining techniques. Catena 151, 118–129. Nawar, S., Buddenbaum, H., Hill, J., Kozak, J., Mouazen, A.M., 2016. Estimating the soil clay content and organic matter by means of diﬀerent calibration methods of vis-NIR diﬀuse reﬂectance spectroscopy. Soil Tillage Res. 155, 510–522. Norris, K., 2001. Applying Norris Derivatives. Understanding and correcting the factors which aﬀect diﬀuse transmittance spectra. NIR News 12, 6–9. Peters, J., Baets, B.De., Verhoest, N.E.C., Samson, R., Degroeve, S., Becker, P.De., Huybrechts, W., 2007. Random forests as a tool for ecohydrological distribution modelling. Ecol. Modell. 207, 304–318. Prasad, A.M., Iverson, L.R., Liaw, A., 2006. Newer classiﬁcation and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9, 181–199. Sankey, J.B., Brown, D.J., Bernard, M.L., Lawrence, R.L., 2008. Comparing local vs. global visible and near-infrared (VisNIR) diﬀuse reﬂectance spectroscopy (DRS) calibrations for the prediction of soil clay, organic C and inorganic C. Geoderma 148, 149–158. Shepherd, K.D., Walsh, M.G., 2002. Development of reﬂectance spectral libraries for characterization of soil properties. Soil Sci. Soc. Am. J. 66, 988–998. Soil Survey Staﬀ, 1999. Soil taxonomy - a basic system of soil classiﬁcation for making and interpreting soil surveys. Agricultural Handbook 436; Natural Resources Conservation Service, second edition. USDA, Washington DC, USA. Stenberg, B., Viscarra Rossel, R.A., Mouazen, A.M., Wetterlind, J., 2010. Visible and near infrared spectroscopy in soil science. Adv. Agron. 107, 163–215. Stevens, A., Ramirez, L., 2013. An Introduction to the Prospectr Package (At: https:// cran.r-project.org/web/packages/prospectr/vignettes/prospectr-intro.pdf (Accessed :22 April 2016).. . Stevens, A., Nocita, M., Tóth, G., Montanarella, L., van Wesemael, B., 2013. Prediction of soil organic carbon at the european scale by visible and near InfraRed reﬂectance spectroscopy. PLoS One 8, 1–113. Tekin, Y., Tumsavas, Z., Mouazen, A.M., 2012. Eﬀect of moisture content on prediction of organic carbon and pH using visible and near-infrared spectroscopy. Soil Sci. Soc. Am. J. 76, 188–198. Viscarra Rossel, R.A., Behrens, T., 2010. Using data mining to model and interpret soil diﬀuse reﬂectance spectra. Geoderma 158, 46–54. Viscarra Rossel, R.A., Chen, C., 2011. Digitally mapping the information content of

Acknowledgements Authors acknowledge the ﬁnancial support received through TruNject project (Nr. 36428-267209), which was jointly sponsored by Innovate UK and Biotechnology and Biological Sciences Research Council (BBSRC). Authors also acknowledge the FWO funded Odysseus SiTeMan Project (Nr. G0F9216N). References Abdel Rahman, A.M., Pawling, J., Ryczko, M., Caudy, A.A., Dennis, J.W., 2014. Targeted metabolomics in cultured cells and tissues by mass spectrometry: method development and validation. Anal. Chim. Acta 845, 53–61. Ackerson, J.P., Demattê, J.M., Morgan, C.L.S., 2015. Predicting clay content on ﬁeldmoist intact tropical soils using a dried, ground VisNIR library with external parameter orthogonalization. Geoderma 259–260, 196–204. Ackerson, J.P., Morgan, C.L.S., Ge, Y., 2017. Penetrometer-mounted VisNIR spectroscopy: application of EPO-PLS to in situ VisNIR spectra. Geoderma 286, 131–138. Aliah Baharom, S.N., Shibusawa, S., Kodaira, M., Kanda, R., 2015. Multiple-depth mapping of soil properties using a visible and near infrared real-time soil sensor for a paddy ﬁeld. Eng. Agric. Environ. Food 8, 13–17. Araújo, S.R., Wetterlind, J., Demattê, J.A.M., Stenberg, B., 2014. Improving the prediction performance of a large tropical vis-NIR spectroscopic soil library from Brazil by clustering into smaller subsets or use of data mining calibration techniques. Eur. J. Soil Sci. 65, 718–729. Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. Bresson, L.M., Koch, C., Le Bissonnais, Y., Barriuso, E., Lecomte, V., 2001. Soil surface structure stabilization by municipal waste compost application. Soil Sci. Soc. Am. J. 65, 1804. British Standard BS 7755 Section 3.8, 1995. Soil Quality. Chemical Methods. Determination of Organic and Total Carbon After Dry Combustion (elementary Analysis). Equivalent to ISO 10694:1995. The British Standards Institution, London, UK. Brodský, L., Vašát, R., Klement, A., Zádorová, T., Jakšík, O., 2013. Uncertainty propagation in VNIR reﬂectance spectroscopy soil organic carbon mapping. Geoderma 199, 54–63. Brown, D.J., 2007. Using a global VNIR soil-spectral library for local soil characterization and landscape modeling in a 2nd-order Uganda watershed. Geoderma 140, 444–453. Brown, D.J., Bricklemyer, R.S., Miller, P.R., 2005. Validation requirements for diﬀuse reﬂectance soil characterization models with a case study of VNIR soil C prediction in Montana. Geoderma 129, 251–267. Christy, C.D., 2008. Real-time measurement of soil attributes using on-the-go near infrared reﬂectance spectroscopy. Comput. Electron. Agric. 61, 10–19. Conforti, M., Castrignanò, A., Robustelli, G., Scarciglia, F., Stelluti, M., Buttafuoco, G., 2015. Laboratory-based Vis-NIR spectroscopy and partial least square regression with spatially correlated errors for predicting spatial variation of soil organic matter content. Catena 124, 60–67. Díaz-Uriarte, R., Alvarez de Andrés, S., 2006. Gene selection and classiﬁcation of microarray data using random forest. BMC Bioinf. 7, 1–13. Efron, B., 1979. Bootstrap methods: another look at the jackknife. Ann. Stat. 7, 1–26. Ge, Y., Morgan, C.L.S., Ackerson, J.P., 2014. VisNIR spectra of dried ground soils predict properties of soils scanned moist and intact. Geoderma 221–222, 61–69. Guo, P.T., Li, M.F., Luo, W., Tang, Q.F., Liu, Z.W., Lin, Z.M., 2015. Digital mapping of soil organic matter for rubber plantation at regional scale: an application of random forest plus residuals kriging approach. Geoderma 237–238, 49–59. Guerrero, C., Wetterlind, J., Stenberg, B., Mouazen, A.M., Gabarrón-galeote, M.A., Ruizsinoga, J.D., Zornoza, R., Viscarra, R.A., 2016. Do we really need large spectral libraries for local scale SOC assessment with NIR spectroscopyó? Soil Tillage Res. 155, 501–509. Ho, T.K., 1998. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20, 832–844. Ishwaran, H., 2007. Variable importance in binary regression trees and forests. Electron. J. Stat. 1, 519–537. Klement, S., Madany Mamlouk, A., Martinetz, T., 2008. Reliability of Cross-validation for SVMs in High-dimensional, Low sample size scenarios. Artiﬁcial Neural Networks ICANN 2008. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp. 41–50. Kodaira, M., Shibusawa, S., 2013. Using a mobile real-time soil visible-near infrared

126

Soil & Tillage Research 190 (2019) 120–127

S. Nawar and A.M. Mouazen

Wang, K., Zhang, C., Li, W., 2013. Predictive mapping of soil total nitrogen at a regional scale: a comparison between geographically weighted regression and cokriging. Appl. Geogr. 42, 73–85. Wang, D., Chakraborty, S., Weindorf, D.C., Li, B., Sharma, A., Paul, S., Ali, M.N., 2015. Synthesized use of VisNIR DRS and PXRF for soil characterization: total carbon and total nitrogen. Geoderma 243–244, 157–167. Wetterlind, J., Stenberg, B., Söderström, M., 2010. Increased sample point density in farm soil mapping by local calibration of visible and near infrared prediction models. Geoderma 156, 152–160. Zhang, H., Wu, P., Yin, A., Yang, X., Zhang, M., Gao, C., 2017. Prediction of soil organic carbon in an intensively managed reclamation zone of eastern China: a comparison of multiple linear regressions and the random forest model. Sci. Total Environ. 592, 704–713.

visible–near infrared spectra of surﬁcial Australian soils. Remote Sens. Environ. 115, 1443–1455. Viscarra Rossel, R.A., Walvoort, D.J.J., McBratney, A.B., Janik, L.J., Skjemstad, J.O., 2006. Visible, near infrared, mid infrared or combined diﬀuse reﬂectance spectroscopy for simultaneous assessment of various soil properties. Geoderma 131, 59–75. Viscarra Rossel, R.A., Adamchuk, V.I., Sudduth, K.A., McKenzie, N.J., Lobsey, C., 2011. Proximal soil sensing: an eﬀective approach for soil measurements in space and time. Adv. Agron. 113, 237–282. Vohland, M., Besold, J., Hill, J., Fründ, H.C., 2011. Comparing diﬀerent multivariate calibration methods for the determination of soil organic carbon pools with visible to near infrared spectroscopy. Geoderma 166, 198–205. Wang, J., He, T., Lv, C., Chen, Y., Jian, W., 2010. Mapping soil organic matter based on land degradation spectral response units using Hyperion images. Int. J. Appl. Earth Obs. Geoinf. 12, S171–S180.

127

On-line vis-NIR spectroscopy prediction of soil organic carbon using machine learning

On-line vis-NIR spectroscopy prediction of soil organic carbon using machine learning

Recommend Documents