Integration of high resolution remotely sensed data and machine learning techniques for spatial prediction of soil properties and corn yield

Integration of high resolution remotely sensed data and machine learning techniques for spatial prediction of soil properties and corn yield

Computers and Electronics in Agriculture 153 (2018) 213–225 Contents lists available at ScienceDirect Computers and Electronics in Agriculture journ...

6MB Sizes 0 Downloads 64 Views

Computers and Electronics in Agriculture 153 (2018) 213–225

Contents lists available at ScienceDirect

Computers and Electronics in Agriculture journal homepage: www.elsevier.com/locate/compag

Original papers

Integration of high resolution remotely sensed data and machine learning techniques for spatial prediction of soil properties and corn yield

T



Sami Khanala, , John Fultonb, Andrew Klopfensteinb, Nathan Douridasc, Scott Shearerb a

Department of Food, Agricultural and Biological Engineering, Ohio State University, Wooster, OH 44691, USA Department of Food, Agricultural and Biological Engineering, Ohio State University, Columbus, OH 43210, USA c Farm Science Review, Ohio State University, London, OH 43140, USA b

A R T I C LE I N FO

A B S T R A C T

Keywords: Remote sensing Soil DEM Yield Mapping

Widespread adoption of precision agriculture requires timely acquisition of low-cost, high quality soil and crop yield maps. Integration of remotely sensed data and machine learning algorithms offers cost-and time-effective approach for spatial prediction of soil properties and crop yield compared to conventional approaches. The objectives of this study were to: (i) evaluate the role of remotely sensed images; (ii) compare the performance of various machine learning algorithms; and (iii) identify the importance of remotely sensed image-derived variables, in spatial prediction of soil properties and corn yield. This study integrated field based data on five soil properties (i.e., soil organic matter (SOM), cation exchange capacity (CEC), magnesium (Mg), potassium (K), and pH) and yield monitor based corn yield data with multispectral aerial images and topographic data, both collected in 2013, from seven fields at the Molly Caren Farm near London, Ohio. Digital elevation model data, at a resolution of 1 m, was used to derive topographic properties of the fields. Multispectral images collected at baresoil conditions, at a resolution 0.30 m, were used to derive soil and vegetation indices. Models developed for prediction of soil properties and corn yield using linear regression (LM) and five machine learning algorithms (i.e., Random Forest (RF); Neural Network (NN); Support Vector Machine (SVM) with radial and linear kernel functions; Gradient Boosting Model (GBM); and Cubist (CU)) were evaluated in terms of coefficient of determination (R2) and root mean square error (RMSE). Machine learning algorithms were found to outperform LM algorithm for most of the times with a higher R2 and lower RMSE. Based on models for seven fields, on average, NN provided the highest accuracy for SOM (R2 = 0.64, RMSE = 0.44) and CEC (R2 = 0.67, RMSE = 2.35); SVM for K (R2 = 0.21, RMSE = 0.49) and Mg (R2 = 0.22, RMSE = 4.57); and GBM for pH (R2 = 0.15, RMSE = 0.62). For corn yield, RF consistently outperformed other models and provided higher accuracy (R2 = 0.53, RMSE = 0.97). Soil and vegetation indices based on bare-soil imagery played a more significant role in demonstrating in-field variability of corn yield and soil properties than topographic variables. The accuracy of the models developed for prediction of soil properties and corn yield observed in this study suggested that the approach of integrating remotely sensed data and machine learning algorithms are promising for mapping soil properties and corn yield at a local scale, which can be useful in locating areas of potential concerns and implementing site-specific farming practices.

1. Introduction Accurate and detailed information on soil properties and crop health is essential for optimization of farm management practices for sustainable production of agricultural goods and services (Souza et al., 2016; Yao et al., 2016), as well as for environmental modeling, and environmental risk assessment and management. High resolution maps of soil properties and crop yields enable producers and the agricultural community to identify in-field variability in soil and crop health and



target areas within the field for soil fertility interventions, improved crop productivity, and better economic outcomes. Traditional approaches for mapping soil properties and crop yield have mostly relied on field surveys and the use of costly equipment. Soil sampling and laboratory analyses are conducted for evaluating soil health, and harvester-mounted yield monitors are used for understanding the spatial variability in crop yield. These approaches however are time consuming and expensive, especially when mapping needs to be done at regional, national, and global scales (Mulder et al., 2011;

Corresponding author. E-mail address: [email protected] (S. Khanal).

https://doi.org/10.1016/j.compag.2018.07.016 Received 9 January 2018; Received in revised form 17 April 2018; Accepted 8 July 2018 0168-1699/ © 2018 Elsevier B.V. All rights reserved.

Computers and Electronics in Agriculture 153 (2018) 213–225

S. Khanal et al.

Table 1 Basic characteristics of the fields studied, including field size, slope, dominant soil map unit, dominant soil order, number of soil samples, and field management practices. Field

Size (ha)

Slope (%)

Soil map unit

Dominant soil order

Sample number

Tillage

Crop rotation

1B 1C 1D 9A 12D MISD PENIN

11 5.3 6.5 13.3 17.5 12 3.8

4.37 5.86 4.35 5.7 4.98 9.26 9.6

Ochraqualfs (40.7%), Argiaquolls (31%), Epiaqualfs (18%), Argiudolls (10.3%) Ochraqualfs (74%), Argiaquolls (26%) Ochraqualfs (94.8%), Argiaquolls (5.2%) Ochraqualfs (58%), Argiaquolls (42%) Argiaquolls (46%), Hapludalfs (27.9%); Ochraqualfs (23.8%) Ochraqualfs (82.5%), Argiaquolls (18.5%) Ochraqualfs (98%)

Alfisols Alfisols Alfisols Alfisols Mollisols Alfisols Alfisols

27 17 20 39 49 36 12

NT CT NT CT CT CT NT

C-C-S C-S-C C-S-C C-S-C W-S-C S-W-C C-C-S

Tillage: NT – No Till; CT – conventional tillage (i.e., field cultivator was used prior to planting the crop). Crop Rotation: C- Corn; S- Soybean; W-Wheat.

The objectives of this study were to: (i) examine the role of remotely sensed images; (ii) evaluate the performance of linear regression and machine learning algorithms; and (iii) identify the importance of remotely sensed image-derived variables, for prediction and mapping of soil properties and corn yield. Seven statistical models were developed for predicting corn yield and soil properties. Soil properties examined in this study included soil organic matter (SOM), cation exchange capacity (CEC), potassium (K), magnesium (K), and pH. Prior studies (Forkuor et al., 2017; Morellos et al., 2016) have used remotely sensed data for mapping of soil properties; however, this is to our knowledge the first evaluation of remotely sensed images of bare soil surface at a spatial resolution < 1 m from multiple fields for prediction and mapping of both soil properties and corn yield.

Yang et al., 2014). Furthermore, these approaches have several limitations. For example, yield monitor based data can only be collected at harvest and, thus, cannot be used for in-season crop management. Also, these data are spatially coarse and fail to capture in-field variability in soil and crop health (Souza et al., 2016). Remotely sensed images have the potential to overcome the limitations of traditional approaches and improve the spatial coverage of soil and crop yield data (Peng et al., 2015; Stevens et al., 2013; Yao et al., 2016). Studies have demonstrated that many soil properties can be estimated by integrating georeferenced field collected soil and crop data with spectral properties of soil acquired by sensors onboard satellite and aircrafts. Dobos et al. (2001) found the Advanced Very High Resolution Radiometer (AVHRR) satellite data and DEM derived terrain variables to be powerful in characterizing soil-forming environments and delineation of soil patterns on a regional scale. Scudiero et al. (2014) found multi-year spectral reflectance data from the Landsat to be a reliable indicator of soil salinity in the western San Joaquin Valley in California, USA. Several studies have also been conducted focusing on crop yield mapping by integrating remotely sensed images acquired from satellite (Lobell et al., 2015), aircraft (Yang et al., 2014), and unmanned aerial vehicles (Geipel et al., 2014; Shi et al., 2016). Despite prior efforts, further exploration on the application of remotely sensed data for mapping of soil properties and crop yield is needed. The success in prediction and mapping of soil properties, and crop health and yield using remotely sensed data to a large extent depends on the availability, quality, and timing of remotely sensed data collection (Blasch et al., 2015), as well as the approaches used for model development (Forkuor et al., 2017; Morellos et al., 2016). Prior studies have mostly focused on estimating crop yield and soil properties at regional scales rather than for individual fields (Lobell et al., 2015). These studies used satellite acquired remotely sensed images with coarse spatial resolution. Mapping of soil properties and crop yield at coarse resolution is of limited use for resource assessment and management at a field scale; whereas, maps at high resolution can help the agricultural and environmental community to cost-effectively detect and characterize the extent of soil and crop health issues. This information can be used for prescription-based farming that help improve economic outcome and environmental footprints associated with agricultural practices. A linear regression algorithm is the most commonly used approach to estimate crop yield and soil properties (Geipel et al., 2014; Lobell et al., 2015). However, it has limitations in handling non-linear relationships between response and predictor variables that usually exist in heterogeneous agricultural landscapes. There are several machine learning algorithms that can overcome this limitation, and provide better prediction of soil variables and crop yield. However, comparisons of the traditional linear regression algorithm to machine learning algorithms for prediction of soil properties and crop yield are limited. In addition to understanding the performance of various models in mapping soil properties and crop yield, there is a need to identify the relative importance of variables for enhancing the predictive ability of the models.

2. Materials and methods 2.1. Study area Fields examined in this study are located in the northwest part (83°26′14.3″–83°26′49.24″W, 39°56′37.82″–39°57′28.7″N) of Madison County, Ohio, USA. The dominant soil types in these fields are Ochraqualfs (Crosby-Lewisburg Complex), Argiaquolls (Kokomo Silty Clay Loam, Westland silty clay loam), and Hapludalfs (Miamian Silt Loam, Eldean silt loam, Thackery variant silt loam) (Table 1). These fields are gently rolling, with the mean slope ranging from 4.35 to 9.26%. The average elevation of the fields is 311 m. The mean annual rainfall (1981–2016) is 998 mm with approximately 58% of annual rainfall occurring between April and September. The mean annual temperature is 10.9 °C, with daily temperatures ranging from −6.7 (minimum) to 29.2 °C (maximum). A strong spatial variability in soil properties was observed in the study area. Soil properties were characterized by large range and high standard deviation, with SOM in the range of 1.2–4.9 (%), CEC of 6–27.3 (meq/100 g), K of 1.2–5.9 (%), Mg of 10.2–36.7 (%), and pH of 5–78 (Table 2). 2.2. Data 2.2.1. Soil and crop data A total of 200 soil samples were collected from seven bare fields (Table 1) in October 1, 2013. In each field, samples were taken at a depth of 18 cm on 1-acre intervals. The samples were air-dried at 49 °C (120 °F) for 24 h, sieved, and sent to the Spectrum Analytic lab (Spectrum Analytic, 2017) for soil analyses. As field 12D has very different soil map units compared to six other fields (Table 1), soil samples were classified into two dominant soil orders (Alfisols and Mollisols), and a “group” was introduced as an independent variable for model development. Corn yield data were available for only one field (i.e.,12D), and thus, the models for corn yield prediction were focused on this field only. Corn yield data were recorded by a John Deere yield monitoring system during harvest. The yield monitor was calibrated before and 214

Computers and Electronics in Agriculture 153 (2018) 213–225

S. Khanal et al.

2.2.2. Remotely sensed data Remotely sensed data used in this study included high spatial resolution multispectral images collected from bare fields and digital elevation model (DEM). Multispectral images (Fig. 1) were obtained in May of 2013 under the Ohio Statewide Imagery Program. They were collected with a Leica ADS80 digital camera onboard aircraft, and rectified using LiDAR data, and have visible (red, green, and blue) and near-infrared wavebands at 0.30 m spatial resolution. Six soil and vegetation indices that were found useful in digital soil mapping (Ray et al., 2004) were calculated using the combination of spectral bands in the multispectral images. Table 3 provides further details on the spectral indices considered in the study. Terrain variables (Table 4) were extracted using DEM data, with 0.76 m resolution, available from the Ohio Geographically Referenced Information Program. Prior to the calculation of terrain variables for analyses, the DEM was pre-processed to generate a depression free DEM. To ensure the proper integration between varying datasets, the DEM was resampled at 0.30 m resolution, the resolution of the multispectral images, using the bilinear interpolation method. To minimize the potential variance among pixels that might have been introduced by various factors, such as microtopography, image processing, and scanning, a low-pass filter with a 5 by 5 cell mask was applied to each band of the multispectral images and the DEM (Hively et al., 2011). Bare soil imagery for field 12D was classified into three soil color classes – dark, medium and light, using supervised algorithms including support vector machine (with linear and radial kernel), random forest, and neural network in R software. Among these algorithms, support vector machine with radial function provided the highest classification accuracy of 81%. Details of these algorithms are provided in Section 2.3.1. Spectral bands, spectral indices, and terrain properties were extracted at locations used for collecting soil samples and yield data using ArcGIS software, and related with soil properties and corn yield to establish the relationship between remotely sensed and field-measured

Table 2 Summary of soil properties and corn yield for study area. Soil variables

Minimum

Maximum

Mean

Standard deviation

All data SOM (%) CEC (meq/100 g) K (%) Mg (%) pH Corn yield (t/ha)

1.20 6.00 1.20 10.20 5 6.4

4.90 27.30 5.90 36.70 7.8 18.0

2.31 15.31 2.28 26.24 6.75 14.39

0.71 4.18 0.56 5.24 0.65 1.47

Train data SOM (%) CEC (meq/100 g) K (%) Mg (%) pH Corn yield (t/ha)

1.2 6 1.2 10.2 5 6.11

4.9 27.3 5.9 36.7 7.8 18.0

2.32 15.35 2.29 26.44 6.76 14.4

0.68 4.17 0.58 5.10 0.65 1.47

Test Data SOM (%) CEC (meq/100 g) K (%) Mg (%) pH Corn yield (t/ha)

1.2 7.1 1.6 11.7 5.4 6.9

4.6 22.5 3.6 36.7 7.7 17.9

2.26 15.1 2.2 25.36 6.71 14.41

0.77 4.15 0.4 5.6 0.66 1.44

Note: Train and test data indicate soil samples and corn yield observations used for model development and validation, respectively.

during the harvest to minimize the potential error in yield estimates. Despite proper calibration, due to changes in machine orientation and speed, yield monitor is likely to provide erroneous yield estimates in field edges (Lyle et al., 2014). Thus, a 20 m buffer was established inside the field edges, and the yield data from that buffer were excluded from the analyses. Additionally, yield data was checked for errors, and errors were removed using the workflow discussed by Sudduth and Drumm (2007).

Fig. 1. Seven fields used in the study. Circles and stars indicate spatial locations of soil samples used for model development and validation, respectively. Field 12D was used for corn yield prediction. The background image is a multispectral displayed with a combination of red, green and blue wavebands. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

215

Computers and Electronics in Agriculture 153 (2018) 213–225

S. Khanal et al.

Table 3 Soil and vegetation indices considered for the analyses. Indices

Formula

Brightness Index (BI)

2 + G2 + B2 0.5

⎛R ⎝

3

⎞ ⎠

Index property

Reference

Average reflectance magnitude

Ray et al. (2004)

Saturation Index (SI)

(R − B ) (R + B )

Spectral slope

Ray et al. (2004)

Hue Index (HI)

(2 ∗ R − G − B ) (G − B )

Primary colors

Ray et al. (2004)

Coloration Index (CI)

(R − G ) (R + G ) R2

Soil color

Ray et al. (2004)

Hematite content

Ray et al. (2004)

Health and amount of vegetation

Ray et al. (2004)

Redness Index (RI)

(B × G3)

Normalized Difference Vegetation Index (NDVI)

(NIR − R) (NIR + R)

Table 4 Terrain variables considered in the study. Parameters

Definition

Units

References

Elevation (Elev) Slope Aspect Roughness (Rough) Terrain Ruggedness Index (TRI) Topographic Position Index (TPI) Flow Direction (FlowDir)

Height above a sea level Inclination of the land surface Direction the slope faces Difference between maximum and minimum elevation Amount of elevation difference between neighboring areas Measure of where a location is in the overall landscape Path of water flow

Meter Degree Degree – – – –

Allen et al. (2014) Davy and Koen, (2014) Wilson et al. (2007) Riley (1999) Wilson et al. (2007) Kitchingman and Lai (2004)

Johnson, 2013). To provide an unbiased sense of model effectiveness, the total data was randomly split into training and test sets at a 4:1 ratio, where the training set was used for model calibration and the test set was used for model evaluation. Although the splitting of data was random, mean values of corn yield, SOM, CEC, K, Mg, and pH between the training and test sets were ensured to be similar so that the calibrated models were well trained to predict the range of soil properties and crop yield in the test dataset (Table 2).

data. To extract information from images and relate to corn yield, a rectangle with the length equal to harvester swath width (6.32 m) and width equal to the distance between logged data points (2 m) in a row was drawn around each logged yield data point. This is done because the size of area represented by each logged point in a yield monitor is proportional to the header width and the distance travelled by the combine harvester between logged data points. Image related information was extracted at these polygons using the zonal statistics function of the Spatial Analyst tool in the ArcGIS. Each rectangle included 97 pixels from the images with 0.30 m resolution, and an average of these pixel values was related to a yield observation. It is our understanding that the DEM and multispectral images used in this study are the highest spatial resolution datasets ever used in mapping soil properties.

2.3.1. Statistical models During the model design, soil properties and corn yield were the dependent variables, and spectral, soil color class, and terrain variables were the independent or predictor variables. One soil property was modeled at a time as the dependent variable against all predictor variables. For each model, the adjusted R2 and residual standard error were considered. During model development, random numbers are used for splitting data for resampling and parameter estimations (Kuhn and Johnson, 2013). To control the randomness ensuring that the same resampling sets were used during cross-validations of models and assure reproducible results for comparison between models, the same random number seed was set prior to the development and training of all the models. Variables were scaled prior to model runs to ensure that all variables are on the same scale. Seven different models (briefly discussed below) were developed for each soil parameter and corn yield estimation. The details on the background and mathematical function of these models can be found in Kuhn and Johnson (2013).

2.3. Statistical analyses All statistical analyses were performed using R software. Six statistical models - linear regression (LM), random forest regression (RF), support vector machine (SVM), stochastic gradient boosting model (GBM), neural network (NN), and cubist (CU), were developed for predicting soil properties and corn yield. The performance of these models were then compared to determine the best model. For the model development, the statistical package “caret” was used. The “caret” package allows fitting and comparisons of numerous linear and nonlinear regression models under a unified framework (Kuhn and

Table 5 Explanatory variables selected for modeling of each soil parameter, and corn yield after stepwise regression. Parameter

Spectral variables

Other variables

Soil Organic Matter (SOM) Cation Exchange Capacity (CEC) Potassium (K) Magnesium (Mg) pH Yield (Approach 1) Yield (Approach 2)

Red, blue, NIR, NDVI, BI, CI, RI, SI Red, green, blue, NDVI, BI Green, blue, NIR, NDVI, BI, CI, HI, RI, SI, NIR, BI, CI, HI, RI Red, BI, HI, RI Red, green, blue, NIR, NDVI, BI, CI, HI, RI, SI, soil class

TRI, Group FlowDir Elev, TRI, Slope Elev, Group, Slope, TRI, FlowDir Group, FlowDir, Slope, TRI, Rough Elev, Aspect, TRI, TPI, FlowDir SOM, CEC, Mg, K, pH

216

Computers and Electronics in Agriculture 153 (2018) 213–225

S. Khanal et al.

absolute error. This model is easy to understand and interpret, and was found to have a good predictive power (Minasny and McBratney, 2008). The model was tuned with 10 committees and 5 neighbors to reduce the root mean square error. The performances of machine learning algorithms were enhanced by tuning several parameters specific to each model (discussed above) using the tenfold cross-validation with five repetitions in the “caret” package. A grid search strategy was used while optimizing the parameters specific to models. The parameters selected for model tuning are provided in the supporting document (Table S1; see supplementary document). To provide comparison with other models, the same set of predictors (Table 5) were maintained for all other models. For corn yield prediction, two approaches were used. The first approach used only remotely sensed image-derived independent variables, and the second approach used only soil variables, which were predicted using remotely sensed data.

2.3.1.1. Linear regression model (LM). It explains dependent variable by means of a linear combination of predictor variables. For the LM analysis, the “lm” function was used. A stepwise regression was conducted to address a problem of multi-collinearity in the regression model. Stepwise regression identifies a subset of predictors based on their statistical significance using stepwise selection using approaches, such as forward selection, backward elimination, and a combination of the two. For stepwise regression, the “stepAIC” function based on both the forward and background search method was used. This function is available in the “MASS” package of the R software, and it uses the AIC statistics as the criteria for variable selection. Table 5 provides the list of predictors identified by stepwise regression. 2.3.1.2. Random forest (RF). It is an ensemble learning method that is used for both classification and regression problems. It operates by constructing multiple decision trees and outputting either a class for classification or mean prediction for regression of the individual trees. Each tree in the forest is independently constructed using a unique bootstrap sample of the training data. The best split from a randomly selected subset of predictors is then selected. Unlike linear regression, it requires no assumption of the probability distribution of the predictor variables, and is robust against nonlinearity and overfitting. In the study, RF model was developed using the “rf” function available through “randomForest” package. The model’s performance was optimized by tuning parameters, such as the number of predictors that are randomly sampled as candidates for each split (i.e., mtry) and the number of trees to grow in the forest (i.e., ntree).

2.3.2. Model assessment and validation The performances of the statistical models were assessed using a repeated k-fold cross-validation resampling technique on the training set, and then validated with the test set. For the k-fold cross-validation, samples are randomly partitioned into k sets of roughly equal sizes. A model is fit using all samples except the first subset, and held-out samples are used to estimate performance measures. The first subset is treated as the training set, and the process repeats with the second subset held-out, and so on. This approach tests the performance of a model on every instance in the available data set without having used it in the training phase. In this study, a 10-fold cross-validation was repeated five times resulting in 50 different subsets for testing the model efficacy. The results were then aggregated and summarized. Statistics, including adjusted R square (hereafter referred to as “R2”) and root mean squared error (RMSE), were used to evaluate the effectiveness of model’s capabilities in predicting soil properties. R2 can be interpreted as the proportion of the information in the data that is explained by the model. It is a measure of correlation, not accuracy. A model with a high R2 value may not necessarily lead to accurate prediction, and could systematically and significantly over and/or under predict the data. Thus, RMSE, a function of the model residuals (i.e., observed values minus model predictions) Eq. (1) that represents how far, on average, the residuals are from zero or the average distance between observed (O) and model predictions (P) was also used.

2.3.1.3. Support Vector Machine (SVM). This is a machine-learning method that constructs a hyperplane or set of hyperplanes in a highor infinite-dimensional space, which can be used for classification or regression. A good separation between hyperplanes is achieved through different types of kernel functions such as linear, radial, sigmoid, and polynomial. For simplicity purpose, only linear and radial kernel functions were selected in this study. The SVM models were tuned based on bandwidth cost parameter and insensitive loss function. 2.3.1.4. Stochastic gradient boosting (SGB). This is another data mining approach that combines the advantages of nonparametric tree-based methods and strengths of boosting algorithms. Instead of focusing on the complete training data, it performs boosting by selecting only a fraction of the training data leading to a gradual improvement in the prediction accuracy. In the study, the model’s performance was optimized by tuning parameters such as tree depth, number of trees, and shrinkage.

RMSE =

2.3.1.5. Neural network (NN). It is one of the powerful nonlinear regression approaches, which is designed to model or mimic some properties of biological neural networks. It consists of interconnected processing elements called nodes or neurons that work together to produce an output function. The connection between nodes are described by the weights, which at the beginning are randomly chosen, but are adjusted interactively if predicted output does not match output of a training dataset. The resilient backpropagation algorithm (rprop) was used because of its promising capabilities compared to other algorithms (Riedmiller and Braun, 1993). NN models were optimized by tuning parameters such as numbers of hidden layer and decay rate.

1 n

n

∑ (Pi−Oi )2 i=1

(1)

2.3.3. Variable importance A variable importance measure was estimated to understand the relative importance of predictors to the outcome of various models. Variable importance measure is estimated based on a method specific to the model (Kuhn, 2017). For instance, in LM, it is computed based on the absolute value of the t-statistics of each model parameter. In RF, it is computed based on two common measures – increased mean square error (IncMSE) and increased impurity index (IncNodePurity). IncMSE measures the change in predictive power by constructing trees with and without a predictor. IncCodePurity measures the total decrease in node impurity from splitting on a predictor in the tree construction process, and is averaged over all trees. For this study, the IncMSE measure was used. In NN, variable importance is estimated based on the combination of the absolute values of weights. In CU, it is estimated based on the percentage of times each variable is used in a condition and/or linear model. A varImp function in the package “caret” in the R software was used for this purpose.

2.3.1.6. Cubist model (CUB). This is a data-mining technique for generating data driven rule-based predictive models. It works in a similar way as decision tree regression models do. A tree is created where the terminal leaves contain linear regression models. At each step of the tree, there are intermediate linear models. The tree is then reduced to a set of rules that initially are paths from the top of the tree to the bottom, and the linear model is then adjusted to reduce the 217

Computers and Electronics in Agriculture 153 (2018) 213–225

S. Khanal et al.

Fig. 2. Correlation among 24 variables for all seven fields. The abbreviation of variable are provided in Tables 3 and 4. Red texts indicate the correlation values that are not significant at p < 0.10. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Table 6 Model performance for estimation of soil properties for all seven fields.

LM RF SVML SVMR GBM NN CU

SOM R2 RMSE 0.55 0.47 0.56 0.47 0.54 0.48 0.56 0.47 0.58 0.46 0.61 0.44 0.60 0.46

LM RF SVML SVMR GBM NN CU

0.56 0.56 0.49 0.44 0.50 0.55 0.51

0.61 0.53 0.63 0.55 0.63 0.53 0.57

LM RF SVML SVMR GBM NN CU

0.55 0.56 0.53 0.54 0.56 0.60 0.58

0.50 0.48 0.51 0.49 0.49 0.46 0.48

Model

Cross-Validation with Training Dataset Mg CEC K R2 RMSE R2 RMSE R2 RMSE 0.21 4.69 0.65 2.40 0.23 0.55 0.61 2.53 0.19 0.50 0.10 4.96 0.65 2.40 0.18 0.50 0.22 4.57 0.62 2.50 0.11 4.85 0.21 0.49 0.63 2.45 0.18 0.50 0.08 4.99 0.16 0.51 0.11 5.03 0.67 2.35 0.61 2.51 0.22 0.51 0.11 5.04 Validation with Test Dataset 0.61 3.08 0.12 0.55 0.27 5.16 0.01 5.97 0.15 0.53 0.63 3.02 0.60 3.16 0.08 0.56 0.23 5.21 0.57 3.19 0.11 0.55 0.13 5.57 0.09 5.67 0.62 3.09 0.21 0.51 0.53 3.10 0.11 0.53 0.02 6.10 0.04 6.01 0.60 3.15 0.08 0.58 Overall Dataset 0.64 2.54 0.21 0.55 0.22 4.78 0.62 2.62 0.18 0.51 0.08 5.16 0.64 2.55 0.16 0.51 0.22 4.70 0.61 2.64 0.11 4.99 0.19 0.50 0.08 5.13 0.63 2.58 0.18 0.50 0.09 5.24 0.15 0.51 0.64 2.50 0.10 5.23 0.61 2.64 0.19 0.52

R2 0.14 0.13 0.14 0.09 0.15 0.12 0.16

pH RMSE 0.63 0.63 0.63 0.64 0.62 0.65 0.63

0.13 0.13 0.09 0.05 0.07 0.12 0.08

0.62 0.62 0.64 0.65 0.63 0.62 0.68

0.13 0.13 0.13 0.09 0.13 0.12 0.14

0.63 0.63 0.63 0.64 0.62 0.64 0.64

Note: LM – Linear Model, RF – Random Forest, SVML– Support Vector Machine with linear kernel, SVMR– Support Vector Machine with radial kernel, GBM – Gradient Boosting Model, NN – Neural Network, CU – Cubist Model. Bold texts indicate the model with the least RMSE and the highest R2 for each soil parameter. 218

Computers and Electronics in Agriculture 153 (2018) 213–225

S. Khanal et al.

Fig. 3. Plots of predicted versus observed soil organic matter (% SOM), cation exchange capacity (CEC meq/100 g), magnesium (Mg), potassium (K), and pH for the training and test datasets. Predictions for SOM and CEC were based on neural network; K, Mg and pH were based on support vector machine with radial function, support vector machines with linear function, and gradient boosting model, respectively; and predicted corn yield was based on random forest algorithm.

3. Results

values ranged from 0.61 to 0.67 for CEC, 0.54 to 0.63 for SOM, 0.16 to 0.23 for K, 0.09 to 0.121 for Mg, and 0.09 to 0.16 for pH (Table 6). Variability in R2 values during cross-validation of models are provided in the supporting document (Figs. S1 and S2). Except for few models for CEC, K, and Mg, performances of the majority of the models were found to be poor for the test sets (Table 6). With the test sets, R2 values were relatively lower, and RMSE values were relatively higher. This could be attributed to a larger number of samples allocated for model development and fewer for model validation. While evaluating the models that provided the highest accuracy for prediction of soil properties at a field level, it was found that although the overall performance of the models developed by integrating the information of seven fields together were low, the models could predict soil properties of some fields with higher accuracy than for others (Table 8, Fig. 3). For example, for field 1B, NN model predicted SOM with R2 = 0.85, but for PENIN, R2 = 0.21. Similarly, the overall performance of models for K, Mg and pH were low (R2 = 0.19, 0.22 and 0.13 for K, Mg and pH, respectively), but the model predicted these variables with higher R2 values for some fields. Mg was predicted with R2 = 0.55 for 1C; K was predicted with R2 = 0.56 for 1B; and pH was

3.1. Relationship between soil properties, yield and remote sensing data Soil properties were highly correlated with the individual wavebands (Red, Green and Blue) as well as the soil and vegetation indices than the terrain properties of the fields. Terrain characteristics were correlated with bare soil imagery, but not as much as the soil properties. Among seven terrain properties, elevation was found to have the highest correlation with the individual wavebands such as green and blue, and RI (Fig. 2). Yield of field 12D was found to have a higher correlation with soil indices such as BI, HI and CI, followed by Mg, CEC, SOM and elevation of the field (results not shown). 3.2. Model performance Assessment of the models used for prediction of five soil properties for all seven fields suggested that high resolution remotely sensed data can predict CEC with relatively higher accuracy, followed by SOM, Mg, K, and pH (Table 6). During cross-validation of the models, average R2 219

Computers and Electronics in Agriculture 153 (2018) 213–225

S. Khanal et al.

Table 7 Model performance for estimation of corn yield based on remotely sensed image-derived variables.

Cross-Validation with Training Dataset R2 RMSE 0.34 1.14 0.52 0.97 0.33 1.15 0.44 1.05 0.40 1.08 0.37 1.11 0.51 0.98

Models LM RF SVML SVMR GBM NN CU

Validation with Test Dataset 2

R 0.35 0.56 0.35 0.48 0.43 0.39 0.55

RMSE 1.17 0.97 1.18 1.05 1.10 1.14 0.98

Overall Dataset R2 0.34 0.53 0.33 0.45 0.41 0.37 0.52

RMSE 1.15 0.97 1.16 1.05 1.08 1.12 0.98

Note: Bold texts indicate the model with the least RMSE and the highest R2.

predicted with R2 = 0.73 for 1C. Table 7 shows average R2 and RMSE of seven models for both crossvalidation and validation stages during corn yield prediction using remotely sensed data-derived variables. R2 values ranged from 0.32 to 0.51 during the model cross-validation, and 0.30 to 0.51 during the validation phase.

performed marginally better in prediction of pH. However, during model validation with test dataset, RF performed better than NN for both SOM and CEC. GBM performed better in K prediction, and LM performed marginally better for pH. For two approaches (first based on only remotely sensed data-derived variables, and the second based on only soil variables) for corn yield prediction, RF and CU models consistently performed better (i.e., higher R2 and lower RMSE) than other models during both cross-validation and validation phases. Although RF and CU models had the same R2, RF performed marginally better, as indicated by its lower RMSE (Table 7). Superiority of machine learning models over LM model could be attributed to the existence of non-linear relationships between the response and predictor variables that machine learning algorithms can integrate during model development. Differences in accuracy of corn yield prediction models developed using only remotely sensed data-

3.3. Comparison of model performance For five soil properties, no model was found to have a consistently superior performance during both cross-validation and validation phases (Table 6). However, for most of the time, machine learning models performed better than LM. For instance, during model development, NN performed better in prediction of SOM and CEC with a higher R2 and lower RMSE. SVM with linear and radial kernel functions performed better in prediction of Mg and K, respectively. GBM

Table 8 Model performance for prediction of soil properties at field level.

Field

Dataset

All Overall Train Test All Train 1B Test All Train 1C Test All Train 1D Test All Train 9A Test All PENIN Train Test All MISD Train Test All Train 12D Test

SOM R2 RMSE 0.60 0.46 0.61 0.44 0.55 0.53 0.85 0.43 0.91 0.43 0.41 0.45 0.75 0.42 0.66 0.44 1.00 0.26 0.53 0.30 0.62 0.30 0.79 0.35 0.76 0.49 0.76 0.40 0.76 0.68 0.20 0.33 0.07 0.34 0.91 0.33 0.68 0.45 0.69 0.44 0.69 0.50 0.32 0.48 0.31 0.46 0.43 0.59

CEC K R2 RMSE R2 RMSE 0.64 2.50 0.19 0.50 0.67 2.35 0.11 0.55 0.53 3.10 0.19 0.50 0.78 2.31 0.56 0.16 0.78 2.13 0.60 0.16 0.77 2.64 0.45 0.17 0.71 2.27 0.15 0.44 0.73 2.18 0.23 0.41 1.00 2.87 1.00 0.60 0.47 2.32 0.05 0.50 0.52 2.29 0.08 0.51 0.00 2.46 0.97 0.44 0.67 2.55 0.07 0.42 0.67 2.31 0.07 0.46 0.92 3.15 0.51 0.30 0.17 2.31 0.00 0.62 0.26 2.81 0.00 0.73 0.31 0.49 0.01 0.27 0.57 2.40 0.33 0.41 0.55 2.50 0.33 0.45 0.77 1.92 0.79 0.12 0.67 2.07 0.37 0.68 0.72 2.03 0.37 0.70 0.20 2.27 0.64 0.50

*Texts that are bold and highlighted for overall dataset indicate R2 > 0.50. 220

Mg pH R2 RMSE R2 RMSE 0.22 4.70 0.13 0.62 0.22 4.57 0.15 0.62 0.23 5.21 0.13 0.62 0.20 4.78 0.27 0.63 0.36 4.12 0.35 0.58 0.09 6.98 0.05 0.82 0.55 3.25 0.73 0.45 0.55 3.19 0.85 0.44 1.00 3.65 0.77 0.48 0.04 4.28 0.41 0.49 0.09 4.03 0.38 0.49 0.32 5.46 0.99 0.46 0.18 5.37 0.42 0.66 0.18 4.98 0.42 0.56 0.23 6.37 0.05 0.90 0.00 4.10 0.01 0.49 0.00 4.25 0.07 0.39 0.07 3.79 0.02 0.63 0.33 5.21 0.24 0.65 0.31 5.38 0.25 0.64 0.62 4.48 0.35 0.71 0.22 3.82 0.36 0.38 0.17 4.02 0.33 0.39 0.69 2.29 0.52 0.31

Computers and Electronics in Agriculture 153 (2018) 213–225

S. Khanal et al.

3.4. Variable importance in model development

derived variables (Table 7), and only soil parameters (Table S2; see supplementary document) suggested that remotely sensed data-derived variables have high potential to provide better estimates of corn yield than based on soil properties.

Of the 18 variables considered in the model development, only few variables were found to have significant influence on the prediction of soil properties (Table 5). While only six variables were found to have

Fig. 4. A comparison of importance scores for selected variables used in six statistical models for: (a) Soil Organic Matter, (b) Cation Exchange Capacity, (c) Potassium, (d) Magnesium, (e) pH, and (f) corn yield. Importance scores for variables were scaled between 0 and 100. Importance scores of variables in SVM models with radial and linear kernel functions were the same. 221

Computers and Electronics in Agriculture 153 (2018) 213–225

S. Khanal et al.

contributed the most to the accuracy of the majority of models for SOM (Fig. 4a) were NIR, Red, and SI. Similarly, red, green and BI were three top predictors for the majority of the models for CEC. Unlike SOM and CEC, importance scores of selected variables for Mg, K, pH and corn yield prediction were more distributed. The top three variables were CI, HI, and BI for Mg; NIR, SI, and CI for K; TRI, Rough, and Slope for pH; SI, CI, and HI for corn yield. Except for pH, spectral bands and indices were consistently identified as important predictors for soil properties. For corn yield prediction, variables such as FlowDir, SI, and NDVI were found to contribute the most to the RF model, which had superior performance compared to other models (Fig. 4f).

significant influence on the prediction of CEC, ten variables were found to be significant for prediction of SOM and Mg. For CEC, K and pH, 6, 12 and 9 variables, respectively, had significant influence. For corn yield prediction, 15 variables were found to have significant influence. Based on the analyses of importance scores of selected variables for prediction of soil properties and corn yield (Table 5), the influence of variables was found to vary with the model. For instance, the variable Group contributed the most to the prediction accuracy of LM and NN based models for SOM, suggesting the importance of soil type during SOM estimation. However, the variable NIR contributed the most for RF, SVM, GBM and CU models for SOM. In general, the predictors that

Fig. 5. Maps showing (a) visual image of bare soil, and predicted (b) SOM (%), (c) CEC, (d) K, (e) Mg, and (f) pH in the study region with observed values at sampling locations overlaid. Note: Predicted maps for SOM and CEC was based on NN model. SVM with radial and linear kernel functions were used for K and Mg, respectively; and GBM was used for pH prediction. 222

Computers and Electronics in Agriculture 153 (2018) 213–225

S. Khanal et al.

Fig. 6. Maps of (left) observed, and (right) predicted corn yield in t/ha. Note: predicted yield map was based on RF model.

3.5. Mapping the spatial distribution of soil properties and corn yield

values obtained in this study is better than or comparable to other studies conducted at local (Morellos et al., 2016; Thomasson et al., 2001) or regional scales (Forkuor et al., 2017; Peng et al., 2015) that considered only spectral data or a combination of spectral, climate and terrain variables. The accuracy of the models in this study might have been influenced by several things, including the difference in timing between soil sample collection and image acquisition, and the use of limited machine learning algorithms. For instance, the bare soil imagery was acquired in May and the soil samples were collected in October. Thus, the model developed based on data collected around the same time might be of interest to improve the accuracy of the prediction. In the study, we observed improvement in prediction of soil properties and corn yield with the use of machine learning algorithms than the linear regression algorithm. For example, NN model, closely followed by CU, produced the most accurate prediction for SOM. Similarly, CU and RF models produced the most accurate prediction of CEC and corn yield, respectively. These findings are similar to prior studies that have found models based on machine learning algorithms to be superior to ones using linear regression (Hahn and Gloaguen, 2008; Minasny and McBratney, 2008; Peng et al., 2015). Increase in model accuracy by using machine learning algorithm is due to the ability of these algorithms to handle the non-linear relationships, which is typically observed between crop, soil, environmental and topographical variables. This study suggested that no single machine learning algorithm is best for evaluating all soil parameters and crop yield at all locations, and that multiple models should be evaluated to enhance the accuracy of prediction estimates. Similar observations were found in prior studies as well. Ließ et al. (2016) reported that GBM performed better than NN, RF, and SVM in prediction of soil organic carbon in a complex tropical mountain landscape in Ecuador. However, Were et al. (2015) found SVM to be the best method to predict SOC stocks in the Afromontane Forest in Eastern Africa. Rossel and Behrens (2010) reported that the smallest RMSE values were found with the SVM approach used for prediction of three soil properties, including SOC, clay content, and pH. Jeong et al. (2016) found RF to be a more effective machine learning method for crop yield predictions at regional and global scales compared to LM. Uno et al. (2005) reported NN to provide better corn yield prediction compared to LM approach. In this study, we evaluated the performance of seven most popular models. There are however other machine learning algorithms, such as multivariate adaptive regression splines, K Nearest Neighbor, and various types of neural network (e.g., convolutional, recursive, recurrent,

The model with the highest R2 and the lowest RMSE (Table 6) was selected to create high resolution maps for each soil property, and corn yield (Figs. 5–6). The geographical distribution of predicted soil properties was found to be similar to that of observed soil properties for most of the fields. For fields 1B, 1C, 1D, 9A, and MISD, the geographical distributions of observed and predicted SOM and CEC were very similar. When the geographical distribution of soil properties were examined against three soil color classes, it was found that both SOM and CEC were highly correlated with soil color (results not shown). Dark color soil corresponded to areas with higher SOM and CEC, and lower K and pH, and vice versa (Fig. 5a–c). This congruence of soil properties with soil color suggested that in-field variability of some soil properties can be estimated based on color of bare soil images. The model predicted corn yield reasonably well with an average difference of 1.48% ( ± 8.85% standard deviation) between predicted and observed corn yield. The observed corn yield ranged from 6.1 to 17.9 t⋅ha−1, and predicted corn yield ranged from 9.5 to 15.24 t⋅ha−1 (Fig. 6). Except for few locations in the center and west parts of the field, the geographical distribution of predicted corn yield was found to be similar to the pattern of observed corn yield, suggesting that the model could capture the spatial variability for most of the observed low and high spots in field. 4. Discussion 4.1. Models for prediction of soil properties and corn yield In the study, remotely sensed image-derived variables were integrated with field collected data to develop models for predicting soil properties and corn yield. Because the fields were heterogeneous (i.e., different in terms of agricultural practices and soil properties; Table 1) and models were developed for all seven fields, combined, instead for each field, the overall accuracy of the models were reported low. When the models’ performances were evaluated for individual fields, the accuracies however were higher for some fields (Table 8). This suggests that a model developed at a plot or field level performs better than a model developed at a larger geographic scale. Studies (Barnes et al., 2000; Stevens et al., 2013) have also noted that the use of multispectral data for predicting the spatial distribution of soil properties can achieve optimal results when the study is conducted at a plot level or in an area with uniform soil surface characteristics. Nonetheless, the range of R2 223

Computers and Electronics in Agriculture 153 (2018) 213–225

S. Khanal et al.

However, no model was found to consistently outperform other models for prediction of soil properties. NN performed better in prediction of SOM and CEC with a higher R2 and lower RMSE, while SVM model with linear and radial kernel function performed better for prediction of Mg and K, respectively. For pH and corn yield prediction, GBM and RF models, respectively, performed better than other models. For seven fields, models for SOM, CEC, Mg, K, and pH showed R2 in the range of 0.2–0.85, 0.17–0.78, 0.0–0.55, 0.0–0.56, and 0.0–0.73, respectively. For corn yield, RF consistently outperformed other models and provided R2 = 0.53. These findings suggest that remotely sensed data can serve as a surrogate for more intensive soil sampling and costly yield monitoring systems. Variables based on multispectral bare soil images were found to be the most important predictors for enhancing the model’s accuracy for the spatial prediction of soil properties, including SOM, CEC, K and Mg. Topographic variables were found to have more influence in the prediction of pH. For corn yield, both spectral and topographic information were important. Despite the high variability in topography and farm management practices of the seven fields, the accuracy obtained in prediction of soil properties and corn yield in this study are promising for high resolution mapping of soil properties and corn yield at a local scale. High resolution maps of soil properties and crop yield help farmers to identify areas of potential concerns prior to planting and manage them for improved crop productivity.

feedforward). The use of these machine learning algorithms may help improve accuracy of the models than the algorithms examined in the study. Thus, we suggest to examine the performance of these models for future works. The another approach to improve the model accuracy might be the use of advanced algorithms for variable selection such as genetic algorithms. In this study, variables for the linear regression models were selected using the most commonly used stepwise AIC approach. Genetic algorithms, inspired by the laws of genetics, try to find optimal solutions to complex problems, which is usually the case in the context of agriculture. It is thus useful to explore the role of genetic algorithms in future studies related to soil properties and yield estimation. 4.2. Important variables for modeling of soil properties and corn yield This study demonstrated that the information derived from multispectral images contributes more to improve the prediction of soil properties than terrain information derived from DEM. This is consistent with the findings of Dobos et al (2001) that combined coarse resolution AVHRR satellite data and DEM derived terrain variables to characterize the soil-forming environment. Among the variables derived using spectral information of multispectral images, it was interesting to note that NDVI of bare soil imagery was found to have significant influence on prediction of majority of the soil properties evaluated in this study, including SOM, CEC, K and pH, although it is a commonly used index for representing vegetation growth. This finding was found to be consistent with previous studies (Escadafal and Huete, 1993; Huete and Tucker, 1991) that also found NDVI to be sensitive to mineral constituents of soil. This study also showed that bare soil imagery can be a good indicator of potential corn yield pattern in a field. Understanding of the potential spatial variability in corn yield patterns based on soil spectral information and topographic conditions prior to planting might give farmers enough time to take preventive actions, such as levelling of high elevation areas, fertilization of areas with poor fertility, to maintain crop quality and yield.

Acknowledgements This work was supported in parts by funds from programs at the Ohio State University- the Field to Faucet program (Grants No. F2F000004), and Ohio Agricultural Research and Development Center (OARDC) (SEEDS: the OARDC Research Enhancement Competitive Grants Program). Appendix A. Supplementary material Supplementary data associated with this article can be found, in the online version, at https://doi.org/10.1016/j.compag.2018.07.016.

4.3. Applicability of the models

References

The statistical models for prediction of soil properties in this study were calibrated and tested with data collected from seven fields in one year. The corn yield was predicted based on one field with one year of data. Thus, the models developed in this study cannot be generalized for the prediction of the same soil parameters and corn yield in other soil types and geographic regions. To reinforce the findings of this study as well as to strengthen the model’s predictive capability over the wide range of soil properties and field management practices, further studies should be carried out with more data from multiple years, and from other fields with varying management practices and soil types. Nevertheless, the analyses presented in this study demonstrated that remotely sensed data and machine learning approaches could be adopted for cost-effective prediction of soil properties and crop yield at high spatial resolution.

Allen, D.E., Pringle, M.J., Bray, S., Hall, T.J., O’Reagain, P.O., Phelps, D., Cobon, D.H., Bloesch, P.M., Dalal, R.C., 2014. What determines soil organic carbon stocks in the grazing lands of north-eastern Australia? Soil Res. 51, 695–706. Barnes, E.M., Baker, M.G., et al., 2000. Multispectral data for mapping soil texture: possibilities and limitations. Appl. Eng. Agric. 16, 731–746. Blasch, G., Spengler, D., Itzerott, S., Wessolek, G., 2015. Organic matter modeling at the landscape scale based on multitemporal soil pattern analysis using rapideye data. Remote Sens. 7, 11125–11150. https://doi.org/10.3390/rs70911125. Davy, M.C., Koen, T.B., 2014. Variations in soil organic carbon for two soil types and six land uses in the Murray Catchment, New South Wales, Australia. Soil Res. 51, 631–644. Dobos, E., Montanarella, L., Nègre, T., Micheli, E., 2001. A regional scale soil mapping approach using integrated AVHRR and DEM data. Int. J. Appl. Earth Obs. Geoinf. 3, 30–42. Escadafal, R., Huete, A.R., 1993. Soil optical properties and environmental applications of remote sensing. Int. Arch. Photogramm. Remote Sens. 29, 709–715. Forkuor, G., Hounkpatin, O.K.L., Welp, G., Thiel, M., 2017. High resolution mapping of soil properties using Remote Sensing variables in south-western Burkina Faso: a comparison of machine learning and multiple linear regression models. PLoS One 12, 1–21. https://doi.org/10.1371/journal.pone.0170478. Geipel, J., Link, J., Claupein, W., 2014. Combined spectral and spatial modeling of corn yield based on aerial images and crop surface models acquired with an unmanned aircraft system. Remote Sens. 6, 10335–10355. Hahn, C., Gloaguen, R., 2008. Estimation of soil types by non linear analysis of remote sensing data. Nonlinear Process. Geophys. 15, 115–126. Hively, W.D., McCarty, G.W., Reeves, J.B., Lang, M.W., Oesterling, R.A., Delwiche, S.R., 2011. Use of airborne hyperspectral imagery to map soil properties in tilled agricultural fields. Appl. Environ. Soil Sci. Huete, A.R., Tucker, C.J., 1991. Investigation of soil influences in AVHRR red and nearinfrared vegetation index imagery. Int. J. Remote Sens. 12, 1223–1242. Jeong, J.H., Resop, J.P., Mueller, N.D., Fleisher, D.H., Yun, K., Butler, E.E., Timlin, D.J., Shim, K.M., Gerber, J.S., Reddy, V.R., Kim, S.H., 2016. Random forests for global and regional crop yield predictions. PLoS One 11, 1–15. https://doi.org/10.1371/journal.

5. Conclusions High spatial resolution mapping of soil properties and crop yield is required for proper management of crop and soil health which is needed for improving crop productivity and lowering agriculture related negative environmental footprint. This study demonstrated that the use of high spatial resolution (< 1 m) multispectral bare soil image and terrain data can capture in-field variability of soil properties, including SOM, CEC, K, Mg, and pH, and corn yield. The performance of seven statistical models, including LM, RF, SVM with linear and radial kernel functions, SGB, NN, and CUB, were compared for their ability to predict soil properties and corn yield, and the machine learning algorithms were found to outperform the LM algorithm most of the time. 224

Computers and Electronics in Agriculture 153 (2018) 213–225

S. Khanal et al.

2009.12.025. Scudiero, E., Skaggs, T.H., Corwin, D.L., 2014. Regional scale soil salinity evaluation using Landsat 7, western San Joaquin Valley, California. USA. Geoderma Reg. 2, 82–90. Shi, Y., Thomasson, J.A., Murray, S.C., Pugh, N.A., Rooney, W.L., Shafian, S., Rajan, N., Rouze, G., Morgan, C.L.S., Neely, H.L., et al., 2016. Unmanned aerial vehicles for high-throughput phenotyping and agronomic research. PLoS One 11, e0159781. Souza, E.G., Bazzi, C.L., Khosla, R., Uribe-Opazo, M.A., Reich, R.M., 2016. Interpolation type and data computation of crop yield maps is important for precision crop production. J. Plant fcenNutr. 39, 531–538. https://doi.org/10.1080/01904167.2015. 1124893. Spectrum Analytic, 2017. Analysis Services [WWW Document]. URL < https://www. spectrumanalytic.com/services/analysis/agsoil.html > . Stevens, A., Nocita, M., Tóth, G., Montanarella, L., van Wesemael, B., 2013. Prediction of soil organic carbon at the European scale by visible and near infrared reflectance spectroscopy. PLoS One 8, e66409. Sudduth, Kenneth, A., Drumm, 2007. Yield editor: software for removing errors from crop yield maps. Agron. J. 99, 1471–1482. https://doi.org/10.2134/agronj2006.0326. Thomasson, J.A., Sui, R., Cox, M.S., Al–Rajehy, A., 2001. Soil reflectance sensing for determining soil properties in precision agriculture. Trans. ASAE 44, 1445–1453 https://doi.org/10.13031/2013.7002. Uno, Y., Prasher, S.O., Lacroix, R., Goel, P.K., Karimi, Y., Viau, A., Patel, R.M., 2005. Artificial neural networks to predict corn yield from Compact Airborne funoSpectrographic Imager data. Comput. Electron. Agric. 47, 149–161. https://doi. org/10.1016/j.compag.2004.11.014. Were, K., Bui, D.T., Dick, Ø.B., Singh, B.R., 2015. A comparative assessment of support vector regression, artificial neural networks, and random forests for predicting and mapping soil organic carbon stocks across an Afromontane landscape. Ecol. Indic. 52, 394–403. Wilson, M.F.J., O’Connell, B., Brown, C., Guinan, J.C., Grehan, A.J., 2007. Multiscale terrain analysis of multibeam bathymetry data for habitat mapping on the continental slope. Mar. Geod. 30, 3–35. Yang, C., Westbrook, J.K., Suh, C.P.-C., Martin, D.E., Hoffmann, W.C., Lan, Y., Fritz, B.K., Goolsby, J.A., 2014. An airborne multispectral imaging system based on two consumer-grade cameras for agricultural remote sensing. Remote Sens. 6, 5257–5278. Yao, R.J., Yang, J.S., Wu, D.H., Xie, W.P., Gao, P., Wang, X.P., 2016. Characterizing spatial-temporal changes of soil and crop parameters for precision management in a coastal rainfed agroecosystem. Agron. J. 108, 2462–2477.

pone.0156571. Kitchingman, A., Lai, S., 2004. Inferences on potential seamount locations from midresolution bathymetric data. Focus (Madison). 32, 128. Kuhn, M., 2017. CARET: Classification and Regression Training [WWW Document]. URL < https://github.com/topepo/caret/ > (accessed 10.1.17). Kuhn, M., Johnson, K., 2013. Applied Predictive Modeling. < https://doi.org/10.1007/ 978-1-4614-6849-3 > . Ließ, M., Schmidt, J., Glaser, B., 2016. Improving the spatial prediction of soil organic carbon stocks in a complex tropical mountain landscape by methodological specifications in machine learning approaches. PLoS One 11, e0153673. Lobell, D.B., Thau, D., Seifert, C., Engle, E., Little, B., 2015. A scalable satellite-based crop yield mapper. Remote Sens. Environ. 164, 324–333. https://doi.org/10.1016/j.rse. 2015.04.021. Lyle, G., Bryan, B.A., Ostendorf, B., 2014. Post-processing methods to eliminate erroneous grain yield measurements: review and directions for future development. Precis. Agric. 15, 377–402. Minasny, B., McBratney, A.B., 2008. Regression rules as a tool for predicting soil properties from infrared reflectance spectroscopy. Chemom. Intell. Lab. Syst. 94, 72–79. https://doi.org/10.1016/j.chemolab.2008.06.003. Morellos, A., Pantazi, X.-E., Moshou, D., Alexandridis, T., Whetton, R., Tziotzios, G., Wiebensohn, J., Bill, R., Mouazen, A.M., 2016. Machine learning based prediction of soil total nitrogen, organic carbon and moisture content by using VIS-NIR spectroscopy. Biosyst. Eng. 152, 104–116. https://doi.org/10.1016/j.biosystemseng.2016. 04.018. Mulder, V.L., De Bruin, S., Schaepman, M.E., Mayr, T.R., 2011. The use of remote sensing in soil and terrain mapping-a review. Geoderma 162, 1–19. Peng, Y., Xiong, X., Adhikari, K., Knadel, M., Grunwald, S., Greve, M.H., 2015. Modeling soil organic carbon at regional scale by combining multi-spectral images with laboratory spectra. PLoS One 10. https://doi.org/10.1371/journal.pone.0142295. Ray, S.S., Singh, J.P., Das, G., Panigrahy, S., 2004. Use of high resolution remote sensing data for generating site-specific soil management plan. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 35, 127–132. Riedmiller, M., Braun, H., 1993. A direct adaptive method for faster backpropagation learning : the RPROP algorithm. In: Neural Networks, International Conference on. pp. 586–591. Riley, S.J., 1999. Index that quantifies topographic heterogeneity. Intermt. J. Sci. 5, 23–27. Rossel, R.A.V., Behrens, T., 2010. Using data mining to model and interpret soil diffuse reflectance spectra. Geoderma 158, 46–54. https://doi.org/10.1016/j.geoderma.

225