Spatial prediction of soil organic carbon using machine learning techniques in western Iran

Spatial prediction of soil organic carbon using machine learning techniques in western Iran

Journal Pre-proof Spatial prediction of soil organic carbon using machine learning techniques in western Iran Hamid Mahmoudzadeh, Hamid Reza Matinfar...

2MB Sizes 2 Downloads 85 Views

Journal Pre-proof Spatial prediction of soil organic carbon using machine learning techniques in western Iran

Hamid Mahmoudzadeh, Hamid Reza Matinfar, Ruhollah Taghizadeh-Mehrjardi, Ruth Kerry PII:

S2352-0094(20)30009-2

DOI:

https://doi.org/10.1016/j.geodrs.2020.e00260

Reference:

GEODRS 260

To appear in:

Geoderma Regional

Received date:

2 July 2019

Revised date:

6 February 2020

Accepted date:

7 February 2020

Please cite this article as: H. Mahmoudzadeh, H.R. Matinfar, R. Taghizadeh-Mehrjardi, et al., Spatial prediction of soil organic carbon using machine learning techniques in western Iran, Geoderma Regional(2020), https://doi.org/10.1016/j.geodrs.2020.e00260

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2020 Published by Elsevier.

Journal Pre-proof

Spatial prediction of soil organic carbon using machine learning techniques in western Iran Hamid Mahmoudzadeh a, Hamid Reza Matinfar a, Ruhollah Taghizadeh-Mehrjardi

b,c

,

Ruth Kerryd a

Department of Soil Science, College of Agriculture, Lorestan University, Khorramabad,

Department of Geosciences, Soil Science and Geomorphology, University of Tübingen,

ro

b

of

Iran ([email protected])

Tübingen, Germany (ruhollah.taghizadeh- [email protected])

-p

Faculty of Agriculture and Natural Resources, Ardakan University, Ardakan, Iran

([email protected]) Department

of

Geography,

Brigham

Young

University,

Provo,

UT,

USA

lP

d

re

c

ur

na

([email protected])

Jo

Corresponding author: Hamid Reza Matinfar

Department of Soil Science College of Agriculture Lorestan University Khorramabad, Iran

Email: [email protected]

Spatial prediction of soil organic carbon using machine learning techniques in western Iran

Journal Pre-proof

Abstract Estimation of soil organic carbon (SOC) is very useful for accurate monitoring of carbon sequestration. However, there are still significant gaps in the knowledge of SOC reserves in many parts of the world, including western Iran. To partially fill the gap, 865 soil samples were used with 101 auxiliary variables and 5 machine learning (ML) algorithms

of

to digitally map SOC for the plough layer (0-30 cm) at a 90-m resolution in Kurdistan

ro

province. Results indicated that the most important auxiliary variables were rainfall

-p

(27.09 %), valley depth (26.66 %), terrain surface texture (23.42 %), air temperature

re

(20.18 %), channel network base level (16.61 %) and terrain vector roughness (14.47 %). Results also showed that Random Forests (RF) performed best in predicting the spatial

lP

distribution of SOC (RMSE=0.35% and R2 =0.60), compared to the other ML algorithms

na

(i.e. Cubist: CU, k-Nearest Neighbor: kNN, Extreme Gradient Boosting: XGBoost and Support Vector Machines: SVM). Furthermore, results estimated the total SOC stocks

ur

(SOCS) for the whole study area (~15208 Tg) and amounts under different land uses.

Jo

These were bareland (~6 Tg), orchard (~356 Tg), irrigated farming (~782 Tg), forest (~1773 Tg), grassland (~5991 Tg) and dry farming (~6297 Tg). As expected, the SOCS were highest in forest soils (652 g m-2 ) and lowest in bareland (437 g m-2 ). This result suggests that the conversion of native land (e.g. Forest) to cultivated land (e.g. Irrigated farming) could lead to significant loss of SOCS and appropriate management of land use could increase SOCS.

Keywords: Soil organic carbon stocks, machine learning algorithms, spatial prediction, random forests, land use, Aridisols 1.

Introduction

Journal Pre-proof Spatial information

about the

soil is

needed

increasingly

for making

informed

management decisions about the environment, particularly with regard to developing different types of agriculture, however, such information is often unavailable especially in western Iran (Nabiollahi et al., 2019). Soil is the largest terrestrial ecosystems sink of organic carbon and changes with atmospheric composition, climate, and land cover transformation. Soil organic carbon (SOC) is a key indicator of soil fertility and plays a

of

vital role in the sequestration of CO2 and other greenhouse gases associated with global

ro

warming. Therefore, there is increasing demand for SOC information around the world

-p

(Schillaci et al., 2019; Yigini and Panagos, 2016). Evaluating SOC provides information

re

about soil fertility, levels of sequestration of carbon, recovery of degraded soil, and the

lP

impact of land use change (Bonfatti et al., 2016).

Traditionally, soil property maps (including SOC maps) were created as polygon maps,

na

with properties from representative soil profiles reported as the soil mapping units (Brus et al., 2017). These traditional SOC maps are costly and time-consuming to create and

ur

difficult to update. Also, traditional description, sampling and mapping of soil properties

Jo

is unlikely to provide sufficient spatial resolution for the current uses that are being made of maps for ecosystem management. The variability of SOC is described by its relationships with environmental conditions (such as soil topography, climate, land use, vegetation, and soil type) or soil forming factors, that directly or indirectly affect the processes controlling the spatial distribution of SOC. Jenny(1994) explain soil development as a function of climate (c), organisms (o), relief (r), parent material (p), and time (a). This function was expanded by McBratney et al., (2003) to become the SCORPAN function. As a result, these environmental factors create a set of conditions that results in an increase or decrease in the amount of SOC.

Journal Pre-proof Thus SCORPAN factors are effective in predicting the spatial distribution of SOC in the Digital Soil Mapping (DSM). DSM has been developed to map spatial variability of soil properties with greater accuracy, economic efficiency and spatial resolution (Arrouays et al., 2014; McBratney et al. 2003). The relationships between soil and visible land surface properties, such as shape, position, and reflectance are used in DSM and traditional soil mapping, however,

of

both methods take different approaches to all aspects of soil mapping, including project

ro

planning and preparation, sampling design, field operation, soil measurement, predictive

-p

modeling, and geographic representation (Roecker et al., 2010).

re

A DSM is based on a statistical sampling in the field followed by laboratory analysis and

lP

mathematical prediction algorithms to create a spatial database of soil properties throughout the landscape, even areas that have not been sampled (Sanchez et al., 2009).

na

DSM is spatially flexible for different soil properties and modelling can be done differently depending on project aims and available input data (Brus et al., 2017). DSMs

ur

are pixel-based and can be more easily shown at higher resolutions compared to

Jo

traditional maps (Sanchez et al., 2009). DSM uses statistical and mathematical algorithms to predict and map soil attributes from easily obtained auxiliary variables (McBratney et al., 2003). Several machine learning (ML) algorithms have been used in DSM, including artificial neural networks (Li et al., 2013; Malone et al., 2009), support vector machines (Viscarra Rossel and Behrens, 2010), boosted regression trees (Martin et al., 2010), and random forests (Grimm et al., 2008; Vågen and Winowiecki, 2013; Wiesmeier et al., 2012). ML algorithms solve the problems of spatial autocorrelation, non-linearity in parametric and non-parametric statistical algorithms (Drake et al., 2006; Kuhn and Johnson, 2013; McRoberts, 2012; Padarian et al., 2019).

Journal Pre-proof In recent years, some researchers compared the performance of different machine learning algorithms for mapping SOC. For example, Somarathna et al., (2016) compared multiple linear regression, regression tree models, and support vector regression (SVR) for SOC mapping in Australia. Were et al., (2015) compared the performance of SVR, artificial neural network, and RF algorithms in Kenya. Due to the high variability in climate, soil properties, and land use management around the world, different algorithms have been

of

used for DSM for different locations and variables (Kumar and Lal, 2011). Algorithms

ro

such as random forests (Hengl et al., 2018; Nabiollahi et al., 2019; Reza Pahlavan Rad et

-p

al., 2014), support vector regression (Ottoy et al., 2017; Were et al., 2015), XGBoost (Fan

re

et al., 2018; Hengl et al., 2017; Meier et al., 2018) k-nearest neighbor (Mansuy et al.,

lP

2014; McRoberts, 2012), Cubist (Kidd et al., 2014; Taghizadeh-Mehrjardi et al., 2014; Viscarra Rossel et al., 2014) have all been used for DSM. In most SOC modeling studies,

na

the RF and SVM models performed better.

Kurdistan province has extensive forests and pastures and can play a vital role in the

ur

discussions about carbon sequestration, but the distribution of SOC has not been

Jo

investigated in Kurdistan using machine learning techniques. As there is a lack of information about the spatial variability of SOC in Iran, especially western Iran (Nabiollahi et al., 2019), the present study aims to evaluate the spatial distribution of SOC in Kurdistan province (29043 km2 ) located in western Iran using 865 soil samples, 101 covariates and 5 ML algorithms. In our study, we used a methodological structure to improve the prediction of SOC for the whole of Kurdistan province and determine how the environmental heterogeneity of Kurdistan influences the SOC distribution. The objectives of this study are to: (i) digitally map SOC with different ML algorithms, (ii) evaluate the effectiveness of models, (iii) identify the most important auxiliary variables controlling the spatial

Journal Pre-proof distribution of SOC in a semi-arid region and (iv) determine the spatial distribution of SOC and SOC stocks (SOCS) under different land uses. 2.

Materials and methods

2.1. Study area and soil data The study was conducted in Kurdistan province between latitudes of 34° 40ʹ and 36° 30ʹ North and longitudes of 45° 30ʹ and 48° 10ʹ 146.06° East (Figure 1a). This mountainous

of

region of western Iran covers an area of 29043 km2 with elevations ranging from 720 to

ro

3240 m above sea level (Figure 2a). Its geographic position and climatic variability results

-p

in a wide range of biodiversity, with six main water sources (main rivers: Sirvan, Ghezel

re

Ozan, Zarrinehroud, Siminrood, Razavar, Zab) and five main land uses (Figure 2b). Based

lP

on the Köppen method, the climate varies from dry to cool dry with a mean rainfall of 250 mm in the east to semi-humid with a mean rainfall of about 900 mm in the west of the

na

province (Mozafari and Safarpour, 2013). The eastern and central regions of the province are affected by dry and warm weather with steppe vegetation cover and the western

ur

regions are affected by rainy weather and covered with Mediterranean, oak forests. The

Jo

climate is classified as Csa (warm temperate- dry-hot summer), BSk (arid-steppe-cold arid) and Dsa (snow- dry- hot summer) in the Köppen-Geiger climate classification (http://koeppen-geiger.vu-wien.ac.at/present.htm) with a mean annual rainfall of 521 mm and mean annual temperature 12.38 °C. We used 865 soil data that were mainly produced by a Soil and Water Research Institute Project from 2000 to 2015. The plough layer of the soil was sampled (0-30 cm). The amount of SOC was determined by means of the wet oxidation with dichromate (Walkley and Black 1934).

Journal Pre-proof 2.2. Auxiliary variables Based on variables that can represent SCORPAN factors of soil development (McBratney et al., 2003) and review of literature (e.g. Bonfatti et al., 2016; Kumar and Lal, 2011; Kumar et al., 2012; Li et al., 2013; Liu et al., 2006; Schillaci et al., 2017; Shelukindo et al., 2014; Taghizadeh-Mehrjardi et al., 2016; Vasques et al., 2010), 101 auxiliary variables (predictors such as: remotely sensed attributes like Normalized Difference

of

Vegetation Index, Soil Adjusted Vegetation Index etc.; terrain attributes like DEM,

ro

aspect, slope etc. and climate attributes like rain, air and soil temperature) were selected to

-p

explain the spatial variability of SOC. Three source databases were used: 1) the median

re

values of atmospheric spectral bands of Landsat 5 TM obtained from 2000 to 2015, 2) a

lP

Digital elevation model (DEM) with a 30 m resolution and 3) climatic maps. Fifty-one auxiliary vegetation indices (see Appendix 1 for detailed information) were extracted

na

from various band ratios (Mulder et al., 2011). Other band ratios were also calculated to represent parent material and soil factors in the study area (clay minerals, iron oxide,

ur

salinity index etc.). The terrain description of the area was performed using a freely

Jo

downloadable DEM data (30 m resolution) obtained from http://srtm.csi.cgiar.org/. The pre-processed DEM was used to derive 47 terrain attributes using SAGA GIS (Conrad et al., 2015) and a brief description and reference to algorithms are given in Appendix 2. Climate data of annual average values of 30-year (1987–2017), i.e., mean annual temperature (AT, °C), mean annual rainfall (mm), and mean annual soil temperature (ST, °C)

were

recorded

at

10

weather

stations

in

Kurdistan

province

(http://kurdistanmet.telepol.ir) (Appendix 3); then to match them with the other raster grids they were interpolated in ARC GIS using the nearest neighbor method. All auxiliary variables obtained from the three source databases were aggregated (average resampling) or disaggregated (bilinear resampling) to a common grid of 90 × 90 m and

Journal Pre-proof then used as auxiliary variables for the development of quantitative spatial models. All of the data were in raster format and coordinates were converted to UTM WGS84 Zone 39 S. 2.3. Spatial prediction models Five supervised ML algorithms, Cubist: CU, k-Nearest Neighbor: kNN, Extreme Gradient Boosting: XGBoost, Support Vector Machines: SVM and Random Forest: RF, were used

of

for SOC prediction. These algorithms are capable of modelling complex non-linear

ro

correlations between SOC and auxiliary variables. Further information and mathematical

-p

equations for all ML algorithms has been published by (Kuhn and Johnson, 2013).

re

2.4.1 k-Nearest Neighbor

lP

The kNN- algorithm is a non-parametric approach to either univariate or multivariate prediction and classification (McRoberts, 2012) that is used to digitally map soil

na

properties using environmental factors (Mansuy et al., 2014). The algorithms require no specific assumptions about the distribution of predictor variables, can use external

ur

information to the geographic area, and are suitable for map creation, small area

Jo

estimation, and inference (McRoberts, 2012). The kNN construction is exclusively based on the samples from the training data, that are used to predict a new sample for regression, it classifies the kNNs samples in the predictor space using the mean of the K neighbor’s responses (Kuhn and Johnson, 2013). 2.4.2. Random forest RFs are a group of algorithms that have developed as an extension of Classification and Regression Tree analysis (CART) to enhance the prediction performance of the model (Breiman, 2001). RFs have been mainly used for classification problems (Boulesteix et al., 2012; Breiman, 2001; Olson et al., 2017). RFs are a data-driven statistical approach that has recently been employed in digital soil mapping studies (Akpa et al., 2016; Grimm

Journal Pre-proof et al., 2008; Hengl et al., 2018). Given that it shows good accuracy, is fast, simple to use, has useful internal estimation of error, and correlation and calculation of variable importance, this algorithm has been used to predict the spatial distribution of SOC recently (Grimm et al., 2008; Vågen et al., 2013; Wiesmeier et al., 2012). 2.4.3 Extreme Gradient Boosting The XGBoost algorithm is a simple tree-based algorithm which has recently become of

of

interest to DSM researchers (Fan et al., 2018). For instance, the XGBoost algorithm has

ro

been used to map soil nutrients (Hengl et al., 2017) and other soil properties (Meier et al.,

-p

2018; Ramcharan et al., 2018). Throughout the training phase in XGBoost, parallel

re

calculations are automatically performed for the functions. The learning processes in the

lP

XGBoost are incremental, so that the first learner is linked to the input data space, and then the second model is applied to focus the weak learner's error. This correction process

na

is repeated several times until the error criterion is eliminated. The final prediction of the model is obtained with the total prediction of each learner (Fan et al., 2018). It generates a

ur

weak learner at each step and accumulates it into the total model. If the weak learner for

Jo

each step is based on the gradient direction of the loss function, it can be called Gradient Boosting Machines (GBM) (Friedman, 2002). Gradient tree boosting is one algorithm that has been used in many applications; and can give advanced results compared to many standard classification approaches (Fan et al., 2018). 2.4.4 Support vector machine The SVM is a class of effective, very flexible modeling algorithms. The theory behind it was initially

developed in the framework of classification models (Kuhn and Johnson,

2013). SVM analysis proposed by Cortes and Vapnik, (1995) is one of the general supervised ML tools for classification and regression. There are many planes in SVM to divide input data into different classes (Wang et al., 2018). Previous studies showed a

Journal Pre-proof better learning ability and lower prediction errors compared with many other algorithms (Maynard and Levi, 2017; Were et al., 2015). 2.4.5 Cubist The CU method is a combination of several algorithms and improves rule-based predictive models that balance the need for accurate prediction versus the requirements of intelligibility (Kuhn and Johnson, 2013). CU models are more commonly recognizable

of

than neural networks and provide better results than those produced by simple algorithms

ro

such as multivariate linear regression (Kuhn et al., 2013; Viscarra Rossel et al., 2014). CU

-p

models have been designed to analyze big databases including hundreds of thousands to

re

millions of records and tens to thousands of numeric or nominal fields (Kuhn and

lP

Johnson, 2013). The CU model prediction is based on linear regression models instead of separate values (Bui et al., 2009; Taghizadeh-Mehrjardi et al., 2014). Recently, the CU

na

model has been used effectively in DSM in various countries (Kidd et al., 2014; Minasny et al., 2008; Viscarra Rossel et al., 2014).

ur

2.5 Validation of statistical algorithms

Jo

We used 10 fold cross-validation with 100 replications to estimate the prediction performance of five models (Hengl et al., 2015). This algorithm provides a structure for creating numerous train/test splits in the dataset and certifying that each data point is in the test set at least once. It performs reliably and is unbiased for smaller data sets (Zeraatpisheh et al., 2019). To compare model performance, we use R2 , RMSE and MAE. (Gomez et al., 2008; Yang et al., 2015) described these measures of model performance as follows:

Journal Pre-proof

where Pi and Oi are the predicted and observed SOC; n is the number of samples; 𝑃 and Ō are the means for the predicted and observed SOC. Models with the highest R2 and lowest RMSE and MAE were deemed the best models. The model calibration and validation were repeated using different auxiliary variables.

of

2.6 SOCS measurement

which

is

a

derived

from

the

world

soil

database

-p

density

ro

The SOCS were calculated using predicted SOC, sampling depth (30 cm) and bulk

(ftp://ftp.soilgrids.org/data/recent/.) and calculated with the following equation.

re

SOCS = [SOC × Ds × D]/100

lP

Where SOCS is soil organic carbon stocks (g m2 ), SOC is soil organic carbon concentrations (g kg), Ds is soil bulk density (Kg m3 ), and D is soil depth (m). Results and Discussion

ur

3.1 Descriptive statistics

na

3.

Jo

The descriptive statistics for SOC are given in Table 1. The results show that the mean SOC in the study area is 1.06 %. According to this table, the average SOC ranged from 0.04 to 4.44 % in the different types of land use. The difference between the minimum and maximum SOC is very high (4.4 %), this difference showed that the top soils are more easily affected by management, environmental variables and disturbance. The coefficient of variation (CV) for SOC was high (more than 51.04 %), which shows a wide range of values across the study area; it implies that the SOC has strong spatial dependence. Wilding (1985) classified CV values into 3 classes with high (CV > 35%), moderate (15%
Journal Pre-proof associated with the difference in land use, types of farming and management practices and topographic position in the study area. Falahatkar et al., (2016) also found this to be the case. Han et al., (2010) showed that soil properties such as SOC and pH are not independent but instead are stochastic variables promoted by the continuity of the soil forming processes and the change in the climate. Consequently, it is necessary to survey the spatial variability in SOC using modern statistical algorithms.

of

Linear correlations between SOC and predictor variables are shown in Table 2. SOC at

ro

the soil surface is positively correlated with catchment slope (r=0.43 and negatively

-p

correlated with topographic wetness index (TWI) (r=-0.29) respectively. The correlation

re

between climatic and topographic variables and the quantity of SOC has been documented

lP

previously (Garcia-Pausas et al., 2007). Climatic variables influence soil properties, vegetation cover, water retention, soil erosion and SOC decomposition. The effect of the

na

rain (r=0.35) as a predictor is important. Aspect did not show a high correlation coefficient (r=0.04) with SOC in our study area. Falahatkar et al., (2016) and Zhao and

ur

Chen, (2005) also found this to be the case in their studies. The positive correlation of

Jo

SOC with slope (r=0.35) in our study has been observed by Zhao and Chen, (2005) previously. Our results also indicated a negative correlation between SOC and the TWI, similar to the findings of Falahatkar et al., (2016) but contrary to the findings of Zhao and Chen, (2005) and Schwanghart and Jarmer, (2011). This negative result might have related to the great vegetation density with a small WI in the high elevation locations with less human activity (Falahatkar et al., 2016). The significant correlations of terrain variables with SOC (elevation r=- 0.46, slope r=- 0.43, and WI r=0.20 have been shown in the research of Li, (2010). The data were pre-processed to identify outliers before being used for SOC mapping. Outliers are data that have markedly higher or lower values than others in the dataset, and

Journal Pre-proof they can alter significantly the results of the statistical analysis. No such distributional outliers were found in the SOC data, and the relative normality of the data was confirmed (Figure 3). 3.2. Relative importance of covariates The relative importance of different variables for SOC prediction obtained by sensitivity analysis is shown in Figure 4. The most considerable positive contribution of any variable

of

to SOC prediction was for rain (27.09 %). The importance of rain and other attributes

ro

could be related to the fact that SOC is highly affected by climate and topography in this

-p

study area. Western parts of this area have high precipitation and vegetative cover, while

re

eastern parts have low precipitation and low vegetative cover. The RS indices were not

lP

recognized as the factors explaining a significant part of SOC variability in the study area. Similar to our results, Gomez et al., (2008) showed that the SOC content was not related

na

to the RS indices and the SOC predictions were not influenced by the vegetation cover. All models showed rainfall as the most important variable for explaining the spatial

ur

variations of SOC. This was expected for statistical and theoretical reasons. Rainfall was

Jo

categorized as the most significant variable that influenced the spatial distribution of SOC for a RF and boosted regression tree model at 0–30 cm (Wang et al., 2018). Similar to the results of our study, Davy and Koen, (2014) observed rainfall as the most important variable influencing SOCS in eastern Australia and Wang et al., (2018) showed that the relationship between climate variables (precipitation and temperature) and soil moisture, is a main driver of plant growth and net primary productivity, and therefore SOC dynamics. The effect of valley depth, terrain surface texture and catchment slope are complicated and maybe indirect. For example, there was an association between these attributes and climatic parameters such as precipitation and temperature, soil erosion and solar radiation.

Journal Pre-proof Previous results show that 71% of the SOC variability was described by terrain attributes (slope, aspect, slope gradient, curvature, WI, and proximity to the nearest stream) Thompson and Kolka, (2005) and others have documented that climate, land use, terrain and soil management control the spatial distribution of SOC (Adhikari et al., 2014; Guo et al., 2019; Minasny et al., 2013). The main terrain attributes which are usually used in SOC prediction and mapping across different scales are: elevation, slope, aspect,

ro

factor (Adhikari et al., 2014; Minasny et al., 2013).

of

curvatures, wetness index, relative height and slope positions, MRVBF, MRRTF and LS

-p

For the study area, the results of sensitivity analysis indicated that elevation made a larger

re

contribution to SOC prediction than the slope. The elevation might cause local variations

lP

in erosion, vegetation, management practice and sunlight which along with the chemical and physical composition of the organic matter are the major controllers of the

na

decomposition and weathering of soil organic matter. Several studies have found different results with larger effects of aspect and slope than elevation (Falahatkar et al., 2016;

ur

Mendoza Vega, 2002; Yimer et al., 2006). Ayoubi et al., (2012) showed that SOC was

Jo

lowest on steep slopes (more than 30%) that were affected by the highest soil erosion rates in western Iran. Falahatkar et al., (2016) showed that erosion and soil destruction decreased the SOC, yet several other researchers have shown a large correlation (r=0.74) between slope and SOCS (Liu et al., 2006; Tan et al., 2004; Yun-Qiang et al., 2009). 3.3. Performance of machine learning algorithms to predict SOC The results of the 10-fold cross-validation procedure used to compare the performance of kNN, RF, CU, XGBoost and SVM are summarized in Table 3. Based on the error indices (R2 , RMSE and ME), all models performed well, but RF was best. We found that there were slight differences between these models, suggesting that these six models were relatively stable in their predictive ability. The results showed that all models predicted

Journal Pre-proof SOC acceptably well based on the range of average values of R2 from 0.50 to 0.60; RMSEs between 0.35 and 0.42 and MAEs from 0.30 to 0.35. Results showed that the kNN algorithm could predict SOC with an R2 , RMSE and MAE of 0.50, 0.42 and 0.35, respectively. CU and SVM showed similar performance in predicting SOC, while XGBoost showed lower efficiency than the CU and SVM models in terms of R2 and prediction errors. The

of

RF model showed that climate, vegetation indices and terrain attributes are important

ro

auxiliary variables for SOC prediction. Selecting appropriate values for SVR parameters

-p

is a main step in SVR analysis which has a great effect on its performance and the

re

prediction accuracy for SOC (Hsu et al., 2007). Results show the performance of the SVR model for predicting SOC in the study area resulting from simulated algorithm analysis.

lP

The R2 , RMSE and MAE values for SOC predictions were 0.58, 0.36 and 0.3,

na

respectively. These results are consistent with those of (Were et al., 2015). The RF model had the lowest RMSE (0.35) and MAE values, as well as the highest R2 (0.6) value.

ur

However, SVR and CU prediction had similar results with RMSE, MAE, and R2 values of

Jo

0.36, 0.30, and 0.58, respectively. This showed quite different prediction accuracies for RF, CU and SVR.

Generally in DSM, model uncertainty may be due to differences in the amount of observed SOC or low predictor accuracy and modelling errors (Goidts et al., 2009; Yang et al., 2016). The increases in RMSEs for the RF, SVR, CU, XGBoost and kNN models can be seen in Table 3. The best model performances were selected for each model from ten-fold cross validation. Our results confirm that the RF and CU models are powerful ML algorithms for prediction, as reported by Akpa et al., (2016), Nawar and Mouazen, (2017), Wang et al., (2018), Zeraatpisheh et al., (2019) and Gomes et al., (2019).

Journal Pre-proof Prediction errors for SOC ranged from -0.24 to 0.19 (Figure 5) with areas where SOC was overestimated and underestimated being somewhat close together. The larger prediction errors were distributed throughout the study area and it can be concluded that the prediction error was related to the topography and vegetation of the area. Despite our expectation, the prediction error in the area with the lowest sample density is no more than area with the highest sample density.

of

3.4. Spatial prediction of SOC

ro

Figure 6 shows the map of SOC obtained by the best model. The spatial patterns of SOC

-p

in all models are sensible with large values in the western part of the study area, which is

re

dominated by forest and covered by dense vegetation, and small values in eastern areas

lP

with its calcareous soils and dry farming cultivation. The maps for all prediction models (not shown) showed both sharp and gradual changes across the study area. The highest

na

SOC values occur in western Kurdistan province which is located in a semi-humid climate with a high mean rainfall, whereas the lowest SOC occurred in eastern areas

ur

where the arid-steppe-cold arid climate dominates. SOC concentrations are the result of

Jo

the balance between inputs and outputs of carbon in soils; numerous factors influence this balance, according to environmental conditions (Davidson and Janssens, 2006) and local characteristics, such as vegetation (Jobbágy and Jackson, 2000) and topography (Tang et al., 2017). In our study precipitation and topography were the most important covariates driving the distribution of SOC, as shown by Gomes et al., (2019). These values are approximately twice the values of those for the dry biome (eastern part of the study area where the climate is drier and seasonal), where there is little accumulation of SOC. In the central area there is a greater relative accumulation of carbon due to high altitude, but less than under forest and more than eastern dry farming. 3.5. Spatial prediction of SOC stocks

Journal Pre-proof The spatial patterns of SOCS predicted by the RF model are shown in Figure 7. SOCS of the surface layer ranged from 187 to 1456 g m-2 . SOCS increased from east to west, with the highest stocks in the western, north and south-western areas. This showed the land cover effect because these were areas covered by grassland and forest. Furthermore, due to large above ground biomass, high rainfall, low temperatures, and higher altitudes, these areas provided favorable conditions for build-up of SOCS due to low SOC turnover at

of

these locations. This has also been observed by Were et al., (2015), Wiesmeier et al.,

ro

(2012) and Smith, (2008). The lowest stocks were found in the eastern areas which

-p

included dry farming locations and grassland that had been converted to croplands. This

erosive processes (non-protective plowing,

non-rotational

lP

removal after harvesting,

re

area had also experienced extensive erosion, and the low SOCS were due to biomass

farming and frequent tillage etc.), which breakdown the soil aggregates, changes aeration

na

and increase the speed of microbial decomposition and oxidation of SOM to release CO2 (Dorji et al., 2014; Eclesia et al., 2012; Were et al., 2015). Dorji et al., (2014) found very

ur

similar variation in SOCS to those in this study area.

Jo

3.6. SOC and SOCS under different land use Dorji et al., (2014) found, the spatial distribution of predicted SOCS under different land use types was quite similar to that of SOC. The highest and lowest SOCS were in the forest and bareland, respectively. The mean SOC and SOCS varied in soils under different land use types, and are shown in Table 4. The predicted SOCS for 90 × 90 m grids for the different land use types ranged from forest (651.83 g m-2 ) to bareland (437.16 31 g m-2 ) (Table 4). Taghizadeh-Mehrjardi et al., (2017), Were et al., (2015), Dorji et al., (2014) all found the same results. As expected, the spatial distribution of the predicted SOCS under different land use was quite similar to that of SOC. The highest SOC was found in forest (1.49 %) whereas bareland (0.95 %) and dry farming (1.09 %) soils had the lowest SOC

Journal Pre-proof (Table 4, Figure 8). Due to the high input of organic matter in forest and grassland, SOC was higher for native land than cropland. As expected, the variation in SOCS was very similar to SOC under all land uses. The mean SOC and SOCS showed an increasing trend from dry farming, irrigated farming, grassland, orchard and forest, respectively. This resulted from the management and related to disturbance activities in cultivated areas. We also estimated the total SOCS for the top 30 cm under different land use types. The

of

total SOCS in soils in different land use types increased as follows: bareland (6.35 Tg; 1

ro

Tg= 1012 g), orchard (356.45 Tg), irrigated farming (782.38 Tg), forest (1773.79 Tg),

-p

grassland (5991.61 Tg) and dry land (6297.91 Tg) (Table 4).

re

The total SOCS of the study area for the top 30 cm depth were 15209 Tg C (1 Tg=1012 g),

lP

indicating high SOC storage in the region, of which 41.4%, 39.4%, 11.7 %, 5.1%, 2.3% and 0.04% are in dry farming, grassland, forest, irrigated farming, orchard and bareland

na

areas, respectively (Table 4). The difference in total SOCS between land use types is

4.

Conclusions

ur

generally due to the difference in area coverage in addition to their difference in SOC.

Jo

This study shows that ML algorithms can be successfully used for mapping SOC and their uncertainty at a large scale in western Iran where there is a wide range in climate, land use and terrain attributes. Data quality can play important roles in the process. The procedural structure permits enhancement of the digital soil mapping process without loss of performance, selecting only the most important variables and the best model, and predicting the soil properties with the related uncertainty. The RF model showed the best performance in predicting SOC at the regional level, achieving the largest accuracy compared to the other ML algorithms tested. However, due to the similar performance of the RF, SVR, CU and XGBoost models, we suggest that all four models should be calibrated, and then the best results applied for spatial prediction of target soil attributes in

Journal Pre-proof other geo-graphical settings.

The most important covariates that influence SOC

distribution in the Kurdistan region are rainfall, valley depth, terrain surface texture, protection index and air temperature. Grassland and forest areas of the western Kurdistan region have the greatest SOC and SOCS, showing the highest potential for carbon sequestration compared to cropland. Sustainable land management strategies and climate change mitigation in the area will need an understanding of the spatial distribution of

of

SOCS; therefore, the SOCS map created here can identify areas for differential allocation

ro

of resources for carbon sequestration and fertility management. This type of information

-p

will be useful for evaluating the effect of land use change on total carbon cycling.

re

5. Acknowledgments

lP

Financial support for this work was provided by Lorestn University. We thank the Kurdistan agricultural and natural resources research and education center and Dr. Sadri

na

and Mr. Bostani for their intelligence and spiritual support. The work of Ruhollah Taghizadeh-Mehrjardi has been supported by the Alexander von Humboldt Foundation

Jo

ur

under the grant number: Ref 3.4 - 1164573 - IRN – GFHERMES-P.

Journal Pre-proof

Reference

Jo

ur

na

lP

re

-p

ro

of

Adhikari, K., Hartemink, A. E., Minasny, B., Kheir, R. B., Greve, M. B., and Greve, M. H., 2014, Digital mapping of soil organic carbon contents and stocks in Denmark: PloS one, v. 9, no. 8, p. e105519. Akpa, S. I., Odeh, I. O., Bishop, T. F., Hartemink, A. E., and Amapu, I. Y., 2016, Total soil organic carbon and carbon sequestration potential in Nigeria: Geoderma, v. 271, p. 202-215. Arrouays, D., Grundy, M. G., Hartemink, A. E., Hempel, J. W., Heuvelink, G. B., Hong, S. Y., Lagacherie, P., Lelyk, G., McBratney, A. B., and McKenzie, N. J., 2014, GlobalSoilMap: Toward a fine-resolution global grid of soil properties, Advances in agronomy, Volume 125, Elsevier, p. 93-134. Ayoubi, S., Karchegani, P. M., Mosaddeghi, M. R., and Honarjoo, N., 2012, Soil aggregation and organic carbon as affected by topography and land use change in western Iran: Soil and Tillage Research, v. 121, p. 18-26. Bonfatti, B. R., Hartemink, A. E., Giasson, E., Tornquist, C. G., and Adhikari, K., 2016, Digital mapping of soil carbon in a viticultural region of Southern Brazil: Geoderma, v. 261, p. 204-221. Boulesteix, A. L., Janitza, S., Kruppa, J., and König, I. R., 2012, Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, v. 2, no. 6, p. 493-507. Breiman, L., 2001, Random forests: Machine learning, v. 45, no. 1, p. 5-32. Brus, D., Hengl, T., Heuvelink, G., Kempen, B., Mulder, T., Olmedo, G., Poggio, L., Ribeiro, E., Thine Omuto, C., and Yigini, Y., 2017, Soil organic carbon mapping: GSOC map cookbook manual. Bui, E., Henderson, B., and Viergever, K., 2009, Using knowledge discovery with data mining from the Australian Soil Resource Information System database to inform soil carbon mapping in Australia: Global biogeochemical cycles, v. 23, no. 4. Conrad, O., Bechtel, B., Bock, M., Dietrich, H., Fischer, E., Gerlitz, L., Wehberg, J., Wichmann, V., and Böhner, J., 2015, System for automated geoscientific analyses (SAGA) v. 2.1. 4: Geoscientific Model Development, v. 8, no. 7, p. 1991-2007. Cortes, C., and Vapnik, V., 1995, Support-vector networks: Machine learning, v. 20, no. 3, p. 273297. Davidson, E. A., and Janssens, I. A., 2006, Temperature sensitivity of soil carbon decomposition and feedbacks to climate change: Nature, v. 440, no. 7081, p. 165. Davy, M., and Koen, T., 2014, Variations in soil organic carbon for two soil types and six land uses in the Murray Catchment, New South Wales, Australia: Soil Research, v. 51, no. 8, p. 631644. Dorji, T., Odeh, I. O., Field, D. J., and Baillie, I. C., 2014, Digital soil mapping of soil organic carbon stocks under different land use and land cover types in montane ecosystems, Eastern Himalayas: Forest Ecology and Management, v. 318, p. 91-102. Drake, J., Swisdak, M., Schoeffler, K., Rogers, B., and Kobayashi, S., 2006, Formation of secondary islands during magnetic reconnection: Geophysical research letters, v. 33, no. 13. Eclesia, R. P., Jobbagy, E. G., Jackson, R. B., Biganzoli, F., and Piñeiro, G., 2012, Shifts in soil organic carbon for plantation and pasture establishment in native forests and grasslands of South America: Global Change Biology, v. 18, no. 10, p. 3237-3251. Falahatkar, S., Hosseini, S. M., Ayoubi, S., and Salmanmahiny, A., 2016, Predicting soil organic carbon density using auxiliary environmental variables in northern Iran: Archives of Agronomy and Soil Science, v. 62, no. 3, p. 375-393.

Journal Pre-proof

Jo

ur

na

lP

re

-p

ro

of

Fan, J., Wang, X., Wu, L., Zhou, H., Zhang, F., Yu, X., Lu, X., and Xiang, Y., 2018, Comparison of Support Vector Machine and Extreme Gradient Boosting for predicting daily global solar radiation using temperature and precipitation in humid subtropical climates: A case study in China: Energy conversion and management, v. 164, p. 102-111. Friedman, J. H., 2002, Stochastic gradient boosting: Computational statistics & data analysis, v. 38, no. 4, p. 367-378. Garcia-Pausas, J., Casals, P., Camarero, L., Huguet, C., Sebastia, M.-T., Thompson, R., and Romanya, J., 2007, Soil organic carbon storage in mountain grasslands of the Pyrenees: effects of climate and topography: Biogeochemistry, v. 82, no. 3, p. 279-289. Goidts, E., Van Wesemael, B., and Crucifix, M., 2009, Magnitude and sources of uncertainties in soil organic carbon (SOC) stock assessments at various scales: European Journal of Soil Science, v. 60, no. 5, p. 723-739. Gomes, L. C., Faria, R. M., de Souza, E., Veloso, G. V., Schaefer, C. E. G., and Fernandes Filho, E. I., 2019, Modelling and mapping soil organic carbon stocks in Brazil: Geoderma, v. 340, p. 337-350. Gomez, C., Rossel, R. A. V., and McBratney, A. B., 2008, Soil organic carbon prediction by hyperspectral remote sensing and field vis-NIR spectroscopy: An Australian case study: Geoderma, v. 146, no. 3-4, p. 403-411. Grimm, R., Behrens, T., Märker, M., and Elsenbeer, H., 2008, Soil organic carbon concentrations and stocks on Barro Colorado Island—Digital soil mapping using Random Forests analysis: Geoderma, v. 146, no. 1-2, p. 102-113. Guo, Z., Adhikari, K., Chellasamy, M., Greve, M. B., Owens, P. R., and Greve, M. H., 2019, Selection of terrain attributes and its scale dependency on soil organic carbon prediction: Geoderma, v. 340, p. 303-312. Han, F., Hu, W., Zheng, J., Du, F., and Zhang, X., 2010, Spatial variability of soil organic carbon in a catchment of the Loess Plateau: Acta Agriculturae Scandinavica Section B–Soil and Plant Science, v. 60, no. 2, p. 136-143. Hengl, T., de Jesus, J. M., Heuvelink, G. B., Gonzalez, M. R., Kilibarda, M., Blagotić, A., Shangguan, W., Wright, M. N., Geng, X., and Bauer-Marschallinger, B., 2017, SoilGrids250m: Global gridded soil information based on machine learning: PLoS one, v. 12, no. 2, p. e0169748. Hengl, T., Heuvelink, G. B., Kempen, B., Leenaars, J. G., Walsh, M. G., Shepherd, K. D., Sila, A., MacMillan, R. A., de Jesus, J. M., and Tamene, L., 2015, Mapping soil properties of Africa at 250 m resolution: Random forests significantly improve current predictions: PloS one, v. 10, no. 6, p. e0125814. Hengl, T., Nussbaum, M., Wright, M. N., Heuvelink, G. B., and Gräler, B., 2018, Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables: PeerJ, v. 6, p. e5518. Hsu, M., Yu, E. Y., Singh, S. M., and Lue, N. F., 2007, Mutual dependence of Candida albicans Est1p and Est3p in telomerase assembly and activation: Eukaryotic cell, v. 6, no. 8, p. 1330-1338. Jenny, H., 1994, Factors of soil formation: a system of quantitative pedology, Courier Corporation. Jobbágy, E. G., and Jackson, R. B., 2000, The vertical distribution of soil organic carbon and its relation to climate and vegetation: Ecological applications, v. 10, no. 2, p. 423-436. Kidd, D., Malone, B., McBratney, A., Minasny, B., and Webb, M., 2014, Digital mapping of a soil drainage index for irrigated enterprise suitability in Tasmania, Australia: Soil Research, v. 52, no. 2, p. 107-119. Kuhn, M., and Johnson, K., 2013, Applied predictive modeling, Springer. Kuhn, M., Weston, S., Keefer, C., Coulter, N., and Quinlan, R., 2013, Cubist: Rule-and InstanceBased Regression Modeling. R package version 0.0. 15. Kumar, S., and Lal, R., 2011, Mapping the organic carbon stocks of surface soils using local spatial interpolator: Journal of Environmental Monitoring, v. 13, no. 11, p. 3128-3135. Kumar, S., Lal, R., and Liu, D., 2012, A geographically weighted regression kriging approach for mapping soil organic carbon stock: Geoderma, v. 189, p. 627-634.

Journal Pre-proof

Jo

ur

na

lP

re

-p

ro

of

Li, Q.-q., Yue, T.-x., Wang, C.-q., Zhang, W.-j., Yu, Y., Li, B., Yang, J., and Bai, G.-c., 2013, Spatially distributed modeling of soil organic matter across China: An application of artificial neural network approach: Catena, v. 104, p. 210-218. Li, Y., 2010, Can the spatial prediction of soil organic matter contents at various sampling scales be improved by using regression kriging with auxiliary information?: Geoderma, v. 159, no. 1-2, p. 63-75. Liu, D., Wang, Z., Zhang, B., Song, K., Li, X., Li, J., Li, F., and Duan, H., 2006, Spatial distribution of soil organic carbon and analysis of related factors in croplands of the black soil region, Northeast China: Agriculture, Ecosystems & Environment, v. 113, no. 1-4, p. 73-81. Malone, B. P., McBratney, A., Minasny, B., and Laslett, G., 2009, Mapping continuous depth functions of soil carbon storage and available water capacity: Geoderma, v. 154, no. 1-2, p. 138152. Mansuy, N., Thiffault, E., Paré, D., Bernier, P., Guindon, L., Villemaire, P., Poirier, V., and Beaudoin, A., 2014, Digital mapping of soil properties in Canadian managed forests at 250 m of resolution using the k-nearest neighbor method: Geoderma, v. 235, p. 59-73. Martin, M., Wattenbach, M., Smith, P., Meersmans, J., Jolivet, C., Boulonne, L., and Arrouays, D., 2010, Spatial distribution of soil organic carbon stocks in France: Discussion paper: Biogeosciences Discussions. Maynard, J. J., and Levi, M. R., 2017, Hyper-temporal remote sensing for digital soil mapping: Characterizing soil-vegetation response to climatic variability: Geoderma, v. 285, p. 94-109. McBratney, A. B., Santos, M. M., and Minasny, B., 2003, On digital soil mapping: Geoderma, v. 117, no. 1-2, p. 3-52. McRoberts, R. E., 2012, Estimating forest attribute parameters for small areas using nearest neighbors techniques: Forest Ecology and Management, v. 272, p. 3-12. Meier, M., Souza, E. d., Francelino, M. R., Fernandes Filho, E. I., and Schaefer, C. E. G. R., 2018, Digital Soil Mapping Using Machine Learning Algorithms in a Tropical Mountainous Area: Revista Brasileira de Ciência do Solo, v. 42. Mendoza Vega, J., 2002, Influences of land use/land cover and soil type on amounts of soil organic carbon and soil characteristics, v. CH/631.47097275 M4. Minasny, B., McBratney, A. B., Malone, B. P., and Wheeler, I., 2013, Digital mapping of soil carbon, Advances in Agronomy, Volume 118, Elsevier, p. 1-47. Minasny, B., McBratney, A. B., and Salvador-Blanes, S., 2008, Quantitative models for pedogenesis—A review: Geoderma, v. 144, no. 1-2, p. 140-157. Mozafari, G.-A., and Safarpour, F., 2013, Ecological model of zoning of pastures in Kurdistan province with emphasis on climatic elements of temperature and precipitation: Geography and environmental sustainability, v. 6. Mulder, V., De Bruin, S., Schaepman, M., and Mayr, T., 2011, The use of remote sensing in soil and terrain mapping—A review: Geoderma, v. 162, no. 1-2, p. 1-19. Nabiollahi, K., Eskandari, S., Taghizadeh-Mehrjardi, R., Kerry, R., and Triantafalis, J., 2019, Assessing soil organic carbon stocks under land-use change scenarios using random forest models: Carbon Management, p. 1-15. Nawar, S., and Mouazen, A., 2017, Comparison between random forests, artificial neural networks and gradient boosted machines methods of on-line Vis-NIR spectroscopy measurements of soil total nitrogen and total carbon: Sensors, v. 17, no. 10, p. 2428. Nelson, D., and Sommers, L. E., 1982, Total carbon, organic carbon, and organic matter 1: Methods of soil analysis. Part 2. Chemical and microbiological properties, no. methodsofsoilan2, p. 539-579. Olson, R. S., La Cava, W., Mustahsan, Z., Varik, A., and Moore, J. H., 2017, Data-driven advice for applying machine learning to bioinformatics problems: arXiv preprint arXiv:1708.05070. Ottoy, S., De Vos, B., Sindayihebura, A., Hermy, M., and Van Orshoven, J., 2017, Assessing soil organic carbon stocks under current and potential forest cover using digital soil mapping and spatial generalisation: Ecological indicators, v. 77, p. 139-150. Padarian, J., Minasny, B., and McBratney, A. B., 2019, Machine learning and soil sciences: A review aided by machine learning tools: SOIL Discuss, v. 2019, p. 1-29.

Journal Pre-proof

Jo

ur

na

lP

re

-p

ro

of

Ramcharan, A., Hengl, T., Nauman, T., Brungard, C., Waltman, S., Wills, S., and Thompson, J., 2018, Soil property and class maps of the conterminous United States at 100-meter spatial resolution: Soil Science Society of America Journal, v. 82, no. 1, p. 186-201. Reza Pahlavan Rad, M., Toomanian, N., Khormali, F., Brungard, C. W., Bayram Komaki, C., and Bogaert, P., 2014, Updating soil survey maps using random forest and conditioned Latin hypercube sampling in the loess derived soils of northern Iran: Geoderma, v. 232, no. 97, p. 232. Roecker, S., Howell, D., Haydu-Houdeshell, C., and Blinn, C., 2010, A qualitative comparison of conventional soil survey and digital soil mapping approaches, Digital Soil Mapping, Springer, p. 369-384. Sanchez, P. A., Ahamed, S., Carré, F., Hartemink, A. E., Hempel, J., Huising, J., Lagacherie, P., McBratney, A. B., McKenzie, N. J., and de Lourdes Mendonça-Santos, M., 2009, Digital soil map of the world: Science, v. 325, no. 5941, p. 680-681. Schillaci, C., Acutis, M., Lombardo, L., Lipani, A., Fantappie, M., Märker, M., and Saia, S., 2017, Spatio-temporal topsoil organic carbon mapping of a semi-arid Mediterranean region: The role of land use, soil texture, topographic indices and the influence of remote sensing data to modelling: Science of the total environment, v. 601, p. 821-832. Schillaci, C., Acutis, M., Vesely, F., and Saia, S., 2019, A simple pipeline for the assessment of legacy soil datasets: An example and test with soil organic carbon from a highly variable area: Catena, v. 175, p. 110-122. Schwanghart, W., and Jarmer, T., 2011, Linking spatial patterns of soil organic carbon to topography—A case study from south-eastern Spain: Geomorphology, v. 126, no. 1-2, p. 252-263. Shelukindo, H. B., Semu, E., Singh, B., and Munishi, P., 2014, Predictor variables for soil organic carbon contents in the Miombo woodlands ecosystem of Kitonga forest reserve, Tanzania. Smith, P., 2008, Land use change and soil organic carbon dynamics: Nutrient Cycling in Agroecosystems, v. 81, no. 2, p. 169-178. Somarathna, P., Malone, B., and Minasny, B., 2016, Mapping soil organic carbon content over New South Wales, Australia using local regression kriging: Geoderma regional, v. 7, no. 1, p. 3848. Taghizadeh-Mehrjardi, R., Minasny, B., Sarmadian, F., and Malone, B., 2014, Digital mapping of soil salinity in Ardakan region, central Iran: Geoderma, v. 213, p. 15-28. Taghizadeh-Mehrjardi, R., Neupane, R., Sood, K., and Kumar, S., 2017, Artificial bee colony feature selection algorithm combined with machine learning algorithms to predict vertical and lateral distribution of soil organic matter in South Dakota, USA: Carbon Management, v. 8, no. 3, p. 277-291. Taghizadeh‐mehrjardi, R., Toomanian, N., Khavaninzadeh, A., Jafari, A., and Triantafilis, J., 2016, Predicting and mapping of soil particle‐size fractions with adaptive neuro‐fuzzy inference and ant colony optimization in central I ran: European Journal of Soil Science, v. 67, no. 6, p. 707725. Tan, Z., Lal, R., Smeck, N., and Calhoun, F., 2004, Relationships between surface soil organic carbon pool and site variables: Geoderma, v. 121, no. 3-4, p. 187-195. Tang, J., Cheng, H., and Fang, C., 2017, The temperature sensitivity of soil organic carbon decomposition is not related to labile and recalcitrant carbon: PloS one, v. 12, no. 11, p. e0186675. Thompson, J. A., and Kolka, R. K., 2005, Soil carbon storage estimation in a forested watershed using quantitative soil-landscape modeling: Soil Science Society of America Journal, v. 69, no. 4, p. 1086-1093. Vågen, T.-G., and Winowiecki, L. A., 2013, Mapping of soil organic carbon stocks for spatially explicit assessments of climate change mitigation potential: Environmental Research Letters, v. 8, no. 1, p. 015011. Vågen, T.-G., Winowiecki, L. A., Abegaz, A., and Hadgu, K. M., 2013, Landsat-based approaches for mapping of land degradation prevalence and soil functional properties in Ethiopia: Remote Sensing of Environment, v. 134, p. 266-275. Vasques, G., Grunwald, S., Comerford, N., and Sickman, J., 2010, Regional modelling of soil carbon at multiple depths within a subtropical watershed: Geoderma, v. 156, no. 3-4, p. 326-336.

Journal Pre-proof

Jo

ur

na

lP

re

-p

ro

of

Viscarra Rossel, R. A., and Behrens, T., 2010, Using data mining to model and interpret soil diffuse reflectance spectra: Geoderma, v. 158, no. 1-2, p. 46-54. Viscarra Rossel, R. A., Webster, R., Bui, E. N., and Baldock, J. A., 2014, Baseline map of organic carbon in Australian soil to support national carbon accounting and monitoring under climate change: Global Change Biology, v. 20, no. 9, p. 2953-2970. Walkley, A., and Black, I. A., 1934, An examination of the Degtjareff method for determining soil organic matter, and a proposed modification of the chromic acid titration method: Soil science, v. 37, no. 1, p. 29-38. Wang, B., Waters, C., Orgill, S., Cowie, A., Clark, A., Li Liu, D., Simpson, M., McGowen, I., and Sides, T., 2018, Estimating soil organic carbon stocks using different modelling techniques in the semi-arid rangelands of eastern Australia: Ecological indicators, v. 88, p. 425-438. Were, K., Bui, D. T., Dick, Ø. B., and Singh, B. R., 2015, A comparative assessment of support vector regression, artificial neural networks, and random forests for predicting and mapping soil organic carbon stocks across an Afromontane landscape: Ecological Indicators, v. 52, p. 394-403. Wiesmeier, M., Spörlein, P., Geuß, U., Hangen, E., Haug, S., Reischl, A., Schilling, B., von Lützow, M., and Kögel‐Knabner, I., 2012, Soil organic carbon stocks in southeast Germany (Bavaria) as affected by land use, soil type and sampling depth: Global Change Biology, v. 18, no. 7, p. 2233-2245. Wilding, L., 1985. Spatial variability: its documentation, accommodation and implication to soil surveys, in Proceedings Soil spatial variability. Workshop, p. 166-194. Yang, R.-M., Zhang, G.-L., Liu, F., Lu, Y.-Y., Yang, F., Yang, F., Yang, M., Zhao, Y.-G., and Li, D.-C., 2016, Comparison of boosted regression tree and random forest models for mapping topsoil organic carbon concentration in an alpine ecosystem: Ecological Indicators, v. 60, p. 870-878. Yang, R., Rossiter, D. G., Liu, F., Lu, Y., Yang, F., Yang, F., Zhao, Y., Li, D., and Zhang, G., 2015, Predictive mapping of topsoil organic carbon in an alpine environment aided by Landsat TM: PloS one, v. 10, no. 10, p. e0139042. Yigini, Y., and Panagos, P., 2016, Assessment of soil organic carbon stocks under future climate and land cover changes in Europe: Science of the Total Environment, v. 557, p. 838-850. Yimer, F., Ledin, S., and Abdelkadir, A., 2006, Soil property variations in relation to topographic aspect and vegetation community in the south-eastern highlands of Ethiopia: Forest Ecology and Management, v. 232, no. 1-3, p. 90-99. Yun-Qiang, W., Zhang, X.-C., Zhang, J.-L., and Shun-Ji, L., 2009, Spatial variability of soil organic carbon in a watershed on the Loess Plateau: Pedosphere, v. 19, no. 4, p. 486-495. Zeraatpisheh, M., Ayoubi, S., Jafari, A., Tajik, S., and Finke, P., 2019, Digital mapping of soil properties using multiple machine learning in a semi-arid region, central Iran: Geoderma, v. 338, p. 445-452. Zhao, H., and Chen, X., Use of normalized difference bareness index in quickly mapping bare areas from TM/ETM+, in Proceedings International geoscience and remote sensing symposium2005, Volume 3, p. 1666.

Jo

ur

na

lP

re

-p

ro

of

Journal Pre-proof

Figure 1. Location of Kurdistan province (study area) in Iran (a) and spatial distribution of soil

samples

(b).

Journal Pre-proof

re

-p

ro

of

(a)

Jo

ur

na

lP

(b)

Figure 2. A digital elevation model (a) and Land use map of the study area (b)

lP

re

-p

ro

of

Journal Pre-proof

Jo

ur

na

Figure 3. Histogram of SOC (%)

re

-p

ro

of

Journal Pre-proof

Figure 4. Relative importance of environmental covariates (%) using the best prediction

Jo

ur

na

lP

model (RF)

Jo

ur

na

lP

re

-p

ro

of

Journal Pre-proof

Figure 5. Error map of predicted SOC (%) using RF model

na

lP

re

-p

ro

of

Journal Pre-proof

Jo

ur

Figure 6. Spatial distribution of SOC (%) using the RF model

na

lP

re

-p

ro

of

Journal Pre-proof

Jo

ur

Figure 7. Spatial distribution of total SOC stocks (g m-2 ) using RF model.

re

-p

ro

of

Journal Pre-proof

Jo

ur

na

lP

Figure 8. SOC (%) in different land uses

Journal Pre-proof Table 1. Descriptive statistics of SOC (%) in the study area. No

Min

Me

Max

Med

SD

V

SE

CV

Quartile 25%

Quartile 75%

Kurtosis

Skewness

0.54

0.29

0.02

51.04

0.7

1.28

4.07

1.56

% 865

0.04

1.06

4.44

0.95

Jo

ur

na

lP

re

-p

ro

of

Notes: No= Number of samples; Min= Minimum; Me= Mean; Max= Maximum; Med= Median; SD=Standard deviation; V= Variance; SE= Standard error; CV= Coefficient of variation.

Journal Pre-proof Table 2. Pearson correlation coefficients between SOC and environmental covariates. Parameter

SOC

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

SOC

1.00

0.43

0.42

0.41

0.37

0.35

0.33

0.32

0.30

0.28

0.10

-0.12

-0.18

-0.18

-0.18

-0.18

-0.20

-0.22

Catchment Slope

0.43

1.00

0.89

0.74

0.66

0.56

0.80

0.81

0.70

0.45

0.35

-0.34

-0.27

-0.28

-0.41

-0.34

-0.21

Protection Index

0.42

0.89

1.00

0.71

0.74

0.47

0.82

0.77

0.67

0.52

0.27

-0.22

-0.30

-0.31

-0.41

-0.37

Valley Depth

0.41

0.74

0.71

1.00

0.54

0.59

0.51

0.48

0.46

0.36

0.43

-0.56

-0.22

-0.34

T errain surface texture

0.37

0.66

0.74

0.54

1.00

0.38

0.60

0.61

0.33

0.61

0.11

-0.09

-0.32

-0.31

Rainfall

0.35

0.56

0.47

0.59

0.38

1.00

0.39

0.41

0.33

0.31

0.30

-0.51

-0.22

LS Factor

0.33

0.80

0.82

0.51

0.60

0.39

1.00

0.91

0.66

0.44

0.24

-0.12

f o

-0.24

Slope

0.32

0.81

0.77

0.48

0.61

0.41

0.91

1.00

0.68

0.38

0.22

19

X

-0.25

-0.29

-0.47

-0.48

-0.52

-0.57

-0.61

-0.27

-0.70

-0.51

-0.54

-0.56

-0.31

-0.25

-0.39

-0.35

-0.33

-0.63

-0.29

-0.38

-0.34

-0.44

-0.67

-0.73

-0.58

-0.21

-0.16

-0.21

-0.18

-0.20

-0.26

-0.32

-0.68

-0.27

-0.28

-0.31

-0.34

-0.19

-0.49

-0.52

-0.59

-0.43

-0.10

-0.25

-0.25

-0.29

-0.31

-0.18

-0.31

-0.59

-0.67

-0.47

Melton Ruggedness Number

0.30

0.70

0.67

0.46

0.33

0.33

0.66

0.68

1.00

0.20

-0.13

-0.18

-0.17

-0.30

-0.18

-0.13

-0.44

-0.30

-0.33

-0.37

Vector T errain Ruggedness (VRM)

0.28

0.45

0.52

0.36

0.61

0.31

0.44

0.38

0.16

0.21

-0.04

-0.22

-0.22

-0.14

-0.28

-0.19

-0.28

-0.35

-0.44

-0.33

Soil temp (20 cm)

0.10

0.35

0.27

0.43

0.11

0.30

0.24

0.22

0.20

0.21

1.00

-0.24

-0.11

-0.12

-0.05

-0.17

0.16

-0.06

-0.03

-0.06

-0.15

Dem

-0.12

-0.34

-0.22

-0.56

-0.09

-0.51

-0.12

-0.10

-0.13

-0.04

-0.24

1.00

-0.01

0.02

0.19

0.09

0.16

0.12

-0.01

-0.01

0.43

Band 1

-0.18

-0.27

-0.30

-0.22

-0.32

-0.22

-0.27

-0.25

-0.18

-0.22

-0.11

-0.01

1.00

0.95

0.10

0.85

0.13

0.16

0.25

0.28

0.29

Band 2

-0.18

-0.28

-0.31

-0.24

-0.31

-0.21

MRRT F

-0.18

-0.41

-0.41

-0.34

-0.29

Brightness

-0.18

-0.34

-0.37

-0.31

-0.38

Flow Path Length (downstream)

-0.20

-0.21

-0.27

-0.25

-0.34

rn

-0.28

-0.25

-0.17

-0.22

-0.12

0.02

0.95

1.00

0.13

0.93

0.13

0.17

0.23

0.27

0.27

-0.31

-0.29

-0.30

-0.14

-0.05

0.19

0.10

0.13

1.00

0.15

0.10

0.31

0.09

0.17

0.23

-0.21

-0.34

-0.31

-0.18

-0.28

-0.17

0.09

0.85

0.93

0.15

1.00

0.14

0.18

0.29

0.34

0.31

-0.18

-0.19

-0.18

-0.13

-0.19

0.16

0.16

0.13

0.13

0.10

0.14

1.00

0.18

0.23

0.25

0.29

Downslope Curvature

-0.22

-0.48

-0.70

-0.39

-0.44

-0.20

-0.49

-0.31

-0.44

-0.28

-0.06

0.12

0.16

0.17

0.31

0.18

0.18

1.00

0.17

0.12

0.28

MRVBF

-0.25

-0.52

-0.51

-0.35

-0.67

-0.26

-0.52

-0.59

-0.30

-0.35

-0.03

-0.01

0.25

0.23

0.09

0.29

0.23

0.17

1.00

0.82

0.43

T opographic Wetness Index

-0.29

-0.57

-0.54

-0.33

-0.73

-0.32

-0.59

-0.67

-0.33

-0.44

-0.06

-0.01

0.28

0.27

0.17

0.34

0.25

0.12

0.82

1.00

0.49

u o

J

-0.16

l a

r P

e

o r p

0.16 1.00

18

Journal Pre-proof Table 3. Performance of five different modeling methods to predict SOC R2 0.50 0.60 0.58 0.57 0.58

RMSE (%) 0.42 0.35 0.36 0.37 0.36

MAE (%) 0.35 0.30 0.30 0.31 0.30

Jo

ur

na

lP

re

-p

ro

of

Model kNN RF Cu XGBoost SVR

Journal Pre-proof Table 4. Statistical description of SOC and SOCS in each land use type.

of ro -p re lP na ur

Forest Orchard Grassland Irrigated farming Dry farming Bareland

Min 0.64 0.45 0.53 0.42 0.41 0.66

SOCS (g m-2 ) Max Mean 1455.76 651.83 1123.00 554.31 1123.13 544.53 1019.18 499.00 1254.65 486.91 718.90 437.16

SOC (%) Max Mean Stdev Min 3.27 1.49 0.17 400.80 2.65 1.27 0.35 243.81 2.62 1.23 0.26 248.45 2.67 1.12 0.31 192.99 2.81 1.09 0.28 186.66 1.66 0.95 0.24 321.30

Jo

Land use

Stdev 76.434 139.90 106.08 134.63 116.69 104.84

total SOCS Tg 1773.79 356.46 5991.61 782.38 6297.91 6.35

Journal Pre-proof Appendix 1. Remotely sensed attributes used in the prediction of SOC and SOCS in the study area. Reference Taghizade et al (2016a) Taghizade et al (2016a) Taghizade et al (2016a) Taghizade et al (2016a) Taghizade et al (2016a) Taghizade et al (2016a) Taghizade et al (2016a) Taghizade et al (2016a) Welikhe et al (2017) P ettorelli et al (2005) P ettorelli et al (2005) Arzani and King (2008) P ettorelli et al (2005) Ghaemi et al (2013) Arzani and King (2008) Arzani and King (2008) Arzani and King (2008) Arzani and King (2008) Ghaemi et al (2013) Ghaemi et al (2013) Kure ar al (2015) Foody et al (2001) Foody et al (2001) Foody et al (2001) Taghizade et al (2016a)

(Band4- Band2/ Band4+ Band2)

Gitelson et al (1996)

(Band4/ Band3+ Band4) (Band4- Band5/ Band4+ Band5)

Crippen (1990) Sriwongsitanon et al (2016)

re

-p

ro

of

Formula Reflectance value of band 1 (blue visible) (band width=0.45 –0.515 μm) Reflectance value of band 2 (green visible) (band width=0.525 –0.605 μm) Reflectance value of band 3 (red visible) (band width=0.63 –0.69 μm) Reflectance value of band 4 (near infrared) (band wi dth=0.75–0.90 μm) Reflectance value of band 5 (shortwave IR-1) (band width=1.55–1.75 μm) Reflectance value of band 6 (thermal band) (band width=10.4 –12.5 μm) Reflectance value of band 7 (shortwave IR-2) (band width=2.09–2.35 μm) (Band4-Band3/Band4+Band3) (Band5/Band4) (Band5/Band7) (Band5- Band3/ Band5+ Band3) (Band4/( Band3+ Band5)) (Band4/ Band3) (Band7- Band5/ Band7+ Band5) (Band3- Band1) (Band3- Band2) ((Band3- Band2)/( Band3+ Band2)) ((Band3- Band1)/(Band3+ Band1)) ((Band7- Band3/(Band7+ Band3)) (Band4- Band3) (Band3- Band4/ Band3+ Band4) (Band4/ Band1+ Band2) (Band1* Band2)/ Band3 ((Band4-( Band1+ Band2)/( Band4+( Band1+ Band2))) (Band4/ Band3)

Lymburner et al (2000)

(Band5- Band2) (Band5- Band2/ Band5+ Band2) (Band4- Band3)/((Band4+ Band3)^1/2) (0.29 Band2-0.56 Band3+0.6 Band5+0.49 Band2) (2.5*( Band4- Band3)/( Band4+6* Band3-7.7* Band1+1) ((Band1^2+ Band2^2+ Band3^2)/3)^0.5 (Band3- Band1/ Band3+ Band1) (2*( Band3- Band2- Band1))/( Band2- Band1) (Band3- Band2/ Band3+ Band2) (Band3^2/ Band1* Band2) (Band5/ Band7) (Band4- Band5/ Band4+ Band5) (Band5/ Band7) (Band3/ Band1) (Band5/ Band4) (Band5- Band6/ Band3+ Band6) (Band5- Band7/ Band5+ Band7 (Band7- Band6/ Band7+ Band6) (Band5- Band6/ Band5+ Band6) (Band3- Band6/ Band3+ Band6) ((Band4-(1.2* Band3))/ Band4+ Band3

Vescovo and Gianelle (2008) Vescovo and Gianelle (2008) Roujean and Breon (1995) Ghaemi et al (2013) Wang et al (2018) Escadafal et al (1994) Escadafal and Huete (1991) Escadafal and Huete (1991) Madeira at al (1997) Madeira at al (1997) Goodwin et al (2008) Goodwin et al (2008) Dogan (2009) Dogan (2009) Dogan (2009)

(Band4- Band3/ Band4+ Band3+0.5)*(1+0.5))

Taghizade et al (2016a)

(Band4- Band3/ Band4+ Band3+0.16)

Nikolakopoulos )2003)

(Band4- Band5/ Band4+ Band5) (0.3037* Band1)+(0.2793* Band2)+(0.4343* Band3)+ (0.5585* Band4)+(0.5082* Band5)+(0.1863* Band7) (-0.2848* Band1)+(-0.2435* Band2)+ (-0.5436* Band3)+(0.7243* Band4)+(0.084* Band5)+(-0.18* Band7) (-0.1509* Band1)+(-0.1793* Band2)+(-0.3299* Band3)+ (0.3406* Band4)+(-0.7112* Band5)+(-0.4572* Band7)

Wang et al (2018)

na

lP

(Band4/ Band3+ Band5)

Jo

ur

P arameters Band 1 Band 2 Band 3 Band 4 Band 5 Band 6 Band 7 Normalized Difference Vegetation Index (NDVI) Moisture Stress Index (MSI) Leaf Water Content (LWC) Transformed Vegetation Index (TVI) reflectance absorption index (RAI) Near Infrared Ratio (NIR) Vegetation Index (VI) P D311 P D321 P D322 P D312 MIRV1 Difference Vegetation Index (DVI) Normalized Difference Snow Index (NDSI) Ratio-based (RB) Stress Related (SR) Normalized based (NB) Ratio Vegetation Index (RVI) Green Normalized Difference Vegetation Index (GNDVI) Infrared P ercentage Vegetation Index (IP VI) Normalized Difference Infrared Index (NDII) Specific Leaf Area Vegetation Index (SLAVI) Canopy Index (CI) Normalized Canopy Index (NCI ) Renormalized difference vegetation index (RDVI ) Green vegetation index (GVI) Enhanced vegetation index (EVI) Brightness Index (BI) Saturation Index (SI) Hue Index (HI) Coloration Index (CoI) Redness Index (RE) Moisture Index (MI) Normalized Difference Water Index (NDWI) Clay Minerals (CM) Iron Oxide (IO) Ferrous Minerals (FM) Chemical soil composition (CSC) salinity index (S-I) Normalized Difference Bareness Index (NDBA11) Normalized Difference Bareness Index (NDBA12) Normalized Difference Bareness Index (NDBA13) Modified Normalized Difference (MND) Soil Adjusted Vegetation Index (SAVI) Optimized Soil Adjusted Vegetation Index (OSAVI) Normalized difference moisture index (NDMI) Tasseled cap coefficient (brightness) Tasseled cap coefficient (greenness) Tasseled cap coefficient (wetness)

Taghizade et al (2016a) Zhao and Chen (2005) Zhao and Chen (2005) Zhao and Chen (2005) P ettorelli et al (2005)

Huang et al. (2002) Huang et al. (2002) Huang et al. (2002)

Journal Pre-proof Appendix 2. Terrain attributes used in the prediction of SOC and SOCS in the study area

T opographic position index Normalized height Standardized height Mid-slope position Landform

T errain Surface Convexity T errain Surface T exture Local curvature Upslope curvature Local Upslope curvature Downslope curvature Local Downslope curvature Wind effect

T aghizade et al (2016a)

of

Bonfatti et al (2016) Bonfatti et al (2016) Bonfatti et al (2016) Guo et al (2019) Guo et al (2019) T aghizade et al (2016a) Wilson and Gallant (2000) T aghizade et al (2016d)

T aghizade et al (2016a)

Balance between soil mass deposited and eroded Measures terrain ruggedness Angle between the surface and the incoming light beams Calculates the vertical distance to a channel network base level

Bonfatti et Bonfatti et Bonfatti et Bonfatti et

Simple flow accumulation related index, calculated as difference between maximum and minimum elevation in catchment area divided by square root of catchment area size. T he calculation is performed for each grid cell, therefore minimum elevation is same as elevation at cell's position

Melton (1965)

al (2016) al (2016) al (2016) al (2016)

ur Jo

Flow path lenght Upslope length factor Effective flow length Sediment balance Curvature classification Slope height Morphometric features Protection index

Identifies valley bottoms based on their topographic signature as flat low-lying areas, Measure of flatness and up-ness Measure of flatness and low-ness, Identifies high flat areas

T aghizade et al (2016a)

na

Catchment Area Catchment Slope Modified Catchment Area Slope length Multi-resolution valley bottom flatness index Multi-resolution ridge-top flatness index Mass Balance Index Vector T errain Ruggedness Analytical Hillshading Vertical Distance to Channel Network Maximum height Melton ruggedness number

Guo et al (2019)

ro

Relative Slope Position

Represents the agreement of the aspect direction of surrounding cells with the theoretical matrix direction. It is calculated as the mean aspect difference between the real aspect and the theoretical maximum divergent direction matrix which is the ideal peak minus 90°. Calculates accumulated flow as the accumulated weight of all cells flowing into each downslope cell in the output raster T opographic wetness index Slope length factor Relative position of the valley Defined as the elevation difference between the cell and the closest channel network defined as the position of one point relative to the ridge and valley of a slope, with a value of 0 for the bottom of the valley and 1 for the top of the ridge Calculated flow accumulation and related parameters Average slope over the catchment Calculated the flow accumulation

T aghizade et al (2016a) T aghizade et al (2016a) Schillaci et al (2017)

-p

T opographicWetness Index Ls Factor Valley Depth Chanell network base level

Reference

re

Flow Accumulation

Data descriptions Elevation above mean sea level Average gradient above flow path Compass direction of the maximum rate of change

lP

Variables DEM- 30 m Slope Aspect Cross-Sectional Curvature Longitudainal Curvature Convergence Index

T his algorithm analyses the immediate surroundings of each cell up to an given distance and evaluates how the relief protects it T his implementation calculates the TPI for different scales and integrates these into one single grid Normalized height of elevation is defined by slope height and valley depth Standard height is the product of normalized height multiplied by absolute height

Guisan et al (1999) Guo et al (2019) Guo et al (2019)

T he obtained landform classification map successfully shows the distribution of plains, peaks and steep slopes. Attributes used for the classification were elevation, slope, plan curvature, profile curvature, tangential curvature and both maximal and minimal curvature. T he convexity is calculated as the ratio of the number of cells having positive curvature to the number of all valid cells within the specified search radius. T errain surface texture calculates the nested-means terrain classification Calculates the local curvature of a cell as sum of the gradients to its neighbour cells. Upslope curvature is the distance weighted average local curvature in a cell's upslope contributing area based on multiple flow direction after Freeman 1994.

Guisan et al (1999)

T he Wind Effect is a dimensionless index. Values below 1 indicate wind shadowed areas whereas values above 1 indicate areas exposed to wind

Boehner and Antonic (2009)

Iwahashi and Pike (2007) Freeman (1991)

Journal Pre-proof

Appendix 3. Climate attributes used in the prediction of SOC and SOCS in the study area. Description Mean annual temperature Mean annual soil temperature in 20-cm Mean annual rainfall

Reference Wang et al (2018) Wang et al (2018)

Jo

ur

na

lP

re

-p

ro

of

P arameters Air temperature (AT) Soil temperature (ST) Rainfall

Journal Pre-proof

Highlights

ur

na

lP

re

-p

ro

of

Spatial distribution of SOC in western Iran at a fine resolution was predicted. Different machine learning algorithms were compared. SOC stocks was assessed in different land uses.

Jo

  