Journal Pre-proof Modeling arsenic content in Brazilian soils: What is relevant?
Michele Duarte de Menezes, Fábio Henrique Alves Bispo, Wilson Missina Faria, Mariana Gabriele Marcolino Gonçalves, Nilton Curi, Luiz Roberto Guimarães Guilherme PII:
S0048-9697(20)30020-6
DOI:
https://doi.org/10.1016/j.scitotenv.2020.136511
Reference:
STOTEN 136511
To appear in:
Science of the Total Environment
Received date:
24 April 2019
Revised date:
30 December 2019
Accepted date:
2 January 2020
Please cite this article as: M.D. de Menezes, F.H.A. Bispo, W.M. Faria, et al., Modeling arsenic content in Brazilian soils: What is relevant?, Science of the Total Environment (2020), https://doi.org/10.1016/j.scitotenv.2020.136511
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2020 Published by Elsevier.
Journal Pre-proof Modeling arsenic content in Brazilian soils: what is relevant? Michele Duarte de Menezesa, Fábio Henrique Alves Bispoa, Wilson Missina Fariaa, Mariana Gabriele Marcolino Gonçalvesa, Nilton Curia, Luiz Roberto Guimarães Guilhermea* a
Department of Soil Science, Federal University of Lavras, Lavras - MG, Brazil. * Corresponding author. Tel.: +55 35 3829 1259 Email address:
[email protected] (Luiz R. G. Guilherme).
Jo ur
na
lP
re
-p
ro
of
Abstract Arsenic accumulation in the environment poses ecological and human health risks. A greater knowledge about soil total As content variability and its main drivers is strategic for maintaining soil security, helping public policies and environmental surveys. Considering the poor history of As studies in Brazil at the country‟s geographical scale, this work aimed to generate predictive models of topsoil As content using machine learning (ML) algorithms based on several environmental covariables representing soil forming factors, ranking their importance as explanatory covariables and for feeding group analysis. An unprecedented databank based on laboratory analyses (including rare earth elements), proximal and remote sensing, geographical information system operations, and pedological information were surveyed. The median soil As content ranged from 0.14 to 41.1 mg kg-1 in reference soils, and 0.28 to 58.3 mg kg-1 in agricultural soils. Recursive Feature Elimination Random Forest outperformed other ML algorithms, ranking as most important environmental covariables: temperature, soil organic carbon (SOC), clay, sand, and TiO2. Four natural groups were statistically suggested (As content ± standard error in mg kg-1): G1) with coarser texture, lower SOC, higher temperatures, and the lowest TiO2 contents, has the lowest As content (2.24±0.50), accomplishing different environmental conditions; G2) organic soils located in floodplains, medium TiO2 and temperature, whose As content (3.78±2.05) is slightly higher than G1, but lower than G3 and G4; G3) medium contents of As (7.14±1.30), texture, SOC, TiO2, and temperature, representing the largest number of points widespread throughout Brazil; G4) the largest contents of As (11.97±1.62), SOC, and TiO2, and the lowest sand content, with points located mainly across Southeastern Brazil with milder temperature. In the absence of soil As content, a common scenario in Abbreviations: As – Arsenic; ML – machine learning; ICP-MS - Inductively coupled plasma mass spectrometry; SOC – soil organic carbon; GPS – global positioning system; GIS – geographic information system; pXRF – portable X-ray fluorescence; REE – rare earth elements; MARS – multivariate adaptive regression splines; XGBoost – extreme gradient boosted tree; RFE – recursive feature elimination; DEM – digital elevation model; NIST – national institute of standards and technology; DL – detection limit; QL – quantification limit; EF – enrichment factor; SMP – Shoemaker-McLeanPratt buffer solution; CEC – cation exchange capacity; BS – base saturation; LREE – light rare earth elements; HREE – heavy rare earth elements; DNIT – national department of transport infrastructure; SRTM – shuttle radar topographic mission; WRB/FAO – world reference base; GLM – generalized linear model; CART – classification and regression tree; RPART – recursive partitioning and regression trees; RMSE – root mean square of error; MAE – mean absolute error.
Journal Pre-proof
2
Brazil and in many Latin American countries, such natural groups could work as environmental indicators. Keywords: cluster analysis; environmental covariates; grouping analysis; machine learning; variable importance. 1. Introduction Arsenic is an element of ever-increasing environmental concern worldwide. The anthropogenic influenced accumulation of As –in soils, ground water, waterway
of
sediments, and drinking water– through diverse pathways threatens plants, wildlife, and
ro
human beings (Figueiredo et al., 2007; Goh and Lim, 2004; Venteris et al., 2014). The
-p
global concern surrounding soil security and its conservation, as means to improve food and fiber production for an ever-growing population, together with the need to keep
re
ecosystems safe (McBratney et al., 2014), requires us to investigate As content
lP
variability and its main drivers in Brazilian soils. This may contribute and shed light for policy makers to better address the environmental challenges As accumulation poses on
na
soil and water conservation.
Jo ur
Brazil has a poor record of As geochemistry research (Figueiredo et al., 2007). Even areas severely affected by As pollution, lack of reliable information, i.e. scientific publications, or recent maps (Mandal and Suzuki, 2002; Smedley and Kinniburgh, 2002). This work seeks to fill this gap by publishing a novel As databank; we included a wide diversity of environments within Brazil –a country with continental dimension–, comprising a large variability of climate (as surrogate cl) (Alvares et al., 2013), parent materials (p) that are influenced by a huge geologic background and presence of unconsolidated sediments (Alkmim, 2015), as well as relief or landform variations (r) (Vieira et al., 2015). Such soil forming factors affect also the occurrence and abundance of living organisms (o) that make up the natural boundaries of Brazilian biomes. The pedogenesis processes are linked to time (t), which in turn, depends basically of erosion
Journal Pre-proof
3
rates in tropical environment (Resende et al., 2014). As long as such environmental characteristics govern soil properties, the majority of As variability can be predicted by means of those soil-forming factors or state factors through a well-known equation of Jenny (1941): s = f (cl, o, r, p, t). More recently, driven by the increasing use of computational processing power, the advent of GPS, remote and proximal sensors, and increasing „socialization‟ of GIS, an expanded version of Jenny‟s equation, was proposed by McBratney et al. (2003): S =
of
f (s,c,o,r,p,a,n) + ε. This version accounts for an error component (ε), geographical
ro
location of the area or point of interest (n), and inherent soil information (s) as a factor
-p
to differentiate soils. With that, the scorpan environmental covariables are integrated to
re
build predictive models –using robust mathematical and statistical methods– for a more quantitative approach (Rossiter, 2018). The use of such equation for the prediction of
lP
soil As content may provide us with a model to advance the understanding of As
na
variability in Brazilian soils, as well as to identify and rank the environmental covariables for soil As contents.
Jo ur
Considering the dynamics of As in the environment and the environmental covariates that could aid in the predictive models, the rationale for using soil (s) is related to several soil properties that may explain As variability by affecting its adsorption, such as SOM, Fe2O3, Al2O3, and clay contents (Chakraborty et al., 2017; Horta et al., 2015; Venteris et al., 2014), or by affecting leaching, such as pH (Shan et al., 2013). Furthermore, concentrations of heavy metals were found to be intercorrelated (Cai et al., 2015; Chakraborty et al., 2017; Hou et al., 2017). Under these circumstances, although pXRF spectroscopy presents some limitation for detecting trace elements, such as As (concentration <100 mg kg-1) (Parsons et al., 2013; Peinado et al., 2010), pXRF data may very well provide a great deal of multi-elemental information
Journal Pre-proof
4
(environmental covariates) (Silva et al., 2018), aiding geochemical interpretation. Rare earth elements (REE), in turn, could act as potential tracers of As due to their coherent and predictable behavior (Tang and Johannesson, 2006), along with their sensitivity to changes in pH, Eh, and adsorption/desorption reactions (Guo et al., 2010). REE contents present significant correlation mainly with organic matter, as well as with Fe and Ti oxides in Brazilian soils (de Sá Paye et al., 2016), which in turn, could correlate with As contents. Soil types have been related to soil heavy metals contents, as information
of
regarding grain size (Davis et al., 2009), SOC, and weathering-leaching stage, allow us
ro
to infer pedogenetic processes, mineralogy, and soil drainage.
-p
In acidic soils, As occurs associated with Al and Fe-arsenates (AlAsO4,
re
FeAsO4), while Ca-arsenate, Ca3(AsO4)2 is the dominant species in alkaline soils (Matschullat, 2000; Fergusson, 1990). Under aerobic conditions, As occurs mainly in
lP
the form of arsenate bound to Fe- and Mn- oxi/hydroxides (Smedley et al., 2002), clay
na
minerals, and organic compounds. Hence, the weathering of rocks converts arsenic sulfides to arsenic trioxide (As2O3), which enters the arsenic cycle as dust (wind
Jo ur
exposition), or by dissolution in rain (precipitation) (Mandal and Suzuki, 2002), justifying the use of climate in As predictive models (Mandal and Suzuki, 2002; Tóth et al., 2016). Concerning organisms and the characteristics of the soil databank of this study, different soil land use (reference and agriculture soil), and vehicle emissions (Chen et al., 2005) –basically anthropogenic activities– were considered as environmental covariates. Furthermore, As enrichment in agricultural soils are mainly ascribed to long-term application of phosphate fertilizers or other agrochemical inputs; e.g. pesticides and manures (Campos, 2002; Corguinha et al., 2015; Lado et al., 2008; Matschullat, 2000; Rodríguez et al., 2008; Zhou et al., 2018).
Journal Pre-proof
5
Highly concentrated heavy metals can be transported form one area to another by surface runoff (Herngren et al., 2005), groundwater flow (Mandal and Suzuki, 2002), or river deposition (Nriagu, 1989). Thus, arsenic exportation may also occur by soil erosion, mostly driven by relief. Parent material is one of the most important soil forming factors influencing As distribution, as it depends on geological characteristics and mineral composition (Alloway, 2013; Hou et al., 2017; Lado et al., 2008; Mandal and Suzuki, 2002; Tóth et al., 2016). Soil age (a) or time factor could be inferred by soil
of
types or properties, a relevant factor in humid tropics influenced by climate and
ro
orography. Although the soil samples were georeferenced, allowing the extraction of
-p
environmental covariables information from each point, location was not considered in
re
this study as the basis of spatial prediction, since points are far apart making unfeasible the analysis related to spatial dependence (e.g., geostatistic).
lP
Clorpt-like models bring state factors, which are observable/measurable, but
na
they were not meant to explain how a given particular condition influences soil properties (Schaetzl and Anderson, 2005). For the discovery of relationships between
Jo ur
predictor and environmental covariables, the use of an emerging computer-based bigdata analysis has emerged the machine learning (Bishop, 2006). From a set of models, it is possible to maximize the amount of useful knowledge and information extracted from the available high-dimensional databank (Kuwatani et al., 2014). One example of machine learning (ML) is the regression-based algorithm, which provides an output variable (prediction), according to the input variables (known). The better-known ML algorithms are the linear or generalized models along with stepwise. More complex ML algorithms may refer to ridge (Bruce and Bruce, 2017), or MARS regression (Friedman, 1991). Decision tree-based algorithms are more robust. Random Forest, Cubist, and the XGBoost have been applied in recent years for earth and environmental sciences
Journal Pre-proof
6
research. These last type of ML algorithms produce non-parametric models capable to generate accurate predictions based on several environmental covariates, and identify their level of importance. This has been considered an important insight on the understanding the main drivers of various soil attributes and their variability (Chackraborty et al., 2017; Gomes et al., 2019; Hengl et al., 2015). A critical review of Hou et al. (2017) pointed out that little attention has been paid to assess different types of more sophisticated or complex multivariate statistical analyses in geochemical
of
studies, which are more advantageous than univariate ones and could improve accuracy
ro
of analysis.
-p
Hence, the objective of this study is the development of predictive models for
re
topsoil As content using a novel databank from a wide range of environmental conditions in Brazil, and validating the models using laboratory-based ICP-MS
lP
analyses. To achieve our goals, environmental covariates in an adapted scorpan
na
framework were measured using traditional laboratory analyses, proximal and remote sensors; these data together with relevant pedological information was compiled by
Jo ur
GIS. The performance of different ML algorithms –GLM, Ridge regression, MARS, Bagged Tree, XGBoost, Cubist, and RFE Random Forest– for predicting topsoil As contents was evaluated. Our hypothesis is that environmental covariables together with complex multivariate analyses could accurately predict As content in soils. Although higher concentration of As is expected in soils from certain types of parent materials such as arsenopyrite and other sulfide and sulfoarsenide compounds (Tarvainen et al., 2013), other environmental covariates may also drive As contents (e.g. clay minerals, organic matter, and sesquioxides of Fe). This is because the arsenate reacts with Fe and Al oxyhydroxides to form stable complexes in the soil (Ciminelli et al., 2018).
Journal Pre-proof
7
2. Materials and methods 2.1. Databank characteristics A total of 159 topsoil (0-20 cm) sample points were collected and analyzed. Although the sampling density is low, such benchmark soils were chosen in order to cover different environmental conditions, accomplishing a wide variation of biomes, geology, soil types, and climate (Figure 1). In addition, samples were taken in cultivated areas (hereafter agriculture soils, n = 97) with application of agricultural inputs, mainly
of
phosphate/phosphogypsum fertilizers, and natural areas (hereafter reference soils, n =
ro
62), collected under native vegetation with minimum anthropogenic disturbance.
-p
Besides As content analysis, the databank was harmonized and environmental
re
covariables (auxiliary information) that could explain As variability were surveyed. Databank consists of soil laboratory analyses (fertility, SOC, texture, REE), proximal
lP
sensing analyses (pXRF), remote sensing (mean annual temperature, accumulated
na
rainfall, and DEM), pedological analysis based on in-field observations (soil drainage and local relief), and GIS analysis (digital terrain models calculated from DEM or
Jo ur
roads). Although As analysis (response variable) corresponds to the total content, the environmental covariables used (explanatory variables) are not necessarily in the same form. This is due to the rationale of this work, supported by scorpan predictive models framework (McBratney et al., 2003): as long as As variability is a multivariate phenomenon, powerful algorithms could take advantage of different amount of As variability explanation provided by environmental covariates. In addition, in this study, the most important environmental covariates were quantitatively selected in order to improve the consistency of the predictive models. In addition, the quality and accuracy of environmental covariates chemical analysis were assessed and presented in Supplementary Section. Additional details are provided next.
Journal Pre-proof
8
2.2. Arsenic analysis The soil samples were air-dried, homogenized, ground and 0.05-mm sieved prior to total digestion (Stoeppler, 1997). Total concentrations of As were performed using inductively coupled plasma mass spectrometry (Quadrupole ICP-MS Elan DRCII, Perkin Elmer Sciex; nebulizer: micro flow nebulizer PFA; nebulizer gas flow: 0.88 l.min-1; gas: Argon), Nebulizer Chamber (glass, cyclone) after multi-acid full digestion
of
(HF-HNO3-HClO4 mixture) carried out in the Analytical and Environmental
ro
Geochemistry Lab from Technische Universität Bergakademie Freiberg (Freiberg,
-p
Sachsen - Germany).
re
For the analytical procedure, 0.100 g of soil samples were added in polytetrafluoroethylene vessel plus 3.5 mL of nitric acid (HNO3 p.a. 65% bi-distilled)
lP
under room temperature. The vials were then capped and heated in a digester block,
na
using a system programmed to attain a temperature of 120°C in about 60 min and holding it constant for 5 hours. After this, 2 mL of hydrofluoric acid (HF 40%
Jo ur
suprapure) and 3 mL of perchloric acid (HClO4 70% suprapure) were also added at room temperature to the digestion vial, followed by automated heating up to a range of 150 to 160°C. The digestion residue was determined by adding 2 mL of HNO3 and 10 mL of deionized water. The vial was again heated at 120°C, with ramp time of 60 min, keeping temperature constant for 60 min. Unwanted residue was dissolved utilizing droplets of HCl and H2SO4. The solution contained in each vial was evaporated without boiling in order to eliminate HF excess in solution. Finally, the liquid extracts were transferred into volumetric flasks. The accuracy of the analytical procedure was verified using certified reference materials (NIST 2710 Montana I Soil - Highly elevated trace element concentrations
Journal Pre-proof
9
and internal BraSol-2010 project standard - tropical soil from Campinas, São Paulo state), achieved following QA/QC protocols, adding blank samples in each batch of the analytical method. Hence, the detection and quantification limits were established using 17 blank extracts for soil samples following the overall procedure. The DL was calculated according to the following formula: DL X t ( n1,1 ) .SD where X is mean of 17 blank extracts, t is Student distribution, depending on the sample size and degree
of
of confidence (t = 2.583, 1% unicaudal), and SD is the standard deviation of blank samples. In addition, the quantification limit (QL) was calculated according to the
ro
following equation: QL = X +10.SD, where X is mean of blank values and SD is
-p
standard deviation of blanks. The DL and QL found were 0.025 mg kg-1 and 0.075 mg
re
kg-1 respectively. The certified, determined and recovery rates were 1540 mg kg-1, 1321 mg kg-1 and 86% for 2710A Montana I Soil certified material, and 4.0 mg kg-1, 2.92 mg
lP
kg-1 and 73% for BraSol-2010 certified material (Recovery = (determined/certified
na
values)*100; Relative Standard Deviation for 2710A Montana I Soil (22%) and for BraSol-2010 (24%)). The aforementioned accuracy assessment indexes point out the
Jo ur
suitability of As determination.
2.3. Arsenic enrichment factor
The enrichment factor (EF) was calculated in order to assess the risk of As enrichment due to anthropogenic activities. Scandium (Sc) was used as reference, according to the following equation (Ding and Ji, 2010): EF = where As and Sc subindex sample refers to sample concentrations, meanwhile subindex crustal refers to upper continental crust concentration of these elements (Rudnick and Gao, 2014). Thus, EF<2 goes from depletion to minimal enrichment; EF in the range of
Journal Pre-proof
10
2-5 means moderate enrichment; EF in the range of 5-20 means significant enrichment; EF in the range of 20-40 means very high enrichment, and EF>40 means extremely high enrichment (Loska and Wiechula, 2003; Sutherland, 2000).
2.4. Mann-Whitney rank-sum test The Mann-Whitney rank-sum test was performed to check if there were any statistical differences between As content among geological types. This non-parametric
of
test of the null hypothesis was applied since the data is not normally distributed. This
ro
test considers that it is equally likely for a randomly selected value, from one sample, to
lP
2.5. Traditional laboratory analyses
re
-p
be less than or greater than a randomly selected value from a second sample.
Sand fraction was separated using a 0.05-mm sieve; silt and clay fractions were
na
separated from each other after sedimentation of the silt fraction, by pipetting a volume of the clay fraction suspension. This was then oven-dryed and weighted (clay fraction).
Jo ur
The silt fraction was computed as the weight difference (Santos et al., 2009). The chemical analyses included: soil pH (potentiometer, water, at 1:2.5 ratio); exchangeable Ca2+, Mg2+ and Al3+ extracted with 1 mol L-1 KCl (Mclean et al., 1958) and determined by Perkin Elmer atomic absorption spectrometer A Analyst 400; available K and P extracted with Mehlich-l solution (Mehlich, 1953), with K being determined by flame photometer and P by colorimetry; H+ + Al3+ determined by SMP extractor (Shoemaker et al., 1961); SOC was determined by wet oxidation with potassium dichromate in sulfuric acid medium (Walkley and Black, 1934); and available Fe, Cu, Zn, and Mn were extracted by Mehlich-1 solution (Mehlich, 1953), followed by quantification via
Journal Pre-proof
11
atomic absorption spectrometer. CEC and BS were calculated using the aforementioned measurements.
2.6. Rare earth element (REE) analysis The pseudo-total concentrations of lanthanides (La, Ce, Pr, Nd, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, and Lu) and Y were determined by ICP-MS (Perkin Elmer NexIon 300D) followed by modified aqua regia digestion (USEPA, 2007). Soil samples
of
(fraction < 0.05 mm) and certified reference material (CRM) were digested into a
ro
programmable microwave digester CEM-Mars 6 (Mars Xpress, CEM Corporation,
-p
Matthews, NC, U.S.A), using Teflon Xpress vials. The conditions within each vial were
re
monitored and controlled by means of an infrared sensor. After acid digestion, the extracts from each soil sample and CRM were taken into centrifuge tubes and diluted
lP
with ultrapure water to 20 mL. The solutions were further diluted to obtain REE
na
contents suitable for analysis. Indium was added (1 μg g-1) to the final dilution as an internal standard (Spex Certiprep, USA) to minimize equipment fluctuations and matrix
Jo ur
effects. The determination was made by means of external calibration, and the analytical curve was constructed with 0.1, 0.5, 2, 10, and 20 g kg-1, in 2% nitric acid solution, from Spex standard solutions, containing 1000 μg kg-1 (Spex CertiPrep, USA) of individual REE: 89Y, 166
139
La,
140
Ce, 141Pr, 142Nd, 152Sm, 153Eu, 158Gd, 159Tb, 164Dy,
165
Ho,
Er, 169Tm, 174Yb, and 175Lu. The validation of the methodology was carried out by means of a CRM
(Calcareous Soil ERM-CC690®, Institute for Reference Materials and Measurements IRMM, Geel, Belgium) in quadruplicate. The QLs were obtained by analyses of pentaplicate blank samples added to the analytical procedure. The operating conditions of the ICP-MS apparatus and the measurement parameters are presented in
Journal Pre-proof
12
Supplementary Section (Table S1), as well as REE determinations of quality control samples (Table S2). Since REE values were obtained by acid digestion, these correspond to pseudo-total contents, whereas certified concentrations are “total contents”. REE recovery values are presented as additional information, as the determination of real REE contents is beyond the scope of our study. Although all REE present similar chemical characteristics, e.g. electronegativity and ionization potential, they are commonly divided into light (LREE - La, Ce, Pr, Nd,
of
Sm, and Eu) and heavy (HREE - Gd, Tb, Dy, Ho, Er, Tm, Yb and Lu), based on their
ro
physical and chemical properties (Tyler, 2004). Thus, from the perspective of
-p
environmental covariates, each As sampling point has a correspondent concentration of
re
LREE (sum of the concentration of light REE), and HREE (sum of the concentration of
lP
heavy REE).
na
2.7. Proximal sensor: pXRF analysis
Soil samples were air-dried, homogenized, and passed through a 2-mm sieve. In
Jo ur
order to obtain the total elemental contents in soil, the samples were scanned with an S1 Titan LE (Bruker Nano Analytics, Kennewick, WA, USA) by 60 s in triplicate, in Trace (dual soil) mode. This equipment contains a Rh X-ray tube of 4 W, 15 keV in phase 1 (30 s, Ti Al filter) and − 50 keV in phase 2 (30 s, no filter)-, 5 – 100-μA, silicon drift detector with a resolution of < 145 eV (based on pXRF manufacturer calibration). The fluorescence detected was analyzed by GeoChem software. pXRF provides selective detection of various chemical elements, ranging from Mg to U. Only elements found above the limit of detection for all of the soil sampling points were used in the predictive models. The accuracy of the equipment was assessed, using a sample provided by the pXRF manufacturer (check sample), as well as two samples certified by
Journal Pre-proof
13
the National Institute of Standards and Technology (NIST): 2710A and 2711A. The recovery values are presented in the Supplementary Section (Table S3), calculated on basis of the known content of each certified reference material.
2.8. Remote sensing and GIS-based environmental covariates Annual mean temperature (oC) and accumulated rainfall (mm) were obtained from WorldClim dataset (http://worldclim.org/version2, accessed in 01/04/2019).
of
Additional details can be found in Hijmans et al. (2005). Road density was calculated
ro
by means of Kernel Density algorithm, implemented on ArcGIS 10.1 (ESRI),
-p
generating a 90 m raster grid of resolution, where the density of roads in a
re
neighborhood is provided pixel by pixel. More details about kernel function are discussed in Silverman (1986). The base information for the density calculus was the
lP
federal and state roads of DNIT (http://servicos.dnit.gov.br/vgeo/, accessed in
na
12/19/2018). Rural roads were not considered here, due to the fact that their density is much lower (Hou et al., 2017), decreasing the potential of As contamination.
Jo ur
Furthermore, rural roads are not yet georeferenced by Brazilian government agencies. The SRTM-DEM represents a terrain surface by its elevation (Miranda, 2019). Consisting of a raster grid with a horizontal resolution of 3” (~90 m near the Equator), this DEM is the finest resolution and most accurate topographic data in global scale. From DEM, other digital terrain maps were calculated using SAGA GIS version 6.4.0 (Conrad et al., 2015): slope (correspond to the change in elevation); channel network base level (gives the altitude above the channel network in the same units as the elevation data, Conrad et al., 2015); terrain surface texture (represents the spatial frequency of peaks and pits, Iwahashi and Pike, 2007); valley depth (calculated by the difference between the elevation and an interpolated ridge level); and wind exposition
Journal Pre-proof
14
(calculates the average Wind Effect Index for all directions using an angular step; values below 1 indicate wind shadowed areas whereas values above 1 indicate areas exposed to wind). The geological polygon-based map was obtained from IBGE (2005), with adaptations taking into account As prediction relevance. The biome polygon-based map was also obtained from IBGE (2005). The map of temperature according to Köppen‟s
of
has 1 hectare of resolution and was obtained from Alvares et al. (2013).
ro
2.9. Pedological information
-p
Soils were classified according to WRB/FAO (IUSS Working Group WRB,
re
2014). Soil drainage relates to the frequency and duration of wet periods. This consequently takes into account soil development conditions, e.g. soil oxidizing or
lP
reducing conditions. Regarding the drainage of soil samples, the observed drainage
na
sequence goes from somewhat excessively drained → well drained → moderately well drained → poorly drained. Following this path, reducing conditions increase, as
Jo ur
oxidizing conditions decrease. Local relief is defined by a pedologist indicating the slope of the landscape where the soil was sampled (Santos et al., 2013). The summary statistics (mean and range) of numerical environmental covariables, as well as their Scorpan model representations are presented in Table 1, and the categorical ones are presented in Table 2. Although all soil-forming factors were used and presented, multicollinearity is expected, i.e. inter-correlations among soil forming factors. Examples of such interactions in this study are: a) wind exposition was generated from DEM, consequently it relates to relief, and climate; b) some chemical elements related to soil could also act as parent material tracers or age/weathering indicators; c) soil type could indicate the age or weathering status; d) it is well known
Journal Pre-proof
15
that temperature regime exerts influence on SOC contents. This type of interactions is better addressed by multivariate statistical analysis. Additional details of some environmental covariates are provided further.
2.10. Statistical analyses 2.10.1. Predictive models
of
Arsenic (As) concentrations were modeled against all environmental covariates by means of predictive models. Since non-normal distribution is common for
ro
geochemical data, As values were log-transformed (Venteris et al., 2014), and the
-p
numerical environmental covariates were normalized by a scale function (value –
re
mean/standard deviation), in accordance with Dudka et al. (1995) and Mancini et al.
lP
(2019).
We chose statistical methods and ML algorithms capable of dealing with
na
complex and multiple explanatory covariables, as well as ranking the most important ones. Training, tuning and validation of the predictive models were performed under the
Jo ur
R software environment (R Development Core Team, 2018) using the package Caret (Classification and Regression Training) (Kuhn, 2018). The multiple linear regression was implemented by the GLM, along with stepwise for environmental covariate selection, where the best model was chosen by Akaike information criterion (Venables and Ripley, 2002). For that, glmStepAIC function was used on Caret package. GLM is a mathematical extension of linear models that do not force data into unnatural scales, and it is based on an assumed relationship between the mean of the response variables and the linear combination of explanatory variables.
Journal Pre-proof
16
Standard linear models often perform poorly in cases with large multivariate datasets. There are also several natural intercorrelations among environmental covariates that could explain As content variability; thus, multicollinearity is unavoidable. Ridge regression is considered a straightforward alternative that suppresses the aforementioned negative effects (Anderson and Basta, 2009). A model was fit containing all p predictors using a technique that constrains the coefficient estimates towards zero. More accurate regression models may be generated including all
of
possible environmental covariates that explain As variability, without misinterpreting
ro
the influence of multicollinearity. The constrains or shrinkage is based on the tuning
-p
parameter lambda, which determines the amount of shrinkage (Bruce and Bruce, 2017).
re
MARS consists of a nonparametric regression algorithm proposed by Friedman (1991), which combines classical linear regression, mathematical construction of splines
lP
and binary recursive partitioning to produce a local model. A pre-processing algorithm
na
of the explanatory variables uses the basis functions max (0,X – c) and max (0,c – X) to transform the environmental variables into a new set of variables. MARS produces basis
Jo ur
functions by examining them in a stepwise method. It is a flexible regression model proper for high dimensional data (large number of explanatory environmental covariables), dealing with linear or nonlinear relationships between response and explanatory variables since the splines are smoothly connected together. The piecewise curves (polynomials), also known as basis functions, result in a flexible model (Zhang and Go, 2016). One appropriate c value should be found to approximate any functional shape (Briand et al., 2004). Then, MARS performs successive approximations of the system using different intervals of the transformed variables ranges, by a series of linear regressions. It has more power and flexibility to model relationships that are nearly
Journal Pre-proof
17
additive or involve interactions in at most a few variables. The earth package was used for calculating MARS models. The tree-based models applied in this study are considered ensemble models, in which multiple individual models are combined in order to deliver superior prediction power. Two of the most popular techniques for constructing ensembles are: a) bagging (bootstrap aggregation): many bootstrap samples are drawn from the dataset, some prediction methods are applied to each bootstrap sample, and then the results are
of
combined, by averaging for regression and simple voting for classification to obtain the
ro
overall prediction (Breiman, 1996); or b) boosting, where a weighted average of results
-p
obtained from applying a prediction method to various samples is used (Freund and
re
Schapire, 1996), which differs from bagging that applies a simple average of results to obtain an overall prediction.
lP
Decision tree method, also known as CART, was developed by Breiman et al.
na
(1984). The R software implementation of such algorithm is called RPART (Recursive Partitioning and Regression Trees) along with treebag, since the bagging method was
Jo ur
applied (Bagged Tree). The algorithm of decision tree model works by repeatedly partitioning the data into multiple sub-spaces, so that the outcomes in each final subspace are as homogeneous as possible. Rules are created by repeatedly splitting the predictor variables, starting with the variable that has the highest association with the response variable (As). This approach is technically called recursive partitioning. The process continues until some predetermined stopping criteria are met. The resulting tree is composed of decision nodes, branches and leaf nodes. Even though ensembles of decision trees are effective as predictive models, one limitation of this model might be its trend to overfit training datasets. Friedman (2001) proposed the XGBoost in order to overcome such limitation. XGBoost is based on
Journal Pre-proof
18
ensembles of CART, which are obtained by recursively partitioning input data and fitting a real-valued prediction model within each partition. Thus, successive trees are fit on the residuals of the previous trees, and ensemble predictions are obtained by taking the sum of the weighted scores predicted by each tree in the model. A scalable end-toend tree boosting algorithm system is implemented in the package called xgbTree (Chen et al., 2016). Cubist is another ensemble tree-based model, where a series of trees are created
of
with adjusted weights (committees), and the terminal leaves contain linear regression
ro
models for prediction. There are intermediate linear models at each step of the tree, and
-p
all models are based on the variables used in previous splits. The tree is reduced to a set
re
of rules, which initially are paths from the top to the bottom of the tree. The final prediction consists of an average of predictions from all committee members. The
lP
algorithm was developed from an earlier version of C4.5 and the M5 model tree
development.
na
(Quinlan, 1992). The cubist package was used for Cubist Regression models
Jo ur
Random Forest is the last ensemble model used that fits multiple CART models to random subsets of input data, and combines trees for prediction (Breiman, 2001). A double random in each tree node is split using a random subset of predictors and observations at each node, and this process is repeated hundreds of times (as specified by the user). The number of predictors to be used in each tree-building process (mtry) and the number of trees to be built in the forest (ntree) can vary depending on the dataset. Each tree is built from a bootstrap sample of the original dataset, allowing robust error estimation with the remaining test set, the so-called Out-Of-Bag sample. Based on Random Forest, RFE algorithm was tested in order to reduce the dimensionality and find a small set of optimal environmental covariables (Kuhn, 2012).
Journal Pre-proof
19
In the caret package, the backward selection is performed by using all the predictors while ranking the environmental covariables importance. The algorithm builds several calibration models that use the pi most important predictors, where pi is an element of a predetermined sequence (p1, p2,…,pn) of possible numbers of predictors. The set of predictors‟ pi producing the best model amongst the candidate models is retained (Stevens et al., 2013), avoiding refitting many models at each step of the search (Kuhn
ro
2.10.2. Evaluation of As content predictive models
of
and Johnson, 2013).
-p
The complete framework of As content predictive models assessment until the
re
grouping analysis is presented in Figure 2. In the first step, the full dataset was split randomly into training or internal validation (n = 111 points) and external validation
lP
dataset (n = 48 points). Such splitting procedure has been used as a common standard
na
approach for soil modeling (Pahlavan Rad et al., 2014, Brungard et al., 2015). In the second step, the effect of the model tuning on the model performance was assessed. In
Jo ur
other words, the caret package tests different tuning parameters specific to each model requirement, and returns as output the best model. In the third step, the best models were validated again by an external dataset not used for modelling (external validation). Three standard statistical performances of accuracy evaluation for both datasets (training and validation) were calculated: R2, RMSE, and MAE, according to the following equations: ∑ ∑
√ ∑
̅̅̅̅
Journal Pre-proof
20
|
∑|
where n is the number of observations, mi is the correspondent measured value, ei is the estimated As values, and ̅̅̅̅ is the average of the measured value. MAE is referred as the bias of prediction, which provides information about the average absolute error, which is considered more rigorous than the common used mean error (Tóth et al., 2016). RMSE quantifies the spread of the errors distribution (Isaaks and Srivastava,
ro
of
1989). The lower the bias and spread, the better the predictive models.
-p
2.11 Grouping analysis
re
In order to better understand As content distribution, in the fourth step the tool Grouping Analysis (ArcGIS 10.1 from ESRI) was performed. This unsupervised MLT
lP
provides a proper number of natural groups based on a set of the main environmental
na
covariates discriminated by previous multivariate statistical analysis (2nd and 3rd steps). Thus, an optimal number of natural groups were provided, where the environmental
Jo ur
covariates similarity within group and the difference among groups is higher than possible. K means algorithm was used. R2 values were provided by each environmental covariate: the higher the value, the more effective is that variable at discriminating among features. The grouping effectiveness is measured using the Calinski-Harabasz pseudo F-statistics, which is the ratio that reflects within-group similarity and betweengroup difference. Thus, the most important environmental covariables ranked from the best predictive model (higher accuracy) were applied in group analysis algorithm.
3. Results and discussion
Journal Pre-proof
21
3.1. As concentration at different soil parent materials and land use Considering the natural concentrations of As in agricultural soils, geological composition is one of the main drivers of variability (Lu et al., 2012; Micó et al., 2006; Tarvainen et al., 2013). Nevertheless, As enrichment would be expected due to common agricultural practices (Cai et al., 2010; Cai et al., 2015; Nicholson et al., 2003). Table 3 shows As concentrations in different geologies according to land use (agriculture and reference soils), as well as EF. Although topsoil As contents are usually low (Kabata-
of
Pendias, 2011), a wide range of As content was found, where the coefficient of variation
ro
was elevated for all parent material/land use conditions. The median As values varied
-p
from 0.14 to 4.41 mg kg-1 in reference soils and 0.28 to 58.3 mg kg-1 in agricultural
re
soils. The median was used to avoid data distortion that generally affects the arithmetic mean values (Reimann et al., 2012; Yamasaki et al., 2001). Soil As concentrations (in
lP
mg kg-1) followed the descending order in reference soils: sedimentary rocks (4.46) >
na
basic igneous rocks (3.06) > metamorphic rocks (1.80) > unconsolidated sediments (1.47) > acid igneous rocks (0.69); and in cultivated soils: sedimentary rocks (7.64) >
Jo ur
basic igneous rocks (2.76) > acid igneous rocks (2.72) > metamorphic rocks (2.31) > unconsolidated sediments (1.56).
Low EF values were found for most of the geological types, independently of land use: EF from 0.3 to 1.6 were found in reference soils, meanwhile EF ranged from 0.4 to 2.5 for cultivated soils. Enrichment was solely verified in unconsolidated sediments (EF: 2.5) for agricultural soils. Such values suggest non-antropogenic As supply, most likely coming from parent materials as reported by Tarvainen et al. (2013). Intense weathering-leaching conditions imposed by tropical climate (precipitation > evapotranspiration), along with the oxidative state of As in acid soils, favor As leaching to deep soil layers (Zhang et al., 2006).
Journal Pre-proof
22
Since EF did not suggest agriculture influence, the Mann-Whitney pairwise was performed in order to check if there was statistical difference between As concentrations in soils from different geological types. The null hypothesis was that groups of parent materials (acid and basic igneous rocks, metamorphic rocks, sedimentary rocks, and unconsolidated sediments) are identical (p < 0.05). Meanwhile, results proved there is statistical difference among those parent material groups under different land use: p = 0.0160 (reference soil), and p = 0.0019 (agriculture soil). Statistical differences were
of
found for soils originated from sedimentary rocks and unconsolidated sediments in both
ro
land uses, which corresponded to the highest and the lowest mean As contents,
-p
respectively. The results are in accordance with Kabata-Pendias (2011) and Reimann et
re
al. (2003), who attributed higher As content to argillaceous sedimentary rocks (specially claystones, shale, and phyllite). Conversely, unconsolidated sediments, which group
lP
soils derived from sandy, clay-sand, fluvial, alluvial, and colluvial sediments, as well as
na
detrital-laterite covers, are not considered rocks, being composed by a wide variation of materials. According to Smedley and Kinniburgh (2002), As concentrations depends on
Jo ur
the texture and mineralogy of the sediments. Sediments with elevated As concentrations have been associated to large amounts of sulfide minerals (Smedley and Kinniburgh, 2002), which was not the nature of the analyzed sediments of our study.
3.2. Predictive model tuning Besides streamlining the workflow for fitting models, the caret package itself includes a set of semi-automated approaches for optimizing the values of the tuning parameters (Kuhn, 2012). Parametrization was not necessary for GLM and Bagged Tree. Based on a range of parameters tested using the training dataset, the optimal parameters were found and the best models are listed in Table 4, with their respective
Journal Pre-proof
23
performances (model performance information columns – internal validation). All of the models presented high scores denoting model consistency, except for XGBoost, which R2 was very low (0.28), with large values of RMSE (0.2894) and MAE (0.2285).
3.3. Accuracy assesment The accuracy assessment based on external validation is also presented in Table 4. In general, superior performance of the tree-based models (Cubist, XGBoost, Bagged
of
Tree, and RFE Random Forest) over the regression ones (GLM, Ridge regression, and
ro
MARS) was found, except for XGBoost. Tree-based models are capable to represent
-p
non-linear and non-smooth relationships between response variables and environmental
re
covariates (Heung et al., 2016), being flexible in handling with numerical and categorical predictors (Hastie et al., 2009). Thus, it was possible to accurately predict
lP
As content by means of Scorpan model.
na
Regarding regression-based predictive model performance, the accuracy increased following the sequence: GLM < Ridge regression < MARS. GLM –a quite
Jo ur
simple method– often has low bias, but large variance. Prediction accuracy was improved by shrinking and setting some coefficients to zero, which is the case of Ridge and MARS regressions. According to Hastie et al. (2009), the bias is shunned to reduce the variance of the predicted values, and hence the overall prediction accuracy is improved. Among the tree-based models, RFE Random Forest outperformed the others (R2 = 0.68, RMSE = 0.2198, MAE =0.1699). Random Forest is conceptually similar to treebased learners and shares the same advantages; nevertheless, it is little sensitive to noninformative predictors (Kuhn and Johnson, 2013), usually performing better than CART. Random Forest is considered a prominent ML algorithm, capable to
Journal Pre-proof
24
successfully predict soil attributes (Towett et al., 2015), as it has been reported to outperform other algorithms (Brungard et al., 2015; Hengl et al., 2015; Mancini et al., 2019), especially those based on linear regression analyses (Ahmad et al., 2010; Hengl et al. 2015). Furthermore, there were slight differences when contrasting accuracy indexes of Bagged Tree versus RFE Random Forest: R2 (0.67 and 0.68, respectively), RMSE (0.2263 and 0.2198, respectively) and MAE (0.1786 and 0.1699, respectively). In the bagging procedure of CART, the instability of a single-tree model is minimized
of
through the aggregation of multiple models, this may result in improved consistency
-p
values of RMSE (0.3203) and MAE (0.2538).
ro
(Breiman, 1996). Conversely, XGBoost performed poorly: low R2 (0.47), and high
re
The development of predictive models for soil attributes has as a core the environmental covariates contribution to the learning process of each algorithm (Duda
lP
et al. 2017; Mancini et al., 2019). Based on model reliability and keeping in mind our
na
goal to identify the main drivers of soil As contents, only the rankings of variable importance for the best models (Bagged Tree, and RFE Random Forest) are presented.
Jo ur
The importance of environmental covariables in Bagged Tree is ranked by the total reduction in sum of squares achieved by all tree splits. Since more than one tree is created, the average is done to derive the overall measure of the environmental covariables importance (Prasad et al., 2006), as listed in Figure 3. Twenty out of 65 environmental covariates were deemed relevant. Climate and soil were the main drivers of As content variability, where the algorithm weighted heavily on temperature, SOC, clay, and pXRF data (total contents of Fe2O3, and TiO2). Parent material (geology) and relief (valley depth, surface texture) also exerted importance, configuring the multivariate influence as expected.
Journal Pre-proof
25
As already mentioned, by bootstrap aggregation of original data in each tree, Random Forest creates robust error estimation. From this, mean decrease accuracy is calculated and the importance of an environmental covariables is determined by measuring the change in prediction accuracy, when the values of the variable are randomly permuted compared with the original observations (Figure 4a). Another calculated index is the mean decrease in Gini, which is the sum of all decreases in Gini impurity due to a given variable (when this variable is used to form a split in the
of
Random Forest), normalized by the number of trees (Figure 4b). So, both indexes are
ro
attributes assigned by RFE Random Forest that can be used for ranking environmental
-p
covariates importance. Important covariates presented large increase in out-of-bag error.
re
Forty-nine out of 65 environmental covariates were deemed as important, and 7 of them presented negative values of mean decrease accuracy (Figure 4a), pointing out that they
lP
are not predictive enough. The most explanatory environmental covariables were quite
na
similar to Bagged Tree: temperature, organic carbon, texture, and pXRF data (total contents of V, Fe2O3, and TiO2).
Jo ur
It is expected that anthropogenic As sources exceed natural sources in the environment by 3:1 (Mandal and Suzuki, 2002; Woolson et al., 1983). Interestingly enough, agriculture as a means to increase As content in soils did not qualify as an explanatory covariable. In fact, it showed a very low importance by the most accurate model (Random Forest), meanwhile it was not even listed by overall Bagged Tree. Considering the common practices of Brazilian agriculture, one of the main hypotheses of this work was that agriculture use would increase As contents, since there is a large input of phosphate fertilizers in agriculture as means to increase crop production in tropical soils. This may present a threat to the environment and a source of trace elements (as arsenic) in soils (Corguinha et al., 2012; Corguinha et al., 2015).
Journal Pre-proof
26
Nevertheless, in accordance with Corguinha et al. (2015), agriculture and reference soils did not present statistically significant differences in this study. The use of pXRF spectrometry played an important role in providing multielemental contents to evaluate their importance as explanatory covariables for soil As content and its prediction. Recently, such equipment has been successfully used to predict several soil physico-chemical properties in tropical conditions (Pelegrino et al., 2018; Ribeiro et al., 2017; Silva et al., 2017; Silva et al., 2016). In this study, Fe2O3,
of
TiO2, and V were consistently ranked for As prediction, but V presented low recovery
ro
percentage (Supplementary Section, Table S3). The first two chemical components are
-p
important weathering tracers for tropical soils, since: a) they are considered very
re
resistant (immobile), and thus, by remaining in soils, their residual concentration increase indicates advanced weathering-leaching conditions (Marques et al., 2004;
lP
Taylor and Eggleton, 2001); b) the concentration of these oxides in the soils varies
na
according to pedogenic processes, and are relative to the type of parent material (Stockmann et al., 2016). Thus, as long as As and its compounds are mobile in soils
Jo ur
(Mandal and Suzuki, 2002), and can be distributed by weathering (Hou et al., 2017) or influenced by characteristics that weathering-leaching print on soils, Fe and Ti contents could be important tracers of soil As content. In addition, studies have reported positive correlation between As and Fe2O3 contents (Campos, 2002; Chakraborty et al., 2017), since arsenate is highly adsorbed on oxides, especially in acid (McBride, 1994) and well-drained soils. Regarding pXRF data usage, recently greater focus has been given to the accuracy of final predictions rather than recovery rate (Zhu et al., 2011; Sharma et al., 2015; Wang et al., 2015; Weindorf et al., 2016; Xu et al., 2019), since several robust algorithms have been used as a fine-tune for improving the accuracy of predictions of
Journal Pre-proof
27
various soil attributes (Andrade et al., 2020). However, keeping in mind the low recovery rate of V that might decrease model reliability, it is important to highlight that the Pedometric models used in this study case are repeatable and updatable as long as new and more accurate data come in line (Bui, 2006). Based on RFE of Random Forest, the top five most important environmental covariables were temperature, clay, sand, organic carbon, and TiO2. This was in most part in accordance with Bagging Tree environmental covariates selection. Soil organic
of
carbon aside, those explanatory covariables are very stable attributes, and they suggest
ro
soil As contents are related to environmental conditions rather than anthropogenic
-p
sources. It is also noteworthy to mention the role of covariable soil (s) as on As content
re
prediction. The fact that the soil forming factors are related to the concentration of the naturally occurring elements in the soil gives the impression they could be predicted
lP
from the soils' elemental composition (Towett et al., 2015).
na
Stevens et al. (2013) obtained accuracy improvement with RFE when compared with other models in soil properties predictive models. The authors attributed such
Jo ur
accuracy increment to the fact that by reducing the dimensionality, keeping only the relevant information on prediction, they allowed the increase of relative weight of the environmental covariables in the models. Furthermore, randomized variable selection minimizes the potential model bias, which occurs when a few predictors are used more often than others to generate the node-splitting rules (Breiman, 2001). Such environmental covariates will compose the grouping analysis, and its influence in soils As content.
Journal Pre-proof
28
3.4. Grouping analysis The Grouping analysis reports Calinski-Harabasz pseudo F-statistics for each suggested number of groups, used as the base for establishing the optimal number of natural groups or clusters (the complete median pseudo F-statistic for each he number of groups is presented in Supplementary Section, Figure S1). Using temperature, clay, sand, SOC, and TiO2, a total of 4 natural groups better identified the similarities and differences of such environmental covariables selected, reaching the highest index of
of
112.62. Furthermore, another metric of grouping is the R2 of each environmental
ro
covariable: the higher the R2, the more significant is a given environment covariate in
-p
splitting groups. Thus, SOC was the most important (R2 = 0.84), followed by clay (R2 =
re
0.77), sand (R2 = 0.77), temperature (R2 = 0.60), and TiO2 (R2 = 0.44). The spatial distribution of groups, as well as the range of As content variation of each group is
lP
presented in Figure 5. The mean ± standard deviation of As content and the most
na
important environmental covariates of each group are presented in Table 5. Group 1 presented the lowest As content among groups, with coarser texture,
Jo ur
lower organic carbon, higher temperatures, and the lower TiO2 contents. Considering the geographic distribution, such group is widespread throughout the Brazilian territory, accomplishing different environmental conditions. According to Mandal and Suzuki (2002), uncontaminated soils tend to range from 1 to 40 g kg-1 of As, and those with large sand contents tend to have the lowest concentration, due to low surface area and higher permeability (Hou et al., 2017). The effect of temperature itself in As dynamics is not documented. However, it is well-know the effect of temperature on SOC dynamics in broad scales (Gomes et al., 2019; Hengl et al., 2015; Minasny et al., 2013), which in turn affect As content (Chakraborty et al., 2017). In the case of the Group 1, higher temperatures might decrease organic carbon content, decreasing As adsorption,
Journal Pre-proof
29
corroborating with findings of Horta et al. (2015) and Hou et al. (2017). In recent study of soil carbon stock for the whole Brazilian territory (Gomes et al. 2019), temperature was ranked as one of the most important environmental covariable is SOC stock predictive models, and areas with higher temperatures with drier and seasonal periods presented the lowest contents. Furthermore, the lowest TiO2 content indicates low weathering-leaching, since it is immobile as it forms part of various minerals such as rutile, ilmenite, micas and pyroxenes (Birkeland, 1999; Maynard, 1992; Schaetzel and
of
Anderson, 2005), beyond Fe-oxide minerals.
ro
Group 2 comprises organic soils (Histosols, organic carbon ≥ 80 mg kg-1),
-p
located in the floodplain of the landscape, where the soil is saturated with water, being
re
virtually free of gaseous oxygen, causing organic carbon accumulation. When compared with other groups, TiO2 and temperature are ranked as “medium”; i.e. in a relative scale
lP
of low-medium-high values. Since the mineral particle contribution is minimal, most of
na
Histosols in Brazilian databanks do not present soil texture analysis, as is the case of this study. The As contents for this group are slightly higher than those of Group 1, but
Jo ur
lower than those of Groups 3 and 4. The organic carbon content in this group is the largest one among all groups, which could lead to higher soil As concentrations (Mastchullat, 2000). Yet, As behavior is dependent on the soils oxidation state. In this case, it seems that As in its reduced state was readily leached since the potential of reduction is higher (Kabata-Pendias, 2011, Masscheleyn et al., 1991). When compared with the others, Group 3 presented medium As content, texture, organic carbon, TiO2, and temperature, presenting the largest number of points widespread throughout the Brazilian territory. The Group 4 presented the largest As, clay, organic carbon, and TiO2 contents, and the lowest sand content. The finer texture may contribute to As adsorption due to
Journal Pre-proof
30
the larger surface area (Ladeira et al., 2002). This group presented the largest organic carbon content among groups with mineral soils (excluding Group 2), with points located mainly across Southeastern Brazil, where the temperature is milder. Such region tends to accumulate SOC, suggesting a long-term pattern of stabilization (Gomes et al., 2019). Arsenates, which is a stable species under oxidizing conditions (soils are mainly well-drained) is strongly sorbed onto organic matter (Mandal and Suzuki, 2002), probably due to its cation exchange capacity (Hou et al., 2015; Romic and Romic,
of
2003), as opposed to Group 1. Lastly, as already discussed, higher TiO2 contents
ro
suggest soils highly weathered-leached, where iron and aluminum oxides become
-p
residually concentrated. In most Brazilian soil conditions, desilication and subsequent
re
Fe- and Al-minerals enrichment are very common phenomena. Those minerals that remain in soils present high capacity of As adsorption (Campos et al., 2013). Moreover,
lP
As adsorption by SOM is generally increased by the bridging action of Fe- and Al-
na
compounds (Campos et al. 2007). In this sense, the combination of environmental covariates characteristics of this group together contributes positively to As content
Jo ur
increment in soils.
4. Conclusions
The environmental covariables along with RFE Random Forest and Bagging Tree models predicted soil As content accurately. By the framework stablished in this work, the main elements contributing to As variability were clearly and accurately identified. By the analysis of environmental covariables importance, it was clear that soil and climate factors were deemed as main drivers of As content variability rather than land use. In addition, the grouping analysis provided a better insight into how the most important environmental covariables are numerically distributed, instead of only
Journal Pre-proof
31
ranking their importance. It was also possible to identify As content ranges among such uniform groups created by means of rigorous statistical analysis, bringing geographical expression of dataset analysis. Such natural groups or clusters could work as environmental indicator in the absence of soil As content analysis, which is a common scenario in Brazil and in many Latin American countries.
Declaration of interest statement
of
The authors declare that they have no known competing financial interests or personal
-p
ro
relationships that could have appeared to influence the work reported in this paper.
re
Acknowledgements
The authors acknowledge CAPES (Brazilian Coordination for the Improvement
lP
of Higher Education Personnel), FAPEMIG (Minas Gerais State Research Support
na
Foundation), and CNPq (Brazilian Council of Scientific and Technological Development) for financial support and scholarships provided to the authors. The
Jo ur
authors also express their gratitude for the critical review made by anonymous reviewers and the help of Dr. Salvador F. Acuña-Guzman, which improved the clarity and quality of our manuscript.
References Ahmad, S., Kalra, A., Stephen, H., 2010. Estimating soil moisture using remote sensing data: A machine learning approach. Adv. Water Resour. 33, 69-80. https://doi.org/10.1016/j.advwatres.2009.10.008. Aide, M., 2005. Geochemical assessment of Iron and Vanadium relationships in oxic soil
environments.
Soil
&
Sediment
https://doi.org/10.1080/15320380500180382.
Contamination
14,
403–416.
Journal Pre-proof
32
Alkmim, F.F., 2015. Geological Background: A Tectonic Panorama of Brazil, in: Hermelin, M. (Ed.),
Landscapes
and Landforms of Colombia, World
Geomorphological Landscapes. Springer International Publishing, Cham, pp. 9–17. doi:10.1007/978-94-017-8023-0_2. Alloway B.J., 2013. Sources of heavy metal and metalloids in soils. In: Alloway B.J., (Ed), Heavy metals in soils. Trace metal and metalloids in soils and their availability, third ed. Springer, Dordrecht. Alvares, C.A., Stape, J.L., Sentelhas, P.C., de Moraes Gonçalves, J.L., Sparovek, G., 2013. Meteorologische Zeitschrift 22, 711-728. https://dx.doi.org/10.1127/0941-
of
2948/2013/0507.
Anderson, R.H., Basta, N.T., 2009. Application of ridge regression to quantify marginal
ro
effects of collinear soil properties on phytoaccumulation of arsenic, cadmium,
-p
lead, and zinc. Environ. Toxicol. Chem. 28, 619-628. https://doi.org/10.1897/08186.1.
re
Andrade, R.; Silva, S.H,G.; Weindorf, D.C.; Chakraborty, S.; Faria, W.M.; Mesquita, L.F.; Guilherme, L.R.G.; Curi, N. 2020. Assessing models for prediction of some
data
in
Brazilian
lP
soil chemical properties from portable X-ray fluorescence (pXRF) spectrometry Coastal
Plains.
Geoderma,
357,
113957.
na
https://doi.org/10.1016/j.geoderma.2019.113957. Birkeland, P.W., 1999. Soils and Geomorphology, third ed. Oxford Univ. Press, New
Jo ur
York. 430 pp.
Bishop, C. M. 2006. Pattern recognition and machine learning (Springer, New York). Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. https://doi.org/10.1023/A:1010933404324. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classification and Regression Trees. Chapman & Hall/CRC. Briand, L.C., Freimut, B., Vollei, F. 2004. Using multiple adaptive regression splines to support decision making in code inspections. The Journal of Systems and Software. 73, 205–217. https://doi.org/10.1016/j.jss.2004.01.015. Bruce, P. C., Bruce, A., 2017. Practical Statistics for Data Scientists, first ed. O‟Reilly Media. Brungard, C.W., Boettinger, J.L., Duniway, M.C., Wills, S.A., Edwards, T.C., 2015. Machine learning for predicting soil classes in three semi-arid landscapes. Geoderma 240, 68–83. https://doi.org/10.1016/j.geoderma.2014.09.019.
Journal Pre-proof
33
E. Bui, 2006. A Review of Digital Soil Mapping in Australia, in: Lagacherie, P.; McBratney, A.B.; Voltz, M. (Eds.), Developments in Soil Science. Elsevier, pp 2537. https://doi.org/10.1016/S0166-2481(06)31002-1. Cai, L., Xu, Z., Bao, P., He, M., Dou, L., Chen, L., Zhou, Y., Zhu, Y.G., 2015. Multivariate and geostatistical analyses of the spatial distribution and source of arsenic and heavy metals in the agricultural soils in Shunde, Southeast China. J. Geochemical
Explor.
148,
189–195.
https://doi.org/10.1016/j.gexplo.2014.09.010. Cai, L.M., Huang, L.C., Zhou, Y.Z., Xu, Z.C., Peng, X.C., Yao, L.A., Zhou, Y., Peng,
of
P.A., 2010. Heavy metal concentrations of agricultural soils and vegetables from
ro
Dongguan, Guangdong. J. Geogr. Sci. 20, 121–134. doi: 10.1007/s11442-0100121-1.
-p
Campos, M.L., Borges, Luiz Roberto Guimarães Guilherme, K.S.C., Antunes, A.S., 2013. Teor de arsênio e adsorção competitiva arsênio/fosfato e arsênio/sulfato em
re
solos de Minas Gerais, Brasil. Ciência Rural. St. Maria 43, 985–991. Campos, M.L., Guilherme, L.R.G., Lopes, R.S., Antunes, A.S., Marques, J.J.G. de S.,
lP
M., Curi, N., 2007. Teor e capacidade máxima de adsorção e arsênio em Latossolos brasileiros. Rev. Bras. Cienc. do Solo 31, 1311–1318.
na
Campos, V., 2002. Arsenic in groundwater affected by phosphate fertilizers at São
0.
Jo ur
Paulo, Brazil. Environ. Geol. 42, 83–87. https://doi.org/10.1007/s00254-002-0540-
Chakraborty, S., Man, T., Paulette, L., Deb, S., Li, B., Weindorf, D.C., Frazier, M., 2017. Rapid assessment of smelter/mining soil contamination via portable X-ray fluorescence spectrometry and indicator kriging. Geoderma 306, 108–119. https://doi.org/10.1016/j.geoderma.2017.07.003. Chen, T., Guestrin, C., 2016. XGBoost: A Scalable Tree Boosting System, in: 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 785–794. https://doi.org/10.1145/2939672.2939785. Chen, T.-B., Zheng, Y.-M., Lei, M., Huang, Z.-C., Wu, H.-T., Chen, H., Fan, K.-K., Yu, K., Wu, X., Tian, Q.-Z., 2005. Assessment of heavy metal pollution in surface soils
of urban parks
in
Beijing, China.
Chemosphere 60,
542-551.
https://doi.org/10.1016/j.chemosphere.2004.12.072. Ciminelli, V.S.T., Antônio, D.C., Caldeira, C.L., Freitas, E.T.F., Delbem, I.D., Fernandes, M.M., Gasparon, M., Ng, J.C., 2018. Low arsenic bioaccessibility by
Journal Pre-proof
34
fixation in nanostructured iron (Hydr)oxides: Quantitative identification of Asbearing phases. J. Hazard. Mater. 353, 261–270. doi:10.1016/j.jhazmat.2018.03.037 Conrad, O., Bechtel, B., Bock, M., Dietrich, H., Fischer, E., Gerlitz, L., Wehberg, J., Wichmann, V., Böhner, J., 2015. System for Automated Geoscientific Analyses (SAGA)
v.
2.1.4.
Geosci.
Model
Dev.
8,
1991–2007.
https
://doi.org/10.5194/gmd-8-1991-2015. Corguinha, A.P.B., Gonçalves, V.C., Souza, G.A., Lima, W.E.A., Penido, E.S., Pinto, C.A.B.P., Francisco, E.A.B., Guilherme, L.R.G., 2012. Cadmium in potato and
of
Food
Composition
https://doi.org/10.1016/j.jfca.2012.05.001.
and
Analysis
27,
32–37.
ro
Journal
of
soybeans: do phosphate fertilization and soil management systems play a role?
-p
Corguinha, A.P.B., Souza, G.A. de, Gonçalves, V.C., Carvalho, C. de A., Lima, W.E.A. de, Martins, F.A.D., Yamanaka, C.H., Francisco, E.A.B., Guilherme, L.R.G., 2015.
safety
purposes.
re
Assessing arsenic, cadmium, and lead contents in major crops in Brazil for food J.
Food
Compos.
Anal.
37,
143–150.
lP
https://doi.org/10.1016/j.jfca.2014.08.004.
Davis, H.T., Aelion, C.M., McDermott, S., Lawson, A.B., 2009. Identifying natural and
PCA,
and
na
anthropogenic sources of metals in urban and rural soils using GIS-based data, spatial
interpolation.
Environ.
Pollut.
157,
2378-2385.
Jo ur
doi:10.1016/j.envpol.2009.03.021. de Sá Paye, H., de Mello, J.W.V., de Magalhães Mascarenhas, G.R.L., Gasparon, M., 2016. Distribution and fractionation of the rare earth elements in Brazilian soils. J. Geochemical Explor. 161, 27–41. doi:10.1016/j.gexplo.2015.09.003 Ding, H. J., Ji, H. B., 2010. Application of chemometric methods to analyze the distribution and chemical fraction patterns of metals in sediment from a metropolitan
river.
Environ.
Earth
Sci.
61,
641-657.
https://doi.org/10.1007/s12665-009-0379-8. Duda, B.M., Weindorf, D.C., Chakraborty, S., Li, B., Man, T., Paulette, L., Deb, S., 2017. Soil characterization across catenas via advanced proximal sensors. Geoderma 298, 78–91. https://doi.org/10.1016/j.geoderma.2017.03.017 Dudka, S., Ponce-Hernandez, R., Hutchinson, T., 1995. Current level of total element concentrations in the surface layer of Sudbury‟s soils. Sci. Total Environ. 162, 161-171. https://doi.org/10.1016/0048-9697(95)04447-9.
Journal Pre-proof
35
Fergusson, J.E., 1990. The heavy elements: chemistry, environmental impact and health effects, Oxford: Pergamon Press, 614. Figueiredo, B.R., Borba, R.P., Angélica, R.S., 2007. Arsenic occurrence in Brazil and human
exposure.
Environ.
Geochem.
Health
29,
109–118.
https://doi.org/10.1007/s10653-006-9074-9. Freund, Y., Schapire, R.E. 1996. Experiments with a new boosting algorithm. Proceeding ICML'96 Proceedings of the Thirteenth International Conference on International Conference on Machine Learning. 148-156. Friedman, J. H., 2001. Greedy function approximation: A gradient boosting machine.
of
Ann. Statist. 29, 1189-1232. https://doi:10.1214/aos/1013203451. Friedman, J.H., 1991. Multivariate adaptive regression splines. Ann. Statist. 19, 1-67.
ro
https://doi:10.1214/aos/1176347963.
-p
Goh, K., Lim, T., 2004. Geochemistry of inorganic arsenic and selenium in a tropical soil: effect of reaction time, pH, and competitive anions on arsenic and selenium Chemosphere
re
adsorption.
55,
849-859.
https://doi.org/10.1016/j.chemosphere.2003.11.041.
lP
Gomes, L.C., Faria, R.M., Souza, E., Veloso, G.V., Schaefer, C.R.G.R., Fernandes Filho, E.I., 2019. Modelling and mapping soil organic carbon stocks in Brazil.
na
Geoderma 340, 337-350. https://doi.org/10.1016/j.geoderma.2019.01.007. Guo, H., Zhang, B., Wang, G., Shen, Z., 2010. Geochemical controls on arsenic and
Jo ur
rare earth elements approximately along a groundwater flow path in the shallow aquifer of the Hetao Basin, Inner Mongolia. Chemical Geology 270, 117-125. https://doi.org/10.1016/j.chemgeo.2009.11.010. Hastie, T.; Tibshirani, R.; Friedman, J.H. 2009. The elements of statistical learning: data mining, Inference and Prediction, 2nd ed. Springer, New York, USA p. 533. Hengl, T., Heuvelink, G.B.M., Kempen, B., Leenaars, J.G.B., Walsh, M.G., Shepherd, K.D., Sila, A., MacMillan, R.A., Mendes de Jesus, J., Tamene, L., Tondoh, J.E., 2015. Mapping Soil Properties of Africa at 250 m Resolution: Random Forests Significantly Improve Current Predictions. Plos One 10(6): e0125814. https://doi.org/10.1371/journal.pone.0125814. Herngren, L., Goonetilleke, A., Ayoko, G.A., 2005. Understanding heavy metal and suspended solids relationships in urban stormwater using simulated rainfall. J. Environ. Manag. 76, 149-158. https://doi.org/10.1016/j.jenvman.2005.01.013. Herrmann, A.G., Heinrichs, H., 1990. Praktikum der Analytischen Geochemie,
Journal Pre-proof
36
Springer-Lehrbuch. Springer Berlin Heidelberg, Berlin, Heidelberg. doi:10.1007/978-3-642-61286-2 Heung, B., Ho, H.C., Zhang, J., Knudby, A., Bulmer, C.E., Schmidt, M.G., 2016. An overview and comparison of machine-learning techniques for classification purposes
in
digital
soil
mapping.
Geoderma
265,
62–77.
https://doi.org/10.1016/j.geoderma.2015.11.014. Hijmans, R.J, Cameron, S.E., Parra, J.L., Jones, P.G., Jarvis, A., 2005. Very high resolution interpolated climate surfaces for global land areas. Int. J. Climatol. 25, 1965–1978. https://10.1002/joc.1276.
of
Horta, A.; Malone, B.; Stockmann, U.; Minasny, B.; Bishop, T.F.A.; McBratney, A.B.; Palasser, R.; Pozza, L. 2015. Potential of integrated field spectroscopy and spatial
ro
analysis for enhance assessment of soil contamination: a prospective review.
-p
Geoderma 241-242, 180-209. https://doi.org/10.1016/j.geoderma.2014.11.024. Hou, D.; O'Connor, D.; Nathanail, P.; Tian, L.; Ma, Y. 2017. Integrated GIS and
re
multivariate statistical analysis for regional scale assessment of heavy metal soil contamination: A critical review. Environmental Pollution 231, 1188-1200.
lP
https://doi.org/10.1016/j.envpol.2017.07.021. IBGE, 2015. Instituto Brasileiro de Geografia e Estatística. Mapa de Biomas e de Geociências.
na
Vegetação.
https://ww2.ibge.gov.br/home/presidencia/noticias/21052004biomashtml.shtm/
Jo ur
(accessed 21 April 2018).
Isaaks, E.H., Srivastava, R.M., 1989. An Introduction to Applied Geostatistics, 1st ed. Oxford University Press, New York. IUSS Working Group WRB, 2014. World reference base for soil resources 2014. International soil classification system for naming soils and creating legends for soil
maps,
World
Soil
Resources
Reports
No.
106,
FAO,
Rome.
doi:10.1017/S0014479706394902 Iwahashi, J., Pike, R.J., 2007. Automated classifications of topography from DEMs by an unsupervised nested-means algorithm and a three-part geometric signature. Geomorphology 86, 409-440. https: //doi.org/10.1016/j.geomorph.2006.09.012. Kabata-Pendias, A., 2011. Trace elements in soils and plants, Trace Elements in Soils and Plants, fourth ed. CRC Press, Boca Raton, FL. https://doi.org/10.1201/b1015825.
Journal Pre-proof
37
Kuhn, M., 2018. Caret: classification and regression training. R package version 6.0-81. Available: https://cran.r-project.org/web/packages/caret/index.html (accessed 06 January 2019). Kuhn, M., 2012. Variable selection using the caret package. Available: http://r-forge.rproject.org/scm/viewvc.php/*checkout*/pkg/caret/inst/doc/caretSelection.pdf?revi sion=77&root=caret&pathrev=90. (accessed 19 January 2019). Kuhn, M., Johnson, K., 2013. Applied Predictive Modeling, first ed. Springer, New York. Kuwatani, T., Nagata, K., Okada, M., Watanabe, T., Ogawa, Y., Komai, T., Tsuchiya,
Tohoku
tsunami
deposits. Scientific
reports, 4
(7077),
1-6.
ro
HTTP://doi.org/10.1038/srep07077.
of
N., 2014. Machine-learning techniques for geochemical discrimination of 2011
imobilização
de
arsênio.
-p
Ladeira, A.C.Q., Ciminelli, V.S.T., Nepomuceno, A.L., 2002. Seleção de solos para R.
Esc.
Minas
55:
215-221.
re
http://dx.doi.org/10.1590/S0370-44672002000300009. Lado, L.R., Hengl, T., Reuter, H.I., 2008. Heavy metals in European soils: A
lP
geostatistical analysis of the FOREGS Geochemical database. Geoderma 148, 189–199. https://doi.org/10.1016/j.geoderma.2008.09.020.
na
Loska, K., Wiechula, D., 2003. Application of principal component analysis for the estimation of source of heavy metal contamination in surface sediments from the
Jo ur
Rybnik Reservoir. Chemosphere 51, 723–733. https://doi.org/10.1016/S00456535(03)00187-5.
Lu, A.X., Wang, J.H., Qin, X.Y., Wang, K.Y., Han, P., Zhang, S.Z., 2012. Multivariate and geostatistical analyses of the spatial distribution and origin of heavy metals in the agricultural soils in Shunyi, Beijing, China. Sci. Total Environ. 425, 66–74. https://doi.org/10.1016/j.scitotenv.2012.03.003. Mancini, M., Weindorf, D.C., Chakraborty, S., Silva, S.H.G., Teixeira, A.F.S., Guilherme, L.R.G., Curi, N., 2019. Tracing tropical soil parent material analysis via portable X-ray fluorescence (pXRF) spectrometry in Brazilian Cerrado. Geoderma 337, 718–728. https://doi.org/10.1016/j.geoderma.2018.10.026. Mandal, B.K., Suzuki, K.T., 2002. Arsenic round the world: a review. Talanta 58, 201235. https://doi.org/10.1016/S0039-9140(02)00268-0. Marques, J.J., Schulze, D. G., Curi, N., Mertzman, S.A., 2004. Trace element geochemistry in Brazilian Cerrado soils. Geoderma 121, 31–43.
Journal Pre-proof
38
Martin, H. W.; Kaplan, D. I., 1998. Temporal changes in Cadmium, Thallium, and Vanadium mobility in soil and phytoavailability under field conditions. Water Air Soil Pollut. 101, 399–410. Masscheleyn, P.H., Delaune, R.D., Patrick, Jr., W.H., 1991. Effect of redox potential and pH on arsenic speciation and solubility in a contaminated soil. Environ. Sci. Technol. 25, 1414–1419. doi:10.1021/es00020a008. Matschullat, J. Arsenic in the geosphere - a review. Science of The Total Environment 249, 297-312. https://doi.org/10.1016/S0048-9697(99)00524-0. Maynard, J.B., 1992. Chemistry of modern soils as a guide to interpreting precambrian
of
paleosols. J. Geol. 100, 279–289. doi: 10.1086/629632.
McBratney, A.B.; Field, D.J.; Koch, A., 2014. The dimensions of soil security.
ro
Geoderma 213, 203–213. https://doi.org/10.1016/j.geoderma.2013.08.013.
-p
McBratney, A.B., Mendonça Santos, M.L., Minasny, B., 2003. On digital soil mapping. Geoderma 117, 3-52. https://doi.org/10.1016/S0016-7061(03)00223-4.
re
McBride, M.B., 1994. Environmental chemistry of soils. Oxford University, New York. 406p.
lP
McLean, E.O., Hedleson, M.R., Bartlett, R.J., Holowaychuk, D., 1958. Aluminium in soils: I. Extraction methods and magnitude clays in Ohio soils. Soil Sci. Soc. Am.
na
J. 22, 382–387. doi:10.2136/sssaj1958.03615995002200050005x. Mehlich, A., 1953. Determination of P, Ca, Mg, K, Na and NH4. In North Carolina Soil
Jo ur
Testing Division; University of North Carolina: Raleigh, NC, USA. p. 195. Micó, C., Recatalá, L., Peris, M., Sánchez, J., 2006. Assessing heavy metal sources in agricultural soils of an European Mediterranean area by multivariate analysis. Chemosphere 65, 863–872. https://doi.org/10.1016/j.chemosphere.2006.03.016. Minasny, B., McBratney, A.B., Malone, B.P., Wheeler, I., 2013. Digital Mapping of Soil
Carbon,
Advances
in
Agronomy.
Elsevier,
Sydney.
https://doi.org/10.1016/B978-0-12-405942-9.00001-3 Miranda, E. E. de (Coord.). Brasil em Relevo. Campinas: Embrapa monitoramento por satélite, 2005. Disponível em:
. (accessed 1 July 2019). Nicholson, F.A., Smith, S.R., Alloway, B.J., Carlton-Smith, C., Chambers, B.J., 2003. An inventory of heavy metals inputs to agricultural soils in England and Wales. Sci. Total Environ. 311, 205–219. https://doi.org/10.1016/S0048-9697(03)001396.
Journal Pre-proof
39
Nriagu, J.O., 1989. A global assessment of natural sources of atmospheric trace metals. Nature 338, 47-49. Pahlavan Rad, M.R., Toomanian, N., Khormali, F., Brungard, C.W., Komaki, C.B., Bogaert, P., 2014. Updating soil survey maps using random forest and conditioned Latin hyper- cube sampling in the loess derived soils of northern Iran. Geoderma 232–234, 97–106. http://dx.doi.org/10.1016/j.geoderma.2014.04.036. Parsons C., Margui Grabulosa E., Pili E., Floor G.H., Roman-Ross G., Charlet L., 2013. Quantification of trace arsenic in soils by field-portable X-ray fluorescence Considerations
conditions. J.
for
sample
Hazard.
preparation
and
Mater.
measurement
262, 1213–1222.
of
spectrometry:
doi:http://dx.doi.org/10.1016/j.jhazmat.2012.07.001.
ro
Peinado, F.M., Ruano, S.M., González, M.G.B., Molina, C.E., 2010. A rapid field
fluorescence
(PXRF).
-p
procedure for screening trace elements in polluted soil using portable X-ray Geoderma
159,
76–82.
re
https://doi.org/10.1016/j.geoderma.2010.06.019.
Pelegrino, M.H.P., Weindorf, D.C., Silva, S.H.G., Menezes, M.D., Poggere, G.C.,
lP
Guilherme, L.R.G., Curi, N., 2018. Synthesis of proximal sensing, terrain analysis, and parent material information for available micronutrient prediction in
na
tropical soils. Precis. Agric. 19, 1-22. http://doi.org/10.1007/s11119-018-9608-z. Prasad, A.M., Iverson, L.R., Liaw, A., 2006. Newer Classification and Regression Tree
Jo ur
Techniques: Bagging and Random Forests for Ecological Prediction. Ecosystems 9, 181–199. DOI: 10.1007/s10021-005-0054-1. Quinlan, J.R., 1992. Learning with continuous classes. In: AI ‟92: Proceedings of the 5th Australian Joint Conference on Artificial Intelligence (editors A. Adams, L. Sterling) pp. 343–348. World Scientific, Singapore. https://doi.org/10.1.1.34.885. Ribeiro, B.T.; Silva, S.H.G.; Silva, E.A.; Guilherme, L.R.G. 2017. Portable X-ray fluorescence (pXRF) applications in tropical Soil Science. Ciência e Agrotecnologia
41,
245-254.
http://dx.doi.org/10.1590/1413-
70542017413000117. Reimann, C., Filzmoser, P., Fabian, K., Hron, K., Birke, M., Demetriades, A., Dinelli, E., Ladenberger, A., 2012. The concept of compositional data analysis in practice Total major element concentrations in agricultural and grazing land soils of Europe. Sci. Total Environ. 426, 196–210. doi:10.1016/j.scitotenv.2012.02.032 Reimann C., Siewers U., Tarvainen T., Bityukova L., Eriksson J., Gilucis A.,
Journal Pre-proof
40
Gregorauskiene V., Lukashev V.K., Matinian N.N., Pasieczna A., 2003. Agricultural soils in northern Europe: a geochemical atlas. Geologisches Jahrbuch, Sonderhefte, Reihe D, Heft SD 5, Schweizerbart'sche Verlagsbuchhandlung, Stuttgart. Resende, M., Curi, N., Rezende, S.B., Corrêa, G.F., Ker, J.C., 2014. Pedologia: Base Para a Distinção de Ambientes. Editora UFLA, Lavras. Rodríguez, J.A., Nanos, N., Grau, J.M., Gil, L., López-Arias, M., 2008. Multiscale analysis of heavy metal contents in Spanish agricultural topsoils. Chemosphere 70, 1085–1096. https://doi.org/10.1016/j.chemosphere.2007.07.056.
of
Romic, M., Romic, D., 2003. Heavy metals distribution in agricultural topsoils in urban area. Environ. Geol. 43, 795–805. https://doi.org/10.1007/s00254-002-0694-9.
ro
Rossiter, D.G. 2018. Past, present & future of information technology in pedometrics.
-p
Geoderma 324,131-137. https://doi.org/10.1016/j.geoderma.2018.03.009. Rudnick, R.L., Gao, S., 2014. Composition of the continental crust, in: Holland, H.D.,
re
Turekian, K.K. (Eds.), Treatise on Geochemistry. Elsevier Ltd. pp. 1-51. R Core Team, 2018. R: a language and environment for statistical R Foundation for
(verified 23 Aug. 2018).
lP
Statistical Computing, Vienna, Austria. Available: https://www.R-project. org/
na
Santos, A.D. dos, Coscione, A.R., Vitti, A.C., Boaretto, A.E., Coelho, A.M., Raij, B. van, et al., 2009. Manual de análises químicas de solos , plantas e fertilizantes, 2nd
Jo ur
ed. Embrapa, Brasilia (in Portuguese). Santos, R.D., Lemos, R.C., Santos, H.G., Ker, J.C., Anjos, L.H.C., Shimizu, S.H., 2013. Manual de descrição e coleta de solo no campo, 6ª ed. Sociedade Brasileira de Ciência do Solo, Viçosa, MG. Schaetzl, R., Anderson, S., 2005. Soils Genesis and Geomorphology, 1st ed. Cambridge University Press, New York. Shan, Y., Tysklind, M., Hao, F., Ouyang, W., Chen, S., Lin, C., 2013. Identification of sources of heavy metals in agricultural soils using multivariate analysis and GIS. J. Soils Sediments 13, 720-729. doi: 10.1007/s11368-012-0637-3. Shoemaker, H.E., McLean, E.O., Pratt, P.F., 1961. Buffer methods for determining the lime requirement of soils with appreciable amounts of extractable aluminum. Soil Sci.
Soc.
Am.
Proc.
doi:10.2136/sssaj1961.03615995002500040014x.
25,
274–277.
Journal Pre-proof
41
Silva, S.H.G.; Silva, E.A.; Poggere, G.C.; Guilherme, L.R.G.; Curi, N. 2018. Tropical soils characterization at low cost and time using portable X-ray fluorescence spectrometrer (pXRF): effects of different sample preparation methods. Ciência e Agrotecnologia 42, 80-92. http://dx.doi.org/10.1590/1413-70542018421009117. Silva, S.H.G., Teixeira, A.F.S., Menezes, M.D., Guilherme, L.R.G., Moreira, F.M.S., Curi, N., 2017. Multiple linear regression and random forest to predict and map soil properties using data from portable x-ray fluorescence spectrometer (pXRF). Ciênc.
Agrotec.
41,
648-664.
https://dx.doi.org/10.1590/1413-
70542017416010317.
of
Silva, S.H.G., Poggere, G.C., Menezes, M.D., Carvalho, G.S., Guilherme, L.R.G., Curi, N., 2016. Proximal Sensing and Digital Terrain Models Applied to Digital Soil
ro
Mapping and Modeling of Brazilian Latosols (Oxisols). Remote Sens. 8, 614.
-p
https://doi.org/10.3390/rs8080614.
Silverman, B. W. 1986. Density Estimation for Statistics and Data Analysis. New York:
re
Chapman and Hall.
Smedley, P.L., Kinniburgh, D.G. 2002. A review of the source, behavior and
lP
distribution of arsenic in natural Waters. Applied Geochemistry 17 (5), 517-568. https://doi.org/10.1016/S0883-2927(02)00018-5.
na
Stevens, A., Nocita, M., Tóth, G., Montanarella, L., van Wesemael, B., 2013. Prediction of Soil Organic Carbon at the European Scale by Visible and Near InfraRed Spectroscopy.
Jo ur
Reflectance
Plos
one.
8
(6):
e66409.
https://doi.org/10.1371/journal.pone.0066409. Stockmann, U., Cattle, S.R., Minasny, B., McBratney, A. B., 2016. Utilizing portable X-ray fluorescence spectrometry for in-field investigation of pedogenesis. Catena 139, 220–231. https://doi.org/10.1016/j.catena.2016.01.007. Stoeppler, M (ed.), 1997. Sampling and sample preparation: practical guide for analytical chemists. Springer-Verlag Berlin Heidelberg, Berlin, Heidelberg. 10.1007/978-3-642-60632-8 Sutherland, R. A., 2000. Bed sediment-associated trace metals in an urban stream, Oahu Hawaii. Environ. Geol. 39, 611–627. DOI: 10.1007/s002540050473. Tang, J., Johannesson, K.H., 2006. Controls on the geochemistry of rare earth elements along a groundwater flow path in the Carrizo Sand aquifer, Texas, USA. Chem. Geol. 225, 156–171. https://doi.org/10.1016/j.chemgeo.2005.09.007. Tarvainen, T., Albanese, S., Birke, M., Poňavič, M., Reimann, C., 2013. Arsenic in
Journal Pre-proof
42
agricultural and grazing land soils of Europe. Appl. Geochemistry 28, 2–10. doi:10.1016/j.apgeochem.2012.10.005 Taylor, G., Eggleton, R.A., 2001. Regolith, Geology and Geomorphology. JohnWiley & Sons, Ltd., Chichester, England. Tóth, G., Hermann, T., Szatmári, G., Pásztor, L., 2016. Maps of heavy metals in the soils of the European Union and proposed priority areas for detailed assessment. Sci.
Total
Environ.
565,
1054–1062.
https://doi.org/10.1016/j.scitotenv.2016.05.115. Towett, E.K., Shepherd, K.D, Tondoh, J.E., Winowieckil L.A,, Lulseged, T.,
of
Nyambura, M, Sila, A., Vågen, T., Cadisch, G., 2015. Total elemental composition of soils in Sub-Saharan Africa and relationship with soil forming Geoderma
Regional
ro
factors.
5,
157–168.
-p
https://doi.org/10.1016/j.geodrs.2015.06.002.
Tyler, G., 2004. Rare earth elements in soil and plant systems - A review. Plant Soil
re
267, 191–206. https://doi:10.1007/s11104-005-4888-2. United States Environmental Protection Agency (USEPA), 2007. Method 3051a -
lP
Microwave assisted acid digestion of sediments, sludges, soils, and oils. Test Methods Eval Solid Waste. pp. 1–30.
na
Venables, W. N., Ripley, B. D., 2002. Modern applied statistics with S, fourth ed. Springer-Verlag, New York.
Jo ur
Venteris, E.R, Basta, N.T., Bigham, J.M., Rea, R., 2014. Modeling Spatial Patterns in Soil Arsenic to Estimate Natural Baseline Concentrations. J. Environ. Qual. 43, 936–946. doi:10.2134/jeq2013.11.0459. Vieira, B.C., Salgado, A.A.R., Santos, L.J.C., 2015. Brazil: A Land of Beautiful and Undiscovered Landscapes. Springer, Dordrecht, pp. 3–7. doi:10.1007/978-94-0178023-0_1. Xu, D.; Zhao, S.; Li, S.; Chen, Q.; Jiang, J.; Zhou, L.; Shi, Z., 2019. Multi-sensor fusion for the determination of several soil properties in the Yangtze River Delta, China. European Journal of Soil Science, 70, 162–173. https://doi.org/10.1111/ejss.12729. Walkley, A., Black, I.A., 1934. An examination of the Degtjareff method for determining soil organic matter and a proposed modification of the chromic acid titration method. Soil Sci. 37, 29-38. Wang, D.; Chakraborty, S.; Weindorf, D.C.; Li, B.; Sharma, A.; Paul, S.; Ali, M.S., 2015. Synthesized use of VisNIR DRS and PXRF for soil characterization:Total
Journal Pre-proof carbon
and
total
nitrogen.
Geoderma,
43 243–244,
157–167.
http://dx.doi.org/10.1016/j.geoderma.2014.12.011. Weindorf, D.C.; Chakrabort, S.; Herrero, J.; Li, B.; Castaneda, C.; Choudhur, A., 2016. Simultaneous assessment of key properties of arid soil by combined PXRF and Vis–NIR
data.
European
Journal
of
Soil
Science,
67,
173–183.
https://doi.org/10.1111/ejss.12320. Woolson, E.A., 1983. Man‟s perturbation on the arsenic cycle, in: Lederer, W.H., Fensterheim R.J. (Eds.), Arsenic: Industrial, Biomedical and Environmental Perspective. Van Nostrand Reinhold Company, New York, p. 393. (conferir se o
of
título está correto).
Yamasaki, S., Takeda, A., Nanzyo, M., Taniyama, I., Nakai, M., 2001. Background
ro
levels of trace and ultra-trace elements in soils of Japan. Soil Sci. Plant Nutr. 47,
-p
755–765. doi:10.1080/00380768.2001.10408440.
Zhang, H.H.; Yuan, H.X.; Hu, Y.G.; Wu, Z.F.; Zhu, L.A.; Zhu, L.; Li, F.B.; LI, D.Q.
re
2006. Spatial distribution and vertical variation of arsenic in Guangdong soil profiles, China. Environmental Pollution 144, 492-499.
lP
Zhang, W.; Goh, A.T.C. Multivariate adaptive regression splines and neural network models for prediction of pile drivability. Geoscience Frontiers 7 (2016) 45-52.
na
https://doi.org/10.1016/j.gsf.2014.10.003. Zhou, Y., Niu, L., Liu, K., Yin, S., Liu, W., 2018. Arsenic in agricultural soils across
Jo ur
China: Distribution pattern, accumulation trend, influencing factors, and risk assessment.
Sci.
Total
Environ.
616–617,
156–163.
doi:10.1016/j.scitotenv.2017.10.232 Zhu, Y.; Weindorf, D.C.; Zhang, W., 2011. Characterizing soils using a portable X-ray fluorescence spectrometer: 1. Soil texture. Geoderma, 167–168, 167–177. doi:10.1016/j.geoderma.2011.08.010
Jo ur
na
lP
re
-p
ro
of
Journal Pre-proof
Figure 1. Location of soil samples in different Brazilian biomes.
44
lP
re
-p
ro
of
Journal Pre-proof
Jo ur
na
Figure 2. Framework of accuracy assessment of As content inference models.
45
Jo ur
na
lP
re
-p
ro
of
Journal Pre-proof
Figure 3. Importance of environmental covariables according to the overall values of Bagged Tree.
46
Jo ur
na
lP
re
-p
ro
of
Journal Pre-proof
Figure 4. Importance of environmental covariables according to Random Forest: a) mean decrease accuracy, and b) mean decrease in Gini.
47
48
lP
re
-p
ro
of
Journal Pre-proof
Figure 5. Spatial distribution of the four natural groups (G) based on the most important
Jo ur
groups.
na
environmental covariates and soil As content variation (mean ± standard errors) within
Journal Pre-proof
49
Table 1. Numerical environmental covariables used in As content predictive models. Scorpan soil Environmental covariate
Mean (range)
forming
Additional details
factor -1
464 (0–870)
Soil
Sand (g kg-1)
464 (0–958)
Soil
Organic carbon (cmolc kg-1)
18.8 (1.8–181.4)
Soil
Ca2+ + Mg2+ (g kg-1)
4.71 (0.02-36.00)
Soil
K+ (cmolc kg-1)
0.21 (0.01-0.98)
Soil
Al3+ (cmolc kg-1)
1.02 (0-11.20)
P (mg kg-1)
8.03 (0-121.00)
Soil
42 (2-100)
Soil
ro
Soil
-p
BS (%)
of
Clay (g kg )
11.6 (1.4-66.9)
Soil
Zn (mg kg-1)
0.82 (0.01-8.42) 0.14 (0.00-1.05)
Soil
-1
0.21 (0.00-7.16)
Soil
Cu (mg kg-1)
0.55 (0.00-10.40)
Soil
Al2O3 (mg kg-1)
28376 (771-15908)
Soil
45183 (175-
Soil
Jo ur
Fe2O3 (mg kg-1)
na
Ni (mg kg-1)
TiO2 (mg kg-1)
Soil
lP
Cr (mg kg )
re
CEC (cmolc kg-1)
Traditional lab analyses
275530)
8109 (494-47901)
Soil
1870 (56-36912)
Soil
3352 (17-37312)
Soil
P2O5 (mg kg-1)
4547 (156-22062)
Soil
MnO (mg kg-1)
472 (8-3283)
Soil
Proximal sensor:
V (mg kg-1)
88 (0-2623)
Soil
pXRF analysis
Cr (mg kg-1)
74 (0-1783)
Soil
Ni (mg kg-1)
15 (0-25)
Soil
Cu (mg kg )
25 (0-193)
Soil
Zn (mg kg-1)
48 (12-138)
Soil
Rb (mg kg-1)
37 (0-359)
Soil
Sr (mg kg-1)
31 (0-913)
Soil
CaO (mg kg-1) K2O (mg kg-1)
-1
Journal Pre-proof
50
180 (0-1931)
Soil
Hf (mg kg-1)
4 (0-22)
Soil
Y (mg kg-1)
12.81(0.11-102.73)
Soil
-1
La (mg kg )
29.88 (1.32-109.54)
Soil
Ce (mg kg-1)
47.12 (0.08-214.79)
Soil
Pr (mg kg-1)
5.87 (0.14-26.16)
Soil
Nd (mg kg-1)
37.66 (0.88-199.93)
Soil
Sm (mg kg-1)
3.56 (0.01-19.94)
Soil
Eu (mg kg-1)
0.76 (0.02-5.12)
Soil
Gd (mg kg-1)
2.49 (0.06-10.23)
Soil
Tb (mg kg-1)
0.39 (0.00-2.22)
Dy (mg kg-1)
2.82 (0.01-16.08)
Soil
Rare earth
Ho (mg kg-1)
0.28 (0.01-2.11)
Soil
elements
Er (mg kg-1)
0.83 (0.01-6.44)
Soil
Tm (mg kg-1)
0.16 (0.01-0.80)
Yb (mg kg-1)
0.90 (0.00-7.37)
Soil
of
Ba (mg kg-1)
0.18 (0.00-0.71)
Soil
124.97 (3.71-
Soil
lP
Jo ur
HREE (mg kg-1)
Soil
na
Lu (mg kg-1) LREE (mg kg-1)
re
-p
ro
Soil
445.01)
8.04 (8.038)
Soil
Annual Mean temperature (˚C)
Accumulated rainfall (mm)
Climate
Raster format in 30 arc s of
Climate
resolution (1 x 1 km at the Equator) Calculalted in
Road density
Organisms
GIS, 90m of resolution SRTM-DEM, 90m
DEM
Relief
Slope
Relief
Channel network base level
Relief
Calculated from
Terrain surface
Relief
DEM, 90m of
of resolution
Journal Pre-proof
51 Relief
Valley depth
resolution
Climate, Relief
Wind Exposition
Table 2. Summary information of categorical environmental covariables and their classes of occurence. Environmental Classes of ocurrence Scorpan soil covariate
forming factor
Geology
Acid igneous rocks (granite, gneiss, dacite) and plinthic sediments; basic igneous rocks (basalt,
of
tuffite, melaphore, andesite, gabbro, noritic, and
ro
trachy); metamorphic rocks (anatexite, gneiss
biotite, gneiss, metasiltite, metashale, slate, shale,
-p
metabasite, migmatite, micaschist, schist,
Parent material
micaceous gneiss, quartzite, calciferous quartzite,
re
phyllite); sedimentary rocks (arcose, politic rocks,
lP
claystone, conglomerate, limestone, sandstone, siltstone, sandy and clayey sediments;
na
unconsolidated sediments; organic sediments Soil types
Acrisols, Arenosols/ Leptosols/ Regosols,
Jo ur
Cambisols, Ferralsols, Fluvisols, Gleysols, Histosols, Luvisols, Nitisols, Planosols,
Soil
Plinthosols, Podzols, Vertisols
Land use Biomes
Agriculture and reference soils Amazon Rainforest, Atlantic Forest, Caatinga, Cerrado, Pantanal
Local relief
Plain (0-3%), gently undulated (3-8%), undulated (8-20%), strongly undulated (20-45%)
Soil dranaige
Somewhat excessively drained, well drained, moderately well drained, and poorly drained
Temperature
Af (Tropical without dry season), Am (Tropical
according to
monsoon), As (Tropical with dry summer), Aw
Köppen
(Tropical with dry winter), BSh (Dry semi-arid
classification
low latitude and altitude), Cfa (Humid subtropical
Organism Organism
Relief
Soil
Climate
Journal Pre-proof oceanic climate, without dry season with hot summer), Cfb (Humid subtropical oceanic climate, without dry season with temperate summer), Csa (Humid subtropical with hot-dry summer), Csb (Humid subtropical with dry summer and temperate summer), Cwa (Humid subtropical with dry winter and hot summer), Cwb (Humid subtropical with dry winter and temperate summer), and Cwc (Humid subtropical
Jo ur
na
lP
re
-p
ro
of
with dry winter and short and cool summer)
52
Journal Pre-proof
53
Table 3. Arsenic concentrations (mg kg-1) of agriculture and reference topsoil soils of different geologies. Acid igneous rocks Basic igneous rocks Metamorphic rocks Sedimentary rocks Unconsolidated sediments reference agriculture reference agriculture reference agriculture reference agriculture reference agriculture n 6 8 5 11 12 17 17 31 22 30 min 0.14 0.31 0.57 0.95 0.53 0.38 1.27 0.71 0.20 0.28 Q25 0.22 1.23 0.80 1.92 0.95 1.27 1.93 3.40 0.60 0.86 Gm 0.73 2.93 2.27 3.08 2.21 2.50 4.85 8.07 1.58 2.35 mean 2.29 5.52 3.30 3.72 4.43 4.02 7.98 13.92 3.24 5.36 median 0.69 2.72 3.06 2.76 1.80 2.31 4.46 7.64 1.47 1.56 Q75 3.51 12.00 5.91 5.30 4.14 5.90 9.85 21.4 4.14 7.07 max 11 15 7.12 9.23 24 13.1 41.1 58.3 16.1 25.7 SD 4.28 5.82 2.70 2.42 6.79 3.82 9.91 14.37 4.19 7.13 CV 186.5 105.5 81.9 65.0 153.1 94.9 124.2 103.2 129.1 132.9 variance 18.3 33.9 7.3 5.8 46.0 14.6 98.3 206.4 17.5 50.9 skewness 2.41 1.03 0.57 1.22 2.62 1.28 2.67 1.50 1.95 1.71 kurtosis 5.86 -0.69 -0.97 1.48 7.07 0.73 8.00 1.99 3.38 1.97 EF 0.7±0.9 1.0±1.3 0.3±0.3 0.4±0.3 0.8±1.1 0.7±0.6 1.6±1.7 2.5±2.7 1.1±4.3 0.6±1.3 n.: number of samples, min: minimum concentration, Q25: 25th percentile of the data set, Gm: geometric mean, Q75: 75th percentile of the data
f o
l a
set,
max:
maximum
concentration,
o J
n r u
SD:
standard
(Assample/Ascrustal)/(Scsample/Sccrustal) with standard deviation.
deviation,
o r p
e
r P CV:
coefficient
of
variation,
and
EF:
Enrichment
factor
=
Journal Pre-proof
54
Table 4. Best parameter optimization values of soil As content predictive models and their respective accuracy. Models GLM Ridge regression
R2
Optimized parameter
RMSE
MAE
R2
RMSE
MAE
Internal validation
External validation
No tuning
0.76 0.1549 0.1200
0.45 0.2839 0.2248
lambda = 0.001575072
0.78 0.1465 0.1140
0.47 0.2703 0.2143
0.76 0.1538 0.1228
0.52 0.2628 0.1875
0.81 0.1550 0.1118
0.67 0.2263 0.1786
f o
o r p
MARS
n prune = 13; degree = 1
Bagged Tree
No tuning
Cubist
Committees = 5; neighbor = 3 nrounds = 938; max_depth = 3; eta = 0.1372147; gamma = 2.385319;
0.99 0.0357 0.0272
0.61 0.2283 0.1806
XGBoost
colsample_bytree = 0.3090273; min_child_weight = 19; subsample =
0.28 0.2894 0.2285
0.47 0.3203 0.2538
0.95 0.0954 0.0682
0.68 0.2198 0.1699
n r u
0.4185117 RFE Random Forest
l a
mtry = 19; ƞtree = 1000
o J
r P
e
GLM - Generalized Linear Model with Stepwise Feature Selection; MARS – Multivariate Adaptive Regression Spline; XGBoost – Extreme Gradient Boosted Tree; RFE - Recursive Feature Elimination.
Journal Pre-proof
55
Table 5. Mean and standard error of As content and selected environmental covariates within natural groups. Group 1
Group 2
Group 3
Group 4
covariate
(n=56)
(n=3)
(n=63)
(n=37)
As (mg kg-1)
2.24 ± 0.50
3.87 ± 2.05
7.14 ± 1.30
11.97 ± 1.62
Sand (g kg-1)
793 ± 16
-
374 ± 22
170 ± 19
Clay (g kg-1)
104 ± 9
-
366 ± 17
632 ± 20
SOC
7.9 ± 0.65
166.2 ± 20.08
17.6 ± 1.32
24.8 ± 1.63
TiO2 (mg kg )
3790 ± 425
6300 ± 1598
6735 ± 376
17054 ± 1759
Temperature (°C)
25 ± 0.24
21 ± 0.67
24 ± 0.27
18 ± 0.43
-1
of
Environmental
ro
Values are the mean ± standard error. Column colors refer to natural group‟s legend in
Jo ur
na
lP
re
-p
the Figure 5.
-p
ro
of
Journal Pre-proof
Jo ur
na
lP
re
Graphical abstract
56
Journal Pre-proof
57
Highlights
Jo ur
na
lP
re
-p
ro
of
Arsenic concentrations were quantified in topsoil from Brazilian soils As was accurately predicted by Random Forest and environmental covariables Agricultural soils do not promoted relevant As enrichment as expected Temperature, organic carbon, clay, sand, and TiO2 were variables ranked in As modeling Four natural groups were suggested, working as environmental indicators of As content