Australia-wide predictions of soil properties using decision trees

Australia-wide predictions of soil properties using decision trees

Geoderma 124 (2005) 383 – 398 www.elsevier.com/locate/geoderma Australia-wide predictions of soil properties using decision trees B.L. Hendersona, E...

766KB Sizes 0 Downloads 47 Views

Geoderma 124 (2005) 383 – 398 www.elsevier.com/locate/geoderma

Australia-wide predictions of soil properties using decision trees B.L. Hendersona, E.N. Buib,*, C.J. Moranb, D.A.P. Simonb a

CSIRO Mathematical and Information Sciences, GPO Box 664, Canberra, ACT 2601, Australia b CSIRO Land and Water, GPO Box 1666, Canberra, ACT 2601, Australia Received 9 September 2003; received in revised form 4 June 2004; accepted 4 June 2004 Available online 11 August 2004

Abstract This paper describes the construction of Australia-wide soil property predictions from a compiled national soils point database. Those properties considered include pH, organic carbon, total phosphorus, total nitrogen, thickness, texture, and clay content. Many of these soil properties are used directly in environmental process modelling including global climate change models. Models are constructed at the 250-m resolution using decision trees. These relate the soil property to the environment through a suite of environmental predictors at the locations where measurements are observed. These models are then used to extend predictions to the continental extent by applying the rules derived to the exhaustively available environmental predictors. The methodology and performance is described in detail for pH and summarized for other properties. Environmental variables are found to be important predictors, even at the 250-m resolution at which they are available here as they can describe the broad changes in soil property. D 2004 Elsevier B.V. All rights reserved. Keywords: Digital soil mapping; Soil pH; Spatial data mining; Spatial modelling

1. Introduction Data on soil water and organic matter content are critical inputs to continental and global simulations of climate, biogeochemical cycles, and vegetation and their response to change. Soil is the largest terrestrial store for C (Post et al., 1990; Batjes, 1996); initialization of soil moisture is critical to global climate model results (Walker and Houser, 2001;

* Corresponding author. E-mail address: [email protected] (E.N. Bui). 0016-7061/$ - see front matter D 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.geoderma.2004.06.007

Leese et al., 2001). Typically, soil depth, texture, bulk density, available water capacity, organic matter content, and pH are used to characterize the land surface; the soil property maps are derived by linking soil mapping units with look-up tables (Bliss, 1990; Kern, 1994; Kittel et al., 1995; Batjes, 1997; Reynolds et al., 2000). The Australian Soil Resources Information System (ASRIS) project was commissioned as part of an Australian government-funded initiative (the National Land and Water Resources Audit, http://audit.ea.gov. au/ANRA/) to assess the baseline state of the environment. It produced a national soils database of existing

384

B.L. Henderson et al. / Geoderma 124 (2005) 383–398

data relating to soil and land resources from the State and Territory natural resource management agencies and the Commonwealth Scientific and Industrial Research Organisation (CSIRO). Additionally, it created Australia-wide estimates of soil properties such as pH, organic carbon, total nitrogen, total phosphorus, texture, thickness, % clay, at both surface and sub-surface level, for the more intensively used agricultural regions. This extent is illustrated by the shading in Fig. 1. The ASRIS extent is large. At the 250-m resolution at which soil properties were modelled predictions were required at over 43 million pixels. These predictions were, however, later aggregated to approximately a 1.1-km (0.018) resolution for presentation. Johnston et al. (2003) provide a detailed description of the construction of the ASRIS database and the soil attributes. This paper documents the approach taken to the spatial modelling, and describes both the successes and many of the issues encountered in modelling at a 250-m resolution over such a large extent. The predictive model for layer 1 (topsoil) pH is presented as an indicative example.

2. Methods 2.1. Model strategy The soil factor equation (Jenny, 1941) has widespread acceptance amongst the soil science community. It asserts that soil characteristics are determined by five soil-forming factors, namely: parent material, climate, organisms, relief and time. That is, soil = f ( pm, climate, organisms, relief, time) These five soil-forming factors might be collectively termed the environment for greater brevity. Hudson (1992) states that this soil factor equation, bis simply a general statement implying that soils are natural bodies that are distributed in a predictable way in response to a systematic interaction of environmental factorsQ. If the nature of the relationship between the soil and these environmental factors can be established, then that relationship provides a vehicle to infer soil properties from knowledge of the local environment.

Fig. 1. Locations of point observations in the ASRIS database. The shaded area corresponds to the ASRIS extent.

B.L. Henderson et al. / Geoderma 124 (2005) 383–398

This provides a natural complement to extensive survey work, which can be a costly exercise for large areas. These relationships are investigated by relating existing point observations to environmental predictor variables at those sites. A statistical model that captures the essence of the soil factor equation can then be constructed to relate the property of interest to (some of) the environmental variables available. If these environmental variables are exhaustively available throughout the region of interest the model can then be used to extend predictions of the soil property from the existing point data set to the entire region. The success of this approach relies on the strength of the relationship between the property and environmental variables available. Where these relationships are strong, the model will also be strong and the associated region-wise predictions given more confidence. Moreover, the benefit relies on these environmental variables being less costly to obtain than commissioning the collection of soil profile data and using some surface interpolation or smoothing procedure. This type of approach has been well investigated in the literature. Much of the early development of the soil factor equation can be attributed to Jenny and Leonard (1934) and Jenny (1941) who sought to validate the soil factor equation by relating changes in soil properties to changes in climate across a West– East transect in the United States. Since then there has been a large body of literature that relates soil properties to landscape characteristics (e.g., Walker et al., 1968; Kreznor et al., 1989; Bell et al., 1992; Odeh et al., 1994). Computer technology has made environmental variables more readily available and facilitated a greater pursuit of quantitative relationships with soil properties. For example, Moore et al. (1993) investigates quantitative terrain attributes for soil property modelling, and Gessler et al. (1995) and McKenzie and Ryan (1999) both investigate statistical models for soil properties such as organic carbon and solum depth using digital environmental predictors. McBratney et al. (2003) thoroughly review these approaches to soil mapping. There are, however, several main distinctions with taking this approach with the ASRIS database. Firstly,

385

the 250-m resolution is notably coarser than the resolution at which others have worked. Some environmental variables that are useful predictors at less than 25 m may not be so at 250 m. Moreover, at 250 m, the within-pixel variation for some properties may constitute a considerable proportion of the total variation, given expectations of local soil variability. The second distinction is that the size/extent of the problem is large. There are over 43 million pixels at the 250-m resolution. Typically, this approach has been employed in small catchments, which also have the virtue of being considerably more homogeneous. It is well known that both the pattern and process of variation will vary with different physiographic domains, and indeed that soil properties themselves will exhibit different ranges of variation. Our focus is, however, necessarily on the long-range variation given our goal is to produce Australia-wide soil property predictions. Finally, the ASRIS data derives from many sources and are collected with a range of underlying intents, as compared to what is typically a single survey in similar studies. There is also a disparity in the sample sizes available. Gessler et al. (1995) and McKenzie and Ryan (1999) base soil property models on at most a few hundred points, whereas there are at least several thousand points available from the ASRIS database for this modelling. That said, the sampling intensity relative to the extent is in all likelihood worse for the ASRIS data than in other applications of this methodology. Moreover, the ASRIS data is subject to a greater variation in this intensity, with some areas only sparsely sampled, and others intensively surveyed. 2.2. Data The database compiled consisted of 135 490 point observations of varying completeness across the properties of interest. The representation by State was as follows: New South Wales, 17 885; Queensland, 48 998; South Australia, 22 085; Tasmania, 3754; Victoria, 2674; Western Australia, 38 701; and CSIRO, 2023. Considerable time was invested in ensuring all sources had similar data standards and in cleaning the data after delivery. Some conversions between States and Territories were required to enable Australia-wide comparisons and modelling. The

386

B.L. Henderson et al. / Geoderma 124 (2005) 383–398

CSIRO data were collected throughout Australia, though it had the strongest representation for New South Wales, the Northern Territory, and to a lesser extent southeastern Victoria. The topsoil and subsoil are termed layers 1 and 2, respectively. More specifically, layer 1 was considered the first A horizon, and layer 2 the first B horizon, recorded. If no horizon designators were available, the first 30 cm was deemed the first A horizon, and thus layer 1. The date at which the samples were taken or analysed was not considered due to the relative incompleteness of those fields in the database. The locations of sampled points are given in Fig. 1. The point density makes it clear that there are both areas of high density where intensive local surveys have been undertaken and sparse areas where only exploratory surveys have been commissioned. 2.3. Environmental predictors A large number of environmental variables were considered to offer a useful quantification of the environment and be potential predictors of soil properties. A suite of terrain variables were derived from a continental 9-in. digital elevation model (DEM) and the drainage network to capture the landscape features and function. These included: elevation, deposition path length, erosion path length, relative elevation, relief, slope percent, hillslope length, slope position, river distance, ridge distance, contributing area, inverse contributing area, transport power in, and transport power out. Nineteen climatic surfaces were derived from the Queensland Department of Natural Resources and Mines (QNR&M) climate data, following the terminology and methods of Houlder et al. (1999). They were nominally available at a resolution of 5 km, though they were bi-linearly interpolated and resampled to a 250-m resolution. These included: mean diurnal change, isothermality, temperature seasonality, maximum temperature of warmest period, minimum temperature of coldest period, temperature annual range, annual precipitation, precipitation of wettest period, precipitation of driest period, precipitation seasonality, annual mean radiation, highest period radiation, lowest period radiation, radiation

seasonality, annual mean moisture index, highest period moisture index, lowest period moisture index, and moisture index seasonality. Johnston et al. (2003) gives a more detailed description of these terrain and climatic attributes. The current climate pattern has been in place over the last 9000 years (Kershaw, 1978) but past climate patterns may have influenced some current soil properties. Soil properties related to organic matter content respond rapidly—on the order of decades to centuries—to changing climate (Hugget, 1995; Batjes, 1996) and have likely adjusted to current climate. A number of other environmental variables were also available. These included: LANDSAT multispectral scanner (MSS) data (three bands in the visible electromagnetic spectrum and one infrared band); two lithology coverages, namely a 1:250 000 and a 1:2 million scale from the Atlas of Australian Soils (Northcote et al., 1968; Bureau of Rural Sciences, 2000); a 1:1 million scale land use coverage; and a 1:2 million scale digitized version of the Australian Soil Classification (ASC) from the Atlas of Australian Soils (Northcote et al., 1968; Bureau of Rural Sciences, 2000). Other predicted soil properties were used in some models. These were typically polygon surfaces, created by associating a soil taxonomic class to a soil property look-up table (McKenzie et al., 2000) with a polygonal soil map (Carlile et al., 2001). Some prediction surfaces from other point models were also used for related properties, e.g., a layer 1 organic carbon was used as a predictive surface in the layer 2 organic carbon model. Some of these predictor variables were initially available at different scales. For example, MSS was originally available at a nominal 80-m resolution, the climate surfaces at a 5-km resolution and the digitized ASC on a 1:2 million scale. This was not seen as a problem given soil-landscape processes operate over a range of scales, and that consequently, variation in soil properties may be explained by different factors at short and long ranges. 2.4. Statistical modelling There were a number of potential modelling techniques available. Moore et al. (1993) used multiple linear regression models to predict a number of

B.L. Henderson et al. / Geoderma 124 (2005) 383–398

soil properties (A-horizon depth, organic matter content, extractable phosphorus, pH, % sand, % silt). Gessler et al. (1995) used the more flexible framework of generalized linear models to develop models for horizon thickness, solum depth and the presence/absence of E horizon. Gessler (1996) extended this flexibility by using generalized additive models which allow smooth functions of predictors to be included in the linear component of the generalized linear model. McKenzie and Ryan (1999) used tree-based methods to model soil depth and total soil carbon. Voltz et al. (1997) compared a number of geostatistical techniques including kriging combined with soil mapping as support to map soil water content at wilting point. A decision tree-based methodology, through the modelling tool Cubist was adopted for the ASRIS point modelling of continuous soil properties. Cubist has origins in the machine learning literature as a predictive modelling tool. It is the commercial successor to a technique known as M5 (Quinlan, 1992, 1993b) which builds model trees using a decision tree methodology. It is available from http://www.rulequest.com/. The approach taken is similar to regression trees in CART (Breiman et al., 1984) but builds linear models rather than values on the leaves. The regression models it constructs are thus piecewise linear models. The Cubist model tree is, however, converted to a series of rules, each with an associated linear model, to facilitate ease of interpretation. This involves some simplification of the set of rules derived from following the path from the root node to each leaf. Each Cubist rule is then of the form if fconditionsg then linear model For example, if slopeba annual mean temperatureNb lithology=( p, q) then Property=c 1elevation+c2mss2+. . . where a and b are numerical values and p and q denote categories of lithology and c 1 and c 2 are coefficients of the linear model.

387

If the predictor variables associated with an observation satisfy the set of conditions, the linear model is used to predict the response. It is possible that any one observation and its associated predictor variables may satisfy more than one rule set in which case the average of the predictions is taken as the overall prediction. Cubist uses a recursive partitioning of the predictor variable space in a similar way to the regression tree methodology of CART (Breiman et al., 1984). Both methods take a divide-and-conquer strategy and seek to minimise the intra-subset variation at each node. The criteria used to split a node is, however, slightly different because where CART uses the variance, Cubist uses the standard deviation as a measure of error. The reduction in error as a result of splitting a node is given by X jTi j  sdðTi Þ Derror ¼ sdðT Þ  jT j i where T denotes the training cases available at that node, Ti represents the subset of those cases with the ith outcome following a given split, and | | denotes the count. The standard deviation of the response values is calculated for T and each subset Ti . The Derror then represents the expected reduction in error as a result of that split. Cubist chooses the split so as to maximise the expected reduction in error rate across all potential splits. The advantage of the condition set in each rule is that they enable interactions to be handled automatically by allowing different linear models to capture the local linearity in different parts of the predictor variable space. This can often lead to smaller trees and better prediction accuracy than regression trees (Quinlan, 1992; Uysal and Gu¨venir, 1999). The size of the decision tree can also be controlled by two parameters in Cubist, an optional constraint on the minimum number of observations upon which to base a rule and a brevity factor, which uses heuristics to control the complexity of the model. In general, other advantages of tree-based regression are that they can handle missing values, can use continuous and categorical predictors, are robust to predictor specification, and make very limited assumptions about the form of the regression model. Moreover, for large data sets, a strength is that they do have the potential to uncover relatively complex structure

388

B.L. Henderson et al. / Geoderma 124 (2005) 383–398

that may be difficult to detect with more conventional regression tools. In recent times, the development of the btree-averagingQ techniques like boosting and bagging (Friedman et al., 2000; Breiman, 1996) which fit multiple trees in making any predictions or classifications have offered potential improvements to the predictive power of decision trees. Possible disadvantages that need to be considered are: that there is no preference for low-order interactions and that they often do not make efficient use of the data and as such may not be useful for small data sets. They are non-backtracking, in that once a partition is made, there is no consideration of alternative choices in the sub-tree (the implication of this is that each split is optimal and not the entire tree). Tree-based regression models are also not as well developed in terms of statistical inference as generalized additive models for instance. Computationally, Cubist was found to be very efficient at deriving the predictive models and extending the predictions to the entire ASRIS extent. This was seen as a particularly important quality given the magnitude of the prediction problem. 2.5. Variable selection The environmental variables used in Cubist were selected after considering utility across a number of investigative measures. The individual predictive power of the environmental variables was investigated through examining correlations with the property of interest, and by considering the reduction in deviance from fitting additive models with a smooth function of the each environmental variable as a predictor in turn. Those variables with largest reduction in deviances were deemed most useful. A forward stepwise linear regression procedure was implemented to consider multivariate relationships and the most important predictors identified. A regression tree model was fitted in S-PLUS using the RPART (Therneau and Atkinson, 1997) routine in an attempt to allow non-linearity in these multivariate relationships. The most important predictors to the RPART model were similarly identified. For categorical predictors, an ANOVA using the predictor as the sole explanatory variable was used to assess variable importance.

Cubist does not directly provide a measure of predictor importance. The set of rules can, however, be examined and the frequency of the individual predictors in the conditioning and the linear model used as a measure of the utility of the predictor. Strong correlations between predictor variables over the 135 490 points in ASRIS database were identified as possible sources of collinearity. This was of some concern in the linear model at each leaf in Cubist but not of concern elsewhere because collinearity is only a source of non-robustness in that it can generate alternative splits that perform almost as well. The final decision on what environmental variables to include as predictors in the Cubist models was made by subjectively pooling all assessment measures considered. 2.6. Model validation and assessment Models were considered with a 70:30 training to test data split. Seventy percent of the observations were used to construct the model in the model development stage. Thirty percent were held back in order to assess the performance of each model. In view of the unequal representation from the States/ CSIRO this 70:30 split was maintained within each state by treating each State and CSIRO as a stratum. Once the strongest possible model according to performance on the test data was identified, it was refitted using all the data to maximize the use of this sparse data set with the same model form and options. This model was then used to make predictions for the entire extent and thus create a map of the predicted soil property. The performance of the model on the full data set was assessed by 10-fold cross validation. This involved randomly dividing the data into 10 partitions or folds. At each step, nine of these partitions were used to fit the model and the performance assessed on the remaining partition held back as the test data. This procedure was repeated for each partition sequentially. The performance, averaged over all 10 partitions held back, delivers the crossvalidated performance assessment. The performance of models was assessed in terms of a number of key indicators. These included the number of points used in the model, the R 2, the (rank) correlation and the RMSE (root mean square error), which gives an estimate of the standard deviation of

B.L. Henderson et al. / Geoderma 124 (2005) 383–398

the errors. A lower RMSE is associated with greater predictive ability. RMSE values cannot, however, be compared between different properties because they depend critically on the scale used. For a validation data set (test data) of size m, the RMSE is defined as vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uX  2  u m RMSE ¼ t yj  yˆ j m j¼1

where yˆj denotes the predicted value. Two additional measures are used. Firstly, the average error which gives the average absolute difference between the observed and predicted values, i.e., average error ¼

m 1 X jyj  yˆ j j: m j¼1

Lower average errors tell us that the predicted values are closer to the observed values more often. The magnitude of the absolute error does, however, depend critically on the scale of the units used and so cannot generally be compared between models. The average error is also known as the mean absolute deviation. Secondly, the relative error, which is defined as the ratio of the average absolute error magnitude to the average error magnitude that would result from predicting the mean value, i.e., 1 m

relative error ¼

1 m

m P j¼1 m P

jyj  yˆ j j jyj  y¯ j

j¼1

is used. If there is little improvement on the mean, the environmental variables have little predictive capacity and the relative error is close to 1. Generally, the smaller the relative error, the more useful the model. 2.7. Sampling issues and model biases The ASRIS database is not a homogeneous quantity. There are many sources of variation in the database. Differences due to sampling regimes, laboratories, agencies, methods and techniques, the level of accuracy recorded, quality of the sample, measurement error, handling of non-response, and the date at which the sample are taken all contribute to the variation in the properties we see.

389

There is a potential for some of these factors to introduce biases into the soil property models because differences in some of these factors may be confounded with changes in property values. Corrections can be made for some factors to reduce this possibility, such as method calibrations (e.g., pH in water and CaCl2) (Henderson and Bui, 2002). For the most part, however, the database is a compilation of existing data, and there is limited scope to isolate or negate the effect of these factors on the derived soil property models. One of the larger expected sources of model error here is due to the heterogeneity in the sampling regimes. De Gruitjer (2000) provides a good review of statistical methods in sampling for spatial inventory and monitoring of natural resources. de Gruijter makes the distinction between three modes of point sampling: convenience, where those points selected are governed by being more readily accessible or measured, e.g., near road sides; purposive, where the points selected are chosen so as to address a particular question, e.g., monitor acidification; and probability, where the points are allocated at random locations. Under probability sampling, the selected points are necessarily representative because each point in the region has an equal chance of occurring (under simple random sampling). The danger of the first two modes, which make up the great majority of the ASRIS database, is that they introduce potentially sizable biases into the modelling because the data do not adequately represent all physical or environmental domains into which we are making predictions. Moreover, there is possibly even a preference for atypical or unrepresentative points in some instances. This might occur for example if an area is largely considered bknownQ and there appears greater value in selecting points so as to characterise those parts of the area which are more unusual. The derived point models are internally validated here, either via a training/test split or via crossvalidation. This provides an assessment of performance against the data we have observed but does not infer how well the models might perform across the whole extent because the data is necessarily a biased sample. This type of assessment can only come from a representative validation data set. Historically, field observations are located according the surveyors best judgement. Unfortunately, the

390

B.L. Henderson et al. / Geoderma 124 (2005) 383–398

precise locations or the schemes used are not always identifiable. While this has improved to some extent with the more recent use of model-based and designed-based studies (e.g., Gessler et al., 1995; McKenzie and Ryan, 1999), it needs to be adopted more pervasively if we are to be more assured of the quality of our inference in such soil property models in the future. 2.8. Spatial dependence No direct account was made of the spatial structure of the point locations: It was assumed that the spatial dependence was largely implicitly accounted for by the environmental predictors. Residual spatial dependence was investigated through variogram analysis Cressie (1993). Variograms provide a popular measure of spatial dependence, illustrating how the observed data are related as a function of distance. If there is spatial dependence, observations that are closer together are typically more alike. The empirical variogram plots half the average squared difference between observations (the semi-variance) for observations that are a given distance apart. When there is spatial dependence, these squared differences are generally smaller. The semi-variance consequently usually demonstrates an increasing phase, before levelling off at a distance when there is no longer spatial dependence. 2.9. Spatial implementation Cubist models can be described in terms of a series of rules. These rules were implemented through a combination of purpose written C++ code and GIS software to produce the digital maps. Cubist incorporates facilities with which to make predictions for unknown observations, but it was more efficient to take the approach adopted for this spatial context.

3. Results 3.1. Example: pH in layer 1 Laboratory assessed soil pH was one of the most well-populated soil property fields in the database. Given measured pH values were known to be

dependent on the assessment method and the soil to solution ratio, only methods 4A1 (pH in water, soil/ solution ratio 1:5), 4B1, and 4B2 (pH in 0.01 M CaCl2, soil/solution ratio 1:5) were considered in the point modelling. For a full description of these methods, see Rayment and Higginson (1992). Measurements in water were converted to equivalent recordings in CaCl2 according to the calibration equation detailed in Henderson and Bui (2002). This was necessary because pH assessed in CaCl2 is known to be up to a unit lower than pH in water, though this magnitude differs with the pH level. There were 25 915 layer 1 pH measurements available in the ASRIS database. Table 1 breaks these down over the States/CSIRO and methods. The column labels of min and max refer to the minimum and the maximum respectively, while q10, q25, q50, q75, and q90 denote the 10th, 25th, 50th, 75th, and 90th quantiles in turn. The histograms of layer 1 pH by State are given in Fig. 2. The most obvious features are the large number of counts for Western Australia, their relative acidity, and the relative alkalinity of South Australia. The locations of the layer 1 pH observations used in the modelling are given in Fig. 3. There is a good dispersion of points across the states, though clearly a large degree of variation in the sampling intensity. A Cubist piecewise linear model was fitted to these data. The variables were selected subjectively after appealing to individual correlations with pH, individual variable GAMs, multiple linear regression and RPART regression tree analysis. After these prelimiTable 1 Distribution of layer 1 pH by State/method min

q10

q25

q50

q75

q90

Max

N

State NSW QLD SA TAS VIC WA CSIRO

2.8 3.3 3.7 3.4 3.5 3.2 3.2

4.0 4.6 5.5 4.0 4.4 4.1 4.1

4.3 5.1 6.5 4.4 4.9 4.4 4.5

4.8 5.8 7.7 4.9 6.5 4.7 4.9

5.5 6.5 8.4 5.4 7.7 5.1 5.9

6.4 7.3 8.8 5.9 7.9 5.7 6.8

8.7 8.6 10.2 8.2 8.7 8.5 8.0

2591 4642 3930 656 308 11 808 980

Method 4A1 4B1 4B2

2.8 2.8 3.2

4.3 4.0 5.3

4.8 4.4 6.5

5.5 4.7 7.8

6.6 5.0 8.5

7.6 5.7 8.8

8.8 9.3 10.2

10 848 11 978 3089

B.L. Henderson et al. / Geoderma 124 (2005) 383–398

Fig. 2. Histogram of layer 1 pH by State/CSIRO.

Fig. 3. Locations of layer 1 pH observations used in the modelling. The shaded area corresponds to the ASRIS extent.

391

392

B.L. Henderson et al. / Geoderma 124 (2005) 383–398

nary analyses, those variables expected to be the most important were MSS band 2, several of the moisturerelated climate variables (mean moisture index, precipitation), and the temperature range variables. Thirty variables were used: 11 climatic, 3 MSS bands, 14 terrain, lithology, and land use. A total of 24 319 points were available inside the ASRIS extent and with all environmental predictors. The initial model was constructed with a train/test split of 70:30. A parsimonious model was sought by requiring that any rule be based on at least 2% of the cases (approximately 250 for the training data). Some control over the simplicity of the model was also governed by the dbrevity factorT in Cubist, which uses heuristics to control the complexity of the final model. Care was taken not to overfit the model by identifying models with performance on the test data that closely paralleled the performance on the training data under repeated application. The premise for this is that nonparsimonious models often perform well on the training data but notably worse on the test data because some of the rules are typically responding to random variation rather than actual structure. The performance of the chosen model on the test data is given in Table 2. The Cubist implementation of the btree averagingQ methodology was tried as a possible improvement to the existing model. The model diagnostics were, however, only found to change negligibly for a notably greater computational load in making predictions. It was not pursued further. The model was then refitted using the same model options and variables on all 24 319 observations. Twenty-seven rules were used in the model. Tenfold cross-validation was performed on this model to judge Table 2 Performance of pH in layer 1 model by State/CSIRO State/ CSIRO

N

Rank correlation

Relative error

Average error

Approximate RMSE

NSW QLD SA TAS VIC WA CSIRO

2465 4329 3830 522 1283 10 935 955

0.63 0.66 0.45 0.51 0.81 0.55 0.74

0.72 0.70 0.87 0.85 0.46 0.85 0.62

0.55 0.57 0.90 0.45 0.59 0.43 0.52

0.72 0.73 1.12 0.63 0.75 0.62 0.70

The rank correlation, relative error, average error, and approximate RMSE are described in Section 2.6.

performance (average error 0.56; relative error 0.51; correlation 0.82). The relative performance of the fitted model for all the layer 1 pH data across the individual States can be assessed in Table 2. Note that all measures are calculated within the State with the predicted values given by the Australia-wide model. For example, the sample mean in the relative error refers to the average of the measurements within the appropriate State. The performance of this model on the test data is, however, summarized better graphically in Fig. 4. A maximum of 2500 points are plotted per panel for clarity. If there were more, a random sample of 2500 points was taken. While there remains considerable unexplained variation, the model clearly demonstrates valuable predictive ability. The most notable features in these plots are the clustering of predictions at low pH and some observed low pH values that the model notably over-estimates. The latter is especially true of South Australia where the model does poorly at predicting values with lower pH (See Fig. 4, SA panel plot). A more spatially distributed estimate of the predictive uncertainty is discussed in Henderson et al. (2001) which details how these relative errors, taken over smaller regions than states, can be combined with information on the point density and the degree to which the local environment is represented in the database. An examination of the 27 Cubist rules for pH layer 1 found that the most important variables to the conditioning were the minimum period precipitation, the mean moisture index, the mean moisture index seasonality, elevation, and the lithology coverage used. The climate variables featured prominently in the linear models as well. The most frequent predictors were the mean diurnal temperature range, mean moisture index, temperature of the minimum period, and the mean moisture index seasonality. Of the non-climatic variables, the strongest predictors of layer 1 soil pH were the elevation, distance to river, relative elevation, hillslope length, and the MSS bands 1–3. While the predictor variables listed are the most important, other correlated predictor variables may also be good predictors in their own right. Moreover, interpreting the importance of a predictor from a set of Cubist rules is complicated by the fact that combinations of variables may actually be used as surrogates

B.L. Henderson et al. / Geoderma 124 (2005) 383–398

393

Fig. 4. Layer 1 pH: Observed versus predicted plots by State/CSIRO. A maximum of 2500 observations are plotted per panel.

for other effects in particular geographical ones. The role that these predictor variables may have in pedogenesis will be discussed in a forthcoming paper (Bui et al., in preparation). The model residuals were examined for spatial structure. The State-based empirical variograms are given in Fig. 5, where gamma is the semi-variance (variogram) as a function of distance. There is residual spatial structure as there is clearly an initial increasing phase in some of the variograms. The ranges are, however, short with little structure left outside a distance of 0.05 degrees (approximately 5 km). This suggests that the Cubist model has not managed to explain all the short range variation but has captured the longer range signal. An explanation for this may be that the climatic variables proved to be good predictors but only have a resolution of 5 km; thus, they have no ability to explain shorter range variation. The large variation in the sill should be noted, e.g., compare South Australia and Western Australia. This reflects the general poorer performance in South Australia and subsequently greater variation in model residuals. The rules derived from the final Cubist model were applied to the ASRIS extent to generate a map of predicted layer 1 pH. This surface is illustrated in Fig.

6. Compared to a map generated by linking a chloropleth map to a soil database such as the one presented in Ahern et al. (1994) for surface pH in Queensland, the ASRIS map clearly represents much more local variability. The digital maps for all soil property predictions are available at http://audit.ea. gov.au/ANRA. 3.2. Model performance Models for all continuous soil properties were developed analogously to layer 1 pH. Models for some properties performed well, while others performed relatively poorly. In general, topsoil models were stronger than subsoil models. There was, however, a large amount of unexplained variation in all models. This was largely expected given the numerous sources of variation and quality in the data used to construct the soil property models. Table 3 presents summaries of continuous soil property model performance. RMSE and relative error are both unit specific and cannot really be compared across properties, though they might be used to compare the performance for different layers of the same property. In making such comparisons, it

394

B.L. Henderson et al. / Geoderma 124 (2005) 383–398

Fig. 5. Residual variograms by State/CSIRO (distance in degrees).

is evident that the layer 1 models are stronger across all the diagnostics used. Note that the model performance for total nitrogen is not given here because it is not modelled directly, but rather through its relationship with organic carbon (Henderson et al., 2001). The thickness of layer 1 and layer 2 were also investigated as continuous soil properties. It was, however, found difficult to establish reliable predictive models, and was subsequently decided to reduce thickness to two classes (thick/thin) and model them as categorical variables. The thick/thin boundaries used were 0.25 and 0.60 m for layer 1 and layer 2, respectively. Binary decision tree models were then constructed using C5.0 (Quinlan, 1993a,b; www. rulequest.com). C5.0 is a classification tool that adopts a decision tree methodology to construct a classifier. It uses a recursive partitioning strategy, analogous to CART (Breiman et al., 1984), to make splits on the predictor variables so as to create a progressively more homogeneous data set. This approach was found to be more successful. There were approximately 106 144 and 96 384 thickness

observations for layers 1 and 2, respectively. On the 30% of the data held back as test data, both decision trees delivered a classification rate of approximately 67% for the 53 and 49 C5.0 rules for layers 1 and 2, respectively. The average classification rates by State/ CSIRO are given in Table 4. The performance of the subsoil model is slightly stronger in that the specific classification rates for the thick and thin categories is less even in some states. For example, in Tasmania, thin soils are predicted very well, but thick soils are not as well identified. The performance of all soil property models varied considerably across the ASRIS extent. Some areas were, however, poorly represented in the database (e.g., Northern Territory, far north Queensland, western South Australia, and northern Western Australia) and as such have predictions which must necessarily be treated with more caution. Some residual spatial structure was found to exist, typically at distances less than 5 km. This suggests that while the models may capture the long-range variation, the short range variation is not always adequately explained. An examination of more local

B.L. Henderson et al. / Geoderma 124 (2005) 383–398

395

Fig. 6. Australia-wide prediction of layer 1 pH.

models may help improve this, though some preliminary investigations in the Burdekin and Katanning areas found difficulties in obtaining adequately predictive models at the 250-m resolution. There is some scope to make a more direct account for the spatial dependence in the future for this type of modelling. In principle, it would be possible to adopt

a regression-kriging strategy (Odeh et al., 1995) which involves kriging the residuals and adding these to the predictions from the Cubist model so as to obtain a better prediction. Unfortunately, the size of the data sets, the large variations in point intensity and the need to make predictions in areas where there are few or no local points makes this an extremely

Table 3 Model performance summary Property

pH pH Organic carbon Organic carbon Total phosphorus Extractable phosphorus Clay content Clay content

Layer

1 2 1 2 1 1 1 2

Unit

log log log log sqrt sqrt

N (total)

24319 12193 11483 5100 7377 2124 9750 7050

Performance on test data set R2

RMSE

Average error

Relative error

Correlation

Rules

0.67 0.54 0.41 0.24 0.62 0.35 0.44 0.22

0.77 0.96 0.57 0.77 0.92 0.85 1.36 1.60

0.56 0.72 0.40 0.59 0.68 0.67 1.05 1.21

0.51 0.59 0.68 0.84 0.54 0.78 0.70 0.86

0.82 0.74 0.64 0.50 0.79 0.59 0.67 0.47

27 27 29 19 18 12 32 21

The R 2, RMSE, average error, relative error, and rank correlation are described in Section 2.6.

396

B.L. Henderson et al. / Geoderma 124 (2005) 383–398

Table 4 Binary soil thickness model: average classification rates by State/ CSIRO for layer 1 and layer 2 State/ CSIRO

Topsoil (layer 1)

Subsoil (layer 2)

N

Classification rate

N

Classification rate

NSW QLD SA TAS VIC WA CSIRO

17 276 42 982 19 455 3535 2213 22 774 1704

0.60 0.72 0.69 0.62 0.62 0.64 0.67

15 326 42 405 17 295 3344 1790 17 892 1573

0.60 0.68 0.70 0.71 0.58 0.72 0.59

difficult task. Indeed, these are the same reasons that a kriging methodology could not be adopted for this data in the first instance.

4. Discussion The soil property models derived here are very encouraging. They show that sparse data taken from different sources can be fused together to make sensible predictions for pH, organic C, total P, and clay content, at least for the topsoil. Useful predictive models were obtained for topsoil and subsoil horizon thickness when they were reduced to the categories thin and thick. The production of maps of soil physical properties such as bulk density, available water capacity, and heat capacity still remains problematic due to the small number of measurements of these properties and continues to rely on the development and implementation of appropriate pedotransfer functions (Minasny et al., 1999; Wosten et al., 2001; Minasny and McBratney, 2002), although remote sensing provides alternatives (Satalino et al., 2002; Schmugge et al., 2002). Modelling provides an alternative to soil property maps derived by linking map unit polygons with lookup tables (Kern, 1994; Kittel et al., 1995; Batjes, 1997). This is especially relevant to soil property modelling over a large extent because, as Oldeman and Van Engelen (1993) discuss, such an approach is typically complicated by maps with different legend structures and different soil taxonomic systems. The modelling approach presented herein obviates the need to update soil maps worldwide before better soil properties

estimates can be made. The resulting maps incorporate more natural spatial variability than can be represented by exploratory or even reconnaissance scale soil maps linked to look-up tables of median values for soil properties; this is an important advantage as difference in the aggregation level of data can have a large impact on estimates of soil C and N stores (Batjes, 2000). Environmental variables are valuable predictors of soil properties even at the fairly coarse resolution of 250 m needed to build Australia-wide predictions. Worldwide coverage of most of the environmental predictors used here are available today, therefore, with a global database of point measurements, a worldwide modelling approach may be feasible. This suggests that building up such a global database of point observations should be a higher priority than generating a more detailed soil map of the world with a unified legend—the current FAO–UNESCO Soil Map of the World is at 1:5 million scale. Given the availability of environmental predictor surfaces with global coverage, modelling should be favoured. Acknowledgments This research was funded as part of the National Land and Water Resources Audit of Australia. It was a collaborative effort between CSIRO Land and Water, the Bureau of Rural Sciences, and the Queensland Department of Natural Resources and Mines, the New South Wales Department of Land and Water Conservation, the West Australian Department of Agriculture, Primary Industries and Resources South Australia, the Victorian Department of Natural Resources and Environment, the Tasmanian Department of Primary Industries, Water and Environment, and the Northern Territory Department of Lands, Planning and Environment. The authors are grateful to John Gallant (CSIRO Land and Water, Canberra) for writing the code to implement the rules developed in Cubist and C5.0 in the GIS. References Ahern, C.R., Weinand, M.M.G., Isbell, R.F., 1994. Surface soil-pH map of Queensland. Australian Journal of Soil Research 32, 213 – 227.

B.L. Henderson et al. / Geoderma 124 (2005) 383–398 Batjes, N.H., 1996. Total carbon and nitrogen in the soils of the world. European Journal of Soil Science 47 (2), 151 – 163. Batjes, N.H., 1997. A world dataset of derived soil properties by FAO–UNESCO soil unit for global modelling. Soil Use and Management 13, 9 – 16. Batjes, N.H., 2000. Effect of mapped variation to soil conditions on estimation of soil carbon and nitrogen stocks for South America. Geoderma 97, 135 – 144. Bell, J.C., Cunningham, R.L., Havens, M.W., 1992. Calibration and validation of a soil-landscape model for predicting soil drainage class. Soil Science Society of America Journal 56, 1860 – 1866. Bliss, N.A., 1990. Hierarchy of soil databases for calibrating models of global climate change. In: Bouwman, A.P. (Ed.), Soils and the Greenhouse Effect. Wiley, New York, pp. 529 – 533. Breiman, L., 1996. Begging predictors. Machine Learning 26 (2), 123 – 140. Breiman, L., Friedman, J.H., Olhsen, R.A., Stone, C.J., 1984. Classification and Regression Trees. Wadsworth, Monterrey, CA. Bui, E.N., Henderson, B.L., Viergecer, K., manuscript in preparation. Knowledge discovery from models of soil properties developed through data mining. Bureau of Rural Sciences, 2000. Digital Atlas of Australian Soils, http://www.affa.gov.au/content/output.cfm?ObjectID= D2C48F86-BA1A-11A1-A2200060B0A05758. Carlile, P., Bui, E. Moran, C., Simon, D, Henderson B., 2001. Method used to generate soil attribute surfaces for ASRIS using soil maps and look up tables. Technical Report 24/01, CSIRO Land and Water, Canberra. Cressie, N.A.C., 1993. Statistics for Spatial Data. Wiley, New York. De Gruitjer, J.J., 2000. Sampling for spatial inventory and monitoring of natural resources. Technical report, Alterrrarapport 070, Wageningen, Alterra, Green World Research. Friedman, J.H., Hastie, T., Tibshirani, R., 2000. Additive logistic regression: a statistical view of boosting. Annals of Statistics 28 (2), 337 – 374. Gessler, P.E., 1996. Statistical soil-landscape modelling for environmental management. PhD thesis, Australian National University, Canberra. Gessler, P.E., Moore, I.D., McKenzie, N.J., Ryan, P.J., 1995. Soillandscape modelling and spatial predictions of soil attributes. International Journal of Geographical Information Systems 4, 421 – 432. Henderson, B.L., Bui, E.N., 2002. An improved calibration curve between soil pH assessed in water and CaCl2. Australian Journal of Soil Research 40, 1399 – 1405. Henderson, B.L., Bui, E.N., Moran, C.J., Simon, D.A.P., Carlile, P., 2001. Technical Report 28/01, CSIRO Land and Water, Canberra. Houlder, D., Hutchinson, M., Nix, H., McMahon, J., 1999. ANUCLIM Version 5.0. Centre for Resource and Environmental Studies, The Australian National University, Canberra. Hudson, B.D., 1992. The soil survey as paradigm-based science. Soil Science Society of America Journal 56, 836 – 841. Hugget, R.J., 1995. Geoecology. Routledge, London. Jenny, H., 1941. Factors of Soil Formation. McGraw-Hill, New York.

397

Jenny, H., Leonard, C.D., 1934. Functional relationships between soil properties and rainfall. Soil Science 38, 363 – 381. Johnston, R.M., Barry, S.J., Bleys, E., Bui, E.N., Moran, C.J., Simon, D.A.P., Carlile, P., McKenzie, N.J., Henderson, B., Chapman, G., Imhoff, M., Maschmedt, D., Howe, D., Grose, C., Schokneckt, N., Powell, B., Grundy, M., 2003. ASRIS: the database. Australian Journal of Soil Research 41, 1021 – 1036. Kern, J.S., 1994. Spatial patterns of soil organic carbon in the contiguous United States. Soil Science of America Journal 58, 439 – 455. Kershaw, A.P., 1978. Record of last interglacial cycle from northeastern Queensland. Nature 272, 159 – 162. Kittel, T.G.F., Rosenbloom, N.A., Painter, T.H., Schimel, D.S., VEMAP modelling Participants, 1995. The VEMAP integrated database for modelling United States ecosystem/vegetation sensitivity to climate change. Journal of Biogeography 22, 857 – 862. Kreznor, W.R., Olson, K.R., Banwart, W.L., Johnson, D.L., 1989. Soil, landscape and erosion relationships in a Northwest Illinois watershed. Soil Science Society of America Journal 53, 1763 – 1771. Leese, J., Jackson, T., Pitman, A., Dirmeyer, P., 2001. Meeting summary: GEWEX/BAHC International Workshop on Soil Moisture Monitoring, Analysis, and Prediction for Hydrometoerological and Hydroclimatological Applications. Bulletin of the American Meteorological Society 82, 1423 – 1430. McBratney, A.B., Mendonc¸a Santos, M.L., Minasny, B., 2003. On digital soil mapping. Geoderma 117, 3 – 52. McKenzie, N.J., Ryan, P.J., 1999. Spatial prediction of soil attributes using terrain analysis. Geoderma 89, 67 – 94. McKenzie, N.J., Jacquier, D., Ashton, L.J., Cresswell, H.P., 2000. Estimating soil properties using the Atlas of Australian Soils. Technical Report 11/00, CSIRO Land and Water, Canberra. Minasny, B., McBratney, A.B., 2002. The efficiency of various approaches to obtaining estimates of soil hydraulic properties. Geoderma, 55 – 70. Minasny, B., McBratney, A.B., Bristow, K.L., 1999. Comparison of different approaches to the development of pedotransfer functions for water-retention curves. Geoderma 93, 225 – 253. Moore, I.D., Gessler, P.E., Nielsen, G.A., Peterson, G.A., 1993. Soil attribute prediction using terrain analysis. Soil Science Society of America Journal 57, 443 – 452. Northcote, K.H., Beckermann, G.G., Bettenay, E., Churchward, H.M., Dijk, D.C.V., Dimmock, G.M., Hubble, G.D., Isbell, R.F., McArthur, W.M., Murtha, G.G., Nicolls, K.D., Paton, R., Thompson, C.H., Webb, A.A., Wright, M.J., 1960–1968. Atlas of Australian Soils, Sheets 1–10, with explanatory booklets. CSIRO and Melbourne University Press, Melbourne, Australia. Odeh, I.O., McBratney, A.B., Chittleborough, D.J., 1994. Spatial prediction from landform attributes derived from a digital elevation model. Geoderma 63, 197 – 214. Odeh, I.O., McBratney, A.B., Chittleborough, D.J., 1995. Further results on prediction of soil properties from terrain attributes: heterotopic cokriging and regression-kriging. Geoderma 67, 215 – 226.

398

B.L. Henderson et al. / Geoderma 124 (2005) 383–398

Oldeman, L.R., Van Engelen, V.W.P., 1993. A world soils and terrain digital database (SOTER)—an improved assessment of land resources. Geoderma 60, 309 – 325. Post, W.M., Peng, T.H., Emanuel, W.R., King, A.W., Dale, V.H., DeAngelis, D.L., 1990. The global carbon cycle. American Scientist 78, 310 – 326. Satalino, G., Mattia, F., Davidson, M.W.J., Le Toan, T., Pasquariello, G., Borgeaud, M., 2002. On current limits of soil moisture retrieval from ERS-SAR data. IEEE Transactions on Geoscience and Remote Sensing 40, 2438 – 2447. Schmugge, T.J., Kustas, W.P., Ritchie, J.C., Jackson, T.J., Rango, A., 2002. Remote sensing in hydrology. Advances in Water Resources 25, 1367 – 1385. Quinlan, J.R., 1992. Learning with continuous classes. Proceedings of the 5th Australian Joint Conference on Artificial Intelligence. World Scientific, Singapore, pp. 343 – 348. Quinlan, J.R., 1993. C4.5: Programs for Machine Learning. Morgan-Kaufman, San Mateo, CA. Quinlan, J.R., 1993. Combining instance-based and model-based learning. Proceedings of the 10th International Conference on Machine Learning. Morgan Kaufmann, San Mateo, California, pp. 236 – 243. Rayment, G.E., Higginson, F.R., 1992. Australian Laboratory Handbook of Soil and Water Chemical Methods. Inkata Press, Melbourne.

Reynolds, C.A., Jackson, T.J., Rawls, W.J., 2000. Estimating soil water-holding capacities by linking the Food and Agriculture Organization soil map of the world with global pedon databases and continuous pedotransfer functions. Water Resources Research 36, 3653 – 3662. Therneau, T.M., Atkinson, E.J., 1997. An introduction to recursive partitioning using the rpart routine. Technical Report 61, Mayo Clinic, Section of Statistics. Uysal, I., Gqvenir, H.A., 1999. An overview of regression techniques for knowledge discovery. Knowledge Engineering Review 14, 319 – 340. Voltz, M., Lagacherie, P., Louchart, X., 1997. Predicting soil properties over a region using sample information from a mapped reference area. European Journal of Soil Science 48, 19 – 30. Walker, P.H., Hall, G.F., Protz, R., 1968. Relationship between landform parameters and soil properties. Soil Science Journal of America Proceedings 32, 101 – 104. Walker, J.P., Houser, P.R., 2001. A methodology for initializing soil moisture in a global climate model: assimilation of near-surface soil moisture observations. Journal of Geophysical ResearchAtmospheres 106, 11761 – 11774. Wosten, J.H.M., Pachepsky, Y.A., Rawls, W.J., 2001. Pedotransfer functions: bridging the gap between available basic soil data and missing soil hydraulic characteristics. Journal of Hydrology 251, 123 – 150.