Available online at www.sciencedirect.com
Geoderma 142 (2007) 285 – 293 www.elsevier.com/locate/geoderma
Incorporating taxonomic distance into spatial prediction and digital mapping of soil classes Budiman Minasny ⁎, Alex B. McBratney Faculty of Agriculture, Food & Natural Resources, The University of Sydney, NSW 2006, Australia Received 9 January 2007; received in revised form 23 July 2007; accepted 22 August 2007 Available online 27 September 2007
Abstract Mapping soil classes digitally generally starts with soil profile description with observed soil classes at a taxonomic level in a particular classification system. At each soil observation location there is a set of co-located environmental variables, and the challenge is to correlate the soil class with the environmental variables. The current methodology treats soil classes as ‘labels’ and their prediction only considers the minimisation of the misclassification error. Soil classes at any taxonomic level have taxonomic relationships between each other, and in some instances the errors in prediction of certain classes are more serious than the others. No statistical procedure so far has been utilised to account for these relationships. This paper shows that in digital mapping of soil classes, we can incorporate the taxonomic distance between soil classes in a supervised classification routine. Using classification trees, we can specify an algorithm that minimises the taxonomic distance rather than misclassification error. Two examples are given in this paper for mapping soil orders in the Australian soil classification system. A site in the Edgeroi area showed the advantage of using the method that minimises the taxonomic distance. Meanwhile a site in the Hunter Valley showed minimising the misclassification error performed similarly to minimising taxonomic distance. The advantages and challenges of using soil taxonomic distance are discussed. © 2007 Elsevier B.V. All rights reserved. Keywords: Soil classification; Spatial prediction; Decision tree; Supervised classification; Classification and regression tree
1. Introduction Mapping soil classes is one of the techniques that has been used to characterise soil spatial variation (Lagacherie, 2005). Soil classes can be derived from local, national or global soil classification systems, e.g. numerical classification (Odeh et al., 1992; Bragato, 2004), the Australian Soil Classification System (Isbell, 1996), or World Reference Base (IUSS Working Group WRB, 2006). Some soil scientists prefer to map soil classes as they are familiar with the concept of classes and can recognise and infer their properties. Some also argue that the soil class implies the properties of a soil profile (soil properties in horizons and at depths). In addition, the soil class can be used to estimate the likely soil properties in the absence of laboratorymeasured soil physical and chemical properties. If well-
⁎ Corresponding author. Tel.: +61 2 9036 9043; fax: +61 2 9351 3706. E-mail address:
[email protected] (B. Minasny). 0016-7061/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.geoderma.2007.08.022
constructed, the soil class is an information carrier, allowing inferences of behaviour, e.g. response in yield trials. Mapping soil classes generally starts with soil profile description with observed soil classes at a taxonomic level in a particular classification system. In digital soil mapping, at each soil observation location there is a set of co-located environmental variables. The challenge is to correlate the soil class with the environmental variables. Hengl et al. (2007) discussed some methods for spatial interpolation of soil classes. The model for mapping soil classes can be written as follows (McBratney et al., 2003): Sc ¼ f ðs;c;o;r;p;a;nÞ where Sc is soil class, s refers to soil information either from a prior map or database, or from an expert knowledge, c refers to climate, o refers to organisms including human activity, r refers to relief or topography, p refers to parent material, a refers to age, and n refers to spatial position. Each factor is represented by a set of one or more continuous or categorical
286
B. Minasny, A.B. McBratney / Geoderma 142 (2007) 285–293
variables, e.g., r by the elevation and slope, o by land-use. The empirical function f represents a supervised classification or supervised learning problem. The supervised learning rules are fitted using the training data, and then used to predict the classes at other locations where only environmental variables are present. The form of f for predicting soil class can be logistic regression (Campling et al., 2002), linear discriminant analysis (Bell et al., 1992), neural networks (Zhu, 2000), support vector machines and learning vector quantisation (Behrens and Scholten, 2007), Bayesian belief network (Mayr and Palmer, 2007), or classification/decision trees (Lagacherie and Holmes, 1997). In this paper we focus on classification or decision trees (Lagacherie and Holmes, 1997; Moran and Bui, 2002; Bui and Moran, 2003). A tree structure is generated by partitioning the data recursively into a number of groups, each division being chosen as to maximise some error measure in the response variable in the resulting groups. However all of the statistical and data-mining models (Behrens and Scholten, 2007) treat the soil class as a ‘label’ and their prediction only considers the minimisation of the misclassification error. This implies that the taxonomic distances between all classes are equal. Soil classes at any taxonomic level have taxonomic relationships between each other, and no statistical procedure so far has been utilised to account for such relationships.
As an example, consider the Australian Soil Classification System (Isbell, 1996) (Fig. 1), using the general classification error criterion, classifying an observed Chromosol (a texturecontrast soil with pH of the B horizon N5.5) as a Sodosol (same as Chromosol, except has a sodic subsoil) will have the same error as classifying it as an Organosol (soils dominated by organic materials) or Hydrosol (soil with prolonged saturation). This is not true for all soil classes, Chromosols, Sodosols, and Kurosols are more closely related to each other compared with Organosols and Hydrosols. These differences should be incorporated into the supervised classification procedures. One solution is provided by Carré and Girard (2002) where they proposed mapping soil classes based on their taxonomic distances. For each location, the taxonomic distance to each of the defined centroids of the soil classes is predicted from the environmental variables using regression kriging. Areas showing the least taxonomic distance to a soil type indicate that particular soil type is dominant or most similar. This paper considers an alternative method which incorporates the soil taxonomic distance as a criterion in supervised classification for prediction of soil classes. We present the theory, then examples of its application. We use decision trees and the Australian soil classification system in this paper, however the concept is universal and can be applied to other data-mining and prediction tools such as neural networks and to other soil classification systems.
Fig. 1. A summary of the Australian soil classification system (after Isbell et al., 1997).
B. Minasny, A.B. McBratney / Geoderma 142 (2007) 285–293
2. Theory 2.1. Defining the loss function Training in supervised classification involves minimising some error measure (Hastie et al., 2001). First we define pˆ k as the proportion of observations correctly classified as class k: n 1X pˆ k ¼ I ðyi ¼ k Þ; n i¼1
ð1Þ
where i = 1, 2,…, n is the number of observations, k = 1, 2,…, c is the number of classes, I( yi = k) is an indicator when observed class yi is equal to class k. The error measures commonly used are: • The misclassification error: E¼
n X c c X 1X I ðyi pk Þ ¼ 1 pˆ k : n i¼1 k¼1 k¼1
ð2Þ
• The Gini index: E¼
X
pˆ pˆ ¼ jpk j k
c X
pˆ k ð1 pˆ k Þ:
ð3Þ
k¼1
• The cross-entropy or deviance: E¼
c X
pˆ k logðpˆ k Þ:
ð4Þ
k¼1
Other classification error measures include the Kappa coefficient which is a chance corrected index of the agreement between the observed and predicted soil class (for example Moran and Bui, 2002). Using the above error criteria assumes that the errors for all classes are equally important. However this is not true for soil classes and does not allow for the situation where some errors are more serious than others. The relative importance between the classes can be represented as a loss or cost matrix (Hastie et al., 2001). This is designed as matrix L with non-negative values and dimensions c × c, where c is the number of soil classes. The elements of L, Ljk, represent the “loss” or “cost” when classifying an observation of class j as class k. The loss matrix in predicting soil classes can be represented as the taxonomic distance between soil classes. The taxonomic distance refers to the distances between soil classes in multidimensional spaces, the axes defined by the characters or attributes of soil profiles or classes. This is based on a taxonomic principle from Adanson (Moore and Russell, 1966) — classifications should be based on many characters, and each should have equal weight. Properties of L, the taxonomic distance matrix, are: symmetry Ljk = Lkj, non-negativity Ljk ≥ 0, zero self difference Ljj = 0, and triangle inequality Ljk ≤ (Lji + Lik) (Gower and Legendre, 1986). When all the classes are equidistant (classes are equally dissimilar or ignore the tax-
287
onomic differences), the off-diagonal elements of L are unity and the diagonal elements are equal to zero. In this study, the off-diagonal elements of L, Ljk, j ≠ k, represents the taxonomic distance between class j and k. We can now introduce the error criterion that needs to be minimised in a supervised classification algorithm. A general one is the average misclassification cost (Pazzani et al., 1994): Ea ¼
n 1X L Ci ;Cˆ i n i¼1
ð5Þ
where Ea is the average cost, L(C,Cˆ ) is the cost of predicting class C as Cˆ , or taxonomic distance between observed class C and predicted class Cˆ . In other words we want to minimise the taxonomic distance error between the observed and predicted soil class. Classifying an observed soil class as a class that is taxonomically distant will result in an increase in the error measure. We called Eq. (5) the average taxonomic distance error. Hastie et al. (2001) suggested another measure which incorporates the loss function to minimise the Gini index: X E¼ pˆ L pˆ ð6Þ jpk j jk k where Ljk is the cost for classifying observed class j as class k. This expression is more general than Eq. (5) as it deals with probabilistic rather than hard classes. This expression is equivalent to the quadratic entropy or the mean taxonomic distance used in measuring pedodiversity (McBratney and Minasny, 2007). Breiman et al. (1984) proposed modifying the probability term: pˆ k tk pˆkV ¼ P c pˆ j tj
ð7Þ
j¼1
where tk ¼
c P
Ljk , element of a vector of the sum of the
j¼1
taxonomic distance for class k. The advantage of Eq. (7) is that it forms a generalised misclassification measure which can be substituted into Eqs. (2–4). 2.2. Defining taxonomic distance between soil classes When soil classes are generated using numerical methods, such as fuzzy k-means (Odeh et al., 1992) it is easy to define the taxonomic distance between the classes, as it is the same criterion used to define the classification. Problems arise when using existing global or national classification systems as the classes are defined based on their morphology and environments. They have no formal quantitative concept of taxonomic similarity. As seen in Fig. 1 for the Australian soil classification system, we cannot simply define the similarity or the distance between soil classes based on their description. One way of defining the taxonomic distance for an existing soil classification system is based on a central concept, e.g. to define a modal soil profile for each soil class. Mazaheri et al.
288
B. Minasny, A.B. McBratney / Geoderma 142 (2007) 285–293
Fig. 2. Distribution of soil orders in the Pokolbin and Edgeroi areas according to the Australian soil classification system.
(1995) derived the centroids for the Australian Great Soil Groups based on a list of soil properties. The distance between the class centroids (thus the taxonomic distance) is calculated using the Mahalanobis distance: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi T djk ¼ xj xk V1 xj xk ð8Þ where X is a c class × m variables matrix of soil properties for the centroids, and V is the variance–covariance matrix of X. Another way of calculating the taxonomic distance between soil classes is using ordination (Hole and Hironaka, 1960), when a soil database, which contains the morphological, physical, and chemical properties of the soil profile, is present. Discriminant analysis is used to predict the soil classes from observed soil physical and chemical properties. The centroid or mean properties for each class then can be obtained, and Eq. (8) can be applied. Both methods will be illustrated in the examples below. 3. Methods Here we provide an example of the use of digital soil mapping to map soil classes in two areas in New South Wales, Australia. The first area is around Pokolbin in the Hunter Valley, situated between 32°49′S, 151°15′E and 32°45′S, 151°19′E. The soil dataset consists of 718 soil profiles with observed soil morphological properties. The environmental covariates available are: a map of soil-landscape units, a digital elevation model
(DEM) and derived terrain attributes, Landsat 7 ETM observed in August 1999, and land-use obtained from supervised classification of an aerial photo of the area. All data layers were registered on a common grid of 25 m × 25 m. The dataset was randomly divided into two sets: 540 profiles for training and 178 profiles for validation. In addition, ten-fold cross validation was also performed to estimate the standard deviation of the misclassification error. The second area is around the township of Edgeroi, located between 29°55′S, 149°24′E and 30°25′S, 150°15′E. The soil dataset consists of 341 soil profiles with observed morphological properties, and laboratory determined physical, and chemical properties. Details can be found in McGarry et al. (1989) and Minasny et al. (2006). Environmental covariates were compiled for the whole area on a grid of 25 m × 25 m, including a digital elevation model and its derivatives, Landsat 7 ETM images from 2003, and gamma radiometrics from airborne survey. The environmental covariates used as predictors are: elevation, slope, aspect, topographic wetness index, Normalised Difference Vegetation Index calculated from Landsat ETM, soil enhancement ratio (band 5/band 7 of Landsat ETM), gamma radiometric K, and sum of K, Th, U. The dataset was randomly divided into two sets: 256 profiles for training and 65 profiles for validation. Ten-fold cross validation was also performed. Soil profile descriptions in both areas were allocated to the classes of the Australian Soil Classification System (Isbell, 1996) by experts (Ms. Uta Stockmann for the Pokolbin, Dr. Lou
Table 1 Key soil morphological attributes for characterising soil Order in the Australian Soil Classification System Anthroposols
Organosols
Rudosols
Tenosols
Podosols
Vertosols
Hydrosols
Kurosols
Sodosols
Chromosols
Calcarosols
Ferrosols
Dermosols
Kandosols
1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0
0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0
0 0 0 1 1 0 0 0 1 1 0 1 0 1 0 0
0 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1
0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0
0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0
0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0
0 0 0 1 0 0 0 0 1 1 1 1 0 1 0 0
0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0
1 represents possible presence of the properties, and 0 is absence of the defined properties. For definition of the properties, see Isbell et al. (1997).
Table 2 Taxonomic distance matrix between soil classes according to the Australian Soil Classification System
Anthroposols Organosols Rudosols Tenosols Podosols Vertosols Hydrosols Kurosols Sodosols Chromosols Calcarosols Ferrosols Dermosols Kandosols
Anthroposols
Organosols
Podosols
Rudosols
Tenosols
Vertosols
Hydrosols
Kurosols
Sodosols
Chromosols
Calcarosols
Ferrosols
Dermosols
Kandosols
0.0 3.3 3.0 2.4 3.2 3.0 2.0 2.8 2.6 2.8 2.6 2.6 2.6 2.6
3.3 0.0 1.4 2.2 1.7 2.4 3.3 2.2 2.4 2.2 2.0 2.0 2.4 2.0
3.0 1.4 0.0 2.2 1.7 2.4 3.0 2.2 2.4 2.2 2.0 2.0 2.4 2.0
2.4 2.2 2.2 0.0 2.0 1.7 2.4 2.0 2.2 2.0 1.7 1.7 1.7 1.0
3.2 1.7 1.7 2.0 0.0 2.2 3.2 2.0 2.6 2.4 2.2 1.7 2.2 1.7
3.0 2.4 2.4 1.7 2.2 0.0 2.6 2.2 2.4 2.2 2.0 2.0 1.4 2.0
2.0 3.3 3.0 2.4 3.2 2.6 0.0 2.8 2.6 2.8 3.0 2.6 2.6 2.6
2.8 2.2 2.2 2.0 2.0 2.2 2.8 0.0 1.7 1.4 2.2 1.7 2.2 2.2
2.6 2.4 2.4 2.2 2.6 2.4 2.6 1.7 0.0 1.0 2.0 2.4 2.4 2.4
2.8 2.2 2.2 2.0 2.4 2.2 2.8 1.4 1.0 0.0 1.7 2.2 2.2 2.2
2.6 2.0 2.0 1.7 2.2 2.0 3.0 2.2 2.0 1.7 0.0 2.0 1.4 2.0
2.6 2.0 2.0 1.7 1.7 2.0 2.6 1.7 2.4 2.2 2.0 0.0 2.0 2.0
2.6 2.4 2.4 1.7 2.2 1.4 2.6 2.2 2.4 2.2 1.4 2.0 0.0 2.0
2.6 2.0 2.0 1.0 1.7 2.0 2.6 2.2 2.4 2.2 2.0 2.0 2.0 0.0
B. Minasny, A.B. McBratney / Geoderma 142 (2007) 285–293
Man-made OM dominated Bs/Hhs/Bh Horizon Clay N 35% throughout Cracks and slickensides Saturated Texture constrast Sodic subsoil B2 pH b 5.5 B2 pH N 5.5 Calcareous Lack texture-contrast B2 Fe N 5% Structured B2 Massive B2 Rudiment B
289
290
B. Minasny, A.B. McBratney / Geoderma 142 (2007) 285–293
Mendonça Santos and Ms. Charlotte Moore for the Edgeroi area). Fig. 2 shows the distribution of the soil orders in the two areas. The Pokolbin area has 8 soil orders (out of 14), and is dominated by Dermosols. The Edgeroi area has 9 soil orders and is dominated by Vertosols. A decision-tree program C5.0 (Quinlan, 1993; RuleQuest Research, 2006) was used to predict soil classes. Program C5.0 builds a tree by determining splits in the data set which minimise an error measure at a node. When the loss matrix is provided, the program will minimise the total cost of misclassified training cases Ea, Eq. (5). 4. Results 4.1. Derivation of taxonomic distance for the Australian soil classification system We present an example of deriving a taxonomic distance matrix between the soil classes for an existing soil classification system. First we define the morphological properties that are used to distinguish the soil order, such as the presence of texture contrast between top-and subsoil, pH in the subsoil (B2 horizon), presence of structure in the B2 horizon, etc. (Isbell et al., 1997). This forms the modal soil profile for each soil order (Table 1) with index 0 and 1 indicating the absence and possible presence of the specific properties. The Euclidean distance of the morphological properties between each of the soil orders was then calculated, and this forms the soil order taxonomic distance matrix (Table 2). Fig. 3 shows the bi-plot of the first two principal components of the soil orders (explaining 43% variation). Soil orders from more unusual environments such
as Hydrosols, Anthroposols, Podosols, and Organosols are well separated from the other orders. Soils exhibiting texturecontrast between top- and subsoil are close together (Kurosols, Sodosols, and Chromosols). Ferrosol, Kandosols, Tenosols and Dermosols appear to have close relationships. 4.2. Derivation of taxonomic distance using soil properties Since laboratory-measured soil physical and chemical properties are available in the Edgeroi area, we used discriminant analysis for predicting soil order from soil properties. Topsoil and subsoil clay content, pH in CaCl2, Electrical conductivity (1:5 soil to water ratio), Na+ content, and CaCO3 content were used to discriminate the soil orders. The soil properties can describe 65% of the observed soil classes. Fig. 4 shows the results of the discriminant analysis of soil classes, with the canonical plot of the data points and soil class means. Each soil class mean is a labelled circle with radius corresponding to a 95% confidence limit for the mean. Soil classes that are significantly different tend to have nonintersecting circles. Vertosols are well separated from other classes, because of their unique properties and large number of observations. Calcarosols also appear to occupy a unique position in the discriminant space. Sodosols and Chromosols are closely related, which confirms their assumed relationships (Fig. 1) along with Kurosols. Meanwhile Kandosols occupy the middle of the plot, which suggests the lack of analytical properties that defines the massive B2 horizon. Rudosols and Dermosols plot in between Calcarosols and Chromosols. It should be noted that because of the small sample sizes, the centroids of Rudosols, Tenosols, and Kurosols have a large uncertainty. Based on the centroids, the distances between the
Fig. 3. Bi-plot of the soil orders and discriminating properties for the Australian soil classification system.
B. Minasny, A.B. McBratney / Geoderma 142 (2007) 285–293
291
Fig. 4. Canonical plots of the Edgeroi data, the mean of a soil class is a labelled circle with radius corresponding to a 95% confidence limit for the mean.
soil classes can be calculated using Eq. (7), and the distance matrix is given in Table 3. 4.3. Digital mapping of soil classes Program C5.0 uses misclassification error, Eq. (2), as the criterion for minimising the error in predicting soil class. For the Pokolbin area, using misclassification error the tree produced a misclassification error of 21% (79% correctly classified soil classes) on the training data, and 48% on the validation set (Table 4). The percentage of correctly predicted soil class is 50– 80%, similar to the findings by Moran and Bui (2002) and Bui and Moran (2003). The main environmental variables selected by the program to predict soil classes are: soil-landscape units, Landsat band 5 (B5), land-use, profile curvature, CTI, elevation, NDVI, Landsat B4, B6, and slope.
We then used the taxonomic distance between soil orders (Table 2) as a loss matrix, using program C5.0 and Eq. (5) as the criterion to minimise. The resulting tree gives a similar misclassification error of 23% in the training set (Table 4). The average taxonomic distance error of Ea = 0.5, without using the loss matrix the total distance error is similar. However on the validation set, incorporating the taxonomic distance neither improves the prediction, nor decreases the distance error. This is also confirmed in the ten-fold cross validation, the misclassification error for both methods are quite similar. For the Edgeroi area, program C5.0 gives a misclassification error of 14% on the training set, and 47% on the validation set. The average distance error is 1.5 for prediction and 4.5 for validation. Incorporating the taxonomic distance (Table 3) as a loss matrix, the resulting classification tree gives a lower misclassification error 36% on the validation set and smaller
Table 3 Taxonomic distance matrix between soil classes according to the Australian Soil Classification System derived from the Edgeroi soil physical and chemical data
Rudosols Tenosols Vertosols Kurosols Sodosols Chromosols Calcarosols Dermosols Kandosols
Rudosols
Tenosols
Vertosols
Kurosols
Sodosols
Chromosols
Calcarosols
Dermosols
Kandosols
0.0 15.4 8.1 20.4 7.1 11.5 9.0 2.4 3.4
15.4 0.0 36.7 15.4 12.4 14.6 30.9 21.1 7.8
8.1 36.7 0.0 35.2 18.1 22.0 7.6 5.0 14.9
20.4 15.4 35.2 0.0 9.1 22.8 31.9 23.4 9.7
7.1 12.4 18.1 9.1 0.0 7.5 16.3 7.3 3.4
11.5 14.6 22.0 22.8 7.5 0.0 20.8 11.8 9.9
9.0 30.9 7.6 31.9 16.3 20.8 0.0 5.4 12.7
2.4 21.1 5.0 23.4 7.3 11.8 5.4 0.0 5.1
3.4 7.8 14.9 9.7 3.4 9.9 12.7 5.1 0.0
292
B. Minasny, A.B. McBratney / Geoderma 142 (2007) 285–293
Table 4 Performance of soil class prediction in the Pokolbin and Edgeroi area using classification trees that minimise the misclassification error and the taxonomic distance No. Minimising misclassification error Minimising taxonomic distance samples Misclassification error (%) Average taxonomic distance (Ea) Misclassification error (%) Average taxonomic distance (Ea) Pokolbin Training 540 Validation 178 Ten-fold cross validation 72
21.1 48.3 53.1 ± 1.9
0.52 0.81 1.15
22.6 52.2 52.0 ± 1.6
0.50 1.16 1.14 ± 0.03
Edgeroi Training 256 Validation 85 Ten-fold cross validation 34
14.5 47.1 46.1 ± 2.2
1.47 4.47 4.58
16.8 36.5 41.6 ± 2.1
1.36 3.08 3.6 ± 0.23
average distance error Ea = 3.1. This result is also observed in the ten-fold cross validation, showing a smaller misclassification error and smaller distance error. The main environmental variables used in the decision tree to predict soil classes in this area are: elevation, sum of the gamma radiometrics elements, Landsat B5, B3, and soil enhancement ratio B5/B7. 5. Discussion The results show some promise but also challenges in the digital soil mapping of soil classes. There are advantages of incorporating soil taxonomic distance as a loss function in the decision tree. It provides a clearer and more soil science basis for the classification process, not merely a statistical exercise. This is useful in areas where distinct soil classes are present and when the errors in prediction certain classes are more serious than the others. The Edgeroi area shows that incorporating taxonomic distance lowers the misclassification error and reduces the soil class distance errors. The taxonomic distance is derived from soil physical and chemical properties. Meanwhile in the Pokolbin area incorporating taxonomic distance did not show benefits. The taxonomic distance (Table 2) is based on some general description of soil and environmental properties, which may not be sufficient to distinguish the soil classes. Pazzani et al. (1994) discussed the problems of minimising the misclassification cost, they suggested that the misclassification cost criterion only attempts to minimise the global cost or global taxonomic distance without dealing with local prediction. Another reason for the poor performance in the Pokolbin area may be due to mapping soil orders, where the environmental factors do not explain well the broad soil classes (misclassification error is 50%). Because the area in Pokolbin is dominated by Dermosols (57%), most of the prediction (70–80%) falls into this class, thus the distances to other classes do not have a strong influence on the global misclassification cost. The authors feel that the Dermosol order does not represent well the large variation within it. In this paper, we have made a few assumptions on the taxonomic distance, and soil classification. We identify some shortcomings: – Soil classes are usually defined by rules based on expert knowledge and its allocation also depends on the expert's background and subjectivity.
– Taxonomic distance calculated based on soil laboratory data considers the soil class as being relative to the data. However for soil surveyors, the soil classes are based on the central concept (Bui, 2004). The central concept is not a point but may be better represented by a volume. – The variables used to define the soil classes (Table 1) may not be enough to explain the soil orders completely. – The distances between soil classes are assumed to be linear, this may not be true since the classification is based from rules and expert knowledge. 6. Conclusions This paper shows that in digital mapping of soil classes, we can incorporate the taxonomic distance between soil classes in a supervised classification routine. This makes the procedure more meaningful, not merely treating the class as a label. The loss matrix also offers the possibility to incorporate the extragrades class (McBratney and de Gruijter, 1992). The main problem is in defining the taxonomic distance between soil classes in existing soil classification systems. Defining and calculating soil taxonomic distances is a challenge that must be taken up by the proponents of the various national and global soil classification systems. Taxonomic distance is essential in many applications such as calculating pedodiversity (McBratney and Minasny, 2007). Acknowledgements This work is funded by an Australian Research Council Discovery project on Digital Soil Mapping. The authors would like to thank Dr. Damien Field for his pedological interpretation, Ms. Uta Stockmann for allocating the soil profiles from the Pokolbin area, and Dr. Lou Mendonca Santos and Ms. Charlotte Moore for allocating the profiles from the Edgeroi area to the Australian Soil Classification System. We also thank Drs. David Rossiter and Florence Carré for their useful comments. References Behrens, T., Scholten, T., 2007. A comparison of data-mining techniques in predictive soil mapping. In: Lagacherie, P., McBratney, A.B., Voltz, M. (Eds.), Digital Soil Mapping. An Introductory Perspective. . Developments in Soil Science, vol. 31. Elseveier, Amsterdam, pp. 353–364.
B. Minasny, A.B. McBratney / Geoderma 142 (2007) 285–293 Bell, J.C., Cunningham, R.L., Havens, M.W., 1992. Calibration and validation of a soil-landscape model for predicting soil drainage class. Soil Science Society of America Journal 56, 1860–1866. Bragato, G., 2004. Fuzzy continuous classification and spatial interpolation in conventional soil survey for soil mapping of the lower Piave plain. Geoderma 118, 1–16. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification and Regression Trees. Wadsworth, Belmont, CA. Bui, E.N., 2004. Soil survey as a knowledge system. Geoderma 120, 17–26. Bui, E.N., Moran, C.J., 2003. A strategy to fill gaps in soil survey over large spatial extents: an example from the Murray–Darling basin of Australia. Geoderma 111, 21–44. Campling, P., Gobin, A., Feyen, J., 2002. Logistic modeling to spatially predict the probability of soil drainage classes. Soil Science Society of America Journal 66, 1390–1401. Carré, F., Girard, M.C., 2002. Quantitative mapping of soil types based on regression kriging of taxonomic distances with landform and land cover attributes. Geoderma 110, 241–263. Gower, J.C., Legendre, P., 1986. Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification 3, 5–48. Hastie, T., Tibshirani, R., Friedman, J., 2001. The elements of statistical learning: data mining, inference and prediction. Springer Series in Statistics. Springer-Verlag, New York. Hengl, T., Toomanian, N., Reuter, H.I., Malakouti, M.J., 2007. Methods to interpolate soil categorical variables from profile observations: lessons from Iran. Geoderma 140, 323–456. Hole, F.D., Hironaka, M., 1960. An experiment in ordination of some soil profiles. Soil Science Society of American Proceedings 24, 309–312. Isbell, R.F., 1996. The Australian Soil Classification. CSIRO Publishing, Collingwood, Victoria. Isbell, R.F., McDonald, W.S., Ashton, L.J., 1997. Concepts and Rationale of the Australian Soil Classification. ACLEP, CSIRO Land and Water, Canberra. IUSS Working Group WRB, 2006. 2nd edition. World Soil Resources Reports, vol. 103. FAO, Rome. Lagacherie, P., 2005. An algorithm for fuzzy pattern matching to allocate soil individuals to pre-existing soil classes. Geoderma 128, 274–288. Lagacherie, P., Holmes, S., 1997. Addressing geographical data errors in a classification tree soil unit prediction. International Journal of Geographical Information Science 11, 183–198.
293
Mayr, T., Palmer, B., 2007. Digital soil mapping: an England and Wales perspective. In: Lagacherie, P., McBratney, A.B., Voltz, M. (Eds.), Digital Soil Mapping. An Introductory Perspective. Developments in Soil Science, vol. 31. Elseveier, Amsterdam, pp. 365–375. Mazaheri, S.A., Koppi, A.J., McBratney, A.B., 1995. A fuzzy allocation scheme for the Australian Great Soil Groups Classification system. European Journal of Soil Science 46, 601–612. McBratney, A.B., de Gruijter, J.J., 1992. A continuum approach to soil classification by modified fuzzy k-means with extragrades. Journal of Soil Science 43, 159–175. McBratney, A.B., Minasny, B., 2007. On measuring pedodiversity. Geoderma 141, 149–154. McBratney, A.B., Mendonça-Santos, M.L., Minasny, B., 2003. On digital soil mapping. Geoderma 117, 3–52. McGarry, D., Ward, W.T., McBratney, A.B., 1989. Soil studies in the Lower Namoi Valley: methods and data. The Edgeroi Data Set. CSIRO Division of Soils, Adelaide. 2 vols. Minasny, B., McBratney, A.B., Mendonça-Santos, M.L., Odeh, I.O.A., Guyon, B., 2006. Prediction and digital mapping of soil carbon storage in the Lower Namoi Valley. Australian Journal of Soil Research 44, 233–244. Moore, A.W., Russell, J.S., 1966. Potential use of numerical analysis and Adansonian concepts in soil science. Australian Journal of Science 29, 141–142. Moran, C.J., Bui, E., 2002. Spatial data mining for enhanced soil map modelling. International Journal of Geographical Information Science 16, 533–549. Odeh, I.O.A., McBratney, A.B., Chittleborough, D., 1992. Fuzzy-c-means and kriging for mapping soil as a continuous system. Soil Science Society of America Journal 56, 1848–1854. Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C., 1994. Reducing misclassification costs. Proceedings of the 11th International Conference of Machine Learning. Morgan Kaufmann, New Brunswick, pp. 217–225. Quinlan, J.R., 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco. RuleQuest Research, 2006. See5/C5.0 version 2.03. RuleQuest Research Pty Ltd., Sydney, Australia. Zhu, A.X., 2000. Mapping soil landscape as spatial continua: the neural network approach. Water Resources Research 36, 663–677.