Sampling for regression-based digital soil mapping: Closing the gap between statistical desires and operational applicability

Sampling for regression-based digital soil mapping: Closing the gap between statistical desires and operational applicability

Spatial Statistics 13 (2015) 106–122 Contents lists available at ScienceDirect Spatial Statistics journal homepage: www.elsevier.com/locate/spasta ...

2MB Sizes 0 Downloads 42 Views

Spatial Statistics 13 (2015) 106–122

Contents lists available at ScienceDirect

Spatial Statistics journal homepage: www.elsevier.com/locate/spasta

Sampling for regression-based digital soil mapping: Closing the gap between statistical desires and operational applicability Mareike Ließ University of Bayreuth, Department of Geosciences/ Soil Physics Division, Universitaetsstrasse 30, 95447 Bayreuth, Germany

article

info

Article history: Received 24 February 2015 Accepted 10 June 2015 Available online 17 June 2015 Keywords: Sampling design Digital soil mapping Regression

abstract With respect to sampling for regression-based digital soil mapping (DSM), the above all aim is to ensure that the spatial variability of the soil is well-captured without introducing any bias, while the design remains feasible according to operational constraints such as accessibility, man power and cost. Representativeness of the sample concerning the population to be sampled needs to be guaranteed in any regression-based modelling approach. Four selected sampling designs were adapted to show that basically any design may be optimised to represent the n-dimensional predictor space of a particular area, while selecting points is only permitted from a small accessible sub-area or from outside the area. Sampling efficiency may be evaluated based on the representation of the predictor space. However, not only each predictor’s probability function but also the interaction between predictors may have to be considered, to select a representative sample. Instead of sampling a previously un-sampled area with limited accessibility, the four sampling designs may also be used to subsample an existing dataset and, thereby, optimise a suboptimal dataset based on the predictor space of the area which shall be mapped by DSM. © 2015 Elsevier B.V. All rights reserved.

E-mail address: [email protected]. http://dx.doi.org/10.1016/j.spasta.2015.06.002 2211-6753/© 2015 Elsevier B.V. All rights reserved.

M. Ließ / Spatial Statistics 13 (2015) 106–122

107

1. Introduction Within the field of digital soil mapping (DSM) special emphasis has to be given to the application of an appropriate sampling scheme in order to receive unbiased predictions by the later developed model. Representativeness of a sample concerning the population to be sampled is of primary importance. Simple random sampling is one of the most popular sampling techniques. It gives each sample the same probability of being selected and the selected samples should, therefore, wellrepresent the considered population. However, particularly with a large number of selectable points and a small sample size, this representativeness is left to chance. Furthermore, in large areas which are difficult to access this approach is not feasible. Accordingly, this causes the often observed discrepancy between statistical theory and the applicability to the soil scientist collecting data in the field. While with interpolation methods such as kriging, emphasis has to be given to a good spatial coverage and hence a uniform spreading of the observations in geographical space (Brus et al., 2007), which is also not feasible in inaccessible or arduous terrain, regression-based DSM does not need a good spatial coverage. There are two general approaches which are common to select sampling positions for model training in regression-based DSM: (1) Samples are taken based on stratified random sampling, i.e. the area is subdivided into a number of strata or subareas and then random samples are taken from each of these strata, giving either equal or different selection probabilities to the strata. The strata can be based on e.g. xy-coordinates, terrain parameters, land use classes, etc. (e.g. McKenzie and Ryan, 1999; Dobermann and Simbahan, 2007) or (2) samples are chosen according to the probability density function of each predictor within the research area as is done e.g. in Gessler et al. (1995) and Hengl et al. (2003) or particularly within conditioned latin hypercube sampling (clhs, Minasny and McBratney, 2006, 2007). So far there are only few proposals regarding sampling for DSM in inaccessible areas or areas which are difficult to access (DTA-areas). These include transect sampling along hill-slopes and thereby covering terrain units (Ließet al., 2009) and the approach by Cambule et al. (2013) to validate a model established for an accessible area with few samples from the inaccessible area. Clifford et al. (2014) made a very promising approach to introduce some flexibility into clhs, so that one can easily select an alternative sampling point in case a point occurs to be inaccessible in the field. However, in general the approaches still assume that any of the points within an area is accessible depending on efforts and cost. This is of course correct, but in reality nobody will walk for several days to take a single soil sample or even risk death in areas contaminated by landmines. Sometimes even other research interests prohibit access to the area of interest. In order to map DTA-areas, a general design is developed for the complete investigation area while samples according to the design are then only taken from the accessible part of the area. This means (1) randomly selected points are replaced by accessible points which represent similar landscape positions, i.e. similar combinations of terrain parameters, vegetation cover, etc. and (2) a landscape classification is applied to the whole area, while samples are then randomly selected only from those parts within each unit which are accessible. The basic assumption on which all strategies are based is of course the concept that similar landscape positions carry similar soils with similar soil properties, the basic concept of all DSM regression approaches, i.e. Jenny’s factor model of soil formation (Jenny, 1941). According to Brus and de Gruijter (1997) the search for cost-effective sampling strategies considering the optimal use of ancillary information has been a major task of sampling theorists since the 1940s. It still remains an up-to-date topic particularly while cost-effectiveness and feasibility are considered. Usually, we do not know the probability function or the spatial specification of a soil property in a particular area. We also do not know if the map created based on a sampled dataset is indeed representing the area’s spatial reality. Whereas, we may try as much as possible to select a sample which covers the spatial variability of the factors of soil formation and their interaction and thereby guarantee that we also capture the spatial variability of any soil property caused by these factors. Differences between the various sampling designs in accounting for this ‘‘representativeness’’ of the sample according to the predictor space, while feasibility of the design due to accessibility, man power and cost is given, shall be demonstrated in this study. It is a concern which has so far only been discussed to be relevant for DSM model validation. But then, who wants to build a biased model?

108

M. Ließ / Spatial Statistics 13 (2015) 106–122

Fig. 1. Position of investigation areas. (a) Ecuador in South America, (b) Quinuas and Laipuna within Ecuador, (c) Quinuas, (d) Laipuna (overlaid hill shading with light source from north). Topographical data use with permission from the Ecuadorian Geographical Institute (2013, national base, scale 1:50.000), further GIS data provided by NCI and ETAPA.

2. Material and methods 2.1. Investigation areas To account for different problems of accessibility, the chosen designs will be applied to two tropical mountain landscapes. The two investigation areas, which are different in ownership, size, climatic conditions, vegetation and soil-landscape and pose different constraints due to accessibility, are both situated in southern Ecuador (Figs. 1(a) and (b)). The natural reserve Laipuna (Fig. 1(d)) is owned by the international NGO ‘‘Nature and Culture International’’ (NCI). It comprises an area of approximately 16 km2 , which ranges between c 400 and 1500 m above sea level (a.s.l.). Mean annual temperature is around 23 °C, mean precipitation amounts for c. 600 mm at 600 m a.s.l. (measurements by T. Peters). The area is covered by dry forest vegetation and is disturbed in some parts by cattle grazing. Its footpath network is very limited, but may be extended in some parts by small transects. Thorny under storey vegetation reduces accessibility even further. Therefore, only a corridor of 30 m distance along the existing footpath network is assumed accessible. The area within 5 m distance from the footpaths shall not be sampled due to a possibly high disturbance. Areas close to the creeks (≤10 m) were also excluded due to extremely high stone content, which would not permit sampling in many cases. In Fig. 1(d) the accessible area is hatched in yellow. Please take a closer look since it is only 2.4% of the area which remain accessible. The Quinuas catchment (Fig. 1(c)) includes part of the Cajas National Park, which is administered by the ‘‘Empresa Pública Municipal de Telecomunicaciones, Agua Potable, Alcantarillado y Saneamiento’’ (ETAPA), a municipal public agency, as well as private areas or areas belonging to local associations. It comprises an area of approximately 93 km2 , elevation ranges between c 3000 and 4400 m a.s.l. Mean annual temperature is reported to be 13.2 °C and average annual precipitation c. 1072 mm (information by ETAPA). The area is mainly covered by páramo vegetation. It is accessible by an

M. Ließ / Spatial Statistics 13 (2015) 106–122

109

interprovincial road and several footpaths. However, in this case, the vegetation cover is allowing for comparatively easy access also beyond these paths. To still reduce time and efforts to a reasonable extent, only those parts which lay within 450 m distance to the road and footpaths shall be sampled. Footpaths may be entered for 3000 m from starting points along the road. For applicability this was restrained by 2000 m air-line distance from the road (Fig. 1(c)). The area of disturbance along the road and footpaths was increased to 50 m in this case. First of all road construction causes a lot more disturbance than unconsolidated paths. Furthermore, there is much higher tourist traffic in this area compared to Laipuna which is why the area along paths is assumed to be disturbed to a farther distance. Of course one could argue if 50 m would really be necessary for this buffer zone along paths, but this is of no importance to show the efficiency of the here proposed sampling strategies. 2.2. Predictors For both investigation areas, a shapefile with contour lines of 40 m distance was available. These were interpolated by the multilevel B-spline algorithm developed by Lee et al. (1997) which is implemented in the SAGA GIS software (SAGA User Group Association, 2014). Regarding Laipuna, a final raster resolution of 10 ∗ 10 m2 for the so obtained digital elevation model (DEM) was used. For the much larger area of Quinuas, a lower resolution of 20 ∗ 20 m2 was used to reduce computation time in the later applications. Selectable sampling points for reasons of simplicity were limited to the centres of all GIS raster cells of each research area. A list of all terrain parameters derived from the DEMs which may be used as predictors for DSM and, therefore, as parameters for stratification purpose in the sampling designs is displayed in Table 1. A search radius of 100–300 m was chosen to well represent the landscape structure by these parameters. The parameter ‘‘aspect’’ is a circular variable; it was, therefore, transformed into two separate parameters: northing and easting. 2.3. Sampling designs Four random sampling schemes shall be compared according to the representativeness of the selected samples and, hence, their suitability for model training and validation in regression-based DSM. All four designs will be adapted to guarantee suitability for any considered investigation area, while limited accessibility is taken into account. Two point sampling approaches are compared to two approaches, which sample landscape units. The former two approaches are termed accessible random point sampling (arPS), the latter two are termed accessible random landscape unit sampling (arLUS). Simple random sampling is compared to the well-known conditioned latin hypercube sampling (clhs, Minasny and McBratney, 2006). The landscape units are formed by cluster analysis and predictor quantile combinations, respectively. The four considered designs are named in the following way:

• • • •

S-arPS = Simple-accessible random Point Sampling. LH-arPS = Latin Hypercube based-accessible random Point Sampling. CA-arLUS = Cluster Analysis based-accessible random Landscape Unit Sampling. QC-arLUS = Quantile Combination based-accessible random Landscape Unit Sampling.

For the two point sampling approaches, S-arPS and LH-arPS, 100 sampling points were chosen. For the two algorithms which perform random sampling based on landscape units (CA-arLUS, QC-arLUS), a different number of points had to be chosen in order to guarantee the sampling based on each unit’s areal coverage. Finally, for each selected sampling point in any of the four designs, two alternative points were selected to account for flexibility. These additional points may be sampled in case the first option still falls in an inaccessible area, due to e.g. private property rights or a recently constructed house, which was formerly unknown. The four designs include three stratified designs (S-arPS, CA-arLUS, QC-arLUS). Of these three, two are designed to represent landscape structure, i.e. typical predictor combinations (CA-arLUS and QC-arLUS), one is designed to represent the quantiles of each predictor separately (LH-arPS). Independence of the used predictors is necessary or at least beneficial for the algorithms CA-arLUS, QC-arLUS and LH-arPS. For algorithms LH-arPS and CA-arLUS an additional reduction of the parameter space seems reasonable. Although algorithms S-arPS and CA-arLUS do not need any reduction, it

Terrain analysis—Preprocessing

Altitude

Topographic Wetness Index (TWI) LS Factor

Terrain Analysis—Hydrology

Terrain Analysis—Hydrology

Terrain Analysis—Lighting, Visibility

Positive openness Negative openness

Saga Wetness Index Catchment area Modified catchment area Catchment slope

Terrain Analysis—Morphometry

Convergence index

LS Factor

Topographic Wetness Index (TWI)

SAGA Wetness Index

Fill Sinks (Planchon/Darboux, 2001)

Topographic Openness

Convergence Index (Search Radius)

Beven and Kirkby, 1979; Böhner and Selige, 2006; Moore et al., 1991 Böhner and Selige, 2006; Desmet and Govers, 1996; Kinnell, 2005

Böhner et al., 2002

Planchon and Darboux, 2001

Anders et al., 2009; Prima et al., 2006; Yokoyama et al., 2002

Köthe and Lehmeier, unpublished

Iwahashi and Pike, 2007

Terrain Surface Texture

Terrain Analysis—Morphometry Terrain Analysis—Morphometry

Terrain Ruggedness Index (TRI) Texture



Riley et al., 1999

Relative Heights and Slope Positions

Terrain Analysis—Morphometry

Normalised height Valley depth Mid-slope position Slope height

Wood, 1996 and 2009

Reference

Terrain Ruggedness Index

Morphometric Features

Terrain Analysis—Morphometry

Module

Module library

Terrain parameter

Slope Aspect Profile curvature Plan curvature Maximum curvature

Table 1 Terrain parameters calculated by SAGA GIS.

Conrad/2003

Conrad/2003

Böhner & Conrad/2001

Wichmann/2003

Conrad/2012

Conrad/2003

Conrad/2012

Conrad/2010

Böhner & Conrad/2008

Conrad/2013

SAGA author/year

110 M. Ließ / Spatial Statistics 13 (2015) 106–122

M. Ließ / Spatial Statistics 13 (2015) 106–122

111

was, still, decided to reduce the n-dimensional parameter space for all four designs to allow for comparison. The application of an a-priori experimental factor analysis to the predictors will make sure no important structural landscape elements represented by certain predictors are lost. 2.3.1. Simple accessible random point sampling (S-arPS) As was mentioned before, simple random sampling is often applied, since it is assumed to select a representative sample by chance. Admittedly, the so selected points are likely to be inaccessible. Thus, the randomly selected points were replaced by accessible points which are close to the selected points within the n-dimensional predictor space, but not necessarily in geographic space. After selecting 100 random points, the n-dimensional distances of the chosen points to all points within the accessible area were calculated to form a distance matrix. The point with the smallest distance in parameter space was chosen to replace the originally selected point. In case the selected random point was already part of the accessible area, it was kept. Within R this was implemented in the following way: All points of the accessible area were assigned to the selected random point which is nearest in the n-dimensional predictor space (in this case after reduction by factor analysis) by nearest neighbour classification of the r-package ‘‘kknn’’ (Schliep and Hechenbichler, 2014). Then within the so formed groups the point with the smallest Manhattan distance to the selected point was chosen as accessible alternative point for sampling. The Manhattan distance betweentwo points X = (x1 , x2 , x3 . . . , xn ) and Y = (y1 , y2 , y3 , . . . , yn ) was n calculated according to D = i=1 |xi − yi |. 2.3.2. Latin hypercube based accessible random point sampling (LH-arPS) A latin hypercube (McKay et al., 1979) combines the various parameters of the n-dimensional predictor space in such a way, that after deciding on the amount of samples (k) to be taken, the k quantiles of all parameters shall be represented exactly by one sample each. The tricky part is that this optimisation of one sample per quantile has to be achieved for all n predictors simultaneously. The clhs algorithm (Minasny and McBratney, 2006) makes the method applicable for DSM, for it allows only those predictor combinations to be selected which exist within the reality of a particular landscape. clhs as implemented within the r-package clhs (Roudier, 2012) uses simulated annealing. Points are chosen randomly and are then repeatedly exchanged one by one to obtain a sample which optimally represents (1) the quantiles of the predictors, (2) the categories’ occurrence probabilities of any categorical predictors and (3) the predictors’ correlation intensity. The term simulated annealing originates from annealing in metallurgy, the heating and controlled slow cooling of materials to increase crystal size. Accordingly, a small temperature decrease rate (tdecrease) is used for optimisation in simulated annealing, i.e. a slow decrease in the probability of accepting worse solutions while exploring the parameter space to optimise the latin hypercube. And accepting worse solutions during the optimisation process is often necessary to find the overall optimum. Due to the random exchange of only single points, the number of iteration steps has to be set sufficiently high to guarantee that the objective function of the process reaches its minimum value zero. As in random sampling, walking distance and accessibility of the selected point is often a problem. Roudier and Hewitt (2012) proposed to address this problem by introducing a cost variable into the optimisation process of the clhs algorithm. Clifford et al. (2014) adapted the clhs algorithm to make the design more flexible for alternative point selection. But even when assigning rather high costs to unreachable sampling points, those points still remain selectable. However, with little changes to the implemented clhs package, quantile strata are calculated based on the complete investigation area, but selectable points to optimise the latin hypercube may only be chosen from an accessible subarea. This can be done by inserting the reduced dataset at the point selection stage into the clhs algorithm, and the algorithm for the LH-arPS design is created. 2.3.3. Cluster analysis based accessible random landscape unit sampling (CA-arLUS) Stratified sampling is often applied to improve the representativeness, i.e. reduce the sampling error. To do so, the area is stratified into several subareas, and then random sampling is applied within each subarea. The here presented CA-arLUS algorithm and the following QC-arLUS algorithm are based

112

M. Ließ / Spatial Statistics 13 (2015) 106–122

on random sampling of such strata, i.e. landscape units. By applying the CA-arLUS algorithm, these units are formed by cluster analysis. The term landscape unit is used, since predictors may not only include terrain, but also vegetation parameters, etc. Each of the units is then sampled randomly according to its spatial coverage within the complete research area, while selectable samples are restricted to the accessible area. There are quite a number of clustering algorithms which all have their advantages and disadvantages, please compare Leisch and Gruen (2013). In this case, clusters were formed by the algorithm ‘‘partitioning around medoids’’ (pam, Kaufman and Rousseeuw, 1990) which obtains a number of k clusters by selecting k representative objects from the dataset. The clusters are then formed by slowly assigning each of the remaining objects to the nearest representative object. Every time, a new sample is added to a particular cluster, a new representative object has to be calculated: the medoid. It lies within the centre of the cluster it defines, i.e. the average distance of the representative object to all other objects of the same cluster is minimised. For a large number of objects, the pam algorithm is implemented by function clara ‘‘clustering large applications’’ from r-package cluster (Maechler et al., 2013). The Dunn Index (Dunn, 1974), the ratio of the shortest distance between objects which are not within the same cluster to the largest intra-cluster distance, may be used to decide on the number of clusters. It is implemented within the R-package clValid (Brock et al., 2014). 2.3.4. Quantile combination based accessible random landscape unit sampling (QC-arLUS) Similar to LH-arPS, this sampling design is also making use of predictor quantiles. Though, no latin hypercube shall be optimised, but all occurring quantile combinations are used. The detail of parameter space representation is chosen through the number of parameters (n) and quantiles (k) each of these parameters is divided into, due to available time and cost. kn determines the number of landscape units. Five quantiles will be sufficient in most cases. However, the size and complexity of the area also has to be taken into account. An area might include more different structures which might require a higher number of sampling units. Again all of the so obtained landscape units are sampled by random sampling according to their spatial coverage. Selectable points are only those within the accessible area. 2.4. Dataset preparation by factor analysis The application of a factor analysis serves to condense the landscape pattern information given by a number of predictors and makes the so obtained factors independent from one another. All predictors were mean centred and divided by their standard deviation before they entered the factor analysis. The number of factors was determined by the acceleration factor of the scree plot (Cattell, 1966; Raiche et al., 2006) and the very simple structure criterion (Revelle and Rocklin, 1979) implemented in the functions ‘‘vss’’ of package ‘‘psych’’ (Revelle, 2014a,b) and ‘‘nScree’’ of package ‘‘nFactors’’ (Raiche and Magis, 2010) of the software package R. After a decision concerning the number of factors was reached, the factors were calculated by function ‘‘fa’’ of package ‘‘psych’’. An overview of the various criteria to determine the number of factors can be obtained from Fabrigar and Wegener (2012), Revelle (2014a,b) and Pearson et al. (2013). Categorical predictors may also be included by conducting a multiple factor analysis as implemented in the R package ’’FactoMineR’’ (Husson et al., 2014). 3. Application to the investigation areas 3.1. Factor analysis The application of an exploratory factor analysis resulted in a 2 factor solution for both areas, due to the acceleration factor of the scree plot (Figs. 2(a) and (b)) and the very simple structure criterion (Figs. 2(c) and (d)). The overall fit, in that effect, how well the factor model represents the correlation matrix, was 0.86 for Laipuna and 0.82 for Quinuas. The rotation method ‘‘varimax’’ was used and scores were calculated according to ‘‘tenBerge’’. Fig. 3 shows the so obtained 2 factors which represent the basic geomorphological pattern of the two research areas according to the factor analysis.

M. Ließ / Spatial Statistics 13 (2015) 106–122

a

b

c

d

113

Fig. 2. Number of factors: (1) according to the acceleration factor of the Scree plot, (a) Laipuna, (b) Quinuas, (2) according to the very simple structure criterion, (c) Laipuna, and (d) Quinuas.

3.2. Simple accessible random point sampling (S-arPS) Fig. 4 shows the Minkowski distance between the randomly selected points and their replacement for Laipuna and Quinuas. As expected, the smaller percentage of the accessible area in Laipuna – 2.4% compared to 25% for Quinuas – leads to higher distances in the n-dimensional predictor space which have to be accepted. As a result of the factor analysis on standardised predictors, both factors had similar ranges (−3.12 to 4.92 compared to −3.29 to 3.20), so that the usage of the Manhattan distance was not a problem. Otherwise, a different distance measure (e.g. Mahalanobis) might have to be considered. 3.3. Latin hypercube based accessible random point sampling (LH-arPS) A number of 6,000,000 iterations was tried to possibly reach the optimal value of zero for the objective function of the adapted clhs design, LH-arPS (Figs. 5(a) and (b)). The minimum value of the objective function for Laipuna, 2.0, was reached after 1,233,841 iterations; the minimum value for Quinuas, 2.41e−06 , was reached after 3,643,396 iterations. For Quinuas, the latin hypercube is optimally represented with exactly one sample per quantile in both factors. Most probably, in the case of Laipuna no further optimisation was possible due to too few selectable samples within the very limited accessible area. Here, quantiles 13 and 85 of factor 1 contain 0 and 2 samples, respectively, while factor 2 is optimally represented by 1 sample per quantile. To select the LH-arPS sample for

114

M. Ließ / Spatial Statistics 13 (2015) 106–122

Fig. 3. Spatial distribution of the two factors’ quantiles. (a) Factor 1 Laipuna, (b) factor 1 Quinuas, (c) factor 2 Laipuna, and (d) factor 2 Quinuas.

b 0.04 0.02

Minkowski distance

0.6 0.4 0.0

0.00

0.2

Minkowski distance

0.8

a

0

20

60 point

100

0

20

60 point

100

Fig. 4. Minkowski distance between random point and accessible alternative point. (a) Laipuna, and (b) Quinuas.

0 0e+00

Objective function 50 100 150 0

Objective function 50 100 150

d

Objective function 50 100 150

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 Iterations

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 Iterations

c

115

0

50

100

150

b

0

a

Objective function

M. Ließ / Spatial Statistics 13 (2015) 106–122

1e+05

2e+05 Iterations

3e+05

4e+05

0e+00

1e+05

2e+05 Iterations

3e+05

4e+05

Fig. 5. Development of the objective function of the simulated annealing, Laipuna (a), (c) and Quinuas (b), (d). Row 1 shows the complete iterative process, row 2 shows the early iteration steps in more detail.

Laipuna and Quinuas, tdecrease was set to 0.9999. The process of worse solutions (higher values of the objective function) being accepted during the optimisation process can be seen in Figs. 5(c) and (d). While the clhs algorithm guarantees a good representation of the probability density function of each predictor and the intensity of correlation between continuous predictors is also taken into account, the relation among predictors is not considered and it is, therefore, very likely that typical landscape positions remain unsampled. Fig. 6(a) shows a possible predictor relation. If we now assume that there are two typical landscape positions within the area, corresponding to the two stars in Fig. 6(a), which account for a spatial coverage of 20%, then this should result in the selection of the two quantile combinations marked as green in Fig. 6(b). According to the rules for optimising the latin hypercube this will be considered suboptimal, so that the search algorithm has to find another solution such as e.g. indicated by the circles in Fig. 6(b). As a consequence, the resulting optimal conditioned latin hypercube cannot represent all typical landscape positions of a particular area, possibly not even the most frequent ones. Furthermore, the relation among the two predictors is simplified which will cause biasedness in the dataset’s predictions. The clhs design is, therefore, unsuitable for model validation, but a second dataset with an unbiased mean needs to be collected. However, very often there is no time to do so and, therefore, parts of the same dataset need to be used for model development and model validation. There is still another issue which differentiates clhs from the original latin hypercube sampling (lhs). In lhs all boxes in Fig. 6(b) would have the same probability of being chosen, in clhs this is not the case. The number of sampling points corresponding to each of the boxes differs. Accordingly, the higher the number of the selected sampling points and the higher the number of the predictors, the more boxes will remain empty without any selectable sampling points and will therefore guide the optimisation process. As a consequence to this, the selected points are not independent from each other but the selection of each point has an impact on the further selection process. This effect could be reduced while the predictors are uncorrelated and their number is limited. Applying an a-priori factor analysis has two benefits: it (1) reduces the number of predictors and (2) decorrelates them. 3.4. Cluster analysis based accessible random landscape unit sampling (CA-arLUS) There are quite a number of algorithms available to classify terrain into strata: e.g. the Topographic Position Index based landform classification (Guisan et al., 1999; Weiss, 2000) implemented in the SAGA GIS software. Accordingly, Drăguţ and Eisank (2012) used SRTM data to obtain terrain classes

116

M. Ließ / Spatial Statistics 13 (2015) 106–122

Fig. 6. (a) Possible predictor relation, and (b) corresponding conditioned latin hypercube. (For interpretation of the references to colour in this figure, the reader is referred to the web version of this article.)

Fig. 7. Landscape units from clustering two factors. (a) Laipuna, and (b) Quinuas.

by using multi-resolution segmentation, Jasiewicz and Stepinski (2013) used a pattern recognition approach to define geomorphologic phonotypes. The CA-arLUS and QC-arLUS provide easily applicable algorithms to stratify the landscape into a number of subareas. Both algorithms are completely objective if the number of classes was set. No importance is given to the interpretation of these landscape units as geomorphological features, etc., they simply capture some sort of spatial landscape pattern. In comparison to other ready implemented terrain classification algorithms, they have the additional advantage that not only terrain parameters but any type of ancillary information can be included, e.g. vegetation indices derived from satellite images. The applied stratification within CA-arLUS makes the approach similar to the random point selection approach S-arPS. Though, instead of finding similar and accessible parameter combinations after randomly choosing sampling points, it simply first groups similar predictor combinations, i.e. landscape positions, by clustering and then choses from each of the so obtained landscape units random and accessible points according to the areal coverage of these units. Similarly, in a simple random sampling approach, landscape positions which are more typical for a particular area are likely to receive more samples. If the landscape has rather abrupt changes concerning some of the used descriptive parameters, an optimal number of clusters could be determined due to some validity index, e.g. the Dunn Index (Dunn, 1974). However, in the two here considered areas the terrain is rather smooth and, therefore, the validity indices show only little changes from one cluster solution to the other. Therefore, the number of clusters was set to 25, to compare the landscape units and results to the QC-arLUS. The spatial distribution of the so obtained landscape units is displayed in Fig. 7. Finally, sampling designs which are based on sampling units instead of singular points have the additional advantage of being more flexible.

M. Ließ / Spatial Statistics 13 (2015) 106–122

117

Fig. 8. Landscape units from the factors’ quantile combinations: (a) Laipuna, and (b) Quinuas.

3.5. Quantile combination based accessible random landscape unit sampling (QC-arLUS) QC-arLUS uses only few quantiles of a very reduced parameter space, while it considers all possible parameter quantile combinations and forms landscape units based on these, which are then sampled randomly and according to their spatial coverage. With the application of an a-priori factor analysis, that reduces the parameter space to only two factors, the choice of how detailed the parameter space needs to be represented, stays with the number of chosen quantiles. To decide about the number of quantiles for each factor and hence the number of sampling units (kn ), two things have to be taken into account: (1) More quantiles describe each of the factors in more detail, what means a better representation of the predictor space and hence the landscape’s structure, but will, naturally, lead to a higher number of sampling units. (2) If the area’s percentage which is covered by the landscape units varies in the order of magnitude from one unit to another, this will lead to a very high number of sampling points (soil profiles), since all landscape units are sampled according to their spatial coverage. There are two options to address the latter problem: (a) a reduction of quantiles might be sufficient, or (b) it might be necessary to exclude some units with a very low spatial coverage from sampling, considering that they are not representative for the area under study. Laipuna has a more uniform distribution of area per sampling unit (Fig. 9f1 ) considering 5 quantiles, with the smallest landscape unit’s area percentage (1.94%) being contained 4 times within the largest landscape unit (7.95%) which leads to 52 sampling points. Quinuas faces a rather uneven distribution (Fig. 10f1 ) considering 5 quantiles, with the smallest landscape unit’s areal percentage (0.16%) being contained 52 times within the largest landscape unit (8.05%), what would lead to 637 sampling points. Therefore, in the case of Quinuas, the landscape unit comprising only 0.16% of the area was excluded from sampling due to its low areal coverage (compare unit 1.1 in Figs. 10f1 and 10f5 ), reducing the number of sampling points to 108. Of course the number of sampling points can always be doubled, tripled, etc. depending on time and monetary constraints. As a rule of thumb the minimum number of sampling points should be at least 100. For Laipuna, setting two sampling points within the landscape unit of lowest areal coverage finally resulted in 102 sampling points. The spatial distribution of the so obtained landscape units for Laipuna and Quinuas is displayed in Fig. 8. 3.6. Comparison of sampling designs The four designs are compared according to their ability to well represent a DTA-area. But how well a sample represents the population it is selected from depends on the considered criteria: (1) Which predictors and how many are taken into account?

118

M. Ließ / Spatial Statistics 13 (2015) 106–122

(2) Are the probability density functions of the included predictors well represented? (3) Are the relationships among the predictors well represented, i.e. are all typical landscape positions sampled according to their spatial coverage? By applying an a-priori factor analysis, we can consider as many predictors as we want, so that criterion (1) is always fulfilled to its best. Criterion (2) is optimised and included within LH-arPS and QCarLUS, respectively. CA-arLUS and QC-arLUS account for criterion (3), the representation of landscape positions, i.e. predictor combinations, according to their spatial coverage. S-arPS does something similar, since more typical landscape positions are more likely to be selected by chance. Still, there is no guarantee. According to Jenny (1941) it is the interacting effect of the factors of soil formation which forms the soil in a particular landscape position. Therefore, it is not enough to guarantee that we are not systematically mis-representing the probability density function of any predictor, but in addition we also need to check if the relation between the predictors, i.e. the area’s typical landscape positions, which are here represented by the landscape units, are not mis-represented. Fig. 9 (Laipuna) and 10 (Quinuas) show a comparison of the parameter (predictor) space of the selected samples with the complete area’s parameter (predictor) space. The two figures indicate whether a sample selected by any of the four here considered sampling designs is representative for the particular area. Row one of the two figures gives an overview of the particular investigation area regarding the two factors’ Kernel densities (Figs. 9a1 , b1 and 10a1 , b1 ), the quantiles of the two factors (Figs. 9c1 , d1 and 10c1 , d1 ) and the landscape units formed by cluster analysis (Figs. 9e1 and 10e1 ) and the landscape units formed by predictor quantile combinations (Figs. 9f1 and 10f1 ). While the landscape units formed by cluster analysis are simply named from 1 to 25, the latter are named according to the combination of the quantiles from the two factors 1 and 2 (qf1 and qf2 ) as qf1 · qf2 and are, therefore, numbered from 1.1 to 5.5. Rows 2–5 show the same information for the samples selected according to the four sampling designs. For Laipuna the probability density functions of the two factors of the selected samples are similar to those of the complete area, for all four considered sampling designs (Figs. 9a1–5 , 9b1–5 ). For Quinuas this is also true with one exception, factor 1 of the S-arPS design (Fig. 10a2 ). The Kolmogorov–Smirnov test confirms this optical impression. The null hypothesis has to be rejected only for the S-arPS design regarding factor 1 in Quinuas with a p-value of 0.035. The successfully tested distribution accordance is indicated with green colour in Figs. 9a2–5 , 9b2–5 , 10a2–5 and 10b2–5 . As demanded by the definition of the sampling design, the five quantiles of the two factors are well represented by LH-arPS and QC-arLUS (Figs. 9c3,5 , 9d3,5 , 10c3,5 , 10d3,5 ) as indicated by green colours in columns 3 and 4 of Figs. 9 and 10. Due to the same reason, Figs. 9e4 , 10e4 , 9f5 and 10f5 are very similar to their corresponding Figs. 9e1 , 9f1 , 10e1 and 10f1 . They cannot be exactly the same since only integer numbers of sampling points were considered. Regression models depend on the used data, i.e. different data will result in a different model. In simple random sampling, the probability density function of the predictors describing a selected sample will only follow that of the population it is selected from, if the sample size is set sufficiently high. This necessary size might depend on the total amount of selectable samples. However, due to the smoothing of the GIS raster to 20 m resolution in Quinuas compared to 10 m resolution for Laipuna, the number of selectable points in the much bigger area Quinuas is not as high as it could be: 233,149 points in Quinuas compared to 163,080 in Laipuna. But still the sample size for Quinuas was not sufficient. The higher Manhattan distance which had to be accepted for Laipuna compared to Quinuas obviously did not have any negative impact. Finally, S-arPS is a good choice for Laipuna but not for Quinuas. Accordingly, the representation of the predictors’ probability density functions should be tested when using simple random sampling to establish regression models. The other three designs, LH-arPS, CA-arLUS and QC-arLUS did not show any problem in representing the predictors’ probability density functions. Due to the application of an a-priori factor analysis the here considered predictors (factors) are completely independent from one another and therefore predictor relations cannot introduce any bias while being neglected. Accordingly, all four designs are equally suited to sample while considering the above remark on sample size in simple random sampling which concerns S-arPS. Still, a particular landscape position is always represented by a number of predictors and therefore the simultaneous

M. Ließ / Spatial Statistics 13 (2015) 106–122

119

Fig. 9. Comparison of sampling designs for Laipuna. Column 1 (a1 –a5 ) Kernel density of factor 1, column 2 (b1 –b5 ) Kernel density of factor 2, column 3 (c1 –c5 ) data percentage per quantiles of factor 1 (qf1 ), column 4 (d1 –d5 ) data percentage per quantiles of factor 2 (qf2 ), column 5 (e1 –e5 ) data percentage per landscape units formed by clustering, column 6 (f1 –f5 ) data percentage per landscape unit formed by quantile combinations x.x = qf1 .qf2 . Row 1 (a1 –f1 ) complete research area, row 2 (a2 –f2 ) sampling points from S-arPS, row 3 (a3 –f3 ) sampling points from LH-arPS, row 4 (a4 –f4 ) sampling points from CA-arLUS, row 5 (a5 –f5 ) sampling points from QC-arLUS. (For interpretation of the references to colour in this figure, the reader is referred to the web version of this article.)

effect they cause on the considered soil property. Therefore, considering predictor combinations and hence landscape units (arLUS approaches) makes more sense than only optimising the probability density functions of each predictor separately as is done in clhs (LH-arPS). The four designs were compared according to their ability to well represent a DTA-area and not according to DSM model performance. It is often argued that a sampling design cannot be evaluated without data. However, spatial prediction performance cannot be used as the ultimate sampling efficiency criterion, either. Using part of the data obtained by a particular sampling design for its validation (e.g. Drăguţ and Dornik, 2013) is not a reasonable solution, because both, the training’s as well as the validation’s data mean might be biased according to the same design. However, in most cases no independent completely randomly selected dataset of sufficient size is available for validation, but part of the data is used for model development and part for model validation. Cross validation procedures provide a way to address this problem; five- and ten-fold cross-validation are usually recommended (e.g. Breiman and Spector, 1992; Kohavi, 1995). Still, emphasise has to be given to the design which is used to collect this single dataset as in most cases it includes both, the data used for model training and the data used for validation. Last but not least, often rather huge soil databases are available for the area which shall be mapped by DSM. Consequently, it might be difficult to justify the effort and cost for yet another surveying campaign complying with the discussed criteria of representativeness. Moreover, this is often unnecessary. The data may originate from different surveyors and different sampling purposes. Accordingly, the complete dataset may not be good in representing a particular area. Though, it is very likely that a subset of the data may do so. Please be aware, that the four here presented designs can be applied to subsample a big but inappropriate dataset to well represent the area it was collected from.

120

M. Ließ / Spatial Statistics 13 (2015) 106–122

Fig. 10. Comparison of sampling designs for Quinuas. Column 1 (a1 –a5 ) Kernel density of factor 1, column 2 (b1 –b5 ) Kernel density of factor 2, column 3 (c1 –c5 ) data percentage per quantiles of factor 1 (qf1 ), column 4 (d1 –d5 ) data percentage per quantiles of factor 2 (qf2 ), column 5 (e1 –e5 ) data percentage per landscape units formed by clustering, column 6 (f1 –f5 ) data percentage per landscape unit formed by quantile combinations x.x = qf1 .qf2 . Row 1 (a1 –f1 ) complete research area, row 2 (a2 –f2 ) sampling points from S-arPS, row 3 (a3 –f3 ) sampling points from LH-arPS, row 4 (a4 –f4 ) sampling points from CA-arLUS, row 5 (a5 –f5 ) sampling points from QC-arLUS. (For interpretation of the references to colour in this figure, the reader is referred to the web version of this article.)

4. Conclusions In sampling a subarea while guaranteeing representativeness regarding the predictor space of a much larger area, which shall be mapped by regression-based DSM, all four algorithms have shown their strength. The general problem of many statistically-based sampling designs that select points which are simply unreachable was solved and the applicability to the soil scientist collecting the data is guaranteed. In addition, instead of sampling a previously un-sampled area with limited accessibility, the four sampling designs may also be used to subsample an existing dataset and, thereby, optimise a suboptimal dataset based on the predictor space of the area which shall be mapped by DSM. Constraints have to be made, regarding the clhs algorithm concerning its applicability for the validation of DSM results. The inter point dependence during the optimisation process and the lack in representing the predictor relations may lead to the neglect of possibly important landscape positions during sampling. This makes it necessary to collect an additional dataset with an unbiased mean for model validation when using clhs for model training. For the simple random sampling approach, representativeness of the selected samples according to the predictors’ probability density functions needs to be tested, since representativeness in this approach is left to chance and might depend on the relation of the sample size to the number of selectable sites. It was shown that basically any sampling design which considers the predictor space for regression models could be adapted to sample in a small subarea or even outside the area of interest while the predictor space does not show any abrupt changes at its boundaries. This is true, at least if we truly trust in Jenny’s factor model of soil formation (Jenny, 1941), which basically states that similar landscape positions which are influenced by the same intensity and combination of soil forming factors carry similar soils. And if we do not trust it why would we then use environmental predictors in regression-based DSM?

M. Ließ / Spatial Statistics 13 (2015) 106–122

121

Sampling efficiency for regression-based DSM may be evaluated based on the representation of the predictor space. According to Jenny (1941) it is the interacting effect of the factors of soil formation which forms the soil in a particular landscape position. As a consequence, this interaction needs to be considered also within the sampling design. To truly represent an area, it is not enough to guarantee that we are not systematically mis-representing the density function of a particular predictor, but in addition we also need to check if all typical landscape positions of the area are well-represented. Only if this is the case, we may conclude that the spatial variability of the soil parameter under question is well-captured without introducing any bias. Furthermore, in considering the importance of the interaction of soil forming factors for any landscape position, it does not make sense to limit the ancillary information we have to only a few predictors, but all information that is available should be included. Of course, the landscape is much better represented if not only terrain parameters, but all other spatial factors of soil formation, i.e. vegetation parameters, land use, climate and lithology would also be included. It was demonstrated that the application of an a-priori factor analysis is a good way to densify the ancillary information to a few factors and thereby represent the spatial structure of a complex predictor space. In addition, receiving uncorrelated predictors is beneficial for many designs. All-in-one, this manuscript wants to provide a guideline on how to adapt any sampling design for regression-based DSM in case most or all of a particular area is inaccessible, and on how to test whether a selected sample provides a good representation of the area to be mapped. Acknowledgements This research was funded by the German Research Foundation (DFG) as part of the Platform for Biodiversity and Ecosystem Monitoring and Research in South Ecuador (PAK 825, LI 2360/1-1). Logistic support by the NGO Nature and Culture International (NCI) and the municipal public agency ETAPA is gratefully acknowledged. References Anders, N.S., Seijmonsbergen, A.C., Bouten, 2009. Multi-scale and object-oriented image analysis of high-res LiDAR data for geomorphological mapping in alpine mountains. In: Proceedings of Geomorphometry. 2009. Beven, K.J., Kirkby, M.J., 1979. A physically-based variable contributing area model of basin hydrology. Hydrol. Sci. Bull. 24 (1), 43–69. Böhner, J., Köthe, R., Conrad, O., Gross, J., Ringeler, A., Selige, T., 2002. Soil regionalisation by means of terrain analysis and process parameterisation. In: Micheli, E., Nachtergaele, F., Montanarella, L. (Eds.) (2002). Soil Classification 2001. European Bureau, Research Report No. 7, EUR 20398 EN, Luxembourg, pp. 213–222. Böhner, J., Selige, T., 2006. Spatial prediction of soil attributes using terrain analysis and climate regionalisation. In: Böhner, J., McCloy, K.R., Strobl, J. (Eds.), SAGA—Analysis and Modelling Applications. In: Goettinger Geographische Abhandlungen, vol. 115. pp. 13–27. Breiman, L., Spector, P., 1992. Submodel selection and evaluation in regression. The X-Random case. Internat. Statist. Rev. 60 (3), 291–319. Brock, G., Pihur, V., Datta, S., Datta, S., 2014. clValid: Validation of clustering results. http://cran.r-project.org/web/packages/ clValid/index.html (access: December 2013). Brus, D.J., de Gruijter, J.J., 1997. Random sampling or geostatistical modelling? Choosing between design-based and modelbased sampling strategies for soil (with Discussion). Geoderma 80, 1–44. Brus, D.J., de Gruijter, J.J., Groeningen, J.W., 2007. Designing spatial coverage samples using the K-Means clustering algorithm. In: Lagacherie, P., McBratney, A.B., Voltz, M. (Eds.), Digital Soil Mapping—An Introductory Perspective. In: Developments in Soil Science, vol. 31. Elsevier, Amsterdam. Cambule, A.H., Rossiter, D.G., Stoorvogel, J.J., 2013. A methodology for digital soil mapping in poorly-accessible areas. Geoderma 192, 341–353. Cattell, R.B., 1966. The Scree test for the number of factors. Multivariate Behav. Res. 1, 245–276. Clifford, D., Payne, J.E., Pringle, M.J., Searle, R., Butler, N., 2014. Pragmatic soil survey design using flexible latin hypercube sampling. Comput. Geosci. 67, 62–68. Desmet, P.J.J., Govers, G., 1996. A GIS procedure for automatically calculating the USLE LS Factor on topographically complex landscape units. J. Soil Water Conserv. 51 (5), 427–433. Dobermann, A., Simbahan, G.S., 2007. Methodology for using secondary information in sampling optimization for making fine-resolution maps of soil organic carbon. In: Lagacherie, P., McBratney, A.B., Voltz, M. (Eds.), Digital Soil Mapping—An Introductory Perspective. In: Developments in Soil Science, vol. 31. Elsevier, Amsterdam. Drăguţ, L., Dornik, A., 2013. Land-surface segmentation as sampling framework for soil mapping. In: Conference Paper, Geomorphometry Conference in Nanjing, China, 2013. Drăguţ, L., Eisank, C., 2012. Automated object-based classification of topography from SRTM data. Geomorphology 141–142, 21–33.

122

M. Ließ / Spatial Statistics 13 (2015) 106–122

Dunn, J.C., 1974. Well separated clusters and fuzzy partitions. J. Cybern. 4, 95–104. Fabrigar, L.R., Wegener, D.T., 2012. Understanding Statistics: Exploratory Factor Analysis. Oxford University Press, Oxford. Gessler, P.E., Moore, I.D., McKenzie, N.J., Ryan, P.J., 1995. Soil-landscape modelling and spatial prediction of soil attributes. Int. J. Geogr. Inf. Syst. 9, 421–432. Guisan, A., Weiss, S.B., Weiss, A.D., 1999. GLM versus CCA spatial modelling of plant species distribution. Plant Ecol. 143, 107–122. Hengl, T., Rossiter, D.G., Stein, A., 2003. Soil sampling strategies for spatial prediction by correlation with auxiliary maps. Aust. J. Soil Res. 41, 1403–1422. Husson, F., Josse, J., Le, S., Mazet, J., 2014. Package FactorMineR. http://factominer.free.fr. Iwahashi, J., Pike, R.J., 2007. Automated classification of topography from DEMs by an unsupervised nested-means algorithm and a three-part geometric signature. Geomorphology 86, 409–440. Jasiewicz, J., Stepinski, T.F., 2013. Geomorphons - a pattern recognition approach to classification and mapping of landforms. Geomorphology 182 147–156. Jenny, H., 1941. Factors of Soil Formation. A System of Quantitative Pedology. New York. Kaufman, L., Rousseeuw, P.J., 1990. Finding Groups in Data—An Introduction to Cluster Analysis. John Wiley & Sons, New York. Kinnell, P.I.A., 2005. Alternative approaches for determining the USLE-M slope length factor for grid cells. http://soilscijournals.org/cgi/content/full/69/3/674 (access: December 2013). Kohavi, R., 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI’95 Proceedings of the 14th International Joint Conference on Artificial Intelligence, vol. 2, pp. 1137–1143. Köthe, R., Lehmeier, F., 1996. SARA-System zur Automatischen Relief-Analyse. User Manual, second ed., Dept. of Geography, University of Goettingen (unpublished). Lee, S., Wolberg, G., Shin, S.Y., 1997. Scattered data interpolation with multilevel B-splines. IEEE Trans. Vis. Comput. Graph. 3 (3), 228–244. Leisch, F., Gruen, B., 2013. CRAN Task View: Cluster Analysis & Finite Mixture Models. http://cran.r-project.org/web/views/ Cluster.html (access: June 2014). Ließ, M., Glaser, B., Huwe, B., 2009. Digital soil mapping in southern Ecuador. Erdkunde 63, 309–319. Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K., Studer, M., Roudier, P., 2013. Package cluster—cluster analysis extended Rousseeuw et al. http://cran.r-project.org/web/packages/cluster/index.html (access: December 2013). McKay, M.D., Beckman, R.J., Conover, W.J., 1979. A comparision of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21 (2), 239–245. McKenzie, N.J., Ryan, P.J., 1999. Spatial prediction of soil properties using environmental correlation. Geoderma 89, 67–94. Minasny, B., McBratney, A.B., 2006. A conditioned Latin hypercube method for sampling in the presence of ancillary information. Comput. Geosci. 32 (9), 1378–1388. Minasny, B., McBratney, A.B., 2007. Latin hypercube sampling as a tool for digital soil mapping. In: Lagacherie, P., McBratney, A.B., Voltz, M. (Eds.), Digital Soil Mapping—An Introductory Perspective. In: Developments in Soil Science, vol. 31. Elsevier, Amsterdam. Moore, I.D., Grayson, R.B., Ladson, A.R., 1991. Digital terrain modelling: a review of hydrological, geomorphological, and biological applications. Hydrol. Process. 5 (1). Pearson, R., Mundfrom, D., Piccone, A., 2013. A comparison of ten methods for determining the number of factors in explanatory factor analysis. Mult. Linear Regres. Viewp. 39 (1), 1–15. http://mlrv.ua.edu/2013/vol39_1/PeS-arPSon%20et%20al..pdf. Planchon, O., Darboux, F., 2001. A fast, simple and versatile algorithm to fill depressions of digital elevation models. Catena 46, 159–176. Prima, O.D.A., Echigo, A., Yokoyama, R., Yoshida, T., 2006. Supervised landform classification of Northeast Honshu from DEMderived thematic maps. Geomorphology 78, 373–386. Raiche, G., Magis, D., 2010. Package ‘nFactors’. http://cran.r-project.org/web/packages/nFactors/nFactors.pdf (access: February 2014). Raiche, G., Roipel, M., Blais, J.G., 2006. Non graphical solutions for the Cattell’s scree test. Presentation at the annual meeting of the Psychometric Society, Montréal, Canada. http://www.er.uqam.ca/nobel/r17165/RECHERCHE/COMMUNICATIONS/ (access: July 2014). Revelle, W., 2014a. Package ‘psych’. p. 332. http://personality-project.org/r/psych/ (access: 05.02.14.). Revelle, W., 2014b. Package ‘‘psych’’. http://cran.r-project.org/web/packages/psych/psych.pdf (access: February 2014). Revelle, W., Rocklin, T., 1979. Very simple structure: An alternative procedure for estimating the optimal number of interpretable factors. Multivariate Behav. Res. 14, 403–414. Riley, S.J., De Gloria, S.D., Elliot, R., 1999. A terrain ruggedness index that quantifies topographic heterogeneity. Intermt. J. Sci. 5 (1–4), 23–27. Roudier, P., 2012. Conditioned latin hypercube sampling (r package clhs). http://cran.r-project.org/web/packages/clhs/clhs. pdf (access: July 2014). Roudier, P., Hewitt, A.E., 2012. A conditioned latin hypercube sampling algorithm incorporating operational constraints. In: Minasny, B., Malone, B.P., McBratney, A.B. (Eds.), Digital Soil Assessments and Beyond. Proceedings of the Fifth Global Workshop on Digital Soil Mapping, Sydney, Australia, 10th–13th April 2012. CRC Press, Boca Raton. SAGA User Group Association, 2014. SAGA—System for Automated Geoscientific Analysis. http://www.saga-gis.org (access: July 2014). Schliep, K., Hechenbichler, K., 2014. Weighted k-Nearest Neighbours. Package kknn. http://cran.at.r-project.org/web/packages/ kknn/kknn.pdf (access: April 2014). Weiss, A.D., 2000. Topographic position and landforms analysis. http://www.jennessent.com/downloads/tpi-poster-tnc_ 18x22.pdf (access: February 2014). Wood, J., 1996. The Geomorphological characterization of Digital Elevation Models. Diss., Department of Geography, University of Leicester, UK. http://www.soi.city.ac.uk/~jwo/phd/ (access: February 2014). Wood, J., 2009. Geomorphometry in LandSerf. In: Hengl, T., Reuter, H. (Eds.), Geomorphometry: Concepts, Software, Applications. In: Developments in Soil Science, vol. 33. Elsevier, pp. 333–349. Yokoyama, R., Shirasawa, M., Pike, R.J., 2002. Visualizing topography by openness: A new application of image processing to digital elevation models. Photogramm. Eng. Remote Sens. 68, 251–266.