Techniques for evaluating the performance of landslide susceptibility models

Techniques for evaluating the performance of landslide susceptibility models

Engineering Geology 111 (2010) 62–72 Contents lists available at ScienceDirect Engineering Geology j o u r n a l h o m e p a g e : w w w. e l s ev i...

2MB Sizes 1 Downloads 47 Views

Engineering Geology 111 (2010) 62–72

Contents lists available at ScienceDirect

Engineering Geology j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / e n g g e o

Techniques for evaluating the performance of landslide susceptibility models Paolo Frattini a,⁎, Giovanni Crosta a, Alberto Carrara b a b

Dipartimento di Scienze Geologiche e Geotecnologie, Università degli Studi di Milano Bicocca, Milano, Italy CNR-IEIIT, Bologna, Italy

a r t i c l e

i n f o

Article history: Received 19 January 2009 Received in revised form 4 December 2009 Accepted 9 December 2009 Available online 16 December 2009 Keywords: Landslide susceptibility Model evaluation ROC curve Cost curve Misclassification costs Accuracy

a b s t r a c t Evaluating the performance of landslide susceptibility models is needed to ensure their reliable application to risk management and land-use planning. When results from multiple models are available, a comparison of their performance is necessary to select the model which performs better. In this paper, different techniques to evaluate model performance are discussed and tested using shallow landslide/debris-flow susceptibility models recently presented in the literature (Carrara, A., Crosta, G.B., Frattini, P., 2008. Comparing models of debris-flow susceptibility in the alpine environment. Geomorphology 94 (3–4), 353–378). Moreover, an evaluation technique based on the minimization of costs that may arise from the adoption of the model as a land management regulatory tool, is presented. The results of the application show that simple statistics such as Accuracy, Threat score, Gilbert's skill score, Pierce's skill score, Heidke skill score, and Yule's Q are problematic as they need to split the classified objects into two classes (e.g., stable/unstable) by defining an a-priori value of cutoff susceptibility, which is often not trivial. ROC curves and Success-Rate curves are cutoff-independent and can be used to efficiently visualize and compare the performance of models, but do not explicitly include classification costs. In addition, Success-Rate curves, under certain conditions, can be misleading when applied to grid-cell models. Cost curves include costs and a-priori probabilities, and are suitable for landslide susceptibility model performance evaluation from a practical point of view. © 2009 Elsevier B.V. All rights reserved.

1. Introduction In the last decades, many efforts have been made to assess landslide susceptibility at a regional scale. In spite of a huge number of models produced using various methods, little attention has been devoted to the problem of result evaluation. Model evaluation is a multi-criteria problem (Davis and Goodrich, 1990). The acceptance of a model needs to fulfil at least three criteria: its adequacy (conceptual and mathematical) in describing the system behaviour, its robustness to small changes of the input data (i.e. data sensitivity), and its accuracy in predicting the observed data. With physically-based models, the first evaluation criterion is aimed at assessing whether the model provides a physically acceptable explanation of the cause–effect relationships. Alternatively, justification is given for using simplifications of physical processes. With statistical or empirical models, the first kind of evaluation focuses on how well the variables used by the models can describe the processes. Due to the complexity of natural systems, this kind of evaluation involves a large component of judgement by experts with a deep knowledge of landslide processes (Carrara et al., 2003).

⁎ Corresponding author. Dipartimento di Scienze Geologiche e Geotecnologie, Università degli Studi di Milano Bicocca, p.zza della Scienza 4, 20126 Milano, Italy. Tel.: + 39 02 64482005; fax: + 39 02 64482073. E-mail address: [email protected] (P. Frattini). 0013-7952/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.enggeo.2009.12.004

The robustness of the model can be evaluated by systematically analyzing the variation of the model performance to small changes of input parameters or uncertainties. In the landslide susceptibility literature, only a few papers deal with robustness evaluation (Guzzetti et al., 2006; Melchiorre et al., 2006). The most relevant criterion for quality evaluation is the assessment of model accuracy, which is performed by analyzing the agreement between the model results and the observed data. In the case of landslide susceptibility models, the observed data comprise the presence/absence of landslides within a certain terrain unit used for the analysis. In the pioneering susceptibility models produced beginning in the 1980s, accuracy was evaluated through visual comparison of actual landslides with susceptibility classification (Brabb, 1984; Gökceoglu and Aksoy, 1996), or in terms of efficiency (or accuracy) (e.g., Carrara, 1983). In the last decade, different authors have proposed equivalent methods to evaluate the models in terms of landslide density within different susceptibility classes (“landslide density”, Montgomery and Dietrich, 1994; Ercanoglu and Gokceoglu, 2002; Crosta and Frattini, 2003; “degree of fit”, Irigaray et al., 1999; Baeza and Corominas, 2001; Fernández et al., 2003; “b/a ratio”, Lee and Min, 2001; Lee et al., 2003). Other authors chose to represent the success of the model by comparing the landslide density with the area of susceptible zone for different susceptibility levels (Zinck et al., 2001; “Success-Rate curves”, Chung and Fabbri, 2003; Remondo et al., 2003; Zêzere et al., 2004; Lee, 2005; Guzzetti et al., 2006). More recently, ROC curves have

P. Frattini et al. / Engineering Geology 111 (2010) 62–72

been adopted for model evaluation and comparison in the landslide literature (Yesilnacar and Topal, 2005; Begueria, 2006; Gorsevski et al., 2006; Frattini et al., 2008; Nefeslioglu et al., 2008). When a landslide susceptibility model is applied in practice, the classification of land according to susceptibility results in economical consequences. For instance, terrain that is classified as stable can be used without restrictions, increasing its economical value, whereas unstable terrain is restricted in use, and is consequently reduced in value. The misclassification of terrain in a model also produces economic costs. Hence, the performance of the models can be evaluated by assessing these costs, in order to select the best model, or the one that minimizes costs to society. This has been typically done in disciplines such as machine learning (Drummond and Holte, 2000; Provost and Fawcett, 2001) and biometrics (Pepe, 2003; Briggs and Ruppert, 2005). All the techniques used in the literature to assess the accuracy of landslide susceptibility models do not account for misclassification costs. This limitation is significant for landslide susceptibility analysis as the costs of misclassifications are very different depending on the error type. Error Type II (false negative) means that a terrain unit with landslides is classified as stable, and consequently used without restrictions. The false negative misclassification cost, c(−|+), is equal to the loss of elements at risk that can be impacted by landslides in these units. This cost depends on the economic value and the vulnerability of elements at risk (e.g., lives, buildings and lifelines), and the temporal probability and intensity of landslides. Error Type I (false positive) means that a unit without landslides is classified as unstable, and therefore limited in their use and economic development. Hence, the false positive misclassification cost, c(+|−), amounts to the loss of economic value of these terrain units. This cost is different for each terrain unit as a function of its environmental (slope gradient, altitude, aspect, distance from the main valley, etc.) and social economic (distance from an urban/industrial area, road, etc.) characteristics. With landslide susceptibility models, costs related to Error Type II are normally much larger than those related to Error Type I. For example, citing a public facility such as a school building, in a terrain unit that is incorrectly identified as stable (Type II error) could lead to very large social and economic costs in the event of a landslide. Accounting for misclassification costs in the evaluation of model performance is possible with ROC curves by using an additional procedure (Provost and Fawcett, 1997), but the results are difficult to visualize and assess. In this paper, a simple technique (Cost curves, Drummond and Holte, 2000) is adopted to explicitly account for these costs. In the following, different techniques for the evaluation and comparison of landslide susceptibility model performance (accuracy statistics, ROC curves, Success-Rate curves, and Cost curves) are presented and tested on shallow landslide/debris-flow susceptibility models. The aim of the paper is to demonstrate the applicability and capability of these techniques, by discussing their advantages and disadvantages. In order to simplify the presentation and to keep the focus on model evaluation, landslides susceptibility models already presented in the literature (Carrara et al., 2008) are used. 1.1. Accuracy statistics As previously mentioned, accuracy is assessed by analyzing the agreement between the model results and the observed data. Since the observed data comprise the presence/absence of landslides within a certain terrain unit, a simpler method to assess the accuracy is to compare these data with a binary classification of susceptibility in stable and unstable units. This classification requires a cutoff value of susceptibility that divides stable terrains (susceptibility less than the cutoff) and unstable terrain (susceptibility greater than the cutoff). The comparison of observed data and model results reclassified into two classes is represented through contingency tables (Table 1). Accuracy statistics assess the model performance by combining

63

Table 1 Contingency table used for landslide model evaluation. Observed

Predicted Class 0 (−) stable

Class 1 (+) unstable

Class 0 (−) stable

(−|−) true negative, tn

Class 1 (+) unstable

(−|+) false negative, fn, Error Type II NP

(+|−) false positive, fp, Error Type I (+|+), true positive, tp

P

PP

T

N

correct and incorrect classified positives (i.e., unstable areas) and negatives (i.e., stable areas). The first statistic, presented in the field of weather forecasting (Finley, 1884), is the Efficiency (Accuracy or Percent correct, Table 2), which measures the percentage of actual landslides that are correctly classified by the model. However, Gilbert (1884) showed that the Efficiency statistic is unreliable because it is heavily influenced by the most common class, usually “stable slopes” in the case of landslide susceptibility models, and it is not equitable. A statistic is equitable if it gives the same score for different types of unskilled classifications. In other words, classifications of random chance, “always positive” and “always negative”, should produce the same (bad) score (Murphy, 1996). True Positive rate (TP) and the False Positive rate (FP) are insufficient performance statistics, because they ignore false positives and false negatives, respectively. They are not equitable, and they are useful only when used in conjunction (e.g., ROC curves). The Threat score (Gilbert, 1884) measures the fraction of observed and/or classified events that were correctly predicted. Because it penalizes both false negatives and false positives, it does not distinguish the source of classification error. Moreover, it depends on the frequency of events (poorer scores for rarer events) since some true positives can occur purely due to random chance. The Equitable threat score (Gilbert's skill score; Gilbert, 1884; Schaefer, 1990) measures the fraction of observed and/or classified events that were correctly predicted, adjusted for true positives associated with random chance. As above, it does not distinguish the source of classification error. The Pierce's skill score (True skill statistic; Peirce, 1884; Hanssen and Kuipers, 1965) uses all elements of contingency table and does not depend on event frequency. This score may be more useful for more frequent events (Mason, 2003). Heidke's skill score (Cohen's kappa; Heidke, 1926) measures the fraction of correct classifications after eliminating those classifications which would be correct due purely to random chance. Table 2 Commonly-used accuracy statistics. tp = true positives, tn = true negatives, fp = false positives (Error Type I), fn = false negatives (Error Type II). See also Table 1. tp + tn Efficiency T (Accuracy or Percent correct) tp tp + fn

= tp P = 1−FN

False positive rate (FP) = 1-specificity Threat score (critical success index)

fp fp + tn

= fp N = 1−TN

Equitable threat score (Gilbert skill score) Pierce's skill score (True skill statistic) Heidke skill score (Cohen's kappa)

tp−tprandom tp + fn + fp−tprandom

Odds ratio

tp · tn

True positive rate (TP) = sensitivity

tp tp + fn + fp

tp fp −fp + tp + fn tn tp + tn−E T−E

tp · tn−fp · fn tp · tn + fp · fn

ðtp + fnÞðtp + fpÞ T

= TP−FP

where E =

fn · fp

Odd ratio skill score (Yule's Q)

where tprandom =

1 T ½ðtp

+ fnÞðtp + fpÞ + ðtn + fnÞðtn + fpÞ

64

P. Frattini et al. / Engineering Geology 111 (2010) 62–72

The odds ratio (Stephenson, 2000) measures the ratio of the odds of true prediction to the odds of false prediction. This statistic takes prior probabilities into account and gives better scores for rare events, but cannot be used if any of the cells in the contingency table is equal to 0. Finally, the odd ratio skill score (Yule's Q; Yule, 1900) is closely related to the odds ratio, but conveniently ranges between −1 and 1. As mentioned, accuracy statistics require the splitting of the classified objects into a few classes by defining specific values of the susceptibility index that are called cutoff values. For statistical models, such as discriminant analysis (e.g., Carrara, 1983) or logistic regression analysis (e.g., Chung et al., 1995; Atckinson and Massari, 1998; Dai and Lee, 2002; Ohlmacher and Davis, 2003; Nefeslioglu et al., 2008), a statistically significant probability cutoff (pcutoff) exists, equal to 0.5. When the groups of stable and unstable terrain units are equal in size and their distribution is close to normal, this value maximizes the number of correctly predicted stable and unstable units. In different conditions, or for other types of landslide susceptibility models, such as physically-based (Van Westen and Terlien, 1996; Gökceoglu and Aksoy, 1996; Crosta and Frattini, 2003; Frattini et al., 2004; Godt et al., 2008), heuristic (e.g., Barredo et al., 2000), artificial neural networks (Lee et al., 2003; Ermini et al., 2005; Nefeslioglu et al., 2008), fuzzy logic (Binaghi et al., 1998; Ercanoglu and Gokceoglu, 2004), the choice of cutoff values to define susceptibility classes is arbitrary, unless a cost criteria is adopted (Provost and Fawcett, 1997). A first solution to this limitation consists in evaluating the performance of the models over a large range of cutoff values by using cutoff-independent performance criteria. Another solution consists in finding the optimal cutoff by minimizing the costs of the models.

1.2. Cutoff-independent performance criteria The most commonly-used cutoff-independent performance techniques for landslide susceptibility models are the Receiver Operating Characteristic (ROC) curves and Success-Rate curves. The ROC analysis was developed during the Second World War to assess the performance of radar receivers in detecting targets. It has been adopted in different scientific fields, such as medical diagnostic testing (Goodenough et al., 1974; Hanley and McNeil, 1982; Swets, 1988) and machine learning (Egan, 1975; Adams and Hand, 1999; Provost and Fawcett, 2001). The Area Under the ROC Curve (AUC) can be used as a metric to assess the overall quality of a model (Hanley and McNeil, 1982): the larger the area, the best the performance of the model over the whole range of possible cutoffs. The points on the ROC curve represent (FP, TP) pairs derived from different contingency tables created by applying different cutoffs. Points closer to the upperright corner correspond to lower cutoff values (Fig. 1). An ROC curve is better than another if it is closer to the upper-left corner. The range of

values for which the ROC curve is better than a trivial model (i.e., a model which classifies objects by chance, represented in the ROC space by a straight line joining the lower-left and the upper-right corner) is defined operating range. Success-Rate curves (Zinck et al., 2001; Chung and Fabbri, 2003) represent the percentage of correctly classified objects (i.e., terrain units) on the y-axis, and the percentage of area classified as positive (i.e., unstable) on the x-axis. In the landslide literature, the y-axis is normally considered as the number of landslides, or the percentage of landslide area, correctly classified. In the case of grid-cell units where landslides correspond to single grid cells and all the terrain units have the same area, the y-axis corresponds to TP, analogous with the ROC space, and the x-axis corresponds to the number of units classified as positive. 1.3. Cost curves The total cost of misclassification of a model depends on (Drummond and Holte, 2000): the percentage of terrain units that are incorrectly classified, the a-priori probability of having a landslide in the area, and the costs of misclassification of the different error types. In order to explicitly represent costs in the evaluation of model performance, Drummond and Holte (2006) proposed the Cost curve representation. The Cost curve represents the Normalized Expected cost as a function of a Probability-Cost function (Fig. 1). The Normalized Expected cost, NE(C) is calculated as: NEðCÞ =

ð1−TPÞ⋅pð + Þ⋅cð− j + Þ + FP ⋅pð−Þ⋅cð + j−Þ pð + Þ⋅cð−j + Þ + pð−Þ⋅cð + j−Þ

ð5Þ

where the expected cost is normalized by the maximum expected cost, that occurs when all cases are incorrectly classified, i.e. when FP and FN are both one. The maximum normalized cost is 1 and the minimum is 0. The Probability-Cost function, PC(+) is: PCð + Þ =

pð + Þ⋅cð− j + Þ pð + Þ⋅cð− j + Þ + pð−Þ⋅cð + j−Þ

ð6Þ

which represents the normalized version of p(+)c(−|+), so that PC(+) ranges from 0 to 1. When misclassification costs are equal, PC(+)=p(+). In general, PC(+)=0 occurs when cost is only due to negative cases, i.e., positive cases never occur (p(+) = 0) or their misclassification cost, c(−|+), is null. PC(+)= 1 corresponds to the other extreme, i.e., p(−)= 0 or c(+|−)=0. A single classification model, which would be a single point (FP, TP) in ROC space, is a straight line in the Cost curve representation

Fig. 1. ROC curve (a) and Cost curve (b) for a landslide susceptibility model (cSU_DIS). The two example points on the ROC curve results from binary classifications of the terrain units into stable/unstable, using two probability cut offs (Pcutoff = 0.21 and 0.47). Each point in the ROC curve corresponds to a straight line in the ROC space.

P. Frattini et al. / Engineering Geology 111 (2010) 62–72

(Fig. 1). A set of points in ROC space, the basis for an ROC curve, is a set of Cost lines, one for each ROC point. 2. Case study To analyse the advantages and disadvantages of the different evaluation techniques, debris-flow/shallow landslides susceptibility models recently presented in the literature (Carrara et al., 2008) are used. In the following, some fundamental data and information concerning the study area and the models are reported. A more detailed description can be obtained from the study performed by Carrara et al. (2008). 2.1. Study area The study area extends for about 300 km2 in the Italian Dolomites (Val di Fassa, Northern Italy, Fig. 2). Bedrock geology is composed of different rocks (effusive, intrusive and sedimentary rocks) that have been folded and faulted, resulting in steep slopes deeply dissected by an actively eroding drainage network. Pleistocene glaciations deeply sculpted the valley profiles, leaving widespread glacial deposits partially reworked by erosive processes and locally overlain by talus. The area is characterized by subalpine and alpine climate with dense coniferous forest up to 1800–1900 m elevation. Higher, a transition to Alpine herbaceous vegetation is observed. Major slope instabilities consist of debris flows, earth-slides and flows, rock falls and rock avalanches. Geological, geomorphological, and other environmental data at 1:10000 in scale were collected (Carrara et al., 2008), and a detailed landslide inventory was compiled using aerial photo interpretation, field surveying, and historical records.

65

Over 7000 debris flows were detected, mapped and classified according to their degree of activity and the morphometric setting of the source area. Most of debris flows originated by shallow landsliding on open slopes and within zero-order basins. More rarely, debris flows were also triggered by mobilization of debris along channels or at the toe of rocky cliffs. Shallow landsliding prevails in thick Quaternary deposits, especially volcanic materials, whereas the other processes prevail in calcareous and intrusive massifs.

2.2. Debris-flow susceptibility models Carrara et al. (2008) presented five debris-flow susceptibility models for this area (Table 3, Fig. 3). These models differ in method (statistical and physically-based) and type of terrain unit (grid cells and slope units). For the statistical models, the independent variables were the geoenvironmental characteristics of the area, and the dependent variable was the presence/absence of debris-flow source areas. In the grid-cell model, the “unstable” terrain units were 7103 grid cells with debris flows: an equal number of “stable” units (i.e., grid cells without debrisflow source areas) were randomly selected. Half of the unstable and stable grid cells were used for training, and the other half for validation. The size of the grid cell is 10 × 10 m. In the slope-unit models, both a fine partition (9198 terrain units with a mean area equal to 30 900 m2, Fig. 3a) and a coarse partition (3015 terrain units with a mean area equal to 95 900 m2, Fig. 3c) were analysed. As a result of the statistical models, a probability of instability was obtained for each terrain unit, and it was used as the susceptibility metric for the statistical models. Numerous models were also developed for cross-validation, by randomly selecting 65% of the slope units for training, and the rest 35% for validation.

Fig. 2. Upper Avisio river basin (Val di Fassa). The black highlighted square refers to the sub-area represented in Fig. 3. (Modified after Carrara et al., 2008).

66

P. Frattini et al. / Engineering Geology 111 (2010) 62–72

Table 3 Models of debris-flow source areas generated by applying different terrain units (slope units and grid cells) and methods (statistical and physically-based) (after Carrara et al., 2008). Terrain unit

Method Statistical

Slope unit

Grid cell

successively calibrated to optimize the model results. The ratio (q/T) of steady-state recharge (q) to soil transmissivity (T) needed to trigger a shallow landslide was adopted as a susceptibility metric for this model: the smaller the ratio, the higher the susceptibility.

Fine (9198 units, mean area: 30,900 m2) Coarse (3015 units, mean area: 95,900 m2) (2,900,485 units; area: 100 m2)

Physicallybased

Discriminant analysis

Logistic regression

fSU_DIS

fSU_LOG



cSU_DIS





PIX_DIS



PHY_BAS

3. Methods Here, the performance of the susceptibility models is evaluated by using accuracy statistics, ROC curves, Success-Rate curves, and Cost curves. Seven commonly-used cutoff-sensitive accuracy statistics are applied (Table 2): Accuracy, Threat score, Gilbert's skill score, Pierce's skill score, Heidke skill score, and Yule's Q. The following cutoffs are defined: • (statistical models) probability = 0.5 • (physically-based shallow landslide model) q/T ratio = 0.01

For the physically-based approach, the Montgomery and Dietrich (1994) model, based on the coupling of a steady-state hydrological model and an infinite-slope stability analysis, was applied. To parameterize the model, the study area was subdivided into nine zones according to the typology of superficial deposits and the presence/ absence of forest cover. Values for soil properties were derived from the literature (Rawls et al., 1983; Phoon and Kulhawy, 1999) and

In order to compare the performance of the models over the whole range of cutoffs, ROC curves and Success-Rate curves are used. ROC curves are developed by using a statistical package (SPSS), which automatically extracts True Positive (TP) and False Positive (FP) rates (Table 1) from contingency tables associated to different cutoff values for each model. The Area Under Curve is also calculated. To calculate the Success-Rate curves, 15 different susceptibility maps are generated

Fig. 3. Examples of debris-flow susceptibility models for a sub-area (see Fig. 2 for location). The models are: a) cSU_DIS; b) fSU_DIS; c) PIX_DIS; d) PHY_BAS. The left legend refers to models a, b, c. The right legend refers to model d.

P. Frattini et al. / Engineering Geology 111 (2010) 62–72

for each model by varying the cutoff values. For each map, the percentage of unstable area and the percentage of the landslide area within the unstable zone (i.e., correctly classified) are calculated. Finally, to explicitly represent costs in the evaluation of model performance Cost curves are applied. Taking into consideration that a single point (FP, TP) in ROC space results in a straight line, with coordinates (0, FP) and (1, FN), in Cost space, Cost curves are implemented through the following steps: 1. a straight line is calculated for each point of the ROC curves, with function NE(C) = FP + (FN − FP)PC(+), where NE(C) is the Normalized Expected Cost and PC(+) is the Probability-Cost function; 2. for small increments of PC(+), the values of the Normalized Expected Cost are calculated for all the linear functions; 3. for each increment, the minimum expected cost is selected; 4. the selected minima are joined to trace the Cost curve, which represents the lower envelope of the straight lines. Once the Cost curves for different models have been prepared, the quality of each model is assessed in terms of Normalized Expected Cost, given a specific value of the Probability-Cost function: the lower the Cost curve, the better the performance, and the difference between two models is simply the vertical distance of the curves. As mentioned, the value of the Probability-Cost function depends on both the a-priori probability and the misclassification costs. In this paper, given the uncertainty in the observed distribution of population, a condition of equal-probability is assumed. Regarding the misclassification cost, a simple analysis of costs based on land-use maps was performed as described in the following. Once a value of Probability-Cost function is assigned, the optimal cutoff is calculated for each model by selecting the lower straight line in the Cost space (Fig. 1). In fact, this straight line corresponds to the contingency table of the optimal cutoff, which minimizes the Normalized Expected Cost for that particular combination of a-priori probability and misclassification costs.

3.1. Evaluation of misclassification costs Misclassification costs are site-specific and vary significantly within the study area. A rigorous analysis would estimate them at each terrain unit independently, and evaluate the total costs arising from the adoption of each model by summing up these costs. This requires the contribution of the administrators and policy makers of local (municipality) and national authorities: a task beyond the capabilities of most investigators. In order to estimate the average cost of false negatives and false positives for the study area, the existing land-use map is used to calculate both the area occupied by elements potentially at risk (e.g., buildings, lifelines, roads, and ski-lifts, Table 4) and the area

67

potentially suitable for building development (Table 4). The latter is defined through a multi-criteria approach: • slope gradient <25°; • elevation <2000 m; • areas not protected by administration regulations. To establish the total value of the elements at risk, Wel, and the total value of the building areas, Wdev, the calculated areas are multiplied by the mean local price per square metre of both the exposed elements and the building areas (Table 4). These total values are then divided by the total area of Val di Fassa, in order to obtain average misclassification costs per square meter (c(+|−)sqm, c(−|+)sqm) for the entire model, respectively. The misclassification cost for Error Type I (false positives) is further divided by a corrective factor of 2, to account for the uncertainty in the apriori classification of stable terrain units, assuming that half of the terrain units could be misclassified (Ardizzone et al., 2002). In fact, a terrain unit interpreted as stable is subject to a high degree of uncertainty, because the evidence of instability can be temporarily invisible due to factors such as the forest cover, the low quality of the photographs, or the man-made reworking of the terrain. In other words, Error Type I can be a non-error and requires further investigations to discern between a real or a false error. In general, a correction is not needed for Error Type II, because a terrain unit interpreted as unstable normally presents strong evidence. Taking into account the uncertainty in these calculations, 3 scenarios of relative costs are considered, combining minimum, maximum and average costs (Table 4). 4. Results Accuracy statistics show similar results for the evaluation of the different models (Fig. 4, Table 5). The coarse slope-unit discriminant model (cSU_DIS) outperforms the others with all the statistics, whereas the physically-based model is always the worst. Overall, Accuracy and the Threat score show smaller differences among the models, thus making the choice of the best model more difficult. The other statistics are practically equivalent. A comparison of grid-cell discriminant models built using training and validation sets show slight differences, demonstrating a good predictive capability of the models. This also supports the choice to evaluate the other models using the entire set of data. The coarse slope-unit discriminant model (cSU_DIS) appears as the best model also using ROC curves (Fig. 5). The model outperforms the others for a large range of cutoffs, but it is not optimal when it is forced to over-predict positives (i.e., in the upper-right corner of the ROC space). Thus, under particular cost and a-priori probability conditions, other

Table 4 Classification costs adopted in the analysis for the Val di Fassa area. Parameter

Value

Building area (m2) Buildings (m2) Infrastructures (m2) Value of building area, Wdev (million €) Value of elements at risks, Wel (million €) c(+|–)sqm (€/m2) Corrected c(+|–)sqm (€/m2) c(–|+)sqm (€/m2) Scenario 1: c(–|+)min:c(+|–)max Scenario 2: c(–|+)mean:c(+|–)mean Scenario 3: c(–|+)max:c(+|–)min

14,590,000 4,530,000 1,040,000 12,401 ± 2,188 14,360 ± 7,180 30.2 ± 15.1 15.1 ± 7.5 49.5 ± 24.9 0.5:0.5 0.7:0.3 0.8:0.2

Fig. 4. Comparison of the debris-flow susceptibility models using cut off-dependent statistics as from Table 5.

68

P. Frattini et al. / Engineering Geology 111 (2010) 62–72

Table 5 Performance of the models evaluated using cut off sensitive statistics. For validation and training sets the reported values are the average of the statistics for the different trials. A = Accuracy, TS = Threat score, GSS = Gilbert's skill score; PSS = Pierce's skill score, HSS = Heidke's skill score, Y'Q = Yule's Q. Model

Population

A

TS

GSS

PSS

HSS

Y'Q

fSU_DIS fSU_DIS fSU_DIS fSU_LOG cSU_DIS cSU_DIS cSU_DIS PIX_DIS PIX_DIS PHY_BAS

All Training Validation All All Training Validation Training Validation All grid cells

0.73 0.73 0.73 0.74 0.77 0.78 0.75 0.72 0.71 0.62

0.61 0.61 0.60 0.64 0.64 0.65 0.63 0.56 0.55 0.52

0.30 0.30 0.29 0.30 0.36 0.38 0.33 0.29 0.27 0.14

0.47 0.47 0.47 0.46 0.54 0.56 0.50 0.45 0.43 0.24

0.46 0.46 0.45 0.46 0.53 0.55 0.50 0.45 0.43 0.24

0.77 0.77 0.77 0.77 0.83 0.85 0.80 0.75 0.72 0.53

models could perform better than the cSU_DIS. This highlights a major limitation of the ROC analysis in comparing models: the difficulty to account for different costs and a-priori probabilities. By using the Success-Rate curve, the performance of the grid-cell discriminant model (PIX_DIS) is the best, followed by the 2 statistical models developed using fine slope units (Fig. 6). This basically inverts the results obtained with the ROC curves, as discussed later. Cost curves allow enhancing the capability to analyse and compare model performance in a cost-sensitive environment. Ten coarse slope-unit models (cSU_DIS) developed for crossvalidation are compared and analysed (Fig. 7). The models are very similar, with an operating range covering the whole range of Probability-Cost function. As mentioned, the operating range corresponds to the range of cutoff values for which a certain model outperforms the trivial model. The average of the models can be simply calculated by averaging the normalized expected cost for each value of the Probability-Cost function (Fig. 7). Then, all the models are compared using the Cost curves (Fig. 8). The coarse slope-unit discriminant model (cSU-DIS) is the best model over most of the operating range. Only for values of the ProbabilityCost function higher than 0.8, the coarse slope-unit model is outperformed by the fine slope-unit models. The performance of fine slope-unit models (discriminant and logistic) is very similar, with a slightly better performance of the logistic (fSU_LOG). The two grid-cell models (discriminant and physically-based) show high expected costs for the whole operating range. Between these models, the discriminant (PIX_DIS) performs better, with an operating range from 0.17 to 1. The operating range of the physicallybased model is smaller, from 0.325 to 0.7. Outside these ranges both the models perform as trivial models.

Fig. 5. ROC curves for the debris-flow susceptibility models. Area Under Curve (AUC) for each model is reported. (Modified after Carrara et al., 2008).

Fig. 6. Success-Rate curves for the debris-flow susceptibility models. Results for the logistic model are not available.

In this analysis, the three cost–probability scenarios presented in Table 4 are considered. For c(−|+):c(+|−) = 0.5:0.5 the best model is the coarse slope-unit model (cSU_DIS) with a cutoff of 0.43 (Fig. 9), and all the models are better than the trivial model. For c(−|+):c(+|−)= 0.7:0.3 the results are equivalent, but the optimal cutoff for the cSU_DIS model is 0.27 (Fig. 9). For c(−|+):c(+|−)=0.8:0.2 the best model is the fine slope-unit logistic model (fSU_LOG) with a cutoff of 0.12 (Fig. 9), and the physically-based model is worse than the trivial model. 5. Discussion 5.1. Accuracy statistics Traditional cutoff-dependent approaches reveal slight differences among the models (Fig. 4). Only the physically-based model shows a significantly lower performance. This result indicates that the accuracy statistics have a scarce capability to discriminate the performance of the models, at least for the present case study. In addition, the application of each statistic is reliable only under specific conditions (e.g., rare events or frequent events) that should be evaluated case by case, in order to select the most appropriate method (Stephenson, 2000). This is a limitation for a general application to landslide susceptibility models. For multivariate statistical models, the application of cutoff-dependent accuracy statistics is straightforward and scientifically correct because the cutoff value is statistically significant. This is true only when assuming equal a-priori probabilities and equal misclassification costs, conditions that are normally violated by landslide susceptibility models.

Fig. 7. Cost curves for coarse slope-unit models used for cross-validation. Ten different trials are analysed and averaged in the figure.

P. Frattini et al. / Engineering Geology 111 (2010) 62–72

69

5.2. ROC and Cost curves

Fig. 8. Cost curves for the debris-flow susceptibility models. The dots identify the best models for the different scenarios: cSU_DIS with probability cut off (Pcutoff) = 0.43 for Scenario 1, c(−|+)min:c(+|−)max; cSU_DIS with Pcutoff = 0.27 for Scenario 2, c(−|+)mean:c(+|−)mean; cSU_DIS with Pcutoff =0.12 for Scenario 3, c(−|+)max:c(+|−)min. See Table 4 for scenario definition.

For other kinds of models (e.g., physically-based, heuristic and fuzzy) there is no theoretical reason to select a certain cutoff, and the application of accuracy statistics is therefore unfeasible.

Evaluating landslide models with cutoff-independent criteria has the advantage that an a-priori cutoff value is not required, and the performance can be assessed over the entire range of cutoff values. Using ROC and Success-Rate curves, different results have been obtained (Figs. 5 and 6). The difference is due to the following reasons. The first curve type is based on the analysis of the classification of the statistical units, and describes the capability of the statistical model to discriminate among two classes of objects. On the other hand, the Success-Rate curve is based on the analysis of spatial matching between actual landslides and susceptibility maps. Thus it considers the area of both the landslides and the terrain units, and not only the number of units correctly or incorrectly classified. According to the ROC curves, the best model is the discriminant model with coarse slope units (cSU_DIS, Fig. 5). Statistically, this model is more robust than the others because the proportion of a-priori classified stable and unstable units is almost the same. Moreover, since the exact location of landslides is affected by a certain degree of uncertainty, the adoption of larger terrain units reduces the noise in the classification and improves the performance. This explains the worse performance of fine slope units. The discriminant grid-cell model is penalized by the fact that unstable units are numerically fewer than stable ones. This requires a random selection of different sub-sets of stable units for the

Fig. 9. Best debris-flow susceptibility models classified in stable/unstable using optimal cut off values for three different cost and probability scenarios: a) cSU_DIS for Scenario 1, b) cSU_DIS for Scenario 2, and c) fSU_LOG for Scenario 3. See Table 4 for scenario definition.

70

P. Frattini et al. / Engineering Geology 111 (2010) 62–72

classification analysis. As a consequence, the representativeness of all the conditions that can favour slope stability is not complete. This is a typical problem when dealing with statistical analysis of grid-cell models, and could be partly solved by guiding the selection of stable units with a supervised procedure (Melchiorre et al., 2008). According to the Success-Rate curves, the discriminant grid-cell model (PIX_DIS) is the best performer (Fig. 6). In other words, this model is the one that optimizes the classification of landslides within a small area classified as unstable. The slope-unit models are penalized by the fact that they classify entire slopes as stable or unstable. Although large slope units can be internally heterogeneous (e.g., small flat areas can be included within an unstable slope), the susceptibility is generalized, and sometimes overestimated. For this reason, the authors believe that Success-Rate curves are not suitable to evaluate performance of slope-unit models. On the other hand, the high resolution of grid cells allows to identify unstable areas with greater spatial detail, resulting in good performance with Success-Rate curves. This applies only for certain landslide typologies (e.g., small shallow landslides) that are compatible with the cell-size resolution. The high resolution of grid cells is an advantage only when the spatial uncertainty of the models is very low. Whereas the location of landslides or the mapping limits of geoenvironmental units used as predictors are inaccurate, the grid-cell model can encompass high uncertainties. In these cases, the spatial resolution of the model is a false resolution that can be misleading for administrators that need to apply the results of the model. Success-Rate curve presents some theoretical problems when applied to grid-cell models. The number of true positives, in fact, contributes to both x- and y-axis. An increase in true positives causes an upward (toward better performance) and rightward (toward worse performance) shift of the curve. In some cases (Fig. 10a) the rightward shift can be faster than the upward one, causing an apparent loss of performance with increasing true positives, and this is clearly a misleading evaluation of model performance. Moreover, the Success-Rate curve is sensitive to the initial proportion of positives and negatives. For examples, the 4 models shown in Fig. 10b have the same proportion of correct classification, but a different total number of positives and negatives. ROC analysis correctly shows the same performance for all the models, whereas Success-Rate curves consider better the cases with a small proportion of positives (a small unstable area). Hence, the application of Success-Rate curves to areas with a low degree of hazard (e.g., flat areas with small steep portions of the landscape, Fabbri et al., 2003; Van den Eeckhaut et al., 2006) will always give better results than application to areas with a high hazard (e.g., alpine valleys with steep slopes), even if the quality of the classification is exactly the same. Again, this is misleading.

5.3. Explicit cost representation The assessment of the costs that can arise after the adoption of the susceptibility models is an important issue for their practical application. The best model being the one that can minimize these costs. The best model for Scenario 1 is the coarse slope-unit model (cSU_DIS) with a cutoff of 0.43 (Figs. 8 and 9a). The unstable area is concentrated on the upper part of the slopes, and large part of the study area is classified as stable, and available for building. A cutoff of 0.27 (scenario 2) brings to a reduction in the stable area. In fact, false negative costs are twice the false positive costs for Scenario 2 (ratio 0.7:0.3, Table 4). This forces the classification toward instability in order to keep the false negative costs low (Fig. 9b). The third Scenario magnifies this behaviour, leading to a classification that strongly favours instability (Fig. 9c). Even if it may seem exaggerated, it is worth to consider that the land planning regulation currently adopted by the Provincial Authority is very close to this last scenario. Cost curves (Drummond and Holte, 2006) demonstrate to be suitable for the evaluation of cost-sensitive models. In particular, they allow: – assessment of the optimal cutoff value of each model in order to optimize the classification of positives and negatives as a function of the a-priori probability and the misclassification costs. This provides a rationale for the selection of a cutoff value also for those models that do not have any scientifically significant cutoff (e.g., empirical models, physically-based models, and neural network models); – comparison of different models and easy measurement of the difference in the performance, which is represented by the distance in the y-axis of the curves; – Identification of the operating range of the models; – for the averaging of many models resulting from different trials in order to assess the mean and the standard deviation of the performance. This last parameter can be considered as an indicator of the model robustness. The application of Cost curves to landslide susceptibility models is complicated by the fact that both the a-priori classification and the costs of different error types cannot be readily determined. In other disciplines, the a-priori probability is derived from the observed distribution of population: the portion of actual positives is assumed as the a-priori probability of positives, and the same for negatives. Unfortunately, in landslide susceptibility models, the actual distribution of landslides that is used for the a-priori classification of units into positives or negatives has a large margin of uncertainty, especially for shallow landslides and debris flows, where evidence is rapidly concealed by erosion or agricultural activity. Given this limitation, the a-priori probabilities need to be assigned on the basis

Fig. 10. Different behaviour of ROC (grey circles and numbers) and Success-Rate (black diamonds and numbers) curves with two simple examples (reported in the tables): a) increase of true positive rate; b) change of proportion of positives and negatives with the same percentages of correct classification. In both the situations, Success-Rate curves show an erroneous behaviour.

P. Frattini et al. / Engineering Geology 111 (2010) 62–72

of other consideration (e.g., expert knowledge). In present application, a condition of equal-probability is assumed. A similar problem exists for misclassification costs. In this analysis the average misclassification costs are calculated based on the specific socio-economical conditions of the area, which controls the value of both the elements at risk that are relevant for the assessment of the false negative costs, and the land for further development that controls the false negative costs. This approach simplifies the problem, because the elements at risk are probably concentrated in the more safe zones. Moreover, it does not consider the temporal probability. The misclassification costs could also been assigned according to the risk attitude of the decision makers, which reflects the risk perception of society. If society is willing to tolerate high level of risk, the decision makers would depreciate the false negative cost, thus decreasing the cost ratio (false negative/false positive). If society tolerates low risk, the decision makers need to increase the false negative cost. In the first case (low cost ratio) a cost-sensitive criteria would favour a model which classifies as unstable a small portion of the area, and of course the opposite for the second case. 6. Conclusions In this paper different approaches for the evaluation of landslide susceptibility models, and for the comparison of their performance are analysed. These approaches are tested on debris-flow susceptibility models that have been developed by the authors (Carrara et al., 2008) by using different methods and different terrain units. From the results of the analysis it is possible to conclude that: • cutoff-dependent accuracy statistics (Accuracy, Threat score, Gilbert's skill score, Pierce's skill score, Heidke's skill score, Odds ratio skill score) require an a-priori choice of the cutoff value that is not always trivial; In addition, the application of each statistic is reliable only under specific conditions that should be evaluated case by case, this representing a strong limitation for a general application to landslide susceptibility models; • ROC curves are cutoff-independent and can evaluate the performance of models over a large range of cutoffs; the major limitation lays in the fact that they do not include cost explicitly; • Success-Rate curves show the same limitations of ROC curves, but the advantage of being easier to understand for geomorphological applications; however, they show some inconsistencies (both practical and theoretical) for grid-cell models; • Cost curves include costs explicitly, and use relative costs that are easier to assess with respect to absolute costs; this performance approach is advisable for evaluation and comparison of susceptibility models when a practical application of the model in land management is expected. Acknowledgments Thanks to Paolo Campedel from the Provincia Autonoma di Trento for useful discussions on the models. Also thanks to Jonathan Godt and Candan Gokceoglu for their helpful reviews. References Adams, N.M., Hand, D.J., 1999. Comparing classifiers when the misallocation costs are uncertain. Pattern Recognition 32, 1139–1147. Ardizzone, F., Cardinali, M., Carrara, A., Guzzetti, F., Reichenbach, P., 2002. Impact of mapping errors on the reliability of landslide hazard models. Natural Hazard and Earth Systems Science 2, 3–14. Atckinson, P.M., Massari, R., 1998. Generalised linear modelling of susceptibility lo landsliding in the Central Apennines, Italy. Computers & Geosciences 24 (4), 373–385. Baeza, C., Corominas, J., 2001. Assessment of shallow landslide susceptibility by means of multivariate statistical techniques. Earth Surface Processes and Landforms 26 (12), 1251–1263. Barredo, J.I., Benavides, A., Hervas, J., Van Westen, C.J., 2000. Comparing heuristic landslide hazard assessment techniques using GIS in the Trijana basin, Gran Canaria Island, Spain. JAG 2 (1), 9–23.

71

Begueria, S., 2006. Validation and evaluation of predictive models in hazard assessment and risk management. Natural Hazards 37 (3), 315–329. Brabb, E.E., 1984. Innovative approaches to landslide hazard mapping. 4th International Symposium on Landslides, Toronto, Vol 1, pp. 307–324. Binaghi, E., Luzi, L., Madella, P., Pergalani, F., Rampini, A., 1998. Slope instability zonation: a comparison between certainty factor and fuzzy Dempster–Shafer approaches. Natural Hazards 17, 77–97. Briggs, W.M., Ruppert, D., 2005. Assessing the skill of yes/no predictions. Biometrics 61, 799–807. Carrara, A., 1983. Multivariate models for landslide hazard evolution. Mathematical Geology 15, 403–427. Carrara, A., Crosta, G.B., Frattini, P., 2003. Geomorphological and historical data in assessing landslide hazard. Earth Surface Processes and Landforms 28 (10), 1125–1142. Carrara, A., Crosta, G.B., Frattini, P., 2008. Comparing models of debris-flow susceptibility in the alpine environment. Geomorphology 94 (3–4), 353–378. Chung, C.F., Fabbri, A.G., 2003. Validation of spatial prediction models for landslide hazard mapping. Natural Hazards 30 (3), 451–472. Chung, C.F., Fabbri, A.G., Van Westen, C.J., 1995. Multivariate regression analysis for landslide hazard zonation. In: Carrara, A., Guzzetti, F. (Eds.), Geographical Information Systems in Assessing Natural Hazards. Kluwer Academic Publishers, Dordrecht, The Netherlands, pp. 107–133. Crosta, G.B., Frattini, P., 2003. Distributed modeling of shallow landslides triggered by intense rainfall. Natural Hazards and Earth System Sciences 3 (1–2), 81–93. Dai, F.C., Lee, C.F., 2002. Landslide characteristics and slope instability modelling using GIS, Lantau Island, Hong Kong. Geomorphology 42, 213–228. Davis, P.A., Goodrich, M.T., 1990. A proposed strategy for the validation of ground-water flow and solute transport models. Technical Report, Sandia National Labs., Albuquerque, NM (USA). Drummond, C., Holte, R.C., 2000. Explicitly representing expected cost: an alternative to ROC representation. Proc. of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 198–207. Drummond, C., Holte, R.C., 2006. Cost curves: an improved method for visualizing classifier performance. Machine Learning 65 (1), 95–130. Egan, J.P., 1975. Signal Detection Theory and ROC Analysis. Series in Cognition and Perception. Academic Press, New York. Ercanoglu, M., Gokceoglu, C., 2002. Assessment of landslide susceptibility for a landslide-prone area (north of Yenice, NW Turkey) by fuzzy approach. Environmental Geology 41 (6), 720–730. Ercanoglu, M., Gokceoglu, C., 2004. Use of fuzzy relations to produce landslide susceptibility map of a landslide prone area (West Black Sea Region, Turkey). Engineering Geology 75, 229–250. Ermini, L., Catani, F., Casagli, N., 2005. Artificial neural networks applied to landslide susceptibility assessment. Geomorphology 66, 327–343. Fabbri, A.G., Chung, C.F., Cendrero, A., Remondo, J., 2003. Is prediction of future landslides possible with a GIS? Natural Hazards 30 (33), 487–499. Fernández, T., Irigaray, C., El Hamdouni, R., Chacón, J., 2003. Methodology for landslide susceptibility mapping by means of a GIS. Application to the Contraviesa Area (Granada, Spain). Natural Hazards 30, 297–308. Finley, J.P., 1884. Tornado predictions. American Meteorological Journal 1, 85–88. Frattini, P., Crosta, G.B., Fusi, N., Dal Negro, P., 2004. Shallow landslides in pyroclastic soil: a distributed modeling approach for hazard assessment. Engineering Geology 73, 277–295. Frattini, P., Crosta, G.B., Carrara, A., Agliardi, F., 2008. Assessment of rockfall susceptibility by integrating statistical and physically-based approaches. Geomorphology 94 (3–4), 419–437. Gilbert, G.F., 1884. Finley's tornado predictions. American Meteorological Journal 1, 166–172. Godt, J.W., Baum, R.L., Savage, W.Z., Salciarini, D., Schulz, W.H., Harp, E.L., 2008. Requirements for integrating transient deterministic shallow landslide models with GIS for hazard and susceptibility assessments. Engineering Geology 103, 214–226. Gökceoglu, C., Aksoy, H., 1996. Landslide susceptibility mapping of the slopes in the residual soils of the Mengen region (Turkey) by deterministic stability analyses and image processing techniques. Engineering Geology 44, 147–161. Goodenough, D.J., Rossmann, K., Lusted, L.B., 1974. Radiographic applications of receiver operating characteristic (ROC) analysis. Radiology 110, 89–95. Gorsevski, P.V., Gessler, P.E., Foltz, R.B., Elliot, W.J., 2006. Spatial prediction of landslide hazard using logistic regression and ROC analysis. Transactions in GIS 10 (3), 395–415. Guzzetti, F., Reichenbach, P., Ardizzone, F., Cardinali, M., Galli, M., 2006. Estimating the quality of landslide susceptibility models. Geomorphology 81 (1–2), 166–184. Hanley, J.A., McNeil, B.J., 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143 (1), 29–36. Hanssen, A.W., Kuipers, W.J.A., 1965. On the relationship between the frequency of rain and various meteorological parameters. Mededelingenen Verhandelingen 81, 2–15. Heidke, P., 1926. Berechnung des Erfolges und der Güte der Windstärkevorhersagen im Sturmwarnungdienst (Calculation of the success and goodness of strong wind forecasts in the storm warning service). Geografika Annaler 8, 301–349. Irigaray, C., Fernández, T., El Hamdouni, R., Chacón, J., 1999. Verification of landslide susceptibility mapping: a case study. Earth Surface Processes and Landforms 24, 537–544. Lee, S., 2005. Application of logistic regression model and its validation for landslide susceptibility mapping using GIS and remote sensing data. International Journal of Remote Sensing 26 (7), 1477–1491. Lee, S., Min, K., 2001. Statistical analysis of landslide susceptibility at Yongin, Korea. Environmental Geology 40, 1095–1113.

72

P. Frattini et al. / Engineering Geology 111 (2010) 62–72

Lee, S., Ryu, J.H., Min, K., Won, J.S., 2003. Landslide susceptibility analysis using GIS and artificial neural network. Earth Surface Processes and Landforms 28, 1361–1376. Mason, I.B., 2003. Binary events. In: Jolliffe, I.T., Stephenson, D.B. (Eds.), Forecast Verification. A Practitioner's Guide in Atmospheric Science. Wiley & Sons Ltd, Chichester, pp. 37–76. Melchiorre, C., Matteucci, M., Remondo, J., 2006. Artificial neural networks and robustness analysis in landslide susceptibility zonation. Proc. International Joint Conference on Neural Networks, Vancouver, BC, Canada, July 16–21, 2006. Melchiorre, C., Matteucci, M., Azzoni, A., Zanchi, A., 2008. Artificial neural networks and cluster analysis in landslide susceptibility zonation. Geomorphology 94 (3–4), 379–400. Montgomery, D.R., Dietrich, W.E., 1994. A physically based model for the topographic control on shallow landsliding. Water Resources Research 30 (4), 1153–1171. Murphy, A.H., 1996. The Finley affair: a signal event in the history of forecast verification. Weather Forecasting 11, 3–20. Nefeslioglu, H.A., Gokceoglu, C., Sonmez, H., 2008. An assessment on the use of logistic regression and artificial neural networks with different sampling strategies for the preparation of landslide susceptibility maps. Engineering Geology 97 (3–4), 171–191. Ohlmacher, G.C., Davis, J.C., 2003. Using multiple logistic regression and GIS technology to predict landslide hazard in northeast Kansas, USA. Engineering Geology 69 (33), 331–343. Peirce, C.S., 1884. The numerical measure of the success of predictions. Science 4, 453–454. Pepe, M.S., 2003. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press, New York. Phoon, K.K., Kulhawy, F.H., 1999. Characterization of geotechnical variability. Canadian Geotechnical Journal 36 (4), 612–624. Provost, F., Fawcett, T., 1997. Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA, pp. 43–48. Provost, F., Fawcett, T., 2001. Robust classification for imprecise environments. Machine Learning 42 (3), 203–231.

Rawls, W.J., Brakensiek, D.L., Miller, N., 1983. Green-Ampt infiltration parameters from soils data. Journal of Hydraulic Engineering 109 (11), 62–70. Remondo, J., González, A., Díaz de Terán, J.R., Cendrero, A., Fabbri, A., Chung, C.F., 2003. Validation of landslide susceptibility maps; examples and applications from a case study in northern Spain. Natural Hazards 30 (3), 437–449. Schaefer, J.T., 1990. The critical success index as an indicator of warning skill. Weather Forecasting 5, 570–575. Stephenson, D.B., 2000. Use of the “odds ratio” for diagnosing forecast skill. Weather Forecasting 15, 221–232. Swets, J.A., 1988. Measuring the accuracy of diagnostic systems. Science 240 (4857), 1285–1293. Van den Eeckhaut, M., Vanwalleghem, T., Poesen, J., Govers, G., Verstraeten, G., Vandekerckhove, L., 2006. Prediction of landslide susceptibility using rare events logistic regression: a case-study in the Flemish Ardennes (Belgium). Geomorphology 76 (3–4), 392–410. Van Westen, C.J., Terlien, M.T.J., 1996. An approach towards deterministic landslide hazard analysis in GIS. A casa study from Manizales (Colombia). Earth Surface Processes and Landforms 21, 853–868. Yesilnacar, E., Topal, T., 2005. Landslide susceptibility mapping: a comparison of logistic regression and neural networks methods in a medium scale study, Hendek region (Turkey). Engineering Geology 79 (3–4), 251–266. Yule, G.U., 1900. On the association of attributes in statistics. Philosophical Transactions of the Royal Society of London 194A, 257–319. Zêzere, J.L., Reis, E., Garcia, R., Oliveira, S., Rodrigues, M.L., Vieira, G., Ferreira, A.B., 2004. Integration of spatial and temporal data for the definition of different landslide hazard scenarios in the area north of Lisbon (Portugal). Natural Hazard and Earth Systems Science 4 (1), 133–146. Zinck, J.A., López, J., Metternicht, G.I., Shrestha, D.P., Vázquez-Selem, L., 2001. Mapping and modelling mass movements and gullies in mountainous areas using remote sensing and GIS techniques. International Journal of Applied Earth Observation and Geoinformation 3 (1), 43–53.