Landslide susceptibility mapping in Injae, Korea, using a decision tree

Landslide susceptibility mapping in Injae, Korea, using a decision tree

Engineering Geology 116 (2010) 274–283 Contents lists available at ScienceDirect Engineering Geology j o u r n a l h o m e p a g e : w w w. e l s ev...

4MB Sizes 3 Downloads 107 Views

Engineering Geology 116 (2010) 274–283

Contents lists available at ScienceDirect

Engineering Geology j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / e n g g e o

Landslide susceptibility mapping in Injae, Korea, using a decision tree Young-Kwang Yeon a, Jong-Gyu Han a,⁎, Keun Ho Ryu b,⁎ a b

Geoscience Information department, Korea Institute of Geoscience and Mineral Resources(KIGAM) Gwahang-no 92, Yusung-gu, Daejeon, 305-350, South Korea College of Electrical & Computer Engineering, Chungbuk National University, 410 Seongbong-ro, Heungdeok-gu, Cheongju, Chungbuk, South Korea

a r t i c l e

i n f o

Article history: Received 23 February 2010 Received in revised form 11 August 2010 Accepted 10 September 2010 Available online 19 September 2010 Keywords: Landslide predictability Decision tree Spatial events C4.5 algorithm Korea

a b s t r a c t A data mining classification technique can be applied to landslide susceptibility mapping. Because of its advantages, a decision tree is one popular classification algorithm, although hardly used previously to analyze landslide susceptibility because the obtained data assume a uniform class distribution whereas landslide spatial event data when represented on a grid raster layer are highly class imbalanced. For this study of South Korean landslides, a decision tree was constructed using Quinlan's algorithm C4.5. The susceptibility of landslide occurrence was then deduced using leaf-node ranking or m-branch smoothing. The area studied at Injae suffered substantial landslide damage after heavy rains in 2006. Landslide-related factors for nearly 600 landslides were extracted from local maps: topographic, including curvature, slope, distance to ridge, and aspect; forest, providing age, type, density, and diameter; and soil texture, drainage, effective thickness, and material. For the quantitative assessment of landslide susceptibility, the accuracy of the twofold crossvalidation was 86.08%; accuracy using all known data was 89.26% based on a cumulative lift chart. A decision tree can therefore be used efficiently for landslide susceptibility analysis and might be widely used for prediction of various spatial events. Crown Copyright © 2010 Published by Elsevier B.V. All rights reserved.

1. Introduction Landslides occur mainly because of heavy rain, and their reoccurrence year after year has led to heavy damage to property and lives not only in Korea but also throughout the world. To mitigate landslide damage, it is necessary to assess and manage areas that are susceptible to them. Hence, in recent years, the assessment of landslide hazard and risk has become a topic of major interest (Aleotti and Chowdhury, 1999). Landslide susceptibility is defined as the propensity of an area to generate landslides (Guzzetti et al., 2006) with susceptibility represented by relative value in a given area. Recently, with the development of GIS data-processing techniques, quantitative studies have been applied to landslide susceptibility analysis using various techniques. Such studies can be identified on the basis of the techniques used, such as probabilistic methods (Luzi et al., 2000; Lee and Min, 2001; Donati and Turrini, 2002; Lee and Chol, 2003; Neuhäuser and Terhorst, 2007), logistic regression (Atkinson

⁎ Corresponding authors: J.-G. Han is to be contacted at Geoscience Information department, Korea Institute of Geoscience and Mineral Resources(KIGAM) Gwahangno 92, Yusung-gu, Daejeon, 305-350, South Korea. Tel.: +82 42 868 3297; fax: +82 42 868 3413. Ryu, College of Electrical & Computer Engineering, Chungbuk National University, 410 Seongbong-ro, Heungdeok-gu, Cheongju, Chungbuk, South Korea. Tel.: +82 43 267 2254; fax: +82 44 275 2254. E-mail addresses: [email protected] (J.-G. Han), [email protected] (K.H. Ryu).

and Massari, 1998; Dai et al., 2001; Dai and Lee, 2001; Nefeslioglu et al., 2008), and artificial neural network methods (Ermini, 2004; Lee et al., 2004; Gómez, 2005; Melchiorre et al., 2008). Most of these studies were aimed at increasing the accuracy of landslide prediction by finding suitable techniques for the respective study area. The objective of this study was to suggest a method to carry out landslide susceptibility analysis using a decision tree, a popular classification technique. Unlike other statistical methods, a decision tree makes no statistical assumptions, can handle data that are represented on different measurement scales, and is computationally fast (Pal and Mather, 2003). Also, such a tree represents a good compromise between comprehensibility, accuracy, and efficiency (Ferri et al., 2003). However, the decision tree algorithm was considered to be an unsuitable method to apply in spatial event prediction such as landslide susceptibility analysis because in the case of most decision tree algorithms, including C4.5(Quinlan, 1993), they normally require a discrete type of output class whereas susceptibility needs to be represented as a continuous value. The Classification and Regression Tree algorithm (CART) (Breiman et al., 1984), which can estimate probability, assumes a uniform distribution of training data set. Thus, previous studies (Saito et al., 2009; Nefeslioglu et al., 2010) only carried out limited applications without overcoming these problems. To estimate probability from a decision tree in a class imbalanced data set, Provost and Domingos (2003), Zadrozny and Elkan (2001), and Ferri et al. (2003) used leaf node ranking methods, which are achieved by smoothing the class frequencies. They used C4.5 and demonstrated

0013-7952/$ – see front matter. Crown Copyright © 2010 Published by Elsevier B.V. All rights reserved. doi:10.1016/j.enggeo.2010.09.009

Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283

a better accuracy by making a big tree. In this paper, we apply such previous methods to the spatial event, and suggest a method to carry out landslide susceptibility mapping using a decision tree. 2. Study area and data The selected study area, shown in Fig. 1, covers 34,696,750 m2 and is located between Inje-eup and Buk-myeon in the middle of Kwangwon, Korea. This site lies between latitude 38° 5' 52.19''N and 38° 3' 42.43''N, and longitude 128° 12' 4.92''E and 128° 17' 56.08''E. The site is mainly a granite-based rocky mountainous area. Landslides in the study area were caused by heavy rainfall during the period July 11–18, 2006. The average annual rainfall of this area was about 1400 mm from 1995 to 2005, with increases to 1740 mm in 2006. During the 8-day period of landslide occurrence, it rained about 559 mm. To extract the landslide casual factors for the area, we used 1:5000scale topographic map, 1:25,000-scale soil map, and 1:25,000-scale forest map. A 5 × 5 m Digital Elevation Model (DEM) extracted using the topographic map was used for generating slope, aspect, curvature, and distance from the mountain ridge. From the soil map, data on texture, drainage, and effective thickness were extracted. From the forest map, forest type, diameter class, density, and age data were extracted. All landslide factors were converted into 5 × 5 m float-type raster images with 1,387,870 pixels. Among the slopes of the layers, the curvature and distance ridge had a continuous value, whereas others had a discrete value, as shown in Table 1. As for the event data set, a total of 590 landslides were identified within the study area by analyzing a 0.4 m resolution airborne image and a Triangulated Irregular Network (TIN), as shown in Fig. 2. In this paper, we used ArcGIS 9.2 software for preparing the image data set. The co-relationship between landslide occurrence and the classes of each extracted attribute layer can be derived by calculating the frequency ratio (Bonham-Carter, 1994), i.e., the ratio of the probability of an event occurrence to the probability of a whole concurrency for the given attributes. If the ratio is greater than 1, the relationship between a landslide event and the factor's range or type is strong. If the ratio is less than 1, then the relationship is weak (Lee and Sambath, 2006). As for the topographic map, the relationships between each attribute layer extracted from the map and the landslides were analyzed. The relationship between the slope and the landslides is explained to determine whether or not a particular slope interval has

275

Table 1 Data set used for landslide susceptibility mapping. Map source

Thematic layer

Type

Scale (resolution)

Airborne Image

Landslide point Slope Aspect Curvature Distance from ridge Texture Drainage Material Effective thickness Forest type Diameter class Density Age

Class

0.4 m

Continuous Discrete Continuous Continuous

5×5m (1:5,000)

Discrete Discrete Discrete Discrete

1:25,000

Discrete

1:25,000

DEM (from topographic map)

Soil map

Forest map

Conversion

5m×5m Float type image

Discrete Discrete Discrete

a strong relationship between the attribute layer and the landslide. The interval from the slope angles between 20° to 39° has a stronger relationship than other intervals. With respect to aspect, landslides are concentrated on the East, Southeast, and South-facing areas. The “curvature” of the topography refers to the degree of the convex or concave nature of the geomorphology. In the interval, −17 to 2 is higher than 1 in terms of frequency ratio. Hence, the interval is highly susceptible to landslides. The buffered ridge means the distance from the ridge. In the relationship between the buffered ridge and landslides, the closer to the ridge, the higher susceptibility they show. However, the interval 26 to 125 m shows a strong relationship between the attribute layer and the landslide. Certain relationships have been discerned between landslides and forest factors. In the case of timber diameter, the frequency ratio of landslide occurrence is high when the timber is thin, with an especially strong relationship observed in the case of young trees. As for forest type, among the 11 types of trees considered, pine, planted pine, Korean pine, larch, and poplar are highly susceptible to landslides. The relationships between landslides and soil factors are as follows: in the case of texture, “Coarse loamy” and “Loamy skeletal” soils showed the susceptibility to landslides. Soil material refers to the origin of the soil and several, such as “Colluvium from granite,” “Colluvium from porphyry,” and “Residuum on granite,” showed the susceptibility to landslides. The effective thickness of soil is related to the environment of plant growth, as if deep then plants will grow well; if shallow then they will not. Hence, in the study area, the shallow soil area exhibits the susceptibility to landslides. However, conversely, no landslide was found in an area of very shallow soil depth. 3. Methods 3.1. Decision tree

Fig. 1. Study area (Inje, Korea).

The decision tree algorithm, C4.5, is widely used for classification tasks (Wu et al., 2008) and is designed to carry out additional functions including the use of continuous attribute based on ID3 (Quinlan, 1986). C4.5 consists of tree growth and tree pruning steps. In the former, tree growth begins from a node, which is then split by selecting the attribute that best classifies a set of examples on the basis of an attribute selection measure. The attribute selection measure uses the concept of entropy, which is defined as the degree of disorder. Thus, a tree grows by

276

Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283

Fig. 2. Landslide location with TIN (Triangulated Irregular Network) image.

selecting an attribute with the smallest Entropy. At a node N, Entropy is calculated by     EntropyðnÞ = −∑j p Cj jN log2 p Cj jN

ð1Þ

where p(Cj|N)is the relative frequency of N. Of the k attributes of N, the Entropy for selecting attribute A is given by

Thus, C4.5 selects an attribute with the smallest Entropy or biggest InfoGain. InfoGain has a tendency to select an attribute with many split points. This feature makes the tree grow toward continuous attributes. To solve this problem, InfoGain is normalized by SplitInfo, a kind of Entropy on the split point of an attribute. Thus, it has a high value for an attribute with a number of splits. When node N is divided into n subsets, the equation for SplitInfo is: v

  jNj j EntropyA ðN Þ = ∑ × Entropy Nj jN j j=1 k

SplitInfo = − ∑ ð2Þ

jNj j jNj j × log2 jN j jN j

ð4Þ

Thus, InfoGain compensated by SplitInfo is GainRatio, which is defined as follows:

InfoGain is a gain from differences between the Entropy of the original node and the Entropy of the newly split nodes. The equation is as follows: Infogainð AÞ = EntropyðNÞ−EntropyA ðNÞ

i=1

ð3Þ

GainRatioð AÞ =

InfoGainð AÞ SplitInfoð AÞ

ð5Þ

For landslide susceptibility mapping, we can consider the probability of an event class in the leaf node. When nnonevent is the number

Fig. 3. Procedure of landslide susceptibility mapping using decision tree.

Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283 Table 2 (continued)

Table 2 Data distribution according to the layers. Layers (variables) t_aspect

Value domain

FLAT NORTH NORTHEAST EAST SOUTHEAST SOUTH SOUTHWEST WEST NORTHWEST Sum t_slope 0~4 5~9 10 ~ 14 15 ~ 19 20 ~ 24 25 ~ 29 30 ~ 34 35 ~ 39 40 ~ 44 45 ~ 49 50 ~ 54 55 ~ 59 60 ~ 64 Over 65 Sum t_curvature below - 28 -27 ~ -23 -22 ~ -18 -17 ~ -13 -12 ~ -8 -7 ~ -3 -2 ~ 2 3~7 8 ~ 12 13 ~ 17 18 ~ 22 23 ~ 27 Over 28 Sum t_ridgebuffer 1 ~ 25 (distance 26 ~ 40 from ridge: 51 ~ 75 76 ~ 100 meter) 101 ~ 125 126 ~ 150 151 ~ 175 176 ~ 200 201 ~ 225 225 ~ 250 251 ~ 275 Sum f_diameter Non-forest (cm) 6 ~ 16 18 ~ 28 Over 30 Sum f_age Non-forest 11–20 year 21–30 year 31–40 year 41–50 year Over 51 year Sum f_density Non-forest Less than 50% 51–70% Over 71% Sum f_type Non-forest Pine Non-conifer Agricultural

277

No. of pixels No. of Frequency Landslide Landslide in domain landslides ratio SetA SetB 1177 169295 150783 115706 151134 165153 193688 222224 218710 1,387,870 16376 48348 86785 132934 194440 251315 268259 219885 117105 40957 9928 1449 84 5 1,387,870 6 17 80 789 9052 149623 953923 255806 17256 1158 141 16 3 1,387,870 314988 292307 250570 204480 150759 99397 51535 17706 4863 1145 120 1,387,870 106105 365945 735844 179976 1,387,870 106105 214605 151340 203433 570157 142230 1,387,870 106105 19390

0 46 24 84 155 104 74 61 42 590 0 0 4 36 95 132 144 112 48 17 1 1 0 0 590 0 0 0 1 32 292 260 5 0 0 0 0 0 590 75 192 154 104 48 13 4 0 0 0 0 590 12 255 304 19 590 12 184 71 86 225 12 590 12 6

0.00 0.64 0.37 1.71 2.41 1.48 0.90 0.65 0.45 1.00 0.00 0.00 0.11 0.64 1.15 1.24 1.26 1.20 0.96 0.98 0.24 1.62 0.00 0.00 1.00 0.00 0.00 0.00 2.98 8.32 4.59 0.64 0.05 0.00 0.00 0.00 0.00 0.00 1.00 0.56 1.55 1.45 1.20 0.75 0.31 0.18 0.00 0.00 0.00 0.00 1.00 0.27 1.64 0.97 0.25 1.00 0.27 2.02 1.10 0.99 0.93 0.20 1.00 0.27 0.73

0 25 9 35 95 54 28 31 18 295 0 0 1 18 58 67 63 47 29 10 1 1 0 0 295 0 0 0 0 6 99 185 5 0 0 0 0 0 295 34 98 90 48 18 3 4 0 0 0 0 295 4 128 153 10 295 4 93 35 41 115 7 295 4 2

0 21 15 49 60 50 46 30 24 295 0 0 3 18 37 65 81 65 19 7 0 0 0 0 295 0 0 0 1 26 193 75 0 0 0 0 0 0 295 41 94 64 56 30 10 0 0 0 0 0 295 8 127 151 9 295 8 91 36 45 110 5 295 8 4

682056 580319 1,387,870 101418 465626 322600 2049

323 249 590 8 237 109 0

1.11 1.01 1.00 0.19 1.20 0.79 0.00

159 130 295 3 117 54 0

164 119 295 5 120 55 0

(continued on next page)

Layers (variables)

s_texture

s_material

s_drainage

s_thickness

Value domain land Mixed (nonconifer, conifer) Planted pine Planted nonconifer Korean pine Larch Poplar Field Sum Coarse loamy Fine loamy Fine loamy or coarse loamy Loamy skeletal River overflow area Sandy skeletal Sum River overflow area Alluvium Alluviumcolluvium from acid rock Alluviumcolluvium from granite Colluvium Colluvium from granite Colluvium from porphyry Local alluvium Local alluviumcolluvium Residuum on granite Residuum on granite gneiss Sum River overflow area Imperfectly Moderately well Somewhat excessively Well Sum Deep Moderately deep River overflow area Shallow Very shallow Sum

No. of pixels No. of Frequency Landslide Landslide in domain landslides ratio SetA SetB 206404

23

0.26

12

11

17897 1261

27 0

3.55 0.00

12 0

15 0

141118 122973 3886 2638 1,387,870 1231824

123 55 4 4 590 536

2.05 1.05 2.42 3.57 1.00 1.02

60 33 3 1 295 259

63 22 1 3 295 277

54928 3512

6 0

0.26 0.00

5 0

1 0

94508

47

1.17

30

17

33

0

0.00

0

0

3065

1

0.77

1

1,387,870 33

590 0

1.00 0.00

295 0

295 0

6498 3077

3 0

1.09 0.00

0 0

3 0

8980

3

0.79

2

1

39992 74310

5 34

0.29 1.08

4 25

1 9

11693

7

1.41

3

4

6212

1

0.38

1

0

21278

4

0.44

3

1

1210155

532

1.03

257

275

5642

1

0.42

0

1

1,387,870 33

590 0

1.00 0.00

295 0

295 0

4200 15341

0 5

0.00 0.77

0 2

0 3

1210330

532

1.03

257

275

157966 1,387,870 32206 1108115

53 590 9 455

0.79 1.00 0.66 0.97

36 295 7 232

17 295 2 223

33

0

0.00

0

0

241114 6402

126 0`

1.23 0.00

56 0

70 0

1,387,870

590

1.00

295

295

278

Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283

Fig. 4. Twofold cross-validation step.

of nonevent classes, and nevent is the number of event classes, the probability of the event class can be estimated as follows: P ðnodeÞ = nevent = ðnnonevent + nevent Þ:

ð6Þ

However, the probability of an event cannot be used as the estimated probability of the event because tree nodes are split by a purity measure, and the estimated probability from the frequencies of a leaf node may be an extreme value: 0 or 1. Thus, instead of estimating the probability directly from the frequencies of leaf nodes, it is more desirable to estimate relative probability by ranking leaf nodes, which can be achieved by smoothing frequencies. 3.2. Leaf node ranking methods The methods outlined were developed to use the applications in a class-imbalanced data set, and they can be applied in the evaluation of reliability and cost-sensitive learning. Leaf node ranking methods commonly use the ratio of target class in the leaf node, but the way of smoothening is different.

Laplace smoothing (Provost and Domingos, 2003) uses Laplace correction for avoiding a probability value of 1 or 0 from leaf nodes. Another method, M-estimate smoothing (Cussents, 1993; Zadrozny and Elkan, 2001), uses the prior probability of events to smooth the probabilities so that estimates are toward the minority class base rate. Both of the above methods consider a uniform class distribution of the sample (Ferri et al., 2003). To obtain predictive accuracy in the classimbalanced data set, Ferri et al. (2003) introduced m-branch smoothing, a recursive root-to-leaf extension of m probability estimation. On each path, the probability estimates at a parent node are propagated downward to all of its children. The rank of the child node can be expressed by m-branch as follows when the target class is an event class: Rankðnode:childÞ =

nevent m × Rankðnode:parent Þ nevent + nnonevent + m

ð7Þ

where parameter m is calculated by: M + ðd−1Þ = d × M ×

pffiffiffiffi N

Fig. 5. AUC values according to the parameter M of m-branch smoothing in the goodness of fit and twofold validation.

ð8Þ

Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283

where M is a constant, N is the global cardinality of the data set, and d is the depth of the node.

4. Mapping and validation of landslide susceptibility We followed the process of landslide susceptibility mapping as seen in Fig. 3. The C4.5 algorithm was used for constructing the decision tree, as in previous studies (Provost and Domingos, 2003; Zadrozny and Elkan, 2001; Ferri et al., 2003). After the tree construction process, leaf nodes were relatively evaluated by the m-branch smoothing method. For searching best accuracy of the tree model, we tested the accuracy according to the parameter M of the m-branch smoothing. For the assessment of accuracy performance, we carried out the goodness of fit using an all-known landslide set and the twofold cross-validation for testing predictive aspects of the decision tree. At the twofold crossvalidation, two independent subsets were used to construct and to evaluate the model. The full-grown decision tree based on C4.5 without

279

using a pruning step was used. We programmed the tree algorithm using the Java programming language. As for the result assessment, the minority event data were regarded as confirmative, and the majority of nonevent data were not, because events might occur in nonevent areas in the future. One of the widely used assessment techniques, the Receiver Operating Characteristic (ROC) (Swets. 1988) can be considered for the model evaluation, but it does not consider such an aspect because it evaluates the results included in the nonevent data, which is nonconfirmative. As an alternative method, a Lift chart can be used, which evaluates the degree of the classification on the target class. Lift charts were introduced in the business data mining area by Berry and Linoff (1997). Then, Chung and Fabbri (1999) used one for estimating a landslide prediction model. Generally, a lift chart is used for accumulating the lift value. What lift actually measures is the change in concentration of a particular class when the model is used to select a group from the general population(Berry and Linoff, 1997). Therefore, if the subsequent curve is biased on the left side, the

Fig. 6. Twofold cross-validation results; (a) the result of first fold, and (b) the result of second fold.

280

Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283

Fig. 7. Landslide susceptibility map using the all-known data set. Rectangular area is selected to represent the rules.

accuracy of the prediction result may be higher and the performance is quantified by calculating using the area under the curve (AUC). To test the goodness of fit of the model, we used a 590-landslide set for constructing and evaluating the prediction model. From the constructed tree, 828 leaf nodes were generated. A series of nodes from a single leaf to the root from the tree can be converted into a rule. For twofold cross-validation, two groups of 295 landslides were selected from the 590 landslides; the distribution of both the groups, Landslide SetA and Landslide SetB, is given in Table 2. In the first fold process, Landslide SetA was used to build a decision tree, and Landslide SetB was used as the validation data set. In the second fold process, the role of the two data sets was changed. From the constructed trees, 393 leaves in the first fold and 486 leaves in the second fold were generated. This procedure is described in Fig. 4. In the twofold cross-validation, the best accuracy covering 89.26% of the AUC was shown when M was 2500. In the goodness-of-fit test, the AUC was assessed to be 86.08% when M was 8000, as shown in Fig. 5. The susceptibility maps trained from the twofold validation

process are shown in Fig. 6. The susceptibility map trained using the all-landslides set is shown in Fig. 7. The cumulative lift charts for each result are shown in Fig. 8. The landslide susceptibility results can also be assessed by the distribution of the percentile value of susceptibility. Fig. 9 represents the distribution of the percentile value of susceptibility gained from both the twofold cross-validation and the goodness of fit in the 95% confidence interval. In the twofold cross-validation result, the mean was 15.01% (Std. Dev. = 15.94) and the median was 12.58%. In the result from goodness of fit, the mean was 11.07% (Std. Dev. = 11.75) and the median was 7.63%. Thus, the result of goodness of fit was better than the result of the twofold cross-validation. 5. Discussion A decision tree is built by selecting attributes; thus, prior knowledge of these is not needed. This feature is helped by gaining knowledge from a real-world phenomenon because many factors are

Fig. 8. Cumulative lift charts of goodness of fit and twofold cross-validation.

Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283

Fig. 9. Box and whisker plots representing percentiles of landslide susceptibility. The box represents second and third quartiles. Whiskers represent first and fourth quartiles. The thicker line in the box represents the median. An open circle represents extreme values. Star points represent outliers.

co-related in the real world. Thus, we collected and could use all 12 landslide factors for this study. The use of continuous attributes is also one of the advantages of using a decision tree to improve the prediction ability. We used m-branch smoothing for inducing relative probability from the tree. This method can estimate the relative value of event occurrence for landslide susceptibility mapping. However,

281

for better accuracy, the parameter value of the m-branch should be searched experimentally. Because landslides occur by means of the interaction among causal factors, to analyze factors of the event, it is necessary to explain the relationship among the factors. A rule consists of an “AND” combination of nodes from the root to the leaf. When a rule is interpreted, the use of all combinations of the node is needed. Thus, the relationship among causal factors is implicitly included in the rule. We selected four places to represent the configuration of nodes in the result, which was trained by using the all-landslide data set, shown within the rectangular area of Fig. 7. The places marked (1) and (2) are of relatively low susceptibility, as shown in Fig. 10. Sites (3) and (4) are the location where landslides occurred. The node information of each location is described in Fig. 11. As a rule, a series of nodes can be represented, for example, at location (1), the m-branch and percentile values are 0.58092 and 43.83%, respectively. The rule is represented as t_curvature N –1 & s_texture = “Leamy skeletal” & t_slope b = 33.0. As for the event occurrence locations (3) and (4), m-branch values are 0.89634 (percentile = 1.63) and 0.89556 (percentile = 1.80), respectively, and are higher than those of (1) and (2). Event locations share from the 1st to 8th nodes. Among the nodes, “t_curvature” and “t_slope” appeared several times because continuous attributes can be repeatedly selected. When we interpret the series of nodes according to the rule, the low level of the node can be ignored when the same attribute is found at a deeper level. For example, location (3) can be represented as t_ridgebuffer b = 27.0 & s_material = “Residem on granite genisis” & f_diameter = “18 ~ 28” & t_slope N 25 & t_curvature b = –6.0 & t_aspect = “South” & f_density = “51 ~ 70%” & s_thickness = “Moderately deep” & f_age = “30 ~ 40 year”. For a predictive point of view, we carried out a twofold crossvalidation. When the class-imbalanced landslide data set was considered, the prediction model may have been underestimated. The twofold cross-validation does not consider two aspects. First, generally many landslides may occur at a place where they previously occurred, whereas the cross-validation method performs a one-leaveout process or tests without replacements. Second, if the number of folds is small, then the predictive result will be pessimistic because the amount of training data used for the construction of a prediction model is small. Thus, the difference in results between goodness of fit and twofold cross-validation can be shown.

Fig. 10. The same rectangular area as in Fig. 7 with marked locations (1) and (2) of relatively low susceptibility, and (3) and (4) where landslides occurred.

282

Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283

Fig. 11. Configuration of nodes at locations (1), (2), (3), and (4) seen in Fig. 10.

6. Conclusions Landslides are caused mainly by heavy rains or earthquakes, but the landslide occurrences and the scale are different depending on geo-environmental conditions. A landslide is explained by the environmental conditions at the event-occurred location, and thus a landslide event can be predicted at a location when specific conditions are satisfied. A decision tree was not previously considered to be a suitable method to analyze landslide susceptibility because data used in such trees assume a uniform class distribution. However, the ratio between event and nonevent classes of spatial event data sets is highly imbalanced because landslides represented in grid raster spatial data are

composed of a small numbers of pixels. Thus, a minority event class is treated as noise. Moreover, it is not desirable to estimate probability from a decision tree in the class imbalanced data set. In this paper, we used a full-grown decision tree because the minority event class can be ignored in the tree-building process. The minority event class, however, has more meaning than the majority nonevent class in the spatial data. The leaf node ranking method for representing susceptibility is achieved by smoothening frequencies. The smoothening technique played an important role in estimating relative rank in the imbalanced data set. This study showed that a decision tree can be used efficiently for spatial prediction problems. Furthermore, it is expected that a decision tree will be widely used for various other spatial prediction problems.

Y.-K. Yeon et al. / Engineering Geology 116 (2010) 274–283

Acknowledgments This work was supported in part by the cooperative research program of the Korea Institute of Geoscience and Mineral Resources (KIGAM) and the Korea Aerospace Research Institute (KARI), and in part by a grant (#07-KLSG-C05) from Cutting-edge Urban Development - Korean Land Spatialization Research Project funded by Ministry of Land, Transport and Maritime Affairs (MLTM) of Korean government and by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF No. 2010-0001732). Constructive comments and suggestions by anonymous reviewers also helped us improve the presentation of this paper. References Aleotti, P., Chowdhury, R., 1999. Landslide hazard assessment: summary review and new perspectives. Bull Eng Geo Environ 58, 21–44. Atkinson, P.M., Massari, R., 1998. Generalized linear modeling of susceptibility to landsliding in the central Apennines, Italy. Computer & Geosciences 24, 373–385. Berry, M.J.A., Linoff, G., 1997. Data Mining Techniques: For Marketing, Sales, and Customer Support. John Wiley & Sons. Bonham-Carter, G.F., 1994. Geographic information system for geoscientist, modeling with GIS. Pergamon Press, Oxford. 398. Breiman, L., Friedman, J., Olshen, R., Stone, C., 1984. Classification and Regression Trees, Chapman & Hal. Wadsworth, Inc, New York. Chung, C.F., Fabbri, A.G., 1999. Probabilistic prediction models for landslide hazard mapping. Photogrammetric Engineering & Remote Sensing (PE&RS) 65 (12), 1388–1399. Cussents, J., 1993. Bayes and psudo-bayes estimates of conditional probabilities and their reliabilities. Proceedings of European Conference on Machine Learning. Dai, F.C., Lee, C.F.J., Li, J., Xu, Z.W., 2001. Assessment of landslide susceptibility on the natural terrain of Lantau Island. Hong Kong, Environmental Geology 40, 381–391. Donati, L., Turrini, M.C., 2002. An objective method to rank the importance of the factors predisposing to landslides with the GIS methodology: application to an area of the Apennines (Valnerina; Perugia, Italy). Eng. Geol. 63, 277–289. Ermini, L., Catani, L., Casagli, N., 2004. Artificial Neural Networks applied to landslide susceptibility assessment. Geomorphology 66 (1–4), 327–343. Ferri, C., Flach, P.A., Hernndez-Orallo, J., 2003. Improving the AUC of probabilistic estimation trees. Proc. of the 14th European Conf. on Machine Learning, pp. 121–132.

283

Gómez, H., Kavzoglu, T., 2005. Assessment of shallow landslide susceptibility using artificial neural networks in Jabonosa River. Basin, Venezuela. Eng. Geol. 78, 11–27. Guzzetti, F., Reichenbach, P., Ardizzone, F., Cardinali, M., Galli, M., 2006. Estimating the quality of landslide susceptibility models. Geomorphology 81, 166–184. Lee, S., Chol, U.C., 2003. Development of GIS-based geological hazard information system and its application for landslide analysis in Korea. Geosci. J. 7, 243–252. Lee, S., Min, K., 2001. Statistical analysis of landslide susceptibility at Yongin. Korea, Environmental Geology 40, 1095–1113. Lee, S., Sambath, T., 2006. Landslide susceptibility mapping in the Damrei Romel area, Cambodia using frequency ratio and logistic regression models. Environ. Geol. 50, 847–855. Lee, S., Ryu, J.H., Won, J.S., Park, H.J., 2004. Determination and application of the weights for landslide susceptibility mapping using an artificial neural network. Eng. Geol. 71 (3–4), 289–302. Luzi, L., Pergalani, F., Terlien, M.T.J., 2000. Slope vulnerability to earthquakes at subregional scale, using probabilistic techniques and geographic information systems. Eng. Geol. 58, 313–336. Melchiorre, C., Matteucci, M., Azzoni, A., Zanchi, A., 2008. Artificial neural networks and cluster analysis in landslide susceptibility zonation. Geomorphology 94, 379–400. Nefeslioglu, H., Duman, T., Durmaz, S., 2008. Landslide susceptibility mapping for a part of tectonic Kelkit Valley., Eastern Black Sea region of Turkey). Geomorphology 94, 401–418. Nefeslioglu, H., Sezer, E., Gokceoglu, C., Bozkir, A., Duman, T., 2010. Assessment of landslide susceptibility by decision trees in the metropolitan area of Istanbul, Turkey. Mathematical Problems in Engineering 2010, Article ID 901095. Neuhäuser, B., Terhorst, B., 2007. Landslide susceptibility assessment using “weightsof-evidence” applied to a study area at the Jurassic escarpment (SW-Germany). Geomorphology 86, 12–24. Pal, M., Mather, P.M., 2003. An assessment of the effectiveness of decision tree methods for land cover classification. Remote Sens. Environ. 86, 554–556. Provost, F.J., Domingos, P., 2003. Tree Induction for Probability-based Ranking. Machine Learning Kluwer Academic Publisher 52 (3), 199–215. Quinlan, J.R., 1986. Induction of decision trees. Machine Learning 1, 81–106. Quinlan, J.R., 1993. C4.5 : Programs for Machine Learning, Morgan Kaufmann. Saito, H., Nakayama, D., Matsuyama, H., 2009. Comparison of landslide susceptibility based on a decision-tree model and actual landslide occurrence: the Akaishi Mountains, Japan. Geomorphology 109 (3–4), 108–121. Swets, J.A., 1988. Measuring the accuracy of diagnostic systems. Science 240, 1285–1293. Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., MacLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D., 2008. Top 10 algorithms in data mining. Knowl. Inf. Syst. 14 (1), 1–37. Zadrozny, B., Elkan, C., 2001. Learning and making decisions when costs and probabilities are both unknown. Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 204–213.