Journal of Hydrology 533 (2016) 343–352
Contents lists available at ScienceDirect
Journal of Hydrology journal homepage: www.elsevier.com/locate/jhydrol
Applying a weighted random forests method to extract karst sinkholes from LiDAR data Junfeng Zhu ⇑, William P. Pierskalla Jr. Kentucky Geological Survey, University of Kentucky, Lexington, Kentucky 40506, USA
a r t i c l e
i n f o
Article history: Received 29 July 2015 Received in revised form 3 December 2015 Accepted 7 December 2015 Available online 17 December 2015 This manuscript was handled by Andras Bardossy, Editor-in-Chief, with the assistance of Wolfgang Nowak, Associate Editor Keywords: Sinkhole LiDAR Weighted random forests Karst Machine learning
s u m m a r y Detailed mapping of sinkholes provides critical information for mitigating sinkhole hazards and understanding groundwater and surface water interactions in karst terrains. LiDAR (Light Detection and Ranging) measures the earth’s surface in high-resolution and high-density and has shown great potentials to drastically improve locating and delineating sinkholes. However, processing LiDAR data to extract sinkholes requires separating sinkholes from other depressions, which can be laborious because of the sheer number of the depressions commonly generated from LiDAR data. In this study, we applied the random forests, a machine learning method, to automatically separate sinkholes from other depressions in a karst region in central Kentucky. The sinkhole-extraction random forest was grown on a training dataset built from an area where LiDAR-derived depressions were manually classified through a visual inspection and field verification process. Based on the geometry of depressions, as well as natural and human factors related to sinkholes, 11 parameters were selected as predictive variables to form the dataset. Because the training dataset was imbalanced with the majority of depressions being non-sinkholes, a weighted random forests method was used to improve the accuracy of predicting sinkholes. The weighted random forest achieved an average accuracy of 89.95% for the training dataset, demonstrating that the random forest can be an effective sinkhole classifier. Testing of the random forest in another area, however, resulted in moderate success with an average accuracy rate of 73.96%. This study suggests that an automatic sinkhole extraction procedure like the random forest classifier can significantly reduce time and labor costs and makes its more tractable to map sinkholes using LiDAR data for large areas. However, the random forests method cannot totally replace manual procedures, such as visual inspection and field verification. Ó 2015 Elsevier B.V. All rights reserved.
1. Introduction In karst terrains, sinkholes are a common geologic hazard, often resulting in damages to buildings and roads and imposing challenges for various land-use planning activities (Currens, 2002; Waltham, 2008). Sinkholes are also major connections between surface water and groundwater (Taylor and Greene, 2008). Sinkholes commonly appear as depressions or cover-collapses on the surface, thus can be mapped if detailed topographic information of the earth surface is available. Light Detection and Ranging (LiDAR) is a remote sensing technology that uses lasers to measure earth surface elevation with very high density and accuracy (NOAA, 2012). LiDAR can see details of ground surface in most forested areas because abundant laser points can reach the ground through small openings in the canopy (NOAA, 2012), therefore making LiDAR a suitable tool for creating highly detailed ⇑ Corresponding author. E-mail address:
[email protected] (J. Zhu). http://dx.doi.org/10.1016/j.jhydrol.2015.12.012 0022-1694/Ó 2015 Elsevier B.V. All rights reserved.
bare-earth elevation models. LiDAR has broad applications and is especially appealing to studying bare-earth elevation controlled processes, such as hydrologic modeling or wetland delineation (e.g. Jones et al., 2008; Maxa and Bolstad, 2009). In recent years, LiDAR has also been applied to identify sinkholes in karst areas with promising results (e.g., Seale et al., 2008; Rahimi and Alexander, 2013; Zhu et al., 2014). However, the process of extracting sinkholes from LiDAR data can be tedious for several reasons. First, numerous sinkholes can occur in many mature karst regions. For example, in Kentucky, more than 100,000 sinkholes were digitized from 1:24,000 scale topographic maps alone (Paylor et al., 2003). Second, LiDAR can reveal even more sinkholes than displayed on topographic maps owing to its high-accuracy and high-density data. For instance, Zhu et al. (2014) revealed three times more sinkholes from LiDAR than that found from the 1:24,000 maps in a small area in central Kentucky. Furthermore, processing LiDAR to find sinkholes often results in numerous depression features that are not sinkholes (false positives). Zhu et al. (2014) found that approximately 85% of the
344
J. Zhu, W.P. Pierskalla Jr. / Journal of Hydrology 533 (2016) 343–352
extracted depression features from LiDAR were not sinkholes in their study area. To identify sinkholes from numerous depressions, Zhu et al. (2014) used a visual inspection method, which involved checking the depressions one by one to judge whether or not the depression was a sinkhole. Although the method was demonstrated to be effective for a small area, expanding the same method to county or state level investigations could potentially be laborintensive and costly because thousands, even millions of depression features may be found in a large area. To make LiDAR a more applicable technique in locating sinkholes, a more efficient depression processing method is needed. The process of visual inspection to separate sinkholes from other depression features can be considered as a classification problem, which may be automated on a computer. Rahimi and Alexander (2013) developed an automatic sinkhole mapping procedure to find new sinkholes using a depth-to-area ratio criterion established from existing sinkholes in Winona County, Minnesota. Their procedure resulted in identifying 127 sinkholes with 97 of them being correct in test areas. However, their method tends to discard sinkholes outside the range of morphometric consideration, such as small sinkholes and large shallow sinkholes. Filin and Baruch (2010) used area, depth, compactness, and a surface fitting test to extract sinkholes along the Dead Sea coastline. They identified approximately 350 sinkholes with 97% of accuracy. The variables used in Filin and Baruch (2010) were applied separately and the criterion for each variable is a single-cut value subjectively defined based on site condition. The automatic sinkhole extraction methods mentioned above only considered a few variables related to sinkhole geometry; therefore are not comparable to a visual inspection, which takes into account many more variables. The visual inspection process in Zhu et al. (2014) considered surface features in aerial images, characteristics of shaded-relief maps, and depression geometries simultaneously. Human eyes excel at detecting geometric shapes, distinguishing subtle variations in hue and tone, and processing and comparing multiple types of information. Another aspect of visual inspection is that there are no clear-cut criteria for many variables being considered, making it more difficult to translate the manual procedure into an automatic process that can be performed on a computer. On the other hand, statistical learning, a subfield of computer science, aims to construct algorithms through learning patterns and structures from input data (i.e., training data) and apply the learned patterns and structures to make predictions for new data (Romberg and Saffran, 2010). One of the common statistical learning methods is decision trees, which represent patterns and structures in the input data with hierarchical and sequential nodes that form tree-like structures (Hastie et al., 2009). The process of constructing the tree-like structure is commonly referred as tree-growing. Decision trees are widely used for automated classification in diverse scientific disciplines (Murthy, 1998). Based on the principle of the decision tree method, Breiman (2001) developed the random forests method, which constructs many decision trees from a dataset and combines the results from all the trees to make predictions. The random forests method offers substantial improvements in classification accuracy and robustness to noise over common classification methods (Cutler et al., 2007) and becomes a popular choice for data classification (Hastie et al., 2009). The method is also computationally efficient because it can be easily implemented for parallel computing (Breiman, 2001). Recently, random forests has been applied to study natural resources and hydrology (e.g. Cutler et al., 2007; Peters et al., 2007; Mutanga et al., 2012). Random forests can potentially be an excellent tool for automatically identifying sinkholes from LiDAR-identified depressions. Miao et al. (2013) tested random forests in extracting sinkholes from depressions based on geometry and depth variables. Their
random forest achieved 87.9% classification accuracy for a training data set of 66 records with roughly equal numbers of sinkholes and non-sinkholes. The dataset in Miao et al. (2013) is referred to as a balanced dataset in classification. However, the ratio of sinkholes to non-sinkholes can be imbalanced in a different karst area. In the Floyds Fork watershed in central Kentucky, Zhu et al. (2014) found more than 80% of LiDAR-derived depressions were not sinkholes; this type dataset is referred as imbalanced in classification because the class of interest is the ‘‘rare” class in the dataset. Imbalanced data can be problematic for many classification algorithms, including random forests, because these algorithms are aimed at minimizing the overall error rate, yielding better prediction accuracy on the majority class and poorer accuracy for the minority class (Chen et al., 2004). Additionally, all the variables in the data set used in Miao et al. (2013) were derived from the shape and depth of the depressions. Many natural and human factors, such as geology, hydrologic conditions, soil type and depth, and land-use are known to affect the occurrence of sinkholes (White et al., 1986; Tharp, 1999). In this study, we applied a weighted random forests method to automatically extract sinkholes from imbalanced data of depressions. In applying the weighted random forests, we considered parameters related not only to the geometry of depressions, but also natural and human factors that were deemed to affect sinkhole occurrence. The main purpose of this study is to test if the random forests method can effectively identify sinkholes from depressions extracted from high resolution and high accuracy LiDAR data. The random forests method relies on a training dataset to learn sinkhole characteristics that may be used to separate sinkholes from other depressions and applies the learned characteristics to make predictions for another dataset. The other dataset, called the test dataset, can be used to test the effectiveness of the random forest grown from a training dataset when the other dataset has already been processed by other means, i.e., the classes are known.
2. Weighted random forests The random forests method is a modification from bootstrap aggregation or bagging, a popular machine learning technique for reducing variance in prediction (Hastie et al., 2009). A single decision tree is well known to have high variance and is prone to noise. The basic idea of bootstrap aggregation is to build many decision trees with each tree growing on a randomly-sampled subset of the training data and make predictions for unseen data by aggregating results from all trees. The random forests method advances the bootstrap aggregation technique by growing individual trees using not only a subset of the training data but also a randomlysampled features or variables for splitting tree nodes. We give a brief summary of the random forests method here; detailed descriptions of the algorithm can be found in Breiman (2001) and Breiman and Cutler (2004). Given a training dataset with N records and P variables in each record, the random forests method grows individual classification or regression trees by randomly drawing n (n < N) bootstrap samples and growing a tree from the bootstrapped data. The tree is grown by repeating a three-step process for each node until a minimum node size is reached (Hastie et al., 2009): (1) selecting p (p < P) variables at random; (2) picking the best split among the p variables; and (3) splitting the node into two nodes. Subsequent trees are grown similarly with each tree growing on a different bootstrap sample. Once all trees are grown, a prediction for a new data point x is made by running x down all the trees and aggregating the results. The result aggregation is done by majority votes for classification and by averaging for regression.
J. Zhu, W.P. Pierskalla Jr. / Journal of Hydrology 533 (2016) 343–352
The regular random forests method tends to bias toward the majority class because the method is constructed to achieve an overall minimum error rate with equal considerations for all classes. This can be problematic for imbalanced training datasets as a classification missing all the minority class can still result in an overall low error rate (Chen et al., 2004). The weighted random forests method is aimed to reduce the bias by assigning different weights to different classes. In the weighted random forests, misclassifying a class with a higher weight carries a higher penalty than misclassifying a class with a lower weight. Thus, the calculation of error rate weighs heavier for a higher weighted class. The class weights are incorporated to evaluate criterion for finding splits and also to determine terminal class by weighted majority vote (Chen et al., 2004). Imbalance in error rates among classes can be used to select the weights. Breiman and Cutler (2004) implemented the random forests method in FORTRAN codes and made the codes freely available online. They employed the weighted random forests method in the codes, which allows users to assign different weights to the classes. We used the random forests FORTRAN codes for this study.
3. Building a weighted random forest for sinkhole detection 3.1. Variable selection A training dataset with multiple variables and multiple records is needed to grow a random forest. We selected 11 variables to build a random forest for sinkhole detection. Among the 11 variables, three of them were based on surficial shape of depressions, four on depth of depressions, two on natural and human factors, and the remaining two on field observations made from Zhu et al. (2014). To calculate the variables for the training dataset, we first followed the procedure in Zhu et al. (2014) to extract depressions and polygons from LiDAR point clouds. The LiDAR data were provided by the Louisville/ Jefferson County Information Consortium (LOJIC) through the Kentucky Division of Geographic Information. The LiDAR data were collected in March 2009 with an average point spacing of 1 m and a vertical root-mean-square error of 8.8 cm. To extract depressions using LiDAR data, we first created a digital elevation model (DEM) with a cell size of 1.5 m from LiDAR point clouds. Then, we used a fill tool in ArcGIS to extract surface depressions up to 6 m deep in a raster format. These depressions were converted to polygons and these polygons were further smoothed. The polygons represented the surficial shape of depressions and were used for processing and identification purposes. The three surficial shape-based variables were area, perimeter, and compactness. The compactness measures how close a shape assembles a circle and is defined here as
C¼
A ; Ac
ð1Þ
where A is the area of the polygon and Ac is the area of the smallest circle circumscribing the polygon. Eq. (1) is one of many ways to calculate compactness (Li et al., 2013) and was introduced by Cole (1964) and later called digital compactness measure (Kim and Anderson, 1984). Values of the digital compactness measure range from 0 to 1 and equal 1 when the polygon is a perfect circle. Natural sinkholes tend to have a circular shape with high compactness value although shapes of many sinkholes are also strongly influenced by underlying geology (Miao et al., 2013). We chose this compactness definition because it was intuitive and easy to implement in ArcGIS. The variables area and perimeter are readily available in
345
ArcGIS. The Minimum Bounding Geometry tool was used to calculate Ac in Eq. (1) for compactness. The four depth-based variables were maximum depth, mean depth, volume, and depth index. The first three depth-related variables were calculated using the Zonal Statistics as Table tool in ArcGIS. The depth index, Di, is defined as
pffiffiffiffiffiffiffiffiffi Di ¼ Dmax = A=p;
ð2Þ
where Dmax is the maximum depth and A is the surficial area. The depth index reflects the slope of the depression by assuming a sinkhole as an upside down cone (Miao et al., 2013). The two variables based on natural and human factors were land-use and soil clay content. Land-use practices, such as groundwater pumping and construction, are known to induce new sinkholes (Tihansky, 1999). On the other hand, many depressions in urban areas are man-made structures, such as drains and culverts; depressions located in a forested area are more likely to be real sinkholes (Zhu et al., 2014). We used the dominant land-use to represent the land-use for each depression polygon. The land-use was extracted from the National Land Cover Database (Jin et al., 2013). The clay content variable was chosen to reflect soil cohesion, or how well soil particles are sticking together. Soil with higher clay content has higher cohesion and tends to be less prone to forming sinkholes. Soil clay content for each depression was extracted from SSURGO (Soil Survey Staff, 2015). A value of 999 was assigned to depressions where soil clay content is not available in SSURGO. The value 999 was interpreted as missing data in the random forest codes. The two variables based on field observations from Zhu et al. (2014) were average berm slope and bottom roughness. The term berm refers to a prominent man-made ridge along the rim of a depression. The existence of a berm often indicates a man-made structure for a water-holding pond, i.e., a non-sinkhole. Although a berm-like shape is readily visible on a shaded-relief map, using a computer algorithm to determine whether a berm exists is not trivial. Using the DEM generated from LiDAR, we calculated the average slope of a 30-foot-wide ring surrounding a depression and called it average berm slope. A high average berm slope indicates the possible existence of a berm. The bottom roughness variable was selected to reflect if holes existed at the bottom of a depression (Miao et al., 2013). These holes are strong indicators for sinkholes. We calculated the bottom roughness as the variance of the 10% of the deepest cells in a depression. The Extract Values to Table tool in ArcGIS was used to extract depression pixels from each polygon for the calculation of the bottom roughness. 3.2. Training dataset We used the depression data generated in the Floyds Fork watershed to build the training dataset. The Floyds Fork watershed is approximately 16 km east of Louisville, Kentucky and drains parts of Bullitt, Henry, Jefferson, Oldham, Shelby, and Spencer Counties, covering approximately 736 km2 (Fig. 1). Most of the Floyds Fork watershed is in the Outer Bluegrass physiographic region, which is underlain by limestones, dolomites, and shales of Late Ordovician and Silurian age (Newell, 2001). A small southwest portion of the watershed is in the Knobs region, which is nonkarst and underlain by diverse shale, mudstone, and limestone of Silurian and Mississippian age. Because the LiDAR data were only available in Bullitt, Jefferson, and Oldham counties, the actual study area in Zhu et al. (2014) covered approximately 580 km2. Depression polygons that overlay or neighbor stream channels, ponds, and roads are evident non-sinkholes and can be filtered out with existing data on hydrology and roads. Note that in karst areas with losing streams, sinkholes can occur along river channels, but
346
J. Zhu, W.P. Pierskalla Jr. / Journal of Hydrology 533 (2016) 343–352
Fig. 1. Location and geology of the study area. The study area consists of a part for the training dataset and the other part for the test dataset.
this is not evident in the study area. The National Hydrography Dataset (NHD) for Kentucky (USGS, 2014) were used to remove polygons that overlay or neighbor streams and waterbodies with a searching distance of 6 m and 3 m, respectively. Polygons that were close to roads were removed using Kentucky road centerline data (KYTC, 2014) with a searching distance of 6 m. Zhu et al. (2014) identified 1696 probable sinkholes from around 10,720 depressions extracted from LiDAR data in the watershed. With those evident non-sinkhole polygons removed, 8427 records remained in the training dataset with 1571 of them as probable sinkholes. Notice that the training data had fewer sinkholes than in Zhu et al. (2014). The filtering process removed some depressions classified as probable sinkholes because these depressions
were too close to streams, ponds, or roads. It is possible to keep all the depression polygons in the training dataset, but the purpose of this study is to apply the random forests method to depressions that would otherwise need to be inspected manually. Removing depressions associated with roads, streams, and waterbodies reduces noise in the training dataset. 3.3. Test dataset We used the area of Oldham County outside the Floyds Fork watershed to test the sinkhole-extraction random forest (Fig. 1). The test area is approximately 350 km2 and is entirely in the Outer Bluegrass physiographic region with similar geology to that in the
J. Zhu, W.P. Pierskalla Jr. / Journal of Hydrology 533 (2016) 343–352
Floyds Fork watershed. We applied the sinkhole mapping procedure in Zhu et al. (2014) to the test area, extracting 8914 depressions with 2879 of the depressions classified as probable sinkholes. A limited field checking of 25 randomly selected probable sinkholes resulted in 22 confirmed sinkholes (88% accuracy), suggesting a high reliability for the test dataset. We applied the depression polygon filtering process used for the training dataset to remove depressions in test dataset that were associated with streams, ponds, and roads. The resulting test dataset had 4746 records with 2822 of them classified as sinkholes. 3.4. Random forest cases The training dataset built from the Floyds Fork watershed is an imbalanced dataset with sinkhole class making up about 19% of the total records. A weighted random forests method was used to avoid favoring the non-sinkhole class. The weighted random forest assigned a higher weight for the sinkhole class than for the nonsinkhole class. In other words, misclassification of a sinkhole carries a higher penalty than misclassification of a non-sinkhole. The optimal weight varies from dataset to dataset. To find an appropriate weight for the sinkhole extraction training dataset, we ran seven cases with weight ratios of sinkhole to non-sinkhole ranging from 1:1 to 10:1. For each case, we grew enough random trees to ensure the error stabilized. For all cases, we found 300 trees were sufficient and subsequently used this number. The weight ratio generating the best performance for the training dataset was applied to the test dataset. 3.5. Random forest performance metrics The accuracy of a random forest classifier is evaluated through a confusion matrix (Table 1), which contains actual and predicted classifications from the random forest. For the training dataset, the confusion matrix is made of errors evaluated from out-of-bag data, i.e., the records not used in growing a decision tree. Many metrics have been developed from the confusion matrix to evaluate performance of classification methods. A common metric is the overall accuracy, AC, which is defined as
AC ¼
TP þ TN ; TP þ TN þ FP þ FN
ð3Þ
where TP is numbers of true positive predictions, i.e., correct predictions for sinkholes; TN is the number of true negative predictions, i.e., correct predictions for non-sinkholes; FP is the number of false positive predictions, i.e. non-sinkholes predicted as sinkholes; and FN is the number of the false negative predictions, i.e., sinkholes predicted as non-sinkholes. The overall accuracy is used in the random forest method as the convergence criterion. But the overall accuracy can be misleading for very imbalanced data, where a classification missing all minority class can still achieve high overall accuracy. Here, we used the average accuracy as the primary metric for the sinkhole data (Chen et al., 2004), which is defined as:
AAC ¼
AC þ þ AC ; 2
ð4Þ
where AC+ = TP/(TP + FN), is the true positive rate (often called recall), AC = TN/(TN + FP), is the true negative rate. The AAC is
Table 1 Definition of the confusion matrix for sinkhole prediction.
Predicted sinkhole Predicted non-sinkhole
Actual sinkhole
Actual non-sinkhole
True Positive (TP) False Negative (FN)
False Positive (FP) True Negative (TN)
347
essentially the average of the true positive rate and the true negative rate. In addition, we also used G-mean (Kubat et al., 1998), precision, and F-measure as performance measures. G-mean is defined as
G-mean ¼
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi AC þ AC ;
ð5Þ
which is considered as independent of class distribution in samples. Precision is defined as
Precision ¼ TP=ðTP þ FPÞ;
ð6Þ
which is a measure of correct positive classifications penalized by incorrect positive classifications. F-measure is a weighted average of the precision and the true positive rate, defined as
F-measure ¼
2 Precision AC þ : Precision þ AC þ
ð7Þ
4. Results and discussion 4.1. Training dataset performance Table 2 compares the performance of random forest for the training dataset under different weights. With equal weights (1:1) for both sinkhole and non-sinkhole class, the random forest classifier had a high overall accuracy rate of 91.55%, but missed almost 30% of the true sinkholes with a true positive rate of 70.72%. As the weight for sinkhole class increased, the overall accuracy rate decreased but the true positive rate increased. The precision rate also decreased with increasing weights, indicating that more non-sinkholes were incorrectly classified as sinkholes. This consequence was attributed to the lowered cost of misclassifying non-sinkholes when the weight for sinkhole class increased. The F-measure increased from weight 1:1 to weight 3:1 and decreased with higher weights. Both the G-mean and the average accuracy increased from weight 1:1 to weight 4:1 and then decreased with higher weights. Based on the G-mean and the average accuracy, the weight 4:1 appeared to produce the best performance among all different weights tested. The average accuracy for the weight of 4:1 was 89.95%, showing that the random forest built on the training dataset was an efficient sinkhole classifier. With the sinkhole class making up of 19% of the training dataset, we found that the optimum weight of 4:1 coincided with a suggestion in Breiman and Cutler (2004). They suggest that weights can be selected to be inversely proportional to the class populations. Fig. 2 illustrates that spatial distribution of the prediction results for the training dataset. The weighted random forest improved the average accuracy rate and the G-mean metrics approximately 6% from the random forest case with even weights for sinkhole and non-sinkhole classes. When the class weights increased from 3:1 to 10:1, the two performance metrics were relatively stable and changed less than 1% (Table 2). The insensitivity of the two metrics to class weights indicated that finding a reasonable class weight for an imbalanced dataset can be found with ease through several test weights. Different weights, however, did affect the true positive and true negative rates. When the class weight increased from 3:1 to 10:1, the true positive rate increased by approximately 8% and the true negative rate reduced about 9%. In cases where either true positive rate or the true negative rate was the primary concern, selecting a class weight required more careful consideration. 4.2. Variable importance A very useful feature of the random forests method is its capability of evaluating the relative importance of the variables
348
J. Zhu, W.P. Pierskalla Jr. / Journal of Hydrology 533 (2016) 343–352
Table 2 Performance comparison of the random forest for the training dataset under different weights. Weight
Overall accuracy (%)
True positive rate (%)
True negative rate (%)
Precision (%)
F-measure (%)
G-mean (%)
Average accuracy (%)
1:1 3:1 4:1 5:1 6:1 8:1 10:1
91.55 90.03 89.07 87.97 86.93 85.44 83.89
70.72 88.61 91.09 92.68 93.95 95.61 96.50
96.32 90.36 88.61 86.89 85.33 83.11 80.99
81.51 67.80 64.69 61.83 59.47 56.47 53.78
75.73 76.82 75.65 74.17 72.83 71.00 69.07
82.53 89.48 89.84 89.74 89.54 89.14 88.41
83.52 89.48 89.85 89.78 89.64 89.36 88.75
Fig. 2. Spatial distribution of the results for the training dataset with the sinkhole to non-sinkhole class weight ratio of 4:1. Green dots represent correct prediction for sinkholes; blue dots represent correct prediction for non-sinkholes; red dots represent sinkholes predicted as non-sinkholes; and magenta dots represent non-sinkholes predicted as sinkholes. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
in predictions. The method calculates the importance through a permuting process, where the importance of a variable is assessed from evaluating the increase of the out-of-bag error when the variable is permuted (i.e., permutation error). The random forest codes calculate two permutation importance scores, the raw importance score and the normalized importance score (Breiman and Cutler, 2004). The raw score is the average permutation errors from all
the trees. The normalized importance score is the raw importance score divided by its standard error. The normalized importance (Fig. 3a) shows that the average berm slope, land-use, compactness, volume, bottom roughness, mean depth, and maximum depth were the top seven important variables in the sinkhole classification. The importance results suggested that inclusions of parameters not associated with spatial features of the depressions
349
J. Zhu, W.P. Pierskalla Jr. / Journal of Hydrology 533 (2016) 343–352
Average Berm Slope Land Use Compactness Volume Bottom Roughness Mean Depth Maximum Depth Polygon Area Clay Content Polygon Perimeter Depth Index
0
(a)
20
40
60
80
100
Normalized Importance Volume
Maximum Depth
have changed. Here, land-use was ranked forth and the average berm slope was ranked fifth. The volume and the maximum depth were ranked first and second, respectively, which were similar to the variable importance ranks in Miao et al. (2013), where the maximum depth ranked first and the volume second. The differences between the raw importance scores and the normalized importance scores stem from the variation of the permutation error among the trees. If a variable has a large variation in permutation errors among trees, its normalized importance score can still be low even with a high raw importance score. Both variable importance measures suggested that area, perimeter, depth index, and clay content were less useful in predicting sinkholes. We observed that sinkholes in the study area had various sizes and shapes; and it was no surprise that area and perimeter were ranked low in importance. The low importance rank for the depth index reflected that the sinkhole depressions were far more complex to be represented by the shape of an upside down cone. The low importance rank for clay content may be attributed to the data availability in SSURGO, where 24% of the depressions missed clay content.
Mean Depth
4.3. Test dataset performance
Land Use Average Berm Slope Compactness Bottom Roughness Polygon Area Depth Index Clay Content Polygon Perimeter
0
(b)
20
40
60
80
100
Raw Importance
Fig. 3. Variable importance plots for the random forest on the sinkhole extraction training dataset: (a) normalized importance and (b) raw importance. The importance values are plotted as a percent of the maximum.
helped improve the classifier. It is not surprising that average berm slope and land-use were identified as important predictors, but the results of them as the two most important variables were counterintuitive. Strobl et al. (2007) observed that the normalized importance score was biased in favor of variables with a large number of categories. Tolosi and Lengauer (2011) suggested that the method of using the normalized importance to rank variable importance was also biased in a way that lower importance ranks are assigned to groups of correlated-variables. Table 3 shows that a high correlation existed between the maximum depth and the mean depth, and also between the polygon area and the volume. Strobl et al. (2008) suggested that the raw importance score had better statistical properties than the normalized importance score. The raw importance scores (Fig. 3b) show the same seven variables remained as the most relevant predictors, but their relative ranks
The case with sinkhole to non-sinkhole weight of 4:1 produced the best average accuracy and the G-mean for the training dataset and therefore was then applied to the test dataset collected in Oldham County (Fig. 4). The performance metrics (Table 4) show that the average accuracy rate was 73.96%, suggesting the random forest classifier worked well, but the accuracy rate was much lower than the rate from the training dataset. Additionally, the true positive rate was merely 65.17%, albeit the true negative rate was better at 82.74%. The mediocre test results may be attributed to the size of the training dataset, which were derived from a single small watershed. The random forest grown from the training dataset may perform differently in areas outside the watershed. In fact, the test area was adjacent to the area where the training data were generated and the topography and geology were similar. Less favorable results may be possible if applying the same random forest classifier to other karst regions. To see whether another class weight could make better predictions for the test dataset, we ran the test dataset down all the seven random forests grown from the training dataset. Table 4 shows that the weight 4:1 yielded the best G-mean and average accuracy among the seven weights. If one other metric was concerned, another weight might be considered. For example, if the true positive rate was the main factor, the weight 10:1 might be a better choice than the 4:1 weight. 4.4. Discussion The prediction results from the training and test datasets suggest that the sinkhole-extraction random forest performed well. However, the average accuracy of 73.96% for the test dataset
Table 3 Correlation between nine predictive variables. Dmax: maximum depth; Dmean: mean depth; Di: depth index; CP: compactness; BR: bottom roughness, BSavg: average berm slope. Land-use and clay content were not included. The bold values indicate high correlations between variables. Variable
Dmax
Dmean
Di
Volume
Area
Perimeter
CP
BR
BSavg
Dmax Dmean Di Volume Area Perimeter CP BR BSavg
1.00 0.91 0.54 0.24 0.33 0.34 0.06 0.66 0.28
1.00 0.47 0.27 0.35 0.32 0.14 0.42 0.36
1.00 0.03 0.15 0.25 0.16 0.30 0.23
1.00 0.82 0.35 0.01 0.35 0.07
1.00 0.77 0.04 0.33 0.06
1.00 0.25 0.23 0.06
1.00 0.00 0.06
1.00 0.08
1.00
350
J. Zhu, W.P. Pierskalla Jr. / Journal of Hydrology 533 (2016) 343–352
Fig. 4. Spatial distribution of the results for the test dataset with the sinkhole to non-sinkhole class weight ratio of 4:1. Green dots represent correct prediction for sinkholes; blue dots represent correct prediction for non-sinkholes; red dots represent sinkholes predicted as non-sinkholes; and magenta dots represent non-sinkholes predicted as sinkholes. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Table 4 Performance metrics of the random forest for the test dataset. Weight
Overall accuracy (%)
True positive rate (%)
True negative rate (%)
Precision (%)
F-measure (%)
G-mean (%)
Average accuracy (%)
1:1 3:1 4:1 5:1 6:1 8:1 10:1
66.14 71.85 72.29 72.36 72.36 72.21 72.61
47.52 60.91 65.17 66.83 68.14 69.35 71.08
93.45 86.24 82.74 80.46 78.53 76.40 74.84
91.41 85.34 84.71 83.38 82.32 81.17 80.56
62.53 71.08 73.66 74.19 74.56 74.79 75.53
66.64 72.48 73.43 73.33 73.15 72.79 72.94
70.49 73.57 73.96 73.64 73.34 72.88 72.96
suggests that the random forest cannot be substituted for a manual method, such as the visual inspection and we need to be careful with the automatic predictions. In places where a high accuracy is required, a visual inspection method should be preferred. Even though it cannot entirely replace a manual method, the random forests method provides a practical means to map sinkholes for areas in regional and state levels because of its high efficiency and minimum requirements for human intervention. The
high accuracy achieved in the visual inspection process in Zhu et al. (2014) came with the price of a laborious process that involved three steps: an initial classification, a review, and a discussion. While the exact labor cost was difficult to quantify, the visual inspection method took multiple months to process approximate 10,720 depressions in the Floyds Fork watershed, an area about 580 km2. Expanding the same method to larger areas can be labor-intensive and time-consuming. On the other hand,
J. Zhu, W.P. Pierskalla Jr. / Journal of Hydrology 533 (2016) 343–352
applying the random forests method to other areas took much less effort. The main task of applying a random forest for prediction for a new area was to prepare a new dataset for the area, which can be programed and automated. For instance, we developed a script program using ArcPy, a Python-based script language for ArcGIS, to automate our data preparation. Once a proper sinkholeextraction random forest was grown, using the forest to predict sinkholes in another area was just a matter of running the random forest program and the data preparation script again. In areas where high accuracy is needed, the method can also be used as a first step to reduce the number of depressions needed for a visual inspection. Tables 2 and 4 show that increasing the sinkhole to non-sinkhole class weight ratio reduced the chance of missing true sinkholes while misclassifying more non-sinkholes as sinkholes. In other words, processing depressions with a random forest with high sinkhole to non-sinkhole weight ratio could filter out apparent non-sinkholes, thus reducing number of depressions that needed to be processed manually. This feature can be applied to filter out obvious non-sinkholes to improve efficiency during highway construction planning, where detailed and accurate sinkhole coverage is needed for large areas.
351
(5) As Breiman and Cutler (2004) said ‘‘the cleverest algorithm are no substitute for human intelligence and knowledge of the data in the problem”, we need to be cautious with the automatic predictions. An automatic sinkhole extraction procedure, such as the random forest classifier, cannot replace the visual inspection and field-verification procedure totally. However, it can dramatically reduce time and labor costs and makes mapping sinkholes using LiDAR data tractable for large areas.
Acknowledgements This study is partially supported by Kentucky Transportation Center Project: SPR15-502. We thank Bart Davidson, Mike Lynch, Richard Smath, and Steve Webb from Kentucky Geological Survey, University of Kentucky and Michael Priddy from Department of Earth and Environmental Sciences, University of Kentucky for assisting field verification of sinkholes. We thank the two anonymous reviewers and the associate editor for their constructive comments, which improved this manuscript. Our gratitude is also extended to Adam Nolte for editing.
5. Conclusions References LiDAR is transforming the way research is conducted in many scientific fields. For karst regions, LiDAR has shown to drastically improve locating and delineating sinkholes. However, the process of classifying LiDAR-derived depressions to extract sinkholes can be laborious with manual methods. To seek a more efficient method to process the depressions, we applied the random forests method to automatically extract sinkholes in a karst region in central Kentucky. The sinkhole-extraction random forest was grown on a training dataset built from an area where LiDAR-derived depressions were manually classified through a visual inspection and field verification process. Based on the geometry of depressions, as well as natural and human factors, we selected 11 parameters as predictive variables to form the dataset. We then applied a weighted random forests method to account for the imbalance in the training dataset and tested the random forest in another area with similar topography and geology to the area where the training dataset was developed. Our study resulted in the following: (1) The random forest results from the training dataset suggested that the random forest method can be an excellent sinkhole classifier. With a sinkhole to non-sinkhole class weight ratio of 4–1, the method achieved an average accuracy of 89.95%. (2) For imbalanced datasets, applying a weighted random forest is needed to improve performance for the sinkhole class. Higher weight for the sinkhole class resulted in higher rates in identifying true sinkholes, but it also resulted in more non-sinkholes misclassified as sinkholes. Finding the best class weight, however, is dependent on focuses of specific problems. (3) The 11 selected variables were sufficient for a sinkholeextraction random forest classifier. Including variables other than shape and depth of depressions improved the classifier. Variable importance analysis suggested that land-use and average berm slope were useful in predicting sinkholes. (4) Results of applying the random forest to another area showed only moderate success, suggesting caution should be taken when applying a random forest sinkhole classifier to other areas, especially to areas with different hydrogeology from the area where the classifier is developed.
Breiman, L., 2001. Random forests. Mach. Learn. 45 (1), 5–32. Breiman, L., Cutler, A., 2004. Setting up, using, and understanding random forests v5.1.
. Chen, C., Liaw, A., Breiman, L., 2004. Using random forest to learn imbalanced data. Dept. Statistics, Univ. California Berkeley, Tech. Rep., p. 12. Cole, J.P., 1964. Study of major and minor civil divisions in political geography. In: The 20th International Geographical Congress, London. Currens, J.C., 2002. Kentucky is karst country! What you should know about sinkholes and springs. Kentucky Geological Survey, Information Circular 4, Series XII, 35p. Cutler, R.D., Edwards Jr., Thomas C., Beard, Karen H., Cutler, Adele, Hess, Kyle T., Gibson, Jacob, Lawler, Joshua J., 2007. Random forests for classification in ecology. Ecology 88 (11), 2783–2792. Filin, S., Baruch, A., 2010. Detection of sinkhole hazards using airborne laser scanning data. Photogram. Eng. Remote Sens. 76 (5), 577–587. Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, second ed. Springer, p. 745, ISBN: 9780-387-84857-0. Jin, S., Yang, L., Danielson, P., Homer, C., Fry, J., Xian, G., 2013. A comprehensive change detection method for updating the National Land Cover Database to circa 2011. Remote Sens. Environ. 132, 159–175. Jones, K.L., Poole, G.C., O’Daniel, S.J., Mertes, L.A.K., Stanford, J.A., 2008. Surface hydrology of low-relief landscapes: assessing surface water flow impedance using LIDAR-derived digital elevation models. Remote Sens. Environ. 112 (11), 4148–4257. http://dx.doi.org/10.1016/j.rse.2008.01.024, ISSN: 0034-4257. Kim, C.E., Anderson, T.A., 1984. Digital disks and a digital compactness measureed. In: Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, ACM Press, New York, NY, pp. 117–124. Kubat, M., Holte, R., Matwin, S., 1998. Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215. KYTC, 2014. Transportation Geospatial Data: Road Centerlines, Kentucky Transportation Cabinet (accessed on 09.30.14). . Li, W., Goodchild, M.F., Church, R.L., 2013. An efficient measure of compactness for 2D shapes and its application in regionalization problems. Int. J. Geograph. Inform. Sci. 27 (6), 1227–1250. Maxa, M., Bolstad, P., 2009. Mapping northern wetlands with high resolution satellite images and LiDAR. Wetlands 29, 248–260. Miao, Xin, Qiu, Xiaomin, Wu, Shuo-Sheng, Luo, Jun, Gouzie, D.R., Xie, Hongjie, 2013. Developing efficient procedures for automated sinkhole extraction from lidar DEMs. Photogram. Eng. Remote Sens. 79 (6), 545–554. http://dx.doi.org/ 10.14358/PERS.79.6.545. Murthy, S.K., 1998. Automatic construction of decision trees from data: a multidisciplinary survey. Data Min. Knowl. Disc. 2, 345–389. Mutanga, O., Adam, E., Cho, M.A., 2012. High density biomass estimation for wetland vegetation using WorldView-2 imagery and random forest regression algorithm. Int. J. Appl. Earth Obs. Geoinf. 18, 399–406. http://dx.doi.org/ 10.1016/j.jag.2012.03.012, ISSN: 0303-2434. National Oceanic and Atmospheric Administration (NOAA) Coastal Services Center, 2012. ‘‘Lidar 101: An Introduction to Lidar Technology, Data, and Applications”. Revised. NOAA Coastal Services Center, Charleston, SC.
352
J. Zhu, W.P. Pierskalla Jr. / Journal of Hydrology 533 (2016) 343–352
Newell, W.L., 2001. Physiography. In: McDowell, R.C. (Ed.), The Geology of Kentucky – A Text to Accompany the Geologic Map of Kentucky. U.S. Geological Survey Professional Paper 1151-H, pp. 79–83. Paylor, R.L., Florea, L.J., Caudill, M.J., Currens, J.C., 2003. A GIS Coverage of Sinkholes in the Karst Areas of Kentucky: Kentucky Geological Survey, metadata file and shapefiles of highest elevation closed contours, 1 CDROM. . Peters, J., De Baets, B., Verhoest, N.E.C., Samson, R., Degroeve, S., De Becker, P., Huybrechts, W., 2007. Random forests as a tool for ecohydrological distribution modelling. Ecol. Model. 207 (2–4), 304–318. http://dx.doi.org/10.1016/j. ecolmodel.2007.05.011, ISSN: 0304-3800. Rahimi, M., Alexander, E.C., Jr., 2013. Locating sinkholes in LiDAR coverage of a glacio-fluvial karst, Winona County, MN. In: Conference Proceedings – Thirteenth Multidisciplinary Conference on Sinkholes and the Engineering and Environmental Impacts of Karst, University of South Florida. pp. 469–480. Romberg, A.R., Saffran, J.R., 2010. Statistical learning and language acquisition. Wiley Interdiscip. Rev. Cognit. Sci. 1 (6), 906–914. http://dx.doi.org/10.1002/ wcs.78. Seale, L.D., Florea, L.J., Brinkmann, R., Vacher, H.L., 2008. Using ALSM to identify closed depressions in the urbanized, covered karst of Pinellas County, Florida— 1, methodological considerations. Environ. Geol. 54, 995–1005. http://dx.doi. org/10.1007/s00254-007-0890-8. Soil Survey Staff, Natural Resources Conservation Service. United States Department of Agriculture. Web Soil Survey (accessed 03.24.15). . Strobl, C., Boulesteix, A.-L., Zeileis, A., Hothorn, T., 2007. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform. 8 (25), 1471–2105.
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., Zeileis, A., 2008. Conditional variable importance for random forests. BMC Bioinform. 9 (307), 1471–2105. Taylor, C.J., Greene, E.A., 2008. Hydrogeologic characterization and methods used in the investigation of karst hydrology. In: Rosenberry, D.O., LaBaugh, J.W. (Eds.), Field techniques for estimating water fluxes between surface water and groundwater, chap. 3. Techniques and Methods 4-D2. U.S. Department of the Interior, U.S. Geological Survey. Tharp, T.M., 1999. Mechanics of upward propagation of cover-collapse sinkholes. Eng. Geol. 52, 23–33. Tihansky, A.B., 1999. Sinkholes, west-central Florida. In: Galloway, D., Jones, D.R., Ingebritsen, S.E. (Eds.), Land Subsidence in the United States: U.S. Geological Survey Circular 1182, 121–140. Tolosi, L., Lengauer, T., 2011. Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics. http://dx.doi.org/10.1093/ bioinformatics/btr300. USGS, 2014, The National Hydrography Dataset for Kentucky, (accessed 09.30.14). Waltham, T., 2008. Sinkhole hazard case histories in karst terrains. Q. J. Eng. Geol. Hydrogeol. 41, 291–300. White, E.L., Aron, G., White, W.B., 1986. The influence of urbanization on sinkhole development in central Pennsylvania. Environ. Geol. Water Sci. 8 (1–2), 91–97. Zhu, J., Taylor, T.P., Currens, J.C., Crawford, M.M., 2014. Improved karst sinkhole mapping in Kentucky using LiDAR techniques: a pilot study in Floyds Fork Watershed. J. Cave Karst Stud. 76 (3), 207–216. http://dx.doi.org/10.4311/ 2013ES0135.