Journal of Hydrology 477 (2013) 119–128
Contents lists available at SciVerse ScienceDirect
Journal of Hydrology journal homepage: www.elsevier.com/locate/jhydrol
Advancing monthly streamflow prediction accuracy of CART models using ensemble learning paradigms Halil Ibrahim Erdal a,⇑, Onur Karakurt b,1 a b
_ Turkish Cooperation and Coordination Agency (TIKA), Atatürk Bulvarı No. 15, Ulus Ankara, Turkey Gazi University Engineering Faculty, Civil Engineering Department, Celal Bayar Bulvarı, 06570 Maltepe, Ankara, Turkey
a r t i c l e
i n f o
Article history: Received 3 July 2012 Received in revised form 5 November 2012 Accepted 6 November 2012 Available online 16 November 2012 This manuscript was handled by Geoff Syme, Editor-in-Chief Keywords: Bagging (bootstrap aggregating) Classification and regression trees Ensemble learning Stochastic gradient boosting Streamflow prediction Support vector regression
s u m m a r y Streamflow forecasting is one of the most important steps in the water resources planning and management. Ensemble techniques such as bagging, boosting and stacking have gained popularity in hydrological forecasting in the recent years. The study investigates the potential usage of two ensemble learning paradigms (i.e., bagging; stochastic gradient boosting) in building classification and regression trees (CARTs) ensembles to advance the streamflow prediction accuracy. The study, initially, investigates the use of classification and regression trees for monthly streamflow forecasting and employs a support vector regression (SVR) model as the benchmark model. The analytic results indicate that CART outperforms SVR in both training and testing phases. Although the obtained results of CART model in training phase are considerable, it is not in testing phase. Thus, to optimize the prediction accuracy of CART for monthly streamflow forecasting, we incorporate bagging and stochastic gradient boosting which are rooted in same philosophy, advancing the prediction accuracy of weak learners. Comparing with the results of bagged regression trees (BRTs) and stochastic gradient boosted regression trees (GBRTs) models possess satisfactory monthly streamflow forecasting performance than CART and SVR models. Overall, it is found that ensemble learning paradigms can remarkably advance the prediction accuracy of CART models in monthly streamflow forecasting. Crown Copyright Ó 2012 Published by Elsevier B.V. All rights reserved.
1. Introduction Forecasting of the streamflow measurements for daily, monthly or longer time intervals is of great importance in terms of both managing water resources systems and planning of new water resources systems effectively. The studies for forecasting of the stream data by using different methods in order to operate water resources systems and perform re-planning activities effectively continue to increase. The realistic forecasting of the streamflow measurements which depends on a number of hydrologic factors will contribute to planning of the short and long-term water resources systems (such as flood control and reservoir operation) in the most effective manner. Therefore, the prediction of the water flow measurements in water systems planning studies is essential and various models developed for this subject continue to increase gradually. In the recent years, ensemble learning techniques (also known as committee machines) (Anctil and Lauzon, 2004; Snelder et al.,
⇑ Corresponding author. Tel.: +90 312 508 10 00; fax: +90 312 309 89 68. E-mail addresses:
[email protected] (H.I. Erdal),
[email protected] (O. Karakurt). 1 Tel.: +90 312 231 74 00; fax: +90 312 230 84 34.
2009) have been employed in the modeling and the estimation of hydrologic variables in different research areas. In general, ensemble methods (i.e. bagging, boosting) are designed to overcome problems with weak predictors (Hancock et al., 2005). Artificial neural networks (ANNs) and decision trees (DTs) are commonly used as base predictors in building ensemble machine learning models (Zhang et al., 2008). This study investigates bagging and boosting techniques, which are widely used in machine learning literature, to build strong predictors for streamflow estimation. Bagging (acronym for bootstrap aggregating) was proposed by Breiman (1996) to improve the prediction accuracy of weak learners. The bagging aims minimizing of prediction variance by generating bootstrapped replica datasets. Boosting (also known as arching) is also very popular an ensemble method which comes from the same philosophy. Boosting creates a linear combination out of many models (Hancock et al., 2005). Each new model is dependent on the preceding model (Friedman, 2002). These techniques differ in the ways to process the training data and to combine the predictions coming from their base predictors (Zhang et al., 2008). To the best of our knowledge ensemble methods have not been applied extensively in hydrological time series analysis especially in streamflow estimation. A few available implementations of ensemble methods in hydrological time series respectively: Jeong
0022-1694/$ - see front matter Crown Copyright Ó 2012 Published by Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.jhydrol.2012.11.015
120
H.I. Erdal, O. Karakurt / Journal of Hydrology 477 (2013) 119–128
and Kim (2005) applied an ensemble neural network (ENN) in monthly inflows forecasting to the Daecheong dam in Korea. The ENN combined the outputs of member neural networks employing the bagging method. The overall results showed that the ENN outperformed than a simple ANN among the three rainfall-runoff models. Cannon and Whitfield (2002) studied the use of ensemble neural networks modeling in the streamflow forecasting. Li et al. (2010) applied bagging to construct various support vector machines (SVMs) models in the streamflow prediction. The results showed that the bagged SVM model outperforms the prediction ability of bagged multiple linear regressions (MLRs), simple SVM, and simple MLR models in all of the adopted evaluation scores. Tiwari and Chatterjee (2011) investigated hybrid wavelet bootstrapped artificial neural networks models in daily discharge forecasting. The results revealed that the model which uses the capabilities of both bootstrap and wavelet methods was the best model. Araghinejad et al. (2011) investigated both generation and combination techniques of ANN ensembles in the peak discharge forecasting of the floods. The results indicated that using the ANN ensembles could enhance the probabilistic forecast skill for hydrological events. Tiwari and Chatterjee (2010) built a hybrid wavelet–bootstrap–ANN model for hourly flood forecasting. The results indicated that the proposed model can enhance flood forecasting results. Snelder et al. (2009) developed a method of mapping flow regime classes using boosted regression trees (BRTs). The performance of the BRT models was compared with other methods (i.e., linear discriminant analysis LDA and classification and regression trees CARTs). They found that, boosted regression trees could increase the confidence in decisions associated with setting environmental flows and the ability to undertake broad-scale ecohydrological research. Boucher et al. (2010) used bagged multi-layer perceptrons for a 1-day-ahead streamflow forecasting purpose on three watersheds. Shu and Burn (2004) incorporated bagging and boosting in building ANN ensembles for estimating the index flood and the 10-year flood quantile. The comparison between ANN ensembles and a single ANN showed that ANN ensembles were more accurate in flood estimation. To our knowledge there are limited applications of classification and regression trees within hydrological forecasting: Vezza et al. (2010) were applied multiple regressions with morphoclimatic catchment characteristics in sub-regions obtained through four classification methods: seasonality indices (SIs), classification and regression trees (CRTs), residual pattern approach (RPA) and weighted cluster analysis (WCA). They found that the CRT model outperforms the models obtained by the other classification techniques in terms of explained variance. The SVM models have been successfully used to estimate seasonal, monthly and hourly streamflows and showed good generalization performance in their applications respectively: Kisi and Cimen (2011) investigated usage of a wavelet-support vector regression (WSVR) conjunction model for in monthly streamflow forecasting. The test results were compared with the single support vector regression model. The comparison results showed that the discrete wavelet transform could increase the accuracy of the SVR model in forecasting monthly streamflows. Guo et al. (2011) studied to improve the performance of the SVM model in predicting monthly streamflow, and compared the results of modified SVM with the results of ANN model and conventional SVM model. Asefa et al. (2006) used SVMs in predicting both seasonal and hourly streamflows. The results from SVMs showed a promising performance in predicting site-specific, real-time streamflows. Samsudin et al. (2010) demonstrated how the monthly river flow could be well represented by the hybrid models, combining the group method of data handling and the least squares support vector machine models.
2. Method In this study, we use a classification and regression trees (CARTs) model for monthly streamflow forecasting, first. A conventional well-known machine learning model, support vector regression (SVR), is employed as the benchmark model. Then, the bagging and stochastic gradient boosting ensemble learning methods are incorporated in building tree-based ensemble modeling to optimize the prediction accuracy of single CART model. In building tree-based ensembles, a training sample is drawn randomly from the entire observed data, first. Then, a CART model is built using this replica dataset. In addition, this procedure is repeated 100 times to get 100 individual forecast models that provide the variability in sampling. Each of these CART models is then used to forecast of the current flow and the linear combination of these forecasts is used as the final forecast. Forecasting of the streamflow measurements for daily, monthly or longer time intervals is of great importance for the effective operation of a water resources system. Plenty of the activities related to planning and operation of the components of a water resource system require decent forecasts of future events. The storage-yield sequences are generally related to monthly periods. Hence, monthly river flow forecast is very important for water resource system planning Kisi et al. (2011). In the applications, the first 30-year of data (360 months) is used for training and the remaining 5-year data (60 months) is used for testing. The surface water hydrographs of rivers exhibit large variations due to many natural phenomena. One of the most commonly used approaches for interpolating and extending streamflow records is to fit observed data with an analytic power model. However, such analytic models may not adequately represent the flow process, because they are based on many simplifying assumptions about the natural phenomena that influence the river flow. This paper demonstrates how a simple ensemble model can be used as an adaptive model as well as a predictor. Therefore, three attribute combinations based on preceding monthly streamflows are developed to forecast current streamflow values. The attribute combinations used in the application are: (i) Qt–1, (ii) Qt–1 and Qt–2, and (iii) Qt–1, Qt–2 and Qt–3. In all combinations, the output is the discharge Qt for the current month. The following three performance measures are used to evaluate the proposed predictive models. The correlation coefficient (R) is a common measure of how well the curve fits the actual data. A value of 1 indicates a perfect fit between actual and predicted values, meaning that the values have the same propensity. The mathematical formula for computing R is:
P P P n y y0 ð yÞð y0 Þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi q R ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P P ffi P P ð y2 Þ ð y2 Þ ð y02 Þ ð y0 Þ2
ð1Þ
where y = actual value; y0 = predicted value; and n = number of data samples. The root mean squared error (RMSE) is the square root of the mean square error. The RMSE is thus the average distance of a data point from the fitted line measured along a vertical line. The RMSE is given by the following equation:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P 0 ðy yÞ2 RMSE ¼ n
ð2Þ
The mean absolute error (MAE) is a quantity used to measure how close forecasts or predictions are to the eventual outcomes. The mean absolute error is given by
MAE ¼
1X jy y0 j n i¼1;n
ð3Þ
Also six numerical descriptors were computed to investigate the statistical relation between observed streamflow data and predicted streamflow data;
H.I. Erdal, O. Karakurt / Journal of Hydrology 477 (2013) 119–128
by using the mean for regression problems or majority voting for classification problems (Pino-Mejias et al., 2008). For a regression problem it works as fallowed (Bühlmann and Yu, 2002): A training set of D consists of data fðX i ; Y i Þ; i ¼ 1; 2; . . . ; ng with Yi the real-valued response and Xi a p-dimensional explanatory variable for the ith instance. A predictor EðYjX ¼ xÞ ¼ f ðxÞ is denoted by
Maximum discharge (Max Q). Minimum discharge (Min Q). Mean of discharge (Mean Q). Variance of discharge (Var Q). Maximum under-prediction (MUP). Maximum over-prediction (MOP).
Moreover, the accuracy of the peak flow estimates will have a major effect on the design of drainage or flood control facilities. Thus, this study also argues the performance of peak flow prediction accuracy of proposed predictive models.
C n ðxÞ ¼ hn ðD1 ; . . . ; Dn ÞðxÞ
2.1. Classification and regression trees Classification and regression trees (CARTs) was proposed by Breiman et al. (1984) and it has gained popularity in the resent years. However, CART is identified as unstable learner because it is prone to overfitting (Breiman, 1996). More specifically, CART is very sensitive to small changes in the training dataset (Hastie et al., 2001). It works as fallowed (Hancock et al., 2005): Each node within the tree has a partitioning rule and the partitioning rule is defined through minimization of the relative error (RE) which is the minimization of the sums-of-squares of a split for regression problems: L X
ðy1 yL Þ2 þ
l¼0
R X ðyr yR Þ2
ð4Þ
r¼0
where y1 and yr are the left and right partitions with L and R observations of y in each, with respective means yL and yR . The decision rule d is a point in some estimator variable x that is used to determine the left and right branches. The partitioning rule that minimizes the RE is then used to construct a node in the tree. The primary parameters for the CART are the following: the number of folds; the minimum total weight; and the number of seeds and the values for these parameters are 3, 2 and 1, respectively. A CART structure is depicted in Fig. 1. 2.2. Bagging Bagging (acronym for bootstrap aggregating) is one of the earliest and most popular ensemble methods. Bootstrap resampling method (Efron, 1979) and aggregating are the basis of bagging. Variety in bagging is derived by using bootstrapped replicas of the original data. Different training sub-datasets are drawn at random with replacement from the training dataset. Separate models are produced and used to predict the entire data from aforesaid subsets. Then various estimated models are aggregated
ð5Þ
where hn is the nth hypothesis. Theoretically, bagging is defined as follows: First, construct a bootstrapped sample
Di ¼ ðY i ; X i Þ
REðdÞ ¼
121
ð6Þ
according to the empirical distribution of the pairs Di = (Xi, Yi), where ði ¼ 1; 2; . . . ; nÞ. Secondly, estimate the bootstrapped predictor by the plug-in principle,
C n ðxÞ ¼ hn ðDi ; . . . ; Dn ÞðxÞ
ð7Þ
where Cn(x) = hn(D1, . . . , Dn)(x) Finally, the bagged predictor is:
C n;B ðxÞ ¼ E ½; Dn ðxÞ
ð8Þ
To sum up, bagging is one of the simplest to implement technique which can reduce variance when combined with the base learner generation, with a good performance (Wang et al., 2011). A more detailed version of bagging is described in Breiman (1999). The bagging ensemble model structure developed in the present study is shown in Fig. 2. A CART is employed as base predictor of BRT and the primary parameters for the CART are identical to Section 2.1. Moreover the bagging parameters are the size of each bag (as a percentage); the number of iterations (number of trees); and the number of seeds. In this case, the values for these parameters are 100, 100, and 1, respectively. 2.3. Stochastic gradient boosting Boosting is an important machine learning meta-algorithm because it enhances the prediction accuracy of weak predictors like decision (regression) trees and artificial neural networks. First boosting algorithm was introduced by Schapire (1990). This study employs the stochastic gradient boosting (also known as treeboost) algorithm which was introduced by Friedman (2001, 2002) to construct the gradient boosted regression trees model. Stochastic gradient boosting is a statistical method of fitting an additive model of base functions and generates replica datasets from original dataset to optimize prediction accuracy and it can be defined as fallowed (Hancock et al., 2005): A training sample of D consists of data {(xi,yi), i = 1,2, . . . , n} where xi is a variable within the predictor set and yi is the response variable. Initialize the model F0(x) with a constant value, first: N X F 0 ðxÞ ¼ arg min Wðyi ; cÞ
c
ð9Þ
i¼1
Compute so-called pseudo-residuals:
@ Wðyi ; Fðxi ÞÞ zim ¼ @Fðxi Þ Fðxi Þ¼F m1ðxÞ
ð10Þ
where zi is the path of steepest decent, W is the loss function and m is number of model (m = 1, . . . , M). This direction is used to constrain each new model entering the boosted subset. Then, compute the parameter am N X am ¼ arg min ½zim qhðxi ; aÞ a;q
Fig. 1. A CART structure.
ð11Þ
i¼1
where a is the split point of xi, h(xi;am) is the model (in this study, it is a CART), and q is the weighting of the tree. After that, calculate
122
H.I. Erdal, O. Karakurt / Journal of Hydrology 477 (2013) 119–128
Fig. 2. Bagging ensemble model structure.
the optimal value of the expansion coefficient bm for the model (base learner) h(xi;am) N X bm ¼ arg min Wðyi ; F m1 ðxi Þ þ bhðxi ; am ÞÞ b
ð12Þ minimize Rreg ðf Þ ¼
i¼1
where b is the weight of each new tree in the direction of zi. After above two-step least squares process we can update Fm
F m ðxÞ ¼ F m1 ðxÞ þ bm hðxi ; am Þ
ð13Þ
The shrinkage parameter m controls the learning rate of the process and reduces the risk of an overfitting.
F m ðxÞ ¼ F m1 ðxÞ þ mbm hðxi ; am Þ
ð14Þ
where 0 < m 6 1 Tree-based ensembles of gradient boosting models can increase their accuracy by generating a series of models and then combine them into an ensemble model with higher performance. Moreover, gradient boosted regression trees inherits advantages of regression trees while overcoming their inaccuracy (Chou et al., 2011). And gradient boosting ensemble model structure built in the present study is shown in Fig. 3. In this study, a CART is used as the base learner and the parameters of CART are identical to Section 2.1. The best parameters for the GBRT are the number of iterations (number of trees) m is 100 and the shrinkage m is 0.5. 2.4. Support vector regression As a statistical method, support vector machines were originally utilized by Vapnik (1995) for the solution of binary classification problems. In the following year, SVM regression was developed by Vapnik, Drucker, Burges, Kaufman and Smola (Schölkopf et al., 1999). This new method was named as support vector regression (SVR). It can be defined as fallowed (Erdal and Ekinci, 2012): Considering a set of training data fðx1 ; y1 Þ; ::::; ðx‘ ; y‘ Þg, where each xi 2 Rn ; yi 2 R, the decision function is given by
f ðxÞ ¼ ðw UðxÞÞ þ b;
with respect to w 2 Rn , b 2 R where U denotes a non-linear transformation from Rn to high dimensional space (). The primal optimization problem is given by:
ð15Þ
l X 1 jjwjj2 þ C Sðf ðxiÞ yiÞ 2 i¼0
ð16Þ
Decent settings are vital for forecasting accuracy of learning machines. The SVR’s parameters are as follows: The kernel is poly kernel; the complexity parameter is 1, 2 and 3; the epsilon is 1.0E11 and 1.0E12 the exponent is 1, 2 and 3. The best parameter configuration for this technique is the complexity parameter is 1, epsilon is 1.0E12 and the exponent is 1. 3. Study area In this study, the monthly streamflow data of Karsßıköy observation station on the Çoruh River in the Eastern Black Sea Region of Turkey are used. The location of the station is shown in the Fig. 4 and the coordinate of the station is on the 41°4200 380 E longitude and on the 41°2700 070 N latitude. The Drainage area of the station is 19654.4 km2 and the elevation of the station is 57 m. The streamflow data measured between 1964 and 2002 (39 years) of the Karsßıköy Station. However, 35 years of measured data (1968– 2002; 420 months) is used in application. The observed data is for hydrologic years, i.e. the first month of the year is October and the last month of the year is September. This study is focused on the Çoruh River because total of 13 hydro-electric dams are planned as part of the Çoruh River Development Plan but a total of 27 are proposed for the Çoruh River Catchment. Under the Çoruh Development Plan, two dams have been completed (Murtli Dam and Tortum Dam), another is under construction (Deriner Dam) and Yusufeli Dam, just upstream is in its final planning phase. 4. Results The correlation coefficient (R), root mean squared error (RMSE), mean absolute error (MAE) statistics of the classification and
Fig. 3. Gradient boosting ensemble model structure.
123
H.I. Erdal, O. Karakurt / Journal of Hydrology 477 (2013) 119–128
Fig. 4. The Karsßıköy Station on the Çoruh River in Turkey.
regression trees (CARTs) and support vector regression (SVR) models in training and testing phases are given in Tables 1 and 2. The tables point that the CART and SVR models whose inputs are the flows of three previous months have the best accuracy in both training and test period. Tables 1 and 2 show that CART (Rtraining = 0.9096, Rtesting = 0.7349) model is superior to the SVR (Rtraining = 0.7792, Rtesting = 0.7033) model for determining correlation coefficient statistic in prediction accuracy. The relative correlation coefficient (R) differences between the CART (attribute combination iii) and the SVR (attribute combination iii) models are 16.74% in the training phase, and 4.49% in the testing phase. We have found that there is a direct relationship between R, MAE and RMSE in training phase and CART is superior to SVR for minimizing RMSE (CART = 79.46 m3/s; SVR = 126.32 m3/s) and MAE (CART = 49.88 m3/s; SVR = 73.88 m3/s). However, RMSE (CART = 137.01 m3/s; SVR = 132.77 m3/s) and MAE (CART = 77.76 m3/s; SVR = 75.28 m3/s) statics are inconsistent with the R statics in testing phase. Even though we have proved that the CART model is superior to SVR model, we feel that the obtained results of CART model in testing phase is not satisfactory. Thus, we use bagging and stochastic gradient boosting in building CART ensemble models (bagged
regression trees BRT; stochastic gradient boosted regression trees GBRT) to enhance the accuracy of forecasting. The performance statistics of BRT and GBRT models in training and testing phases are given in Tables 3 and 4. The tables depict that the BRT and GBRT models with three attributes yield the best results in both training and test period. Tables 1–3 indicate that the obtained results of the CART (Rtraining = 0.9096), BRT (Rtraining = 0.9086) and GBRT (Rtraining = 0.9163) models are very close and better than SVR (Rtraining = 0.9792) model in training phase. However, Tables 1–4 show that the ensemble models (BRT = 0.8085; GBRT = 0.8054) are superior to single machine learning models (CART = 0.7349; SVR = 0.7033) for determining R statics in testing phase. The relative correlation coefficient (R) differences between the BRT (attribute combination iii) and the CART (attribute combination iii) models is 10.01%, and differences between the GBRT (attribute combination iii) and the CART (attribute combination iii) models are and 9.59% in the testing phase. Obviously, it can be understood from the empirical results that ensemble learning paradigms can improve the prediction accuracy of their base predictors. In addition, the relative correlation coefficient (R) differences between the BRT (attribute combination iii) and the SVR (attribute
Table 1 R, MAE and RMSE statistics for CART. Model
CART
Model attributes
(i) Q(t–1) (ii) Q(t–1) & Q(t–2) (iii) Q(t–1), Q(t–2) & Q(t–3)
Training
Testing
MAE (m3/s)
RMSE (m3/s)
R
MAE (m3/s)
RMSE (m3/s)
R
95.54 52.42 49.88
145.33 88.87 79.46
0.6479 0.8851 0.9096
92.86 83.21 77.76
137.35 141.12 137.01
0.6493 0.7038 0.7349
124
H.I. Erdal, O. Karakurt / Journal of Hydrology 477 (2013) 119–128
Table 2 R, MAE and RMSE statistics for SVR. Model
Model attributes
Training 3
SVR
(i) Q(t–1) (ii) Q(t–1) & Q(t–2) (iii) Q(t–1), Q(t–2) & Q(t–3)
Testing 3
MAE (m /s)
RMSE (m /s)
R
MAE (m3/s)
RMSE (m3/s)
R
88.40 73.94 73.88
148.55 126.79 126.32
0.6695 0.7778 0.7792
89.85 75.30 75.28
152.56 133.17 132.77
0.5833 0.7018 0.7033
MAE (m3/s)
RMSE (m3/s)
R
MAE (m3/s)
RMSE (m3/s)
R
86.95 49.90 50.21
127.90 79.57 81.33
0.7457 0.9122 0.9086
96.11 68.51 67.65
143.99 117.17 106.86
0.6077 0.7818 0.8085
MAE (m3/s)
RMSE (m3/s)
R
MAE (m3/s)
RMSE (m3/s)
R
89.24 67.02 53.86
130.19 99.18 80.39
0.7441 0.8907 0.9163
93.23 76.71 69.15
136.27 116.52 108.38
0.6587 0.7651 0.8054
Table 3 R, MAE and RMSE statistics for BRT. Model
BRT
Model attributes
(i) Q(t–1) (ii) Q(t–1) & Q(t–2) (iii) Q(t–1), Q(t–2) & Q(t–3)
Training
Testing
Table 4 R, MAE and RMSE statistics for GBRT. Model
GBRT
Model attributes
(i) Q(t–1) (ii) Q(t–1) & Q(t–2) (iii) Q(t–1), Q(t–2) & Q(t–3)
Training
Testing
combination iii) models is 14.96%, and differences between the GBRT (attribute combination iii) and the SVR (attribute combination iii) models are and 14.52% in the testing phase. The obtained results indicate that tree-based ensemble models are superior alternatives to conventional machine learning models. Moreover, MAE and RMSE statics are inconsistent with the determination of correlation statics in training and testing phases. For minimizing RMSE statics, GBRT (81.83 m3/s) is the best, BRT (80.39 m3/s) is the second, CART (79.46 m3/s) is the third and SVR (126.32 m3/s) is the worst model and CART (49.88 m3/s) is the best, BRT (50.21 m3/s) is the second GBRT (53.86 m3/s) is the third and SVR (73.88 m3/s) is the worst model for minimizing MAE statics in training phase. In addition, in testing phase, the best model for minimizing RMSE (106.86 m3/s) and MAE (67.65 m3/s) is BRT, the 2nd model is GBRT (RMSE = 108.38 m3/s, MAE = 69.15 m3/s), the 3th model is SVR (RMSE = 132.77 m3/s, MAE = 75.28 m3/s) and finally the worst model is CART (RMSE = 137.01 m3/s, MAE = 77.76). The BRT, GBRT, CART and SVR forecasts and residuals in test period are shown in Fig. 5. The BRT and GBRT models can approximate the hydrographs better than the CART and SVR models. The under estimations of the peaks of proposed predictive models can be seen from the residuals. Moreover, numerical descriptors (i.e., Max Q, Min Q, Mean Q, Var Q, MUP & MOP) for the proposed ensemble machine learning models are summarized in Tables 5 and 6. The streamflow peak-estimates obtained by the BRT, GBRT, CART and SVR models and the corresponding observed values are compared in Table 7 and Fig. 6. From the table and figure it can be seen that, the accuracy of tree-based models (i.e., CART, BRT GBRT) are better than SVR model. The BRT estimation of the maximum peak is 566.5 m3/s instead of observed 740.8 m3/s with an under estimation of 23.5%, while the GBRT SVR and CART compute as 546.2 m3/s with an under estimation of 26.2%, 524.8 m3/s with an under estimation of 29.1% and 520.8 m3/s with an under estimation of 29.7%, respectively. The CART estimation of the second
maximum peak is 520.8 m3/s instead of observed 701.40 m3/s, with an under estimation of 25.7%, while the BRT, GBRT and SVR result in 447.6 m3/s with an underestimation of 36.18%, 444.5 m3/s, with an underestimation of 36.6% and 205.4 m3/s, with an underestimation of 70.7%, respectively. For training and test phases the box plots depict the distributions of monthly observed and predicted streamflows in Figs. 7 and 8. The box height corresponds to the interquartile range, the whiskers depict the 5th and 95th percentiles and the horizontal line is the median. Dots indicate values outside the range and the horizontal line within each boxes indicate the median values. Observed and predicted streamflow data are positively skewed for training and test phases. Box plot representations of observed and predicted streamflow variability are shown in the figures. The performance of BRT model is better than the SVR, CART and GBRT models when compared to the distribution of the observed streamflow data. Moreover the distribution of streamflow data predicted by the BRT model is similar to the distribution of observed data and the BRT model do the best job at the capturing the observed data for training and test phases. Figs. 9 and 10 depict the distributions of the models and the measured flow data statistically for training and testing phases. The figures show that the best results obtained by BRT and GBRT gave a better fit to a straight line than SVR and CART did, which indicated that these techniques were more accurate for predicting streamflow. 5. Discussion In this study, a classification and regression tree (CART) is employed in monthly streamflow forecasting and is compared with the support vector regression (SVR) model, first. For determining correlation coefficient statistic, the CART model yields better results than the SVR model. The relative correlation coefficient (R) differences between the CART and the SVR models are 16.74% in the training phase, and 4.49% in the testing phase.
125
H.I. Erdal, O. Karakurt / Journal of Hydrology 477 (2013) 119–128
Fig. 5. Monthly streamflow estimates of SVR, CART, BRT and GBRT models in test period.
The results indicate that CART model is significantly better than SVR model especially in training phase. This may happen because, first of all, CART does not need a priori information about dataset and this allows considering a bigger variety of possible model specifications. Even if training set holds some irrelevant information (e.g. measurement errors, misspecification) the model will choose correct splits by itself and hence account for disturbances automatically. Although all possible data splits are analyzed, CART architecture is flexible to account all of them and do it faster. CART can use any combination of continuous and categorical data so researchers is no more limited to a particular class of data and created models will be able to capture more real-life effects (Andriyashin, 2005). Moreover, It is well known that, the decent parametric settings are vital for forecasting accuracy of SVM and the performance of SVM is highly depends on kernel function selection which forces researchers to make additional assumptions. On the other hand, the architecture of CART is non-parametric. When no data structure hypotheses are available, non-parametric analysis becomes very effective data mining tool. Beside this, when using CART for predictive modeling researchers do not need to make any additional assumptions concerning model errors distribution (Andriyashin, 2005). Ultimately, the flexible structure of CART helps it to fit the target correctly. Conversely, the SVR is more rigid method than the CART and its stiffness (bias) may increase the prediction error of the model. However, in the testing phase, the obtained result of CART (R = 0.7349) is not remarkable. Thus, secondly, two ensemble learning paradigms (i.e., bagging and stochastic gradient boosting) are incorporated in building CART ensembles. In bagging and boosting, multiple versions of the CART are formed by making bootstrap replicas of the learning set and using these as new learning sets. The predicted value generated by the ensembles is an average over these multiple versions of predictors (Grunwald et al., 2009). Although bagging and boosting both combine the outputs from different predictors, they differ in the ways to permutate the training data and to combine the estimates coming from their base learners Zhang et al. (2008). In bagging, each new resample is drawn at random with replacement from the entire learning dataset which are all independent, however in stochastic gradient boosting the resampling for the next CART depends on the performance of the previous CART. More precisely, in boosting, the algorithm trains the first CART with the original sample, and the training set of a new CART is assembled based on the performance of the prior CART. The new sample obtained from the previous CART differs significantly from observed values.
Table 5 Numerical descriptors for four predictive models and observed data for training phase.
Observed SVR CART BRT GBRT
Min Q (m3/s)
Max Q (m3/s)
Mean Q (m3/s)
Var Q
MOP (m3/s)
MUP (m3/s)
48.96 34.43 63.94 72.85 67.79
1119.58 893.51 984.44 774.81 842.09
210.63 174.30 211.99 210.94 212.07
44365.70 30379.87 44938.54 44495.49 44974.77
313.36 380.74 340.04 304.46
799.15 300.81 399.29 316.19
Table 6 Numerical descriptors for four predictive models and observed data for testing phase.
Observed SVR CART BRT GBRT
Min Q (m3/s)
Max Q (m3/s)
Mean Q (m3/s)
Var Q
MOP (m3/s)
MUP (m3/s)
56.89 12.95 63.94 73.11 80.00
740.84 613.23 984.44 663.32 722.04
204.83 170.93 218.99 210.20 215.50
41953.42 29215.56 47957.53 44184.57 46438.58
254.30 681.34 355.47 418.94
495.98 390.16 405.15 410.10
126
H.I. Erdal, O. Karakurt / Journal of Hydrology 477 (2013) 119–128
Table 7 The comparison of SVR, CART, BRT and GBRT peak-estimates for the test period. Peak no.
Observed peaks (m3/s)
BRT
GBRT
CART
SVR
1 2 3 4 5 6 7 8 9
318.6 581.9 251.8 659.8 561.2 740.8 701.4 391.7 589.5
168.0 530.4 139.3 538.8 156.0 566.6 447.6 152.9 550.7
222.5 476.4 159.2 559.1 151.1 546.2 444.5 154.1 559.1
171.0 518.6 171.0 520.8 171.0 520.8 520.8 159.0 520.8
161.3 293.2 111.4 255.8 130.4 524.8 205.4 136.5 367.7
Relative error (%) BRT (%)
GBRT (%)
CART (%)
SVR (%)
47.3 8.9 44.7 18.3 72.2 23.5 36.2 61.0 6.6
30.2 18.1 36.8 15.3 73.1 26.3 36.6 60.7 5.2
46.3 10.9 32.1 21.1 69.5 29.7 25.8 59.4 11.7
49.4 49.6 55.7 61.2 76.8 29.2 70.7 65.1 37.6
Fig. 6. Monthly streamflow peak estimates of SVR, CART, BRT and GBRT models in test period.
Fig. 8. Box plots of monthly observed and predicted streamflows for test phase.
Fig. 7. Box plots of monthly observed and predicted streamflows for training phase.
The patterns of new sample are adjusted with higher probability of being sampled, thus, they will have a greater chance of appearing in the new sample than those correctly predicted. This means that different CARTs are specialized in different parts of the observation space (Shu and Burn, 2004). In this study, 100 randomly drawn replica datasets are generated for building both bagged and boosted regression trees. Then, these datasets is used to build individual CART models. Finally, the outputs of the CART modes are combined using linear combination for final forecast. To the best of our knowledge this is the first study which employs the ensemble learning paradigms in building CART ensembles for monthly streamflow forecasting. In the training phase the results of CART (R = 0.9096), bagged regression tree (R = 0.9086) and stochastic gradient boosted regression tree (R = 0.9163) are very close to each other. It looks like the ensemble learning methods (i.e., bagging; boosting) do
not work in training phase. This result is not surprising because the obtained result of CART is remarkable and, thus, the ensemble learning methods may not contribute to increase the prediction accuracy. In that, the ensemble learning methods are expected to advance the accuracy of weak learning process. However, as a fact, good training accuracy does not guarantee good testing accuracy. Indeed, ın the testing phase, the empirical results suggest that bagged regression tree (BRT) and stochastic gradient boosted regression tree (GBRT) models (RBRT = 0.8085, RGBRT = 0.8054) are superior to single CART model. But, though BRT slightly outperforms then GBRT, there is no significant difference between bagging and boosting. Similar results were reported by Ismail and Mutanga (2010) when they applied an empirical comparison of ensemble learning methods via CART. The bagging and boosting methods increase the R statics of single CART model by 10.01%, and 9.59%, respectively. In addition, the relative correlation coefficient (R) differences between the BRT and the SVR models is 14.96%, and differences between the GBRT and the SVR models is 14.52% in the testing phase. These results indicate that the ensemble methods can improve the accuracy of single CART model noticeably and they are superior alternatives to well-known support vector machine model. Our findings approximate previously reported results: Jeong and Kim (2005) used a bagged neural network in monthly inflows forecasting and the overall results indicated that the bagged neural network outperformed than a simple ANN among the three rainfall-runoff models. Beside this, the peak-estimates of tree-based models are also more accurate than the SVR model. However, ın general, it looks like the CART model performed better than proposed ensemble models in forecasting monthly peak flows. This may happen because the proposed ensemble models try to reduce the noise by
H.I. Erdal, O. Karakurt / Journal of Hydrology 477 (2013) 119–128
127
Fig. 10. Observed versus predicted daily streamflow data for testing phase. Fig. 9. Observed versus predicted daily streamflow data for training phase.
generating sub-datasets to enhance the overall prediction accuracy. And this procedure usually makes the data smoother which may lead to lack of fit of the model for peak estimating. Also, visually, the distribution of streamflow data predicted by the BRT and GBRT model remarkably fit the distribution of observed data while CART and SVR do not in both training and testing phases. Then a question comes to mind: Why CART ensembles work? As argued before, the main philosophy of bagging and stochastic gradient boosting ensemble learning methods is to enhance the prediction accuracy of conventional weak predictors. In general, tree-based ensembles can inherit almost all advantages of treebased models while overcoming their primary problem, which is inaccuracy. Moreover, tree-based ensemble models can reasonably increase their accuracy by generating many replica datasets and creating various models, which has a lower bias, and then
integrating them in building an ensemble model with higher performance (Chou et al., 2011). Since the CART and SVR are data driven techniques, they require large number of input–output data sets for their training process according to ensemble models. This is why the tree-based ensemble models perform better than the single CART and SVR models. Using only one model to predict the streamflow time-series, usually, may not illuminate the internal mechanism of the phenomenon.
6. Conclusions In this study, 35-year of measured data of Karsßıköy observation station on the Çoruh River in Turkey (1968–2002) is used and three attribute combinations based on preceding monthly streamflows are developed to forecast current streamflow values. Moreover, the correlation coefficient (R), mean absolute error (MAE) and root
128
H.I. Erdal, O. Karakurt / Journal of Hydrology 477 (2013) 119–128
mean squared error (RMSE) statistics are used to evaluate the proposed predictive models. The results from this study indicate that (i) the classification and regression trees (CARTs) is promising technique on monthly streamflow forecasting and yields better results than the support vector regression (SVR) model (ii) the obtained results of the CART model in testing phase noticeable but not overwhelming (iii) ensemble learning methods (bagging, stochastic gradient boosting) can noticeably increase the accuracy of the single CART model (iv) bagging ensemble model yields slightly better results than boosting ensemble model. As a result of this empirical study, tree-based ensemble models can be implemented to successfully forecast monthly streamflow time series and the usage of the proposed models is very easy. In this study, bagging and stochastic gradient boosting methods are used in building ensemble models. The other ensemble learning models (i.e., stacking, adaboost) could be used for this process. Moreover, this study employs a single CART model as the base predictor. In construction ensemble models, other machine learning models (i.e., support vector machines, artificial neural networks) could be used as base predictors. These may be subject of future works. Finally, ın this study, only three input–output combinations based on preceding monthly streamflows are used to forecast current streamflow values. A broadly analysis of the environmental factors that influence streamflow is beyond the scope of this paper, but this could be subject of another important future study. Acknowledgement The authors would like to thank the editor and anonymous referees for valuable and helpful comments. References Anctil, F., Lauzon, N., 2004. Generalisation for neural networks through data sampling and training procedures with applications to streamflow predictions. Hydrol. Earth Syst. Sci. 8 (5), 940–958. http://dx.doi.org/10.5194/hess-8-9402004. Andriyashin, A., 2005. Financial Applications of Classification and Regression Trees. Master Thesis. Humboldt University. Berlin pp: 6–8. Araghinejad, S., Azmi, M., Kholghi, M., 2011. Application of artificial neural network ensembles in probabilistic hydrological forecasting. J. Hydrol. 407 (1–4), 94– 104. http://dx.doi.org/10.1016/j.jhydrol.2011.07.011. Asefa, T., Kemblowski, M., Mckee, M., Khalil, A., 2006. Multi-time scale stream flow predictions: the support vector machines approach. J. Hydrol. 318 (1–4), 7–16. http://dx.doi.org/10.1016/j.jhydrol.2005.06.001. Boucher, M.-A., Lalibert´e, J.-P., Anctil, F., 2010. An experiment on the evolution of an ensemble of neural networks for streamflow forecasting. Hydrol. Earth Syst. Sci. 14, 603–612. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification and Regression Trees. Wadsworth, Int. Group, Belmont, California, pp. 357–358. Breiman, L., 1996. Bagging predictors. Machine Learning 24 (2), 123–140. http:// dx.doi.org/10.1023/A:1018054314350. Breiman, L., 1999. Using adaptive bagging to debias regressions. Technical Report No. 547 of University of California, Berkeley. Bühlmann, P., Yu, B., 2002. Analyzing bagging. Ann. Stat. 30 (4), 927–961. Cannon, A.J., Whitfield, P.H., 2002. Downscaling recent streamflow conditions in British Columbia, Canada using ensemble neural network models. J. Hydrol. 259, 136–151. http://dx.doi.org/10.1016/S0022-1694(01)00581-9. Chou, J.S., Chiu, C.K., Farfoura, M., Al-Taharwa, I., 2011. Optimizing the prediction accuracy of concrete compressive strength based on a comparison of data-
mining techniques. J. Comput. Civil Eng. 25 (3), 242–263. http://dx.doi.org/ 10.1061/(ASCE)CP.1943-5487.0000088. Efron, B., 1979. Bootstrap methods: another look at the jackknife. Ann. Stat. 7 (1), 1– 26. http://dx.doi.org/10.1214/aos/1176344552. Erdal, H.I., Ekinci, A., 2012. A comparison of various artificial intelligence methods in the prediction of bank failures. Comput. Econ.. http://dx.doi.org/10.1007/ s10614-012-9332-0. Friedman, J.H., 2002. Stochastic gradient boosting. Comput. Stat. Data Anal. 38 (4), 367–378. http://dx.doi.org/10.1016/S0167-9473(01)00065-2. Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29 (5), 1189–1232. Grunwald, S., Daroub, S.H., Lang, T.A., Diaz, O.A., 2009. Tree-based modeling of complex interactions of phosphorus loadings and environmental factors. Sci. Total Environ. 407 (12), 3772–3783. Guo, J., Zhou, J., Qin, H., Zou, Q., Li, Q., 2011. Monthly streamflow forecasting based on improved support vector machine model. Expert Syst. Appl. 38 (10), 13073– 13081. http://dx.doi.org/10.1016/j.eswa.2011.04.114. Hancock, T., Put, R., Coomans, D., Vanderheyden, Y., Everingham, Y., 2005. A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies. Chemomet. Intell. Lab. Syst. 76 (2), 185–196. http://dx.doi.org/ 10.1016/j.chemolab.2004.11.001. Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag, New York, p. 500. ISBN: 9780387848570. Ismail, R., Mutanga, O., 2010. A comparison of regression tree ensembles: Predicting Sirexnoctilio induced water stress in Pinuspatula forests of KwaZulu-Natal, South Africa. Int. J. Appl. Earth Obs. Geoinf. 12S, S45–S51. Jeong, D.-I., Kim, Y.-O., 2005. Rainfall-runoff models using artificial neural networks for ensemble streamflow prediction. Hydrol. Process. 19 (19), 3819–3835. http://dx.doi.org/10.1002/hyp. 5983. Kisi, O., Cimen, M., 2011. A wavelet-support vector machine conjunction model for monthly streamflow forecasting. J. Hydrol. 399 (1–2), 132–140. http:// dx.doi.org/10.1016/j.jhydrol.2010.12.041. Li, P.-H., Kwon, H.-H., Sun, L., Lall, U., Kao, J.-J., 2010. A modified support vector machine based prediction model on streamflow at the Shihmen Reservoir, Taiwan. Int. J. Climatol. 30 (8), 1256–1268. http://dx.doi.org/10.1002/joc.1954. Pino-Mejias, R., Jimenez-Gamero, M.D., Cubiles-de-la-Vega, M.D., Pascual-Acosta, A., 2008. Reduced bootstrap aggregating of learning algorithms. Pattern Recogn. Lett. 29 (3), 265–271. http://dx.doi.org/10.1016/j.patrec.2007.10.002. Samsudin, R., Saad, P., Shabri, A., 2010. A hybrid least squares support vector machines and GMDH approach for river flow forecasting. Hydrol. Earth Syst. Sci. Discuss 7, 3691–3731. http://dx.doi.org/10.5194/hessd-7-3691-2010. Schapire, E.R., 1990. The strength of weak learnability. Machine Learning 5, 197– 227. http://dx.doi.org/10.1023/A:1022648800760. Schölkopf, B., Burges, C.J.C., Smola, A., 1999. Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge MA, ISBN 0-262-19416-3. Shu, C., Burn, D.H., 2004. Artificial neural network ensembles and their application in pooled flood frequency analysis. Water Resour. Res. 40 (W09301). http:// dx.doi.org/10.1029/2003WR002816. Snelder, T.H., Lamouroux, N., Leathwick, J.R., Pella, H., Sauquet, E., Shankar, U., 2009. Predictive mapping of the natural flow regimes of France. J. Hydrol. 373, 57–67. http://dx.doi.org/10.1016/j.jhydrol.2009.04.011. Tiwari, M.K., Chatterjee, C., 2011. A new Wavelet–Bootstrap–ANN hybrid model for daily discharge forecasting. J. Hydroinform. 13 (3), 500–519. http://dx.doi.org/ 10.2166/hydro.2010.142. Tiwari, M.K., Chatterjee, C., 2010. Development of an accurate and reliable hourly flood forecasting model using Wavelet–Bootstrap–ANN (WBANN) hybrid approach. J. Hydrol. 394 (3–4), 458–470. http://dx.doi.org/10.1016/ j.jhydrol.2010.10.001. Vapnik, V.N., 1995. The Nature of Statistical Learning Theory. Springer-Verlag, New York, ISBN 0-387-98780-0. Vezza, P., Comoglio, C., Rosso, M., Viglione, A., 2010. Low flows regionalization in North-Western Italy. Water Resour. Manage. 24, 4049–4074. http://dx.doi.org/ 10.1007/s11269-010-9647-3. Wang, G., Hao, J., Mab, J., Jiang, H., 2011. A comparative assessment of ensemble learning for credit scoring. Exp. Syst. Appl. 38 (1), 223–230. http://dx.doi.org/ 10.1016/j.eswa.2010.06.048. Zhang, C.X., Zhang, J.S., Wang, G.W., 2008. An empirical study of using rotation forest to improve regressors. Appl. Math. Comput. 195 (2), 618–629. http:// dx.doi.org/10.1016/j.amc.2007.05.010.