Constructing neural network sediment estimation models using a data-driven algorithm

Constructing neural network sediment estimation models using a data-driven algorithm

Available online at www.sciencedirect.com Mathematics and Computers in Simulation 79 (2008) 94–103 Constructing neural network sediment estimation m...

803KB Sizes 0 Downloads 34 Views

Available online at www.sciencedirect.com

Mathematics and Computers in Simulation 79 (2008) 94–103

Constructing neural network sediment estimation models using a data-driven algorithm ¨ ur Kisi ∗ Ozg¨ Civil Engineering Department, Engineering Faculty, Erciyes University, 38039 Kayseri, Turkey Received 10 May 2007; received in revised form 5 September 2007; accepted 9 October 2007 Available online 18 October 2007

Abstract Artificial neural network (ANN) models are designed for suspended sediment estimation using statistical pre-processing of the data. Statistical properties such as cross-, auto- and partial auto-correlation of the data series are used for identifying a unique input vector to the ANN that best represents the sediment estimation process for a basin. The methodology is evaluated using the flow and sediment data from the stations Quebrada Blanca and Rio Valenciano in USA. The result of the study indicates that the statistical pre-processing of the data could significantly reduce the effort and computational time required in developing an ANN model. Three ANN training algorithms are also compared with each other for the selected input vector. © 2007 IMACS. Published by Elsevier B.V. All rights reserved. Keywords: A data-driven algorithm; Sediment estimation; Neural networks

1. Introduction The transport of sediment in rivers is important with respect to pollution, channel navigability, reservoir filling, hydroelectric-equipment longevity, fish habitat, river aesthetics and scientific interests. The assessment of the volume of sediments being transported by a river is of vital interest in hydraulic engineering due to its importance in the design and management of water resources projects. ANNs have been successfully applied in a number of diverse fields, including water resources. The application of ANNs to water resources problems is rapidly gaining popularity due to their immense power and potential in the mapping of nonlinear system data [1,3,4,6,8,11,12,15,16,19,21]. However, there are few published works in the field of suspended sediment data prediction using ANN approach [5,7,14,17,18,25,26]. One of the most important problems in modelling sediment phenomenon using ANN is what architecture should be used to map the phenomenon effectively. In data-driven approaches like ANN, the sets of variables that influence the system are not known a priori, unlike the physically based models. Therefore, the selection of an appropriate input vector that will allow an ANN to map to the desired output vector successfully is not a trivial task. In most of the applications that are reported this has been done by a trial and error method [14,22]. When developing models such as autoregressive moving average (ARMA) or multi linear regression (MLR) type, the order of the inputs can be determined using emprical and/or analytical approaches [13]. Taking a statistical perspective is especially important ∗

Tel.: +90 352 437 0080; fax: +90 352 437 5784. E-mail address: [email protected].

0378-4754/$32.00 © 2007 IMACS. Published by Elsevier B.V. All rights reserved. doi:10.1016/j.matcom.2007.10.005

¨ Kisi / Mathematics and Computers in Simulation 79 (2008) 94–103 O.

95

for atheoretical models like ANNs, because the reason for applying them is that they do not require knowledge about an adequate functional form. However, the analytical approaches are not used to determine the inputs for multivariate ANN models. Because the ANNs belong to the class of data-driven approaches, whereas the conventional statistical methods are model-driven [2]. In the model-driven approaches, the model structure can be determined using the empirical or analytical approaches before the unknown model parameters being determined. In data-driven approaches, however, the determination of model structure is critical. Presenting a large number of inputs to ANN models and relying on the network to determine the critical model inputs usually increase network size. This causes a number of disadvantages, such as increasing training time, increasing the amount of data required for efficiently estimating the connection weights and increasing the number of local minima in the error surface, which makes it more difficult to obtain a near-optimal combination of the weights for the problem under consideration [24]. Sudheer et al. [24] proposed a data-driven algorithm based on statistical pre-processing of the data set for constructing ANN rainfall-runoff models. They used radial basis neural network (RBNN) in their study. They only used data of one sample for the rainfall-runoff case study. In the present study, the proposed data-driven algorithm is investigated for the selection of appropriate input vector to the ANN sediment estimation models. The MLP neural networks are used instead of RBNN and the methodology is successfully applied to the flow and suspended sediment data of two stations in USA. After determining the input vector, three ANN algorithms are compared with each other. 2. Methodology 2.1. Neural networks Artificial neural networks (ANNs) are based on the present understanding of biological nervous system, though much of the biological detail is neglected. ANNs are massively parallel systems composed of many processing elements connected by links of variable weights. Of the many ANN paradigms, the multi-layer back propagation network (MLP) is by far the most popular [20]. The network consists of layers of parallel processing elements, called neurons, with each layer being fully connected to the proceeding layer by interconnection fully connected to the proceeding layer by interconnection strengths, or weights, W. Fig. 1 illustrates a three-layer neural network consisting of layers i, j and k, with the interconnection weights Wij and Wjk between layers of neurons. Initial estimated weight values are progressively corrected during a training process that compares predicted outputs to known outputs, and back propagates any errors (from right to left in Fig. 1) to determine the appropriate weight adjustments necessary to minimize the errors. The Levenberg–Marquardt (LM) training algorithm is used here for adjusting the weights. 2.2. Data-driven method Let Qt and St denote the discharge and suspended sediment values at time t. The sediment estimation model can be represented as the functional form of St = f (Qt , Qt−1 , . . . , Qt−n , St−1 , St−2 , . . . , St−(n−1) ) where n is the number of data sets.

Fig. 1. A three-layer neural network structure.

(1)

96

¨ Kisi / Mathematics and Computers in Simulation 79 (2008) 94–103 O.

In sediment estimation models, the input values may be the discharge and suspended sediment with different lags. However, how many antecedent Q and S values should be included in the input vectors not known a priori. Determining the number Q and S values in the input vector involves finding the lags of discharge and sediment that have a significant influence on the output, St . The cross-correlation analysis between the discharge and sediment series would reveal which antecedent discharge significantly influences the sediment at certain time. Similarly, the influencing antecedent sediment patterns can be determined using the auto-correlation and partial auto-correlation analyses. The cross-, autoand partial auto-correlation analyses of the data series with 95% confidence levels can be examined and the number of antecedent discharge and sediment values that should be in input vector can be decided by the modeller. In the present study, the methodology is checked by using different input combinations consisting of various antecedent discharge and sediment values for the ANN models. To investigate the influence of variation in the input vector, the mean absolute relative error (MARE) statistic of the sediment estimates is analyzed. 3. Case study The daily flow-sediment time series data of Quebrada Blanca Station at Jagual (USGS Station No: 50051150, Latitude 18◦ 09 40 , Longitude 65◦ 58 58 ) and Rio Valenciano Station near Juncos (USGS Station No: 50056400, Latitude 18◦ 12 58 , Longitude 65◦ 55 34 ) operated by the U.S. Geological Survey (USGS) are used in the study. The location of the stations is shown in Fig. 2. The gage datums are 426.5 and 320.0 feet above sea level and the drainage areas at these sites are 43.57 km2 and 8.63 km2 for the Quebrada Blanca and Rio Valenciano Station, respectively. For these stations, the river flow and sediment concentration data were downloaded from the web server of the USGS. After examining the data and noting the periods in which there are gaps in one or more of the two variables, the periods for calibration and validation are chosen. The data of 01 October 1993 to 30 September 1994 (1994 water year) are used for the calibration and the data of 01 October 1994 to 30 September 1995 (1995 water year) are chosen for the validation for the both stations. It may be noted that the periods from which calibration and validation data are chosen span the same temporal seasons (October–September). The statistical parameters of streamflow and sediment concentration data of Quebrada Blanca and Rio Valenciano stations are shown in Tables 1 and 2. In these tables, Sx, Cv and Csx denote the standard deviation, variation and

Fig. 2. The locations of the Quebrada Blanca (USGS Station: 50051150) and Rio Valenciano (USGS Station: 50056400) stations in Puerto Rico.

¨ Kisi / Mathematics and Computers in Simulation 79 (2008) 94–103 O.

97

Table 1 The statistical parameters of Quebrada Blanca Station data Data

Mean

Sx

Cv

Csx

Minimum

Maximum

Calibration period Streamflow (m3 /s) Sediment concentration (mg/l)

0.10 14.9

0.33 51.9

3.3 3.5

14.0 9.1

0.01 1

5.69 713

Validation period Streamflow (m3 /s) Sediment concentration (mg/l)

0.13 20.8

0.26 63.1

2.0 3.0

0.01 1

2.38 635

5.42 7.08

Table 2 The statistical parameters of Rio Valenciano Station data Data

Mean

Sx

Cv

Csx

Minimum

Maximum

Calibration period Streamflow (m3 /s) Sediment concentration (mg/l)

0.60 42.0

2.07 106

3.5 2.5

13.6 7.45

0.04 2

219 1200

Validation period Streamflow (m3 /s) Sediment concentration (mg/l)

1.05 71.1

2.47 148

2.4 2.1

5.72 4.31

0.05 4

24.6 1090

skewness coefficients, respectively. It can be seen from tables that the sediment and flow data show significantly high skewed distribution. The ratio between standard deviation and mean, Cv, is also high for both stations. 4. Application and results Multi-layer ANN can have more than one hidden layer, however theoretical works have shown that a single hidden layer is sufficient for ANNs to approximate any complex nonlinear function [9]. Therefore, in this study, one hidden layered ANN is used. A difficult task with ANNs involves choosing parameters such as the number of hidden nodes and the learning rate. Determining an appropriate architecture of a neural network for a particular problem is an important issue, since the network topology directly affects its computational complexity and its generalization capability. Here, the hidden layer node numbers of each model are determined after trying various network structures since there is no theory yet to tell how many hidden units are needed to approximate any given function. The tangent sigmoid and pure linear functions are found appropriate for the hidden and output node activation functions, respectively.

Fig. 3. The cross-correlation function of the flow-sediment series—Quebrada Blanca Station.

¨ Kisi / Mathematics and Computers in Simulation 79 (2008) 94–103 O.

98

Before applying the ANN to the data, the training input and output values are normalized using the equation a

xi − xmin +b xmax − xmin

(2)

where xmin and xmax denote the minimum and maximum of the training and testing data. Different values can be assigned for the scaling factors a and b. There are no fixed rules as to which standardization approach should be used in particular circumstances [10]. In this study the a and b are taken as 0.6 and 0.2, respectively. The learning and

Fig. 4. The cross-correlation function of the flow-sediment series—Rio Valenciano Station.

Fig. 5. The auto-correlation function of the sediment series—Quebrada Blanca Station.

Fig. 6. The auto-correlation function of the sediment series—Rio Valenciano Station.

¨ Kisi / Mathematics and Computers in Simulation 79 (2008) 94–103 O.

99

momentum rates are taken as 0.01 and 0.9, respectively. It is seen that choosing high values like 0.5 for the learning rate, as done by Raman and Sunilkumar [23], throws the network into oscillations or saturates the neurons [24]. The correlation analyses of the data of each station are employed for selecting appropriate input vectors in ANN sediment estimation models. The cross-correlation (CC) statistics and the corresponding 95% confidence bands from lag 0 to lag 10 are estimated for the flow-sediment series (Figs. 3 and 4). The Pearson cross-correlation between the flow and sediment series (Fig. 3) shows a significant correlation, at the 95% confidence level, up to 1 day lag in flow data on the sediment at any time for the Quebrada Blanca Station, and thereafter, fell within the confidence band. The flow series of the Rio Valenciano Station, on the other hand, has significant correlation up to 2 days lag. The auto-correlation functions (ACF) of the sediment data are shown in Figs. 5 and 6. It can be obviously seen from the figures that there is a significant auto-correlation up to lag 5 (5 days) for the Quebrada Blanca and up to lag 2 for the Rio Valenciano, respectively. The gradual decaying pattern of the auto-correlation exhibits the presence of a dominant autoregressive process. Similarly, the estimated partial auto-correlation function (PACF) and corresponding

Fig. 7. The partial auto-correlation function of the sediment series—Quebrada Blanca Station.

Fig. 8. The partial auto-correlation function of the sediment series—Rio Valenciano Station. Table 3 The MARE statistics of testing results of each ANN models—Quebrada Blanca Station Antecedent streamflow

– 0 1 2

Antecedent sediment values in input vector –

1

2

3

– 120.0 121.0 122.4

167.7 139.7 112.8 107.7

171.0 142.1 82.55 109.9

127.5 130.4 104.2 104.1

¨ Kisi / Mathematics and Computers in Simulation 79 (2008) 94–103 O.

100

Table 4 The MARE statistics of testing results of each ANN models—Rio Valenciano Station Antecedent Streamflow

– 0 1 2 3

Antecedent sediment values in input vector –

1

2

3

– 80.35 76.89 98.46 96.41

134.2 95.81 82.40 78.72 93.53

160.4 94.47 79.84 66.76 93.53

162.1 100.5 86.51 91.92 85.93

Table 5 Training and testing results of each ANN training algorithm Algorithm

Training period

Testing period

Iterations

Time (s)

MARE

R2

MARE

R2

Quebrada Blanca Station Levenberg–Marquardt (LM) Conjugate gradient (CG) Gradient descent (GD)

50 1000 20000

1.2 3.3 218

75 156 196

0.964 0.937 0.893

83 167 199

0.940 0.896 0.704

Rio Valenciano Station Levenberg–Marquardt (LM) Conjugate gradient (CG) Gradient descent (GD)

50 1000 20000

1.3 3.7 277

74 71 99

0.627 0.600 0.517

67 61 87

0.869 0.846 0.751

Fig. 9. Suspended sediment estimates of ANN LM, ANN CG and ANN GD models in test period—Quebrada Blanca Station.

¨ Kisi / Mathematics and Computers in Simulation 79 (2008) 94–103 O.

101

95% confidence limits between lag 0 and lag 10 are presented in Figs. 7 and 8. The PACF indicates significant correlation up to lag 2 and, thereafter, fell within the confidence limits for both stations. The rapid decaying pattern of the PACF confirms the dominance of the autoregressive process, relative to the moving average process. The auto- and partial auto-correlation coefficients suggest that incorporating sediment values of up to 2 days lag in input vector to the ANN models. The above analyses suggest that the 1 antecedent and current flow and 2 antecedent sediment values are adequate in input vector to the ANN for the Quebrada Blanca Station. For the Rio Valenciano Station, current flow and 2 antecedent flow and sediment values are found appropriate for the input vector of ANN. These analyses of the data relieve the modeller of a long trial and error procedure in identifying the appropriate input vector that will allow an ANN to map to the desired output. In these analyses, however, the input vector selection procedure relies on the linear relationship between the variables and the effect of an additional variable to capture any nonlinear residual dependencies is not assessed. The ANNs are trained with various combinations of flow and sediment values and the MARE statistics of the testing results is computed to quantify the influence of variation in the input vector. The MARE values of sediment estimates for each station are given in Tables 3 and 4. As can be seen from Table 3, the ANN model whose inputs are 1 antecedent and current flows and 2 antecedent sediment values has the lowest MARE (82.55). For the Rio Valenciano Station, the MARE (66.76) is minimum for the ANN model with an input vector containing current flow and 2 antecedent flow and sediment values. These results are in direct agreement with the input vector selected based on the statistical analyses above. It is obviously seen that the only one input vector to the ANN model, which would map the process satisfactorily, could be assigned using the statistical pre-processing of the data set. After determining the best ANN structure, another two algorithms, conjugate gradient (CG) and gradient descent (GD), are used for training of the network and their performances are compared with the network trained by the Levenberg–Marquardt (LM) algorithm. The performance comparison of these three algorithms in suspended sediment estimation of Quabrada Blanca and Rio Valenciano stations are given in Table 5. The comparison is displayed in terms

Fig. 10. Suspended sediment estimates of ANN LM, ANN CG and ANN GD models in test period—Rio Valenciano Station.

102

¨ Kisi / Mathematics and Computers in Simulation 79 (2008) 94–103 O.

of MARE and determination coefficient (R2 ). The iteration numbers and computation time (in second) in training period are also compared in the same table. From the table, it is obvious that the GD algorithm takes an unusually high number of iterations and time (on a PC-Pentium IV) in training for the both stations. The LM algorithm gives the lowest MARE (83) and the highest R2 (0.964) for the Quebrada Blanca Station. For the Rio Valenciano Station, however, the CG algorithm has slightly better MARE (61) value than that of the LM algorithm. The performance of the GD algorithm is the worst for both stations. For the Quebrada Blanca Station, the suspended sediment estimates of ANN LM, ANN CG and ANN GD models are shown in Fig. 9 in the form of hydrograph and scatterplot (the latter plotted as double logarithmic for better representation). As seen from the hydrograph graphs that the ANN LM model approximates the corresponding observed suspended sediment values better than the other ANN models. It can be also seen from the scatterplots that the ANN LM model predictions are much closer to the exact fit line than those of the ANN CG and ANN GD models. The ANN CG also performs better than the ANN GD models especially for the peak values. For the Rio Valenciano Station, the suspended sediment estimates of each model and observed values are compared in Fig. 10. Here the ANN LM and ANN CG estimates seem to be similar to each other and their accuracies are better than the ANN GD. 5. Conclusions The ANN sediment estimation models have been constructed using a procedure based on statistical pre-processing of the data set. The procedure has been applied to two stations in USA and checked by using different input combinations consisting of various antecedent discharge and sediment values for the ANN models. The MARE statistic of the sediment estimates has been analyzed to investigate the influence of variation in the input vector. The results were found in direct agreement with the input vector selected based on the procedure employed in the present study. It can be concluded that the statistical pre-processing of the data could significantly reduce the effort and computational time required in developing an ANN model. The modeller can simply determine the appropriate input vector that will allow an ANN to map to the desired output vector successfully and can acquire valuable information about the relationship between input and output time series using this procedure. The results obtained from this study justify the methodology proposed by the Sudheer et al. [24]. Three ANN algorithms were compared with each other for the selected input vectors. The comparison results indicated that the LM and CG algorithms performed much better than the GD in suspended sediment estimation. The algorithm of GD takes an unusually high number of iterations and time taken by the other two algorithms for training of the network. References [1] D.-H. Bae, D.M. Jeong, G. Kim, Monthly dam inflow forecasts using weather forecasting information and neuro-fuzzy technique, Hydrol. Sci. J. 52 (1) (2007) 99–113. [2] K. Chakraborty, K. Mehrotra, C.K. Mohan, S. Ranka, Forecasting the behavior of multivariate time series using neural networks, Neural Netw. 5 (1992) 961–970. [3] P. Chaves, T. Tsukatani, T. Kojiri, Operation of storage reservoir for water quality by using optimization and artificial intelligence techniques, Math. Comput. Simulat. 43 (2004) 377–386. [4] H.K. Cigizoglu, Incorporation of ARMA models into flow forecasting by artificial neural networks, Environmetrics 14 (4) (2003) 417–427. [5] H.K. Cigizoglu, Estimation and forecasting of daily suspended sediment data by multi layer perceptrons, Adv. Water Resour. 27 (2004) 185–195. [6] H.K. Cigizoglu, O. Kisi, Flow prediction by three back propagation techniques using k-fold partitioning of neural network training data, Nord. Hydrol. 36 (1) (2005) 49–64. [7] H.K. Cigizoglu, O. Kisi, Methods to improve the neural network performance in suspended sediment estimation, J. Hydrol. 317 (2006) 221–238. [8] P. Coulibaly, F. Anctil, B. Bobeˇıe, Real time neural network based forecasting system for hydropower reservoirs, in: E.T. Miresco (Ed.), Proceedings of the First International Conference on New Information Technologies for Decision Making in Civil Engineering, University of Quebec, Montreal, Canada, 10–13 October, 1998, pp. 1001–1011. [9] G. Cybenko, Approximation by superposition of a sigmoidal function, Math. Control Signals Syst. 2 (1989) 303–314. [10] W.C. Dawson, R. Wilby, An artificial neural network approach to rainfall-runoff modeling, Hydrol. Sci. J. 43 (1) (1998) 47–66. [11] M. Firat, M. Gungor, River flow estimation using adaptive neuro fuzzy inference system, Math. Comput. Simulat. 75 (2007) 87–96. [12] O. Giustolisi, D. Laucelli, Improving generalization of artificial neural networks in rainfall–runoff modelling, Hydrol. Sci. J. 50 (3) (2005) 439–457. [13] J.P. Haltiner, J.D. Salas, Short-term forecasting of snowmelt runoff using ARMAX models, Water Resour. Bull. 24 (5) (1988) 1083–1089. [14] S.K. Jain, Development of integrated sediment rating curves using ANNs, ASCE J. Hydraulic Eng. 127 (1) (2001) 30–37.

¨ Kisi / Mathematics and Computers in Simulation 79 (2008) 94–103 O.

103

[15] A.W. Jayawardena, P.C. Xu, F.L. Tsang, W.K. Li, Determining the structure of a radial basis function network for prediction of nonlinear hydrological time series, Hydrol. Sci. J. 51 (1) (2006) 21–44. [16] O. Kisi, River flow modeling using artificial neural networks, ASCE J. Hydraulic Eng. 9 (1) (2004) 60–63. [17] O. Kisi, Daily suspended sediment modeling using a fuzzy-differential evolution approach, Hydrol. Sci. J. 49 (1) (2004) 183–197. [18] O. Kisi, Suspended sediment estimation using neuro-fuzzy and neural network approaches, Hydrol. Sci. J. 50 (4) (2005) 683–696. [19] R. Linker, I. Seginer, Greenhouse temperature modeling: a comparison between sigmoid neural networks and hybrid models, Math. Comput. Simulat. 65 (2004) 19–29. [20] R. Lippman, An introduction to computing with neural nets, IEEE ASSP Magazine 4 (1987) 4–22. [21] H.R. Maier, G.C. Dandy, Modelling cyanobacteria (blue-green algae) in the River Murray using artificial neural networks, Math. Comput. Simulat. 43 (1997) 377–386. [22] H.M. Nagy, K. Watanabe, M. Hirano, Prediction of sediment load concentration in rivers using artificial neural network model, ASCE J. Hydraulic Eng. 128 (6) (2002) 588–595. [23] H. Raman, N. Sunilkumar, Multivariate modelling of water resources time series using artificial neural networks, Hydrol. Sci. J. 40 (2) (1995) 145–163. [24] K.P. Sudheer, A.K. Gosain, K.S. Ramasastri, A data-driven algorithm for constructing artificial neural network rainfall-runoff models, Hydrol. Processes 16 (2002) 1325–1330. [25] G. Tayfur, Artificial neural networks for sheet sediment transport, Hydrol. Sci. J. 47 (6) (2002) 879–892. [26] G. Tayfur, V. Guldal, Artificial neural networks for estimating daily total suspended sediment in natural streams, Nord. Hydrol. 37 (2006) 69–79.