A genetic programming approach to suspended sediment modelling

A genetic programming approach to suspended sediment modelling

Journal of Hydrology (2008) 351, 288– 298 available at www.sciencedirect.com journal homepage: www.elsevier.com/locate/jhydrol A genetic programmin...

469KB Sizes 0 Downloads 91 Views

Journal of Hydrology (2008) 351, 288– 298

available at www.sciencedirect.com

journal homepage: www.elsevier.com/locate/jhydrol

A genetic programming approach to suspended sediment modelling Ali Aytek a b

a,*

¨ zgu ¨r Kisßi ,O

b

Gaziantep University, Civil Engineering Department, Hydraulics Division, 27310 Gaziantep, Turkey Erciyes University, Civil Engineering Department, Hydraulics Division, 38039 Kayseri, Turkey

Received 5 June 2007; received in revised form 29 November 2007; accepted 4 December 2007

KEYWORDS Suspended sediment load; Rating curves; Soft computing; Genetic programming

This study proposes genetic programming (GP) as a new approach for the explicit formulation of daily suspended sediment–discharge relationship. Empirical relations such as sediment rating curves are often applied to determine the average relationship between discharge and suspended sediment load. This type of models generally underestimates or overestimates the amount of sediment. During recent decades, some black box models based on artificial neural networks have been developed to overcome this problem. But these type of models are implicit that can not be simply used by other investigators. Therefore it is still necessary to develop an explicit model for the discharge–sediment relationship. It is aimed in this study, to develop an explicit model based on genetic programming. Explicit models obtained using the GP are compared with rating curves and multi-linear regression techniques in suspended sediment load estimation. The daily streamflow and suspended sediment data from two stations on Tongue River in Montana are used as case studies. The results indicate that the proposed GP formulation performs quite well compared to sediment rating curves and multi-linear regression models and is quite practical for use. ª 2007 Elsevier B.V. All rights reserved.

Summary

Introduction Correct estimation of sediment volume carried by a river is very important for many water resources projects. The prediction of river sediment load also constitutes an important issue in hydraulic and sanitary engineering. It is well known * Corresponding author. Tel.: +90 532 605 43 46; fax: +90 342 360 11 07. E-mail address: [email protected] (A. Aytek).

fact that all reservoirs are designed to a volume known as ‘‘the dead storage’’ to accommodate the sediment income that will accumulate over a specified period called the economic life. The underestimation of sediment yield results in insufficient reservoir capacities while the overestimation will lead to over-capacity reservoirs. Only the appropriate reservoir design and operation is sufficient to justify every effort to determine sediment yield accurately, but in sanitary engineering the prediction of river sediment load has an additional significance, especially if the particles also

0022-1694/$ - see front matter ª 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.jhydrol.2007.12.005

A genetic programming approach to suspended sediment modelling transport pollutants. On the other hand, the sediment can aggrades channel beds with excess sand and gravel for tens to hundreds of kilometers downstream. Such aggradations promote lateral migration of channels and may cause serious flooding during rainstorms, due to loss of channel capacity necessary to convey floodwaters. The assessment of the volume of sediment carries significance also for the flooding problem. A number of attempts have been made to relate the amount of sediment transported by a river to flow conditions such as discharge, velocity and shear stress. However, none of the equations derived have received universal acceptance. Usually, either the weight or the concentration of sediment is related to the discharge. These two forms are often used interchangeably. McBean and Al-Nassri (1988) examined this issue and concluded that the practice of using sediment load vs. discharge is misleading because the goodness of fit implied by this relationship is spurious. Instead they recommended that the regression link be established. Over the years, researchers have proposed several rating curves to determine the average relationship between discharge and suspended sediment load (Thomas, 1985; Asselman, 2000; Picoet et al., 2001; Overleir, 2004; Crowder et al., 2007). The physically based models are based on the simplified partial differential equations of flow and sediment flux as well as on some unrealistic simplifying assumptions for flow and empirical relationships for erosive effects of rainfall and flow. Examples of such models are presented by Wicks and Bathurst (1996), Kothyari et al. (1997), Refsgaard (1997), and others. They are highly sophisticated and complex models that have the advantages of having components that correspond to physical processes and of being theoretically capable of taking into account the spatial variation of catchment properties as well as uneven distribution of precipitation and evapotranspiration. The sophistication and complexity of the model should, however, be keyed to utilizable information about the catchment characteristics and density and frequency of the available input data. Especially because the real spatial distribution of precipitation is not presently measurable for much of the world, process-oriented distributed models offer no practically significant advantage over lumped models and have many practical disadvantages (Guldal and Muftuoglu, 2001). The focus of this paper is the use of genetic programming (GP) for suspended sediment modelling. GP has been applied to a wide range of problems in artificial intelligence, engineering and science applications, industrial, and mechanical models. GP can be successfully applied to areas, where (i) the interrelationships among the relevant variables are poorly understood (or where it is suspected that the current understanding may well be wrong), (ii) finding the size and shape of the ultimate solution is hard and a major part of the problem, (iii) conventional mathematical analysis does not, or cannot, provide analytical solutions, (iv) an approximate solution is acceptable (or is the only result that is ever likely to be obtained), (v) small improvements in performance are routinely measured (or easily measurable) and highly prized, (vi) there is a large amount of data, in computer readable form, that requires examination, classification, and integration (such as molecular biology for protein and DNA sequences, astronomical data,

289

satellite observation data, financial data, marketing transaction data, or data on the World Wide Web) (Banzhaf et al., 1998). It was observed that only a few studies existed in the literature related to the use of GP in the field of water resources engineering. Cousin and Savic (1997), Savic et al. (1999), Drecourt (1999), Whigham and Crapper (1999, 2001), Babovic and Keijzer (2002) applied GP to rainfall-runoff modeling. Babovic et al. (2001) applied GP to sedimentary particle settling velocity equations. Harris et al. (2003) studied on velocity predictions in compound channels with vegetated floodplains using GP. Dorado et al. (2003) studied on prediction and modeling of the rainfall-runoff transformation of a typical urban basin using artificial neural networks (ANNs) and GP. Giustolisi (2004) determined Chezy resistance coefficient in corrugated channels by using GP. Rabunal et al. (2007) determined the unit hydrograph of a typical urban basin using GP. Only two studies were observed for sediment modeling using GP approach; Babovic (2000) used experimental flume data utilized by Zyserman and Fredsoe (1994) and expressed a new formulation for bed concentration of suspended sediment. Kizhisseri et al. (2005) used GP methodology to explore a better correlation between the temporal pattern of fluid field and sediment transport by utilizing two datasets; one from numerical model results and other from Sandy Duck field data. The purpose of this study is to develop a mathematical model for estimation of suspended sediment load based on GP. For this aim, two sediment stations on Tongue River in Montana, USA, were used as case studies. The performance of the GP models was compared with sediment rating curves and multi-linear regression techniques.

Model development Overview of genetic programming In this section, a brief overview of the GP and Gene Expression Programming (GEP) is given for motivation. GP is first proposed by Koza (1992). It is a generalization of genetic algorithms (GAs) (Goldberg, 1989). GP starts with an initial population of randomly generated computer programs composed of functions and terminals appropriate to the problem domain. The functions may be standard arithmetic operations, standard programming operations, standard mathematical functions, logical functions, or domain-specific functions. Depending on the particular problem, the computer program may be boolean-valued, integer-valued, real-valued, complex-valued, vector-valued, symbolic-valued, or multiple-valued. GEP is, like GAs and GP, a genetic algorithm as it uses populations of individuals, selects them according to fitness, and introduces genetic variation using one or more genetic operators (Ferreira, 2006). GEP is an extension to GP that evolves computer programs of different sizes and shapes encoded in linear chromosomes of fixed length. One strength of the GEP approach is that the creation of genetic diversity is extremely simplified as genetic operators work at the chromosome level. Another strength of GEP consists of its unique, multigenic nature which allows the evolution of more complex programs composed of several subprograms. As a result GEP surpasses the old GP

¨ . Kisßi A. Aytek, O

290 system in 100–10,000 times (Ferreira, 2001a; Ferreira, 2001b). The fundamental difference between GAs, GP and GEP is due to the nature of the individuals: in GAs the individuals are linear strings of fixed length (chromosomes); in GP the individuals are nonlinear entities of different sizes and shapes (parse trees); and in GEP the individuals are encoded as linear strings of fixed length (the genome or chromosomes) which are afterwards expressed as nonlinear entities of different sizes and shapes (i.e., simple diagram representations or expression trees). There are five major preliminary steps for solving a problem by using GP; (i) ‘‘the set of terminals’’; the independent variables of the problem, the state variables of the system and the functions with no arguments (ii) ‘‘the set of functions’’; arithmetic operations, testing functions (such as IF and CASE statements) and boolean functions (iii) ‘‘the fitness measure’’ which identifies the way of evaluating how good a given program solves a particular problem (iv) ‘‘control parameters’’; the values of the numerical parameters and qualitative variables for controlling the run, and (v) ‘‘stop condition’’; the criterion for designating a result and terminating a run. The fundamental steps of the GEP are schematically represented in Fig. 1. The process begins with the random generation of the chromosomes of a certain number of individuals (the initial population). Then these chromosomes are expressed and the fitness of each individual is evaluated against a set of fitness cases (also called selection environment which, in fact, is the input to a problem). The individuals are then selected according to their fitness (their

performance in that particular environment) to reproduce with modification, leaving progeny with new traits. These new individuals are, in their turn, subjected to the same developmental process: expression of the genomes, confrontation of the selection environment, selection, and reproduction with modification. The process is repeated for a certain number of generations or until a good solution has been found (Ferreira, 2006). The reproduction in Fig. 1, sequentially, consists of replication, mutation, inversion, insertion sequence (IS) transposition, root IS (RIS) transposition, gene transposition, one-point recombination, twopoint recombination, and gene recombination.

Derivation of suspended sediment models based on GP Generally, selection of input parameters does not completely define the environment from which the system will learn. The researcher must also choose specific past examples from the learning domain. Each example should contain data that represent one instance of the relationship between the chosen inputs and the outputs. These examples are often referred to as ‘‘training cases’’ or ‘‘training instances’’ while they are called ‘‘fitness cases’’ in the case of GP. Collectively, all of the training instances are referred to as the ‘‘training set’’. Once the training set is selected, one could say that, the learning environment of the system is defined (Banzhaf et al., 1998). The procedure to model suspended sediment by using GP approach is as follows: the first step is the fitness function. For this problem, the fitness, fi, of an individual program, i, is expressed by fi ¼

Create Chromosomes of Initial Population

n X 

R  jP ðijÞ  T j j



ð1Þ

j¼1

Express Chromosomes Execute Each Program

Evaluate Fitness

Iterate or Terminate

Terminate

End

Iterate Best of Generation?

Yes

No Selection

Reproduction

New Chromosomes of Next Generation Figure 1

The Flowchart of gene expression programming.

where R is the range of selection, P(ij) is the value predicted by the individual program i for fitness case j and Tj is the target value for fitness case j. The absolute value corresponds the error and is called precision. Thus for a perfect fit, P(ij) = Tj for all fitness cases. In this problem the precision is arranged as 0.01 (equal or less then). The second step consists of choosing the set of terminals T and the set of functions F to create the chromosomes. In this problem, the terminal set obviously consists of the independent variables, i.e., St = {Qt, Qt1, Qt2. . .. . .. , St1, St2. . .., where Qt and St denote the discharge and suspended sediment load at time t, respectively}. The choice of the appropriate function set is not so obvious; however, a good guess can always be helpful in order to include all the necessary functions. In this study, four basic arithmetic operators (+, , *, /) and p some basic mathematical functions ( , ln(x), log(x), ex, x 10 , power) were utilized. The third step is to choose the chromosomal architecture, i.e., the length of the head and the number of genes. Length of the head, h = 8, and three genes per chromosome were employed. The fourth step is to choose the linking function. In this study, the sub expression trees (ETs) were linked by addition. Finally, the fifth step is to choose the set of genetic operators and their rates. In this case a combination of all the modification operators such as mutation, inversion, the three kinds of transposition, and the three kinds of recombination were

A genetic programming approach to suspended sediment modelling used. The parameters used per run are summarized as; chromosomes = 30, head size = 8, number of genes=3, linking function = addition, fitness function error type = mean absolute error, mutation rate = 0.044, inversion rate = 0.1, one-point recombination rate = 0.3, two-point recombination rate = 0.3, gene recombination rate = 0.1 and gene transposition rate = 0.1.

Sediment rating curve (SRC)

S ¼ aQ b

ð2Þ

in which Q is stream discharge and S is either suspended sediment load (Sandy, 1990). Values of a and b for a particular stream are determined from data via a linear regression between (log S) and (log Q). After log-transformation to the arithmetic domain and application of the Ferguson (1986) correction factor, the sediment load occurring at a specific discharge can be estimated using the following expression: S ¼ CF  a  Q b

ð3Þ

where CF is the log-transformation bias correction factor. Specifically, CF ¼ e2:65s

y ¼ a þ b1 x 1 þ b2 x 2 þ . . . þ bm x m

2

ð4Þ

where e is the exponential function and s is the standard error of the regression equation. In the applications, first sediment rating curve (Eq. (2)) is denoted as SRC1 and the second one with bias correction factor (Eq. (3)) is denoted as SRC2.

Multi-linear regression (MLR) If it is assumed that the dependent variable Y is effected by m independent variables X1, X2, . . . , Xm and a linear equation is selected for the relation among them, the regression equation of Y can be written as

ð5Þ

y in this equation shows the expected value of the variable Y when the independent variables take the values X1 = x1, X2 = x2, . . . , Xm = xm. The regression coefficients a, b1, b2, . . . , bm are evaluated, similar to simple regression, by minimizing the sum of the eyi distances of observation points from the plane expressed by the regression equation (Bayazıt and Oguz, 1998) N X

A rating curve consists of a graph or equation, relating sediment discharge or concentration to stream discharge, which can be used to estimate sediment loads from the streamflow record. The sediment rating curve generally represents a functional relationship of the form

291

e2yi ¼

i¼1

N X ðy i  a  b1 x 1i  b2 x 2i  bm x mi Þ2

ð6Þ

i¼1

In this study, the coefficients a, b1, b2, . . . , bm are determined using least squares method.

Data The data set used in this study was obtained from US Geological Survey (USGS). The time series of daily streamflow and suspended sediment data from two stations on the Tongue River in Montana, USA, are used; the upstream station (station no: 6307830) below Brandenberg Bridge near Ashland and the downstream station (station no: 6308500) at Miles City. The drainage areas at these sites are 10.521 km2 for the upstream station and 13.932 km2 for the downstream station. Information on the daily time series for these stations can be acquired from the USGS web server (http:// webserver.cr.usgs.gov/sediment). The data of October 01, 1977–September 30, 1980 (75% of the whole data) were chosen for calibration and the data of October 01, 1980– September 30, 1981 (25% of the whole data) were chosen for validation. The periods from which calibration and validation data were chosen span the same temporal seasons (October–September). The scatter plots of the downstream and upstream station data are given in Fig. 2. It can be seen that there is a nonlinear and scattered relationship between discharge and sediment data for both stations. Fig. 2 seems to indicate the presence of outliers (also seem in Table 1). In the downstream data set, the suspended sediment load values of 84,400 ton day1 and 81,600 ton day1 are observed while the other sediment values are below 50,000 ton day1. In the upstream data set, a suspended sediment load

90000

30000 25000

70000

Sediment (ton day-1)

Sediment (ton day-1)

80000

60000 50000 40000 30000 20000

20000 15000 10000 5000

10000 0

0

0

100

200

Discharge (m 3s -1)

Figure 2

300

0

100

200

Discharge (m 3s -1)

Scatterplots of the (a) downstream and (b) upstream data.

300

¨ . Kisßi A. Aytek, O

292 Table 1

The daily statistical parameters of data set for the stations

Data set

Station

Basin area (km2)

Data type

xmean

Sx

Cv (Sx/xmean)

Csx

xmax

xmin

xmax xmean

Training

Downstream (6308500) Upstream (6307830)

13,932

Flow (m3 s1) Sediment (ton day1) Flow (m3 s1) Sediment (ton day1)

15.6 996 15.1 444

24.9 4860 20.6 1726

1.60 4.88 1.36 3.89

4.1 11.5 4.4 8.5

218 84,400 215 27,200

1.9 3.4 1.8 1.5

14 85 14 61

Downstream (6308500) Upstream (6307830)

13,932

Flow (m3 s1) Sediment (ton day1) Flow (m3 s1) Sediment (ton day1)

9.0 369 10.3 235

12.9 1235 12.5 835

1.43 3.35 1.21 3.55

3.1 4.0 3.1 5.4

62.3 7400 62.9 8000

0.1 0.1 1.9 3

7 20 6 34

Testing

Table 2

10,521

10,521

The RMSEs and determination coefficients of GP models in test period

Input combinations

Upstream station 1

(i) Qt (ii) St1 (iii) Qt and St1 (iv) St1 and St2 (v) Qt, St1, and St2 (vi) Qt, Qt1, and St1 (vii) Qt, Qt1, St1, and St2

Downstream station 2

RMSE (ton day )

R

421 333 331 333 348 231 262

0.767 0.851 0.848 0.842 0.851 0.941 0.903

value of 27,200 ton day1 is observed while the other values are below 20,000 ton day1. These outliers are also used in the calibration period. These values give an additional difficulty to the models in estimation. The models calibrated using such outliers generally give overestimations of the low sediment values. The daily statistical parameters of the streamflow and sediment data for the stations are given in Table 1. In the table, the xmean, Sx, Cv, Csx,xmax and xmin denote the mean, standard deviation, coefficient of variation, skewness, maximum and minimum, respectively. The skewness and coefficient of variation of flow and sediment data of the downstream and upstream stations are high, particularly for the training (calibration) data. In the calibration flow data, xmin and xmax values fall in the ranges 1.9–218 m3s1 for the downstream. However, the testing flow data set extremes are xmin = 0.1 m3 s1, xmax = 62.3 m3 s1. The value of xmin for the calibration flow data is higher than that for the corresponding validation set for the downstream station. This may cause extrapolation difficulties in estimation of low sediment values. The maximum–mean ratios (xmax/xmean) for sediment series in the training period are also quite high (85 and 61 for the downstream and upstream, respectively).

Application and results Several input combinations (Table 2) are tried using GP to estimate suspended sediment load for the upstream and downstream stations. The RMSE and determination coefficient of GP models in test period are given in Table 2 for the upstream station. As seen from the table, the GP model whose inputs are current discharge and one previous discharge and suspended sediment load (input combination

RMSE (ton day1)

R2

621 343 381 357 398 331 338

0.878 0.937 0.930 0.937 0.932 0.934 0.933

(vi)) has the lowest RMSE (231) and the highest R2 (0.941). The GP performance for the first input combination (only current discharge) is the worst due to the hysteresis effect between sediment load and discharge. That is to say, the suspended sediment for a given level of streamflow discharge in the rising stage of a streamflow hydrograph is greater than in the falling stage. This confirms that the practice of using sediment load vs. discharge is misleading as stated by McBean and Al-Nassri (1988). Fig. 3 shows the expression tree of GP model formulation (input combination (vi)) for the upstream station which actually is St ¼ ðdð0Þ  sqrtðdð0ÞÞÞ þ dð0Þ þ absððððdð0Þ  G2C0Þ þ ðdð2Þ=dð1ÞÞÞ  ððdð0Þ  dð1ÞÞ  ðG2C1=dð1ÞÞÞÞÞ þ ðððdð0Þ þ dð0ÞÞ þ dð2ÞÞ=dð1ÞÞ =ððG3C0 þ G3C1Þ=absðdð0ÞÞÞ

ð7Þ

where the actual parameters are dð0Þ ¼ Q t ;

dð1Þ ¼ Q t1

and dð2Þ ¼ St1

ð8Þ

and the constants in the formulation are G1C0 ¼ 6:058; G1C1 ¼ 9:641; G2C0 ¼ 5:135; G2C1 ¼ 9:641; G3C0 ¼ 1:636;

G3C1 ¼ 0:021

ð9Þ

After putting the corresponding values, the final equation becomes        5:135Q t þ St1 x 9:641 Q t  Q t1  þ Q þ St ¼ Q 3=2 t t   Q Q t1  t1    2Q t þ St1 Qt þ ð10Þ x Q t1 1:657

A genetic programming approach to suspended sediment modelling

Figure 3

293

Expression tree for upstream station.

SRC1, SRC2 and MLR formulas obtained for the upstream station are S ¼ 0:4296  Q 2:1022

ð11Þ

S ¼ 0:4296CF  Q 2:1022 S ¼ 131:9  Q t  110:52  Q t1 þ 0:641  St1  163:1

ð12Þ ð13Þ

CF is calculated as 1.389 in Eq. (12). The GP model is compared with the SRC1, SRC2 and MLR models in Table 3 for the upstream station. It can be obviously seen from this table that the GP model performs much better than the rating curve techniques. However, the MLR technique has a RMSE value slightly better than the GP model. The accuracy of the SRC2 model seems to be better than the SRC1 model. It can be observed from Tables 2 and 3 that the performance of the GP model (input combination (i)) is inferior than the SRC2 technique when input to both GP and rating curve model is only the current discharge. In this case, the RMSE and R2 values for the GP model are 421 and 0.767 respectively. On the other hand the RMSE and R2 for the SRC2 model are 396 and 0.777. The suspended sediment estimates of the GP and SRC models are shown in Fig. 4 in the form of hydrograph and scatterplot. As seen from the figure that the GP model approximates the corresponding observed suspended sedi-

ment values better than the rating curve and MLR techniques. The MLR also performs better than the SRC1 and SRC2 models. However, the MLR method provides negative approximations for some of the low sediment values, whereas the other model estimates are all positive. The SRC2 seems to have a better accuracy than the SRC1 model. The estimation of total sediment load is also considered for comparison due to its importance in reservoir management. The GP model estimated the observed total sediment load of 85,731 ton as 93,213 ton, with overestimations of 8.7%, while the SRC1, SRC2 and MLR methods respectively computed total sediment load as 59,754 ton, 82,972 ton and 75,841 ton, with underestimations of 30.3%, 3.2% and 11.5%. The SRC2 model provides better total sediment load estimate than the GP model. For the downstream station, the RMSE and determination coefficient of GP models are given in Table 2. Alike the preceding application, the GP model provides best accuracy for the input combination (vi). The GP model whose input is only the current discharge performs the worst as found in the upstream station. Fig. 5 shows the expression tree of GP model formulation (input combination (vi)) for the downstream station which actually is

¨ . Kisßi A. Aytek, O

294 Table 3 The RMSEs and determination coefficients of GP models in test period Model

Upstream station

Downstream station

1

R

RMSE (ton day1)

R2

231 465 396 226

0.941 0.777 0.777 0.928

331 650 449 508

0.934 0.878 0.878 0.897

dð0Þ ¼ Q t ;

dð1Þ ¼ Q t1

8000

Observed GP

6000

y = 0.813x + 64.441 R2 = 0.941

4000 2000

2000

0

0 30

60

0

90 120 150 180 210 240 270 300 330 360

8000

8000

Observed SRC1

y = 0.5272x + 39.871 R2 = 0.777

6000

SRC1

6000

2000 4000 6000 8000

Observed

Time (Day)

Sediment (ton/day)

and dð2Þ ¼ St1

6000

4000

0

4000

4000 2000

2000

0 0

0 0

30

60

Observed 8000

8000

y = 0.7321x + 55.364 R2 = 0.777

Observed

6000

SRC2

SRC2

Sediment (ton/day)

2000 4000 6000 8000

90 120 150 180 210 240 270 300 330 360

Time (Day)

6000 4000

4000 2000

2000

0

0 0

30

60

0

90 120 150 180 210 240 270 300 330 360

Time (Day)

y = 0.9429x - 13.694 R2 = 0.928

8000 Observed

6000

MLR

MLR

4000

4000

2000

2000

0

0 0

-2000

Figure 4

30 60

2000 4000 6000 8000

Observed

8000 6000

90 120 150 180 210 240 270 300 330 360

Time (Day)

ð14Þ

where the actual parameters are

GP

Sediment (ton/day)

8000

Sediment (ton/day)

GP SRC1 SRC2 MLR

RMSE (ton day )

2

St ¼ ðdð0Þ=ððððdð1Þ  G1C0Þ  ðG1C0=dð0ÞÞÞ =ðdð0Þ  G1C1ÞÞ  ð1:0=3:0ÞÞÞ þ ððððG2C0 þ dð2ÞÞ  ðdð1Þ  dð0ÞÞÞ=ððG2C0  G2C1Þ  ðdð1Þ  2ÞÞÞ  2Þ þ ðððdð0Þ  ðsqrtðdð2ÞÞ =G3C1ÞÞ=absðdð1ÞÞÞ  2Þ

0

2000 4000 6000 8000

-2000

Observed

The observed and estimated suspended sediments of the upstream station in test period.

ð15Þ

A genetic programming approach to suspended sediment modelling

SRC1, SRC2 and MLR formulas obtained for the downstream station are

and the constants in the formulation are G1C0 ¼ 0:055; G2C0 ¼ 1:415; G3C0 ¼ 2:526;

295

G1C1 ¼ 7:995;

S ¼ 0:7066  Q 2:0589

G2C1 ¼ 4:465; G3C1 ¼ 1:379

ð16Þ

After putting the corresponding values, the final equation becomes  1=3 0:0031Q t1 St ¼ Q t Qd t ðQ t þ 7:995Þ !2 ð1:4149 þ St1 ÞðQ t1  Q t Þ þ 0:1075 Q 2t1 !2 Q t S1=2 t1 þ 0:5257 ð17Þ Q 2t1

Figure 5

2:0589

S ¼ 0:7066CF  Q S ¼ 401:11  Q t  362:32  Q t1 þ 0:661  St1  272:8

ð18Þ ð19Þ ð20Þ

CF is calculated as 1.496 in Eq. (19). The comparison between the GP and the other models are presented in Table 3. For the downstream station also the GP model shows better accuracy than the rating curve and MLR techniques. Here also the SRC2 with bias term performs better than the SRC1 model. In case of downstream data also, the performance of the SRC2 model is better than the GP model when input to the models is only the current discharge (see Tables 2 and 3). It seems that the perfor-

Expression tree for downstream station.

¨ . Kisßi A. Aytek, O

296 10000 Observed

8000

y = 0.864x + 89.106 R2 = 0.934

GP

6000

GP

Sediment (ton/day)

10000

5000

4000 2000

0

0 0

30

0

60 90 120 150 180 210 240 270 300 330 360

10000 SRC1

SRC1

Sediment (ton/day)

y = 0.5266x + 22.567 R2 = 0.878

Observed

8000 6000 4000

5000

2000

0

0 0

30

0

60 90 120 150 180 210 240 270 300 330 360

8000

Observed

8000

SRC2

6000 4000

4000 2000

2000

0

0 0

30

0

60 90 120 150 180 210 240 270 300 330 360

8000 Observed

6000

MLR

MLR

4000

4000

2000

2000

0

0 0

Figure 6

30 60

10000

y = 1.1403x - 98.578 R2 = 0.897

8000

6000

5000

Observed

Time (Day)

-2000

10000

y = 0.7878x + 33.756 R2 = 0.878

6000

SRC2

Sediment (ton/day)

10000

5000

Observed

Time (Day)

Sediment (ton/day)

10000

Observed

Time (Day) 10000

5000

90 120 150 180 210 240 270 300 330 360

Time (Day)

0

2000 4000 6000 8000

-2000

Observed

The observed and estimated suspended sediments of the downstream station in test period.

mance of the GP model is not superior when level of input is same for both GP model and rating curve technique. The suspended sediment estimates of GP, SRC and MLR models and observed values are compared in Fig. 6. It can be seen from this figure that the GP model estimates are closer to the observed suspended sediment values than those of the rating curve and MLR techniques (especially the SRC1). The SRC2 seems to be much better than the SRC1 model. The negative estimates of the MLR model are obviously seen

from the hydrograph and scatterplot graphs. The total sediment load estimates of the GP, SRC1, SRC2 and MLR methods are 10.5% higher and 41.2%, 12.1% and 12.7% lower, respectively, than the observed value of 134,689 ton.

Conclusions This study indicates the ability of genetic programming (GP) technique to model the streamflow-suspended sedi-

A genetic programming approach to suspended sediment modelling ment relationship. The GP model is explicit and simple that can be used by anyone not necessarily being familiar with GP. The model gives a practical way for sediment estimation to obtain accurate results and encourages use of GP in other aspects of water engineering studies. The suspended sediment estimates based on GP models are compared with two different sediment rating curves and multi-linear regression. The results obtained with GP models are better than those obtained using the conventional rating curve and multi-linear regression techniques and confirm the ability of this approach to provide a useful tool in solving specific problems in hydrology, such as suspended sediment estimation. The results suggest that the GP approach may provide a superior alternative to the sediment rating curve and multi-linear regression techniques. The MLR method was found to provide negative approximations for some of the low sediment values, whereas the other model estimates are all positive. The difficulties in the estimation of suspended sediment load using only current discharge, resulting from the hysteresis effect, are also indicated. The SRC2 model with a bias term is found to be much better than the unbiased SRC1 model. The results also indicate that the SRC2 model may provide better performance than the GP in the estimation of total sediment load. The study only used data from two areas and further work using more data from various areas may be required to strengthen these conclusions. In the present study, genetic programming that is a generalization of genetic algorithms was used for the explicit formulation of daily suspended sediment–discharge relationship. Other optimization techniques such as Tabu Search, Differential Evolution, Particle Swarm and Ant Colony may also be used for the derivation of formulas instead of genetic programming and their accuracies may be compared with each other. This may be a subject of another study.

Acknowledgements The first author would like to thanks the Research Foundation of Gaziantep University for the support provided during this study. The data used in this study were downloaded from the web server of the USGS. The authors wish to thank the staff of the USGS who are associated with data observation, processing, and management of USGS Web sites.

References Asselman, N.E.M., 2000. Fitting and interpretation of sediment rating curves. J. Hydrol. 234, 228–248. Babovic, V., 2000. Data mining and knowledge discovery in sediment transport. Comput.-Aided Civ. Infrastruct. Eng. 15 (5), 383–389. Babovic, V., Keijzer, M., 2002. Declarative and preferential bias in GP-based scientific discovery. Genet. Program. Evol. Mach. 3 (1). Babovic, V., Keijzer, M., Aguilera, D.R., Harrington, J., 2001. Automatic Discovery of Settling Velocity Equations. D2K Technical Report, D2K-0201-1. Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.D., 1998. Genetic Programming. Morgan Kaufmann, San Francisco, CA.

297

Bayazıt, M., Oguz, B., 1998. Probability and Statistics for Engineers. Birsen Publishing House, Istanbul, Turkey, p. 159. Cousin, N., Savic, D.A., 1997. A rainfall-runoff model using genetic programming. Centre for Systems and Control Engineering, Report No. 97/03, School of Engineering, University of Exeter, Exeter, United Kingdom, p. 70. Crowder, D.W., Demissie, M., Markus, M., 2007. The accuracy of sediment loads when log-transformation produces nonlinear sediment load–discharge relationships. J. Hydrol. 336, 250– 268. Dorado, J., Rabunal, J.R., Pazos, A., Rivero, D., Santos, A., Puertas, J., 2003. Prediction and modelling of the rainfall-runoff transformation of a typical urban basin using ANN and GP. Appl. Artif. Intell. 17, 329–343. Drecourt, J.P., 1999. Application of neural networks and genetic programming to rainfall-runoff modeling. D2K Technical Report 0699-1-1, Danish Hydraulic Institute, Denmark. Ferguson, R.I., 1986. River loads underestimated by rating curves. Water Resour. Res. 22 (1), 74–76. Ferreira, C., 2001a. Gene expression programming in problem solving. In: 6th Online World Conference on Soft Computing in Industrial Applications (invited tutorial). Ferreira, C., 2001b. Gene expression programming: a new adaptive algorithm for solving problems. Complex Syst. 13 (2), 87–129. Ferreira, C., 2006. Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence. Springer, Berling Heidelberg New York, 478 p. Giustolisi, O., 2004. Using genetic programming to determine Chezy resistance coefficient in corrugated channels. J. Hydroinform., 157–173. Goldberg, D.E., 1989. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading, Mass. Guldal, V., Muftuoglu, R.F., 2001. 2D unit sediment graph theory. J. Hydrol. Eng. 6 (2), 132–140. Harris, E.L., Babovic, V., Falconer, R.A., 2003. Velocity predictions in compound channels with vegetated floodplains using genetic programming. Int. J. River Basin Manage. 1 (2), 117–123. Kizhisseri, A.S., Simmonds, D., Rafiq, Y., Borthwick, M., 2005. An Evolutionary computation approach to sediment transport modeling. In: Fifth International Conference on Coastal Dynamics, April 4–8, 2005, Barcelona, Spain. Kothyari, U.C., Tiwari, A.K., Singh, R., 1997. Estimation of temporal variation of sediment yield from small catchments through the kinematic method. J. Hydrol., Amsterdam 203, 39– 57. Koza, J.R., 1992. Genetic Programming: On the Programming of Computers by means of Natural Selection. The MIT Press, Cambridge, MA. McBean, E.A., Al-Nassri, S., 1988. Uncertainty in suspended sediment transport curves. J. Hydr. Eng. ASCE 114 (1), 63–74. Overleir, A.P., 2004. Accounting for heteroscedasticity in rating curve estimates. J. Hydrol. 292, 173–181. Picoet, C., Hingray, B., Olivry, J.C., 2001. Empirical and conceptual modeling of the suspended sediment dynamics in large tropical African River: The Upper Niger River Basin. J. Hydrol. 250, 19– 39. Rabunal, J.R., Puertas, J., Suarez, J., Rivero, D., 2007. Determination of the unit hydrograph of a typical urban basin using genetic programming and artificial neural networks. Hydrol. Process. 21, 476–485. Refsgaard, J.C., 1997. Parameterization, calibration and validation of distributed hydrological models. J. Hydrol., Amsterdam 198, 69–97. Sandy, R., 1990. Statistics for Business and Economics. McGraw-Hill Publishing, New York, p. 1117. Savic, A.D., Walters, A.G., Davidson, J.W., 1999. A genetic programming approach to rainfall-runoff modeling. Water Resour. Manage. 13, 219–231.

298 Thomas, R.B., 1985. Estimating total suspended sediment yield with probability sampling. Water Resour. Res. 21, 1381–1388. Wicks, J.M., Bathurst, J.C., 1996. SHESED: a physically based, distributed erosion and sediment yield component for the SHE hydrological modeling system. J. Hydrol. 175, 213–238. Whigham, P.A., Crapper, P.F., 1999. Time series modeling using genetic programming: an application to rainfall-runoff models.

¨ . Kisßi A. Aytek, O In: Spector, L. et al. (Eds.), Advances in Genetic Programming. The MIT Press, Cambridge, MA, pp. 89–104. Whigham, P.A., Crapper, P.F., 2001. Modeling rainfall-runoff using genetic programming. Math. Comput. Model. 33, 707–721. Zyserman, J.A., Fredsoe, J., 1994. Data analysis of bed concentration of suspended sediment. J. Hydr. Eng., ASCE 120 (9), 1021– 1042.