Accepted Manuscript Research papers Physically sound formula for longitudinal dispersion coefficients of natural rivers Yu-Fei Wang, Wen-Xin Huai, Wei-Jie Wang PII: DOI: Reference:
S0022-1694(16)30774-0 http://dx.doi.org/10.1016/j.jhydrol.2016.11.058 HYDROL 21672
To appear in:
Journal of Hydrology
Received Date: Revised Date: Accepted Date:
13 July 2016 23 November 2016 27 November 2016
Please cite this article as: Wang, Y-F., Huai, W-X., Wang, W-J., Physically sound formula for longitudinal dispersion coefficients of natural rivers, Journal of Hydrology (2016), doi: http://dx.doi.org/10.1016/j.jhydrol.2016.11.058
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
1
Physically sound formula for longitudinal dispersion coefficients of natural rivers Yu-Fei Wang , Wen-Xin Huai* , Wei-Jie Wang State Key Laboratory of Water Resources and Hydropower Engineering Science, Wuhan University, Wuhan, Hubei 430072, China. Corresponding author: Wen-Xin Huai, State Key Laboratory of Water Resources and Hydropower Engineering Science, Wuhan University, Wuhan 430072, China (
[email protected])
Abstract: The longitudinal dispersion coefficient (k) is necessary for a plethora of mass transport applications in fluids, but a general formulation for k remains lacking. In this study, we propose a canonical form for k that reflects the physics of dispersion and suits complex flow conditions encountered in natural streams. This general form is much more concise than previous predictors. A predictor for k of natural streams is also obtained using a genetic programming(GP) without pre-specified correlations among field data or a pre-specified form of the predictor. This predictor is physically sound (i.e. exhibits the aforementioned canonical form) and appears to be commensurate to or better than previous estimates of k. A grey model, which measures the proximity of data to a target shape (i.e. the proposed physically sound form), is also used to verify that the canonical form is appropriate. A formulation for
2
k in natural rivers is obtained by utilising a GP. Its form is consistent with the canonical form.
Keywords: Canonical form, Contaminant
transport,
Genetic
programming,
Longitudinal dispersion coefficients, Natural rivers
1. Introduction
Understanding the transport of matter (mass, momentum and heat) in solvent (gas and liquid) is necessary in a plethora of applications, such as contamination control, sediment deposition, flow with vegetation, water intake, and thermal discharge (Burn and Meiburg, 2012; Cassol et al., 2009; Chen et al., 2011; Deng et al., 2001; Escobar, 2015; Guerrero and Skaggs, 2010; Jin et al., 2015; Mino et al., 2013). In the ideal case of passive scalars (mass) in still water, mass flux q is given by Fick’s lawq = −Ddc/dx, where c is the scalar concentration, x is the distance along the longitudinal direction and D is the diffusion coefficient caused by Brownian motion that is influenced by fluid temperature and the size of the molecules of scalar c. In a moving fluid, the transport of scalar mass is conventionally explored in a coordinate system moving at the same average velocity as the fluid but without changing the molecular properties of c (i.e. diffusion coefficient = D). However, the effective ‘diffusion coefficient’, defined as −q(dc/dx) in moving fluids along x, appears to be much larger than D (Abderrezzak et al., 2015;Aris, 1959; Chen et al., 2012; Fischer et al., 1979; Ng
3
and Zhou, 2012; Taylor, 1953) and is the theme of the present study. This ‘virtual diffusion coefficient’, which is commonly referred to as the longitudinal dispersion coefficient k, is not associated with molecular motion. It is the result of macroscopic flow properties associated with the average of the advective acceleration in the longitudinal direction and bulk mixing in the lateral direction (Taylor, 1953; Wu and Chen, 2014a, 2014b; Zeng et al., 2015).
Under certain circumstances, k can be derived by solving the advection–diffusion equation in laminar flow within a circular pipe, in turbulent flow within a circular pipe, in laminar flow in an elliptical pipe, in laminar flow within two planes, in laminar flow in an open channel and in turbulent flow in an open channel. However, k in natural streams does not have a complete theoretical predictor because natural streams have many irregular factors, such as dead zones and vegetation (Huai et al., 2012; Lees et al., 2000).
However, because dispersion in natural rivers shows the same mechanics (i.e.k results from the combination of concentration and velocity gradients in the lateral direction), this study begins with dispersion in laminar flow in a circular pipe (the deduction for dispersion in natural rivers can be found in Text s1). We attempt to discover the generality among the analytical formulae and discuss the dispersion in the aforementioned circumstances, which is a relevant topic worthy of exploration. These analytical formulae are the foundation for obtaining the general formula for k
4
because they are all theoretical solutions and share an interesting identity.
One of the earliest formulations for k put forward by Taylor (1953) was derived for laminar flow in a pipe. Both theoretical and experimental results showed that k/D>> 1 as a result of the mean velocity gradient in the cross section. The theoretical solution for kis obtained by solving the advection–diffusion equation given as 2c 1 c 2c c c D 2 2 u r r x t x r
(1)
The velocity (u) distribution in a laminar pipe can be derived from the Navier–Stokes equation in radial coordinates; it is given as u = u0(1 – r2/a2), where r is the radial distance from the centre line, a is the radius of the pipe and u0 is the velocity along the centre line (i.e. maximum velocity). Assuming quasi-steady state conditions and noting that the diffusion gradients along r are much larger than their counterparts along x (i.e. boundarylayer approximation) in long pipes, the third term on the left-hand side and the first term on the right-hand side of Equation (1) can be neglected. By setting z1=r/a and x1 = x-u0t/2, the advection–diffusion equation for the coordinate system moving at a mean speed U (= u0/2) is
2c 1 c a 2u0 1 2 c 2 z1 D 2 x1 z1 z1 z1
(2)
The formula for k in a laminar pipe flow (see Formula (T1.a) in Table 1) is obtained by solving the equation for the boundary conditions c / z1 0, at:z1 1 and by letting c / x1 be independent of z1. However, k in a turbulent circular pipe flow is different from that in a laminar circular pipe case. The differences are due to the
5
entire mean velocity distribution not having a theoretical formula and the mixing coefficient in the lateral direction being different from the diffusion coefficient in laminar flow as a result of turbulence. Taylor (1954) obtained k for a turbulent circular pipe flow through two experimental results. (1) The experiential velocity distribution over a cross section of a pipe was (u0−u)/u∗ = f(z1 ), where u∗ is the shear velocity given as u∗ = (τ0/ρ)1/2 = (gaJ/2)
1/2
for the pipe; here,τ0is the shear stress on
the wall, g is the gravitational acceleration and J is the energy slope (f(z1 ) is omitted here) (Taylor, 1954). (2) Reynolds analogy—the transfers of mass, momentum and heat by turbulence can be connected—was used, i.e. εr = τ/(ρ∂u/∂r) = −qr/(∂c/∂r), where εr is the transfer coefficient (a turbulent mixing coefficient in the radial direction, which is much larger than D), τ is the shear stress at radius r and qr is the concentration flux in the radial direction (Taylor, 1954). By substituting diffusion coefficient D with radial mixing coefficient εr and using the experimental mean velocity distribution, k for a turbulent circular pipe flow can be obtained. Specifically, k (see Formula (T2.a) in Table 1) comprises two parts, namely, longitudinal advection and longitudinal mixing, the combination of which results in the longitudinal dispersion (Taylor, 1954).
Aris (1956) obtained a formula for k (see Formula (T3. a) in Table 1) for a laminar elliptical pipe flow; in this formula, the ratio of the major axis (a1) to the minor axis (b1) isrs,. The formula recovers the one obtained by Taylor (1953) (see Formula (T1.a) in Table 1) when rs = 1. Extensive studies have also been
6
conducted on k for the laminar flow between two wide planes, the distance between which is Ht, with consideration of the velocity variations in the direction perpendicular to the plane. Another formula for k (see Formula (T4. a) in Table 1) was given by Dewey and Sullivan (1979), and Formula (T4.a) turns out to be a formula for k for a laminar flow in an open channel (see Formula (T5.a) in Table 1) when depth H is half of Ht (Chatwin and Sullivan, 1982).
Only considering velocity variations along the channel depth, Elder (1959) obtained a formula for k (see Formula (T6.a) in Table 1) through Reynolds analogy and by using a prescribed mean velocity distribution based on experiments. By using the N-zone model, Chikwendu (1986) obtained the same formula as those given by Taylor (1953), Elder (1959) and Chatwin and Sullivan (1982) (see Formulae (T7.a), (T7.b) and (T7.c) in Table 1). The formula obtained did not omit the considerably small longitudinal diffusion.
Fischer (1967) found that k in natural rivers is caused by depth-average velocity gradient in the transverse direction because of the large width-to-depth ratio (rBH) (for the field data in Table S1, the mean value of rBH is 72.7) (Figure 1). The longitudinal dispersion caused by the velocity gradient in the depth direction can be neglected compared to that caused by velocity gradient in the width direction (Fischer, 1975), so velocity gradient in the width direction is the major factor (see the top view in Figure 1). The mixing in the depth direction, which is obtained by
7
Reynold's analogy (Elder, 1959), is assumed to take place in a short time (see the longitudinal section in Figure 1). This is because the depth is quite small compared the width. By using the method of Taylor (1953, 1954), Fischer (1967) derived a triple integral expression to predict k; this expression is given as (3)
where A is the cross-sectional area, B is the channel width, h(y) is the local flow depth, y is the lateral direction, u′(y) is the deviation of the local depth–averaged longitudinal velocity from the cross-sectional mean velocity U (i.e. u′(y) = u(y)−U) and εy(y) is the local transverse mixing coefficient. Fischer et al. (1979) suggested that εy(y) = 0.15h(y)u∗ in uniform flows, where u∗ = (gHs)1/2 is the friction velocity, s is the bed slope and H is the mean water depth. On the basis of the above equation and experiments, Fischer (1975) provided a predictor for k in natural rivers as a result of u(y) (Formula (T8.a) in Table 1). Similar studies on k in natural rivers were conducted and reviewed elsewhere (Bogle, 1997; Liu, 1977;) (see Table 1). The predictions obtained by these studies show obvious variations.. The process of obtaining k in a rectangular flume is provided in supporting information Text s1 for completeness.
The aforementioned analytical formulae exhibit a similar form, as discussed in Section 2.1. Current studies focus on longitudinal dispersion in natural rivers (mainly caused by velocity variation in the width direction) and researchers have utilised genetic models to obtain the predictor for k in natural rivers(Seo and Cheong, 1998;
8
Deng et al., 2001; Kashefipour and Falconer, 2002; Rajeev and Dutta, 2009; Azamathulla and Ghani, 2011; Etemad-Shahidi and Taghipour, 2012; Sahay, 2013; Zeng and Huai, 2014; Disley et al., 2015; Sattar and Gharabaghi, 2015). To obtain a predictor with experimental data has become popular in hydraulic engineering (Najafzadeh et al., 2016a; Najafzadeh et al., 2016b; Najafzadeh et al., 2016c). Empirical predictors are obtained on the basis of the form of k given by Seo and Cheong (1998) by using M5’ model tree, genetic algorithm (GA), differential evolution (DE), and gene expression programming (GEP) (Etemad-Shahidi and Taghipour, 2012; Sahay and Dutta 2009, 2013; Sattar and Gharabaghi, 2015; Li et al., 2013). In other studies (Sahay, 2011; Toprak and Cigizoglu, 2008; Azamathulla and Wu, 2011; Riahi-Madvar et al., 2009), k is predicted by utilizing different data-driven methods, such as artificial neural networks (ANNs), adaptive neuro-fuzzy inference system (ANFIS), and support vector machine (SVM); however, explicit formulas are not provided. The neuro-fuzzy-based group method of data handling (NF-GMDH), which has been widely employed in hydraulic engineering, is applied to predict k in natural rivers on the basis of particle swarm optimization (Najafzadeh et al., 2014; Najafzadeh et al., 2015; Najafzadeh, 2015; Najafzadeh and Zahiri, 2015; Najafzadeh and Bonakdari, 2016; Najafzadeh and Tafarojnoruz, 2016; Najafzadeh and Azamathulla, 2013; Najafzadeh et al., 2013). Analytical solutions for k in natural rivers have been obtained by Seo and Baek (2004) by applying a beta velocity distribution equation. Table 1 shows that the forms of current predictors for natural rivers are similar, particularly those obtained by Seo and Cheong (1998) and the subsequent
9
researchers.
The form of current predictors for k is determined with a classic method on the basis of theory provided by Buckingham (Seo and Cheong, 1998). This method is reasonable when the system is complex and the theory is incomplete. In the present study, we aim to determine whether the forms of current predictors for natural rivers are appropriate, whether a general form for k in natural rivers exists and whether we can obtain a predictor of this general form by using a GP without a pre-given form (Nee et al. 2005; Poli et al. 2008).
The remainder of the manuscript is arranged as follows. The general form of k is derived in Section 2.1. The predictor based on the data-driven method is featured in Section 2.2. Results and discussions are presented in Section 3. Finally, conclusions are given in Section 4.
2. Materials and methods 2.1. Obtaining the general form for k Previous studies, some of which were based on the data-driven method, derived various formulae for predicting k in natural rivers (see Table 1, which lists some typical formulae). The aforementioned formulae, some of which were obtained on the basis of reasonable hypotheses and the mechanics of dispersion, can predict field
10
data and are dimensionally appropriate. However, they cannot provide obvious information on dispersion. Moreover, the precision of such formulae does not always equate to physical soundness. Thus, a predictor obtained in one case may not be applicable to other cases. ‘Reproducibility is a core principle of science’ (Open Science Collaboration, 2015), and predictive ability is a criterion of science.
Furthermore, the complex form of existing formulae may cause confusion and hinder the easy identification of the primary cause for k. In a number of studies, the exponents of predictors are decimals, and some of the exponents contain variables; thus, these predictors might fail to clearly reflect dispersion (Seo and Cheong, 1998; Deng et al., 2001; Kashefipour and Falconer, 2002; Rajeev and Dutta, 2009; Azamathulla and Ghani, 2011; Azamathulla and Wu, 2011; Etemad-Shahidi and Taghipour, 2012; Sahay, 2013; Zeng and Huai, 2014; Disley et al., 2015; Sattar and Gharabaghi, 2015; Wang and Huai, 2016). If a predictor is not physically sound, it might exhibit poor predictive ability (i.e., it can only predict data in some certain cases). These formulae estimate the data that were used to obtain them, but they may not perform well when estimating k in conditions with distinguishing characteristics (e.g. width, mean velocity and depth) because physical meaning might be eliminated in the numerical process.
The physical meaning of a predictor is significant even for experiential formulae, but for longitudinal dispersion, no general formulae exist for all types of flows, especially
11
natural stream flows. Studies on longitudinal dispersion coefficients, which date back to the work off Taylor (1953), put forward theoretical a solution for laminar flow in a pipe. Other works derived theoretical solutions for k in different types of flows, the sectional geometry of which is regular and uniform (Taylor, 1954; Aris, 1956; Dewey and Sullivan, 1979; Chatwin and Sullivan, 1982; Elder, 1959; Chikwendu, 1986). The theoretical solution for k in natural rivers, whose sectional geometry is irregular, partly resists theoretical treatment (Johnson et al., 2014; Kelleher et al., 2013; Kerr et al., 2013; Trévisan and Periáñez, 2016; Zaramella et al., 2016; Zhou et al., 2015). Fischer (1975) obtained an experiential predictor by combining a triple integral expression and experimental data. However, this predictor has limited accuracy. Others put forward their predictors for k in natural rivers on the basis of the regression analysis of experiments or on the basis of a pre-given form (Seo and Cheong, 1998; Deng et al., 2001; Kashefipour and Falconer, 2002; Rajeev and Dutta, 2009; Azamathulla and Ghani, 2011; Etemad-Shahidi and Taghipour, 2012; Sahay, 2013; Zeng and Huai, 2014; Disley et al., 2015; Sattar and Gharabaghi, 2015; Wang an Huai, 2016). These predictors lose some theoretical underpinnings through the process, although their dimension is often correct. For example, the solution given by Seo and Cheong (1998) shows a different form from that given by Fischer (1975), with k being proportional to Hu∗; in the theoretical solution given by Fischer (1975),k is proportional to (Hu∗)−1, where Hu∗ is related to the lateral mixing coefficient obtained from experiments (i.e. 0.15Hu∗). The latter formulae show forms similar to that of Seo and Cheong (1998) (Sattar and Gharabaghi, 2015; Li et al., 2013;
12
Etemad-Shahidi and Taghipour, 2012). Thus, the most appropriate formulae merit careful consideration.
As is widely known, k is the result of the combination of the concentration gradient and velocity gradient in the lateral direction. On the one hand, the lateral concentration gradient is decreased by lateral mixing, thus decreasing the longitudinal dispersion. Thus, k is proportional to D− (for natural rivers, D=ε(y)), and such characteristic is in line with the form in previous theoretical results (Taylor, 1954; Aris, 1956; Dewey and Sullivan, 1979; Chatwin and Sullivan, 1982; Chikwendu, 1986). On the other hand, the lateral velocity gradient increases with mean velocity because the minimum velocity is zero on the boundary; thus, k is proportional to UFurthermore, both concentration and velocity gradients are related to the lateral scale; thus, longitudinal dispersion is proportional to Lβ, where L is major scale in the lateral direction of the system. Finally, we find that k∝D− U Lβ; however, the mixing coefficient in natural rivers has no specific formula, and the effect of D is always associated with the Peclet number Pe. Thus, we can simplify this relation as k∝U Lβ. However, we cannot directly find such relations in the equations by Seo and Cheong (1998) and other succeeding studies.
Specifically, by changing Formulae (T1.a)–(T6.a) to (T1.b)–(T6.b), where C∗=U/u∗ = Csg−1/2and Cs = R1/6/n, one can find that the longitudinal dispersion coefficient can be written in the form k= (π1, π2, ..., πi)RdU, where πi is a non-dimensional variable
13
andi=1,2,..., Rd is defined as the major lateral scale. Here,is related to the Peclet number (Pe=Ey/(URd) for k in laminar flows, where Ey is the mean lateral coefficient.(Pe) reflects the ratio of the lateral mixing coefficient to the longitudinal advection, i.e. ratio of the diffusive character of the solute to the flow scale. Moreover, is related to C∗ (a constant for a certain flow) in turbulent flow; Pe does not exist in k for turbulent flow because the lateral mixing coefficient is a function of velocity and distance. To achieve a universal form, we use the formula k =
(π1 ,π2 ,...,πi)RdU in predicting k in all types of flows. Using U (commonly measured) instead of u∗ does not cause information loss because U=C∗u∗. The formula for the longitudinal dispersion coefficient can now be written as k=ηURd
(4)
where η = φ(π1 ,π2 ,...,πi ), i = 1,2,...Table 2 provides η for different flows. In the formula, π denotes either sinuosity or irregularity for natural rivers, or it may simply represent the width-to-depth ratio for natural rivers. Rd is the largest among B and H for a rectangular channel, and it is the major axis for an elliptical pipe. With regard to the difference between Formula(4) and the formulae by Seo and Cheong (1998) and other studies, we find that the exponents in Formula(4) are both units, whereas those in the formula of Seo and Cheong (1998) are decimals. Certainly, we cannot transform all factors in Formula(4)into integers because natural rivers comprise many irregular elements. Most exponents of physical formulae are integers, and thus, the form of Formula(4) seems physically sound compared with the form of the formulae in previous studies on k in natural rivers. Section 3 shows that the equation is
14
plausible for natural rivers.
2.2. Data-driven methods 2.2.1. Genetic programming(GP) Previous data-driven methods assume a mathematical structure first and proceed to find optimal parameters using a multi-objective optimization method or trial and error (Yapo et al., 1998). Some studies have attempted to derive predictors without an empirical model while leaving the recognition task to programming (Koza, 2010). Schmidt and Lipson (2009) found that machine learning (ML) techniques could obtain physical correlations between variables without any pre-given correlations among data. The GPhas now evolved into an efficient method for extracting physical relations from experimental data instead of remaining a pure numerical or statistical method; it has also demonstrated the ability to resolve satisfactory solutions in complicated systems (Tinoco et al., 2015;Goldstein et al., 2013; Kambekar and Deo, 2012; Limber et al., 2014; Goldstein and Coco, 2014). These successes are the motivation behind the present work on k.
The GP treats sub-expressions of formulae as individuals of population evolution. Data are first divided into three groups: the training group (the population) that produces generations of individual sub-expressions, the validation group that selects the best individual sub-expressions while weeding out poor sub-expressions and the testing group that judges the results (Koza, 1992). Elimination, mutation (a random
15
process) and crossover occurs during the process of evolution, and genetic programming obtains the meaningful connections in the evolution. The GPyields a set of formulae with different complexities, but the formulae that are neither too simple nor too complex are preferred. An oversimplified formula is not accurate, whereas an excessively complex formula is over-fitting. A balance between simplicity and precision must be considered in the selection of formulae (Tinoco et al., 2015). The software used here is Eureqa developed by Schmidt and Lipson (2009; 2013).
2.2.2. Data pre-processing for GP
We explore the use of the data-driven GP to obtain a formula for k on the basis of the experiments in Table s1. However, the use of ML requires data pre-processing (Bowden et al., 2002; Campolo et al., 1999; Hudson et al., 1996; Kaski et al., 1998; Kohonen, 1982, 1990, 2001; Liu and Weisberg, 2005; Roweis and Saul, 2000; Solidoro et al., 2007; Tourassi et al., 2001). The following four main reasons explain the need to employ such pre-processing: (1) data are limited in flow conditions and geometry (May et al., 2008); (2) the data bank may be too large because of repeated runs (Dawson and Wilby, 1998); (3) the data obtained through experiments always exhibit biases and variances (such as noise) (May et al., 2010), which are caused by the field techniques used and emphasise the need to perform de-noising prior to the use of ML; (4) the effects caused by sampling variance may be more conspicuous than other factors (e.g. initialising training) (May et al., 2010). Hence, an appropriate selection
16
and clustering method is significant to the GP because it is conducted only once as a first step of the data-driven method (Maier and Dandy, 1996).
Here, the maximum dissimilarity algorithm (MDA) by Camus et al. (2011) is used and discussed. The MDA is a dissimilarity-based compound section (Kennard and Stone, 1969) described thoroughly by Snarey et al. (1997). The aim of the method is to obtain a subset in which the data show the largest dissimilarities. The distance (or dissimilarity) can be Euclidean distance or inner product (Roweis and Saul, 2000), but Euclidean distance is preferred here. The data selected with the MDA can represent the data bank, as they do not concentrate on a certain area (Lajiness and Watson, 2008).
The data bank (seen in Table s1 and Figures 2 and 4) collected by Zeng and Huai (2014) from previous studies (Carr and Rehmann, 2005; Deng et al., 2001; Kashefipour and Falconer, 2002) comprise 116 field data of natural rivers. Figure 2shows that the velocity distribution is relatively uniform but that the distributions of other variables show odd data that are far from the concentrated area. Using all field data (each datum comprises five variables) directly without transforming them into non-dimensional variables would hamper the easy identification of the correlations among the variables, and the predictor dimensions would always be inaccurate (see Table 3). After 1.88e7 generations and the validation of 1.8e12 formulae, 11 formulae are generated by the software, with the largest size being 44. The first
17
seven formulae corresponding to the correlation coefficient (r2), mean squared error (MSE) and mean absolute error (MAE) are shown in Table 3, which indicates that the accuracy of the predictor increases with size. Among the formulae, k = BU shows the correct dimension and the same form as the general formula (Formula (4)). The accuracy of the formula is not discussed further, but it does prove that the GP can identify physical correlations among variables on the basis of field data.
As often suggested, data should be transformed into dimensionless variables to reduce programme workload. The theorem put forward by Buckingham in 1915 (Fischer et al., 1979) is a good choice when correlations among variables are unknown. Five dimensional variables (H, B, U, u∗ and k) in the system and two different physical dimensions (length and time) equate to three (= 5−2) independent, dimensionless variables. We can use k/Hu∗ (or k/HU, k/Bu∗, k/BU), B/H, u∗/U. As provided in Formula (4), the longitudinal dispersion coefficient is k=ηUB or k/UB=η. If we use k/BU as the dimensionless variable for the longitudinal dispersion coefficient, then the result automatically yields a general form. To prove that the GP can produce a predictor of a general form (i.e. k = ηUB) without this form pre-specified, we use k/Hu∗ as a dimensionless variable for k without B or U. The resulting distributions of the dimensionless variables are shown in Figure 3. The distributions of three dimensionless variables concentrate in an area, but these variables show some odd data that are far from the concentrated area.
18
We plot the data in a logarithmic coordinate system and find a stimulating character in Figure 4. Plotting data in a uniform coordinate system (see the top left panel of Figure 4) makes finding correlations among variables difficult, especially when concentrating in a small area. Hence, we plot data in logarithmic coordinates (see the rest of the panels in Figure 4). We find from the two-dimensional logarithmic coordinate system that
k / (Hu* ) 1 (U/ u* )1
(or k 1U ) (see the bottom left panel of 1
Figure 4) and k / (Hu* ) 2 ( B / H ) (or k B ) (see the last panel in Figure 4), which 2
2
2
are consistent with the general form k =ηBU and thus demonstrate the reasonability of this general form. Odd data are important in determining the overall trend, and they should not be treated as outliers and be omitted. On the contrary, the data selected with the MDA contain these odd data.
2.2.3. Data grouping for GP
Data cannot reflect the whole data bank if they are too few because noise may lead to a wrong formula outcome. Conversely, too much data increase the burden on programming and make the formula too complex and over-fitting. Too much data also make the required evaluations for functions increase dramatically (Yapo et al., 1998). The amount of data that can represent the whole data bank has been the center of much debate. Matter (1997) argued that the number should be 0.35N, whereas Brown and Martin (1997) suggested that it should be 0.2N, where N is the size of the data bank. Flood and Kartam (1994) noted that using more data leads to
19
more accurate formulae. In this work, we use all the data mainly because the data bank is too small.
When we use the MDA for data selection, we select 47 (40%) datasets from the data bank as the training group, 47 (40%) datasets as the validation group and the remaining data as the testing group. We use X to represent the data bank, i.e. X= (x1, x2, …, xN), xi={x1i, x2i, x3i}, i=1, 2, …, N. To eliminate the effects caused by different variable scales, data should be normalized. We use xn = (xi-xmin)/(xmax-xmin) to normalize the data and then use xi= xn(xmax-xmin)+xmin to de-normalize them. We use x to represent the normalized data for convenience.
The data are selected one by one, i.e. subset M (0) comprises only one data point. The criterion of dissimilarity was described by Holliday and Willett (1996), and the main criteria are the maximum-sum and maximum-minimum; simple but effective versions of the two criteria were provided by Holliday et al. (1995) and Polinsky et al. (1996). The maximum-minimum criterion is used in this study. The steps are as follows: (a) M is initialized by selecting one datum from X as M (0) (M(0) only contains m1). Four methods are used to initialize M: 1) selecting one datum randomly, 2) selecting one datum with the largest dissimilarity relative to the rest of the data, 3) selecting one data point in the central part of the data bank and 4) selecting one data point with the largest or smallest value. We use the data point with the largest value, i.e.
20
the sum of squares of the three variables of the data point is the maximum. (b) The data point with the largest dissimilarity (i.e. Euclidean distance)tom1is selected as m2. ... (c) After the selection for t times, M (t) comprises t data, and the data bank comprises (N−t) data. Pairwise dissimilarities between the subset and the rest of the data bank are calculated, and each xi has tdij. The minimum value dimin is chosen to obtain (N−t) dimin. The data point of maximum dimin is selected as mt+1. (d) Step (c) is repeated until t = MN. The selected data in the training set are distributed mostly on the edge of the data bank (see Figure 5). The most important characteristic of the MDA is that it contains odd data on the boundary in the subset. For a large data bank, Camus et al. (2011) suggested the use of the method given by Polinsky et al. (1996) (Ferguson et al., 1996; Matter, 1997).
The statistical properties of the data sets, which are selected with the MDA, and those from the data bank, are listed in Table 4. The data in Table 4 have been de-normalized into the original data.
As shown in Table 4, the maximum and minimum belong to the training group, which is obtained firstly. Thus, the MDA can find the data on the edges. Variances decrease from the training group to the testing group.
21
2.2.4. Genetic programming
We use the software Eureqa (Schmidt and Lipson,2009; 2013) to find solutions with the MDA-selected data. A generic step-wise implementation of GP is described as follows: Step 1: Feed the programming with the training and validation data selected by MDA in Section 2.2.3. Step 2: Initialize a group of candidate solutions randomly. An encoded random symbolic function generator creates solutions by combining operands (e.g., constants and variables) and arithmetic operators (e.g., binary and unary). Binary operators, such as addition, subtraction, multiplication, and division, are basic arithmetic operators that act on two terminals, whereas unary operators, such as trigonometry, exponential functions, logarithms, square root, and power, are advanced operators that act on one terminal. The operands used in this study are addition, subtraction, multiplication, division, square root, and power. The initialized solutions are used indirectly because of the maximum size restriction. If the output of a solution remains unchanged within an encoded range when a subexpression is abandoned, then this subexpression is eliminated. Thus, the programming reduces the complexity or the sum of mathematical operators and variables of the solutions. The complexity changes during evolution.
22
Step 3: Compare the solutions with the field data used for validation by using a chosen error metric. Poor solutions are abandoned, whereas retained solutions are combined with the crossover probability function encoded in the software. Subexpressions are changed, and new subexpressions are added with the mutation probability function encoded in the software. In this study, the mean absolute error is chosen as the error metric.
Step 4: Terminate the computation process when satisfactory solutions emerge because the programming does not stop automatically. A series of solutions with different complexities is provided, and each given solution has the best accuracy compared to other candidate solutions with the same complexity. The best solutions are found on Pareto front, which describes the accuracy against complexity. Solutions with simple forms are not abandoned because of poor accuracy, whereas complex solutions may not be the best solution. The following factors are considered to select the best solution: 1) physically sound form; 2) solution on the cliff of the Pareto front, where the accuracy increases significantly with a slight change in complexity, and 3) accuracy of the solution (Goldstein et al., 2013, and the references therein).
The results are listed in Table 5. When the MDA-obtained data are used, solutions
23
become steady after the validation of the 2.1e12 formulae. The software generates 10 formulae, with 26 as the largest size and 1 as the smallest size. Figure (6) shows the Pareto front, where accuracy versus complexity is described. In general, predictive ability increases as complexity increases, and the increasing rate decreases as the complexity increases. The predictive ability may not increases when the complexity increases (e.g., from complexity 8 to complexity 14). The best formula is obtained when the predictive ability significantly increases to its asymptotic value. The predictive ability significantly increases when the complexity level is 8. By contrast, the predictive ability does not significantly increase even though the complexity reaches 26. On the basis of previous studies, we select the solution of complexity 8 (i.e., the solution in the gray zone in Table 5) as the final predictor (Schmidt and Lipson, 2009; Goldstein et al., 2013; Tinoco et al., 2015)..
Simple formulae do not contain the necessary information among dimensionless variables, whereas complex formulae are over-fitting. A complex solution is malformed, although it achieves the best accuracy (see Table 3). It does not have physical meaning as well; thus, it cannot predict data outside the data bank. However, previous studies tended to choose the most precise formula, which shows quite a complex form, as the final predictor. The software requires researchers to balance the complexity and physical meaning of solutions instead of considering only the accuracy of the solution (Tinoco et al., 2015).
24
3. Results and discussions 3.1. Results
Changing the form of a selected formula in Table 5 yields Formula (9), which agrees with the expected general form (i.e. Formula (4)).
k ( g g / rBH ) UB
(9)
whereg = 0.718, g = 47.9 and rBH = B/H. The (g +g /rBH) is represented by the characteristic parameter . When H is constant, the value of decreases as B increases because the effects on the velocity caused by side walls decrease if B increases (Fischer, 1979). This formula has a better form than the formulae given by Seo and Cheong (1998) and other subsequent researchers (Deng et al., 2001; Kashefipour and Falconer, 2002; Rajeev and Dutta, 2009; Etemad-Shahidi and Taghipour, 2012; Li et al., 2013; Sahay, 2013; Zeng and Huai, 2014; Disley et al., 2015; Sattar and Gharabaghi, 2015; Wang and Huai, 2016). In natural rivers, the formulae for k given by previous studies share a similar form k = (U/u*)(B/H)Hu* (see T11–T21in Table 1) and parameters, especially the exponents, which are not integers or 0.5; thus, physical meaning in previous formulae are not obvious.
3.2. Discussion
25
Table 6 shows the comparison of the accuracy of Formula (9) with those of previous formulae from other studies. The table lists down the Mean Error(ME),Mean Absolute Error (MAE) and Scatter Index (SI) (Hasson et al., 2009); numbers of data used (Nd) for obtaining the predictor are also listed. The figures of ‘predicted values versus measured values’ can be found in the support information figure s1.
Overall, different predictors provide various advantages when diverse data sets are predicted, and the performance of a predictor varies when different metrics are used. A formula that performs well for a certain data group may fail to predict the data in the other data group. The best accuracy for all data sets cannot be obtained by applying the same formula when different metrics are used. For instance, the formula by Wang and Huai (2016) yields the highest accuracy in terms of ME for all data, whereas Formula (9) provides a slight advantage in terms of MAE for all data; the predictor by Seo and Cheong (1998) can well predict the SI for data in testing group, whereas formula (9) shows slightly advantage when estimating al data. In Table 6, a large data bank unlikely offers an accurate formula (Etemad-Shahidi & Taghipour, 2012; Sattar and Gharabaghi, 2015). By comparison, a small data bank may establish an accurate formula when a certain data set is predicted (Seo and Cheong, 1998). This observation can be attributed to the following factors. 1) The obtained formulas perform well for certain cases and 2) the data involve too many noises, and the size of the data is too small that the noises may significantly affect the results. The accuracy of the highly nonlinear predictor by Sattar and Gharabaghi
26
(2015) is not greater than that of Formula (9). The formula by Etemad-Shahidi & Taghipour, (2012) cannot effectively improve the accuracy even though the predictor has two parts. The performance of different predictor varies marginally. Therefore, a specific formula with the highest accuracy is not easily determined when noises in the limited field data are considered. Obviously, Formula (9) has commensurate accuracy as those in previous studies, and this formula is more accurate than some formulas reported when certain metrics are used. However, Formula (9) is presented in a different form, which is simpler than other formulas used in previous studies. The formula obtained with the GP shows a general form. Thus, this formula indicates the physical properties of k to some extent, as verified in Section 3.2.
We determine whether Formula (9) is the best predictor for k in natural rivers. Our results indicate that there is a series of 'best predictors' for k in natural rivers. Natural rivers are characterized by many uncertain factors, such as vegetation and irregular shapes, and the data bank is small. The software produces different solutions with the same training and validation data because evolutionary processes, such as mutation and crossover, are based on the encoded probability functions. This phenomenon exists, although it has not been described in previous studies. The predictor of complexity 8 is shown in all of the results, even though the parameter may differ marginally from one another. In this study, 10 formulas of complexity 8 are obtained when the GP is fed with the same training and validation data points ten
27
times (Table 7). Nevertheless, we cannot assume that the formula with the highest accuracy is the best. In all of the operations, the predictive ability significantly increases when the complexity is 8 before an asymptotic value is reached (Figure 7). Although the formulas of complexity 8 differ from one another, they are all the points where the accuracy achieves its asymptotic value. Therefore, the form of complexity 8 is the best candidate for this study. Formula (9) is a representative of these ten formulas. Formula (9) becomes more accurate if the GP is fed with more field data.
In genetic programming, a training group produces formulae to be proven by the validation group. Thus, it is important in finding correlations among variables and producing the right formulae. The formulae are then proven and selected by the validation group. Thus, the validation group plays an important role in the evolution of formulae. We cannot determine which group should be given priority because the data are very limited (only 116) and were obtained in different rivers. Further studies on data splitting are needed to determine which group should be prioritized in genetic programming.
The GP can provide a set of solutions of different sizes. A complex-sized solution achieves good accuracy in predicting the data used, but it may not be utilised to perform prediction in other systems as it does not have physical meaning. Conversely, a simple-sized solution cannot reflect intrinsic correlations among variables. Complexity and accuracy should be balanced in choosing the best and most
28
predictive solution (Schmidt and Lipson, 2009).
The general form of k (i.e. Formula (4)) shows that k is closely related to the major lateral scale (Rd) and mean velocity (U). In a natural river, whose ratio of width (B) to depth (H) is large, k is closely related to B and U (see Formula (9)). The validity of the universal form for natural rivers is verified by using a grey model.
In a grey system, the proximity of a series of variables (vi, i=1, 2,…) to a target variable (v0) is analysed, and proximity is defined as grey relational grade (r0i). A high correlation degree means a high relation between a variable and a target variable. Correlation degree reflects proximity among variables in a multidimensional space, and thus, the grey model is a geometrical analysis method. On the basis of geometrical proximity, the grey model can recognize the grey relational grade between a series of variables and the target variable, and it can rank the correlation degrees.
In a complex grey model, dimensionless variables are used for calculation, making terms have physical meanings and reducing the number of terms. The common dimensionless variables in fluid mechanics are as follows. a) The Strouhal number (St, St= L/(UT), where T is the dimension of time) reflects the ratio of local acceleration to convective acceleration. St is considerably large in unsteady flow because local acceleration is significantly high. In natural rivers, whose longitudinal and lateral
29
scales are immense, flow is regarded as steady and uniform, i.e. it does not have convective acceleration as a whole or local acceleration from a local point. St has limited utility in natural rivers, but we still obtain St=(L/T)/U by changing the form of St, where L is the dimension of length and L/T has a velocity dimension. Shear velocity u*is frequently used, in addition to mean velocity U. Thus,u*/U is used as a dimensionless variable in the grey system, and u*/U is used to obtain a dimensionless number when using the theorem (see Figure 3). u*/U equals 1/C*, i.e. [R1/6/(g0.5n)]-1. b) Froude number (Fr, Fr= U/(gH)0.5) reflects (inertia force/gravity)0.5. Fr (see Figure 7) is proportional to mean velocity, which is the indirect driving force of k. When H is constant, k increases with Fr, as shown in Formula (9). c) The Reynolds number (Re, Re= UH/) reflects the ratio of inertia force to viscous force, where is the kinematic viscous coefficient. When Re is small, the viscous force is larger than the inertia force, i.e. the viscous term is larger than the convective term. d) B/H determines which direction (i.e. depth or width) exerts the largest effect on k. Given that H is smaller than B in natural rivers (see Figure 3), we assume that the mixing time in the depth direction is smaller than that in the width direction. e) Slope (s) reflects the gravity component in the longitudinal direction. It is the driving force of velocity. f) Peclet number (Pe, Pe=D/UL) can be k/(BU) and k/(HU);k/( Bu*) and k/(Hu*) can also be used because they have the same form.
A grey model is used to recognize the correlation degrees of {u*/U, Fr, Re, B/H} when the target is k/(Bu*) (or k/(Hu*), k/(BU), k/(HU)).
30
Four steps must be fulfilled to obtain the grey relational grades: 1) Set target and independent variables; 2) Change original data; 3) Calculate grey relational coefficients; 4) Calculate grey relational grades.
The details are shown in Text s2. Set k/(Bu*) as the target sequence, and set {u*/U, Fr, Re, B/H} as the comparison sequence. Then, grey relational grades are obtained (see the first row of Table 8). In the same way, grey relational grades of {u*/U, Fr, Re, B/H, s} towards k/(Hu*), k/(BU) and k/(HU) are also acquired (see the second to fourth rows of Table 8).
Table 8 shows that Re has the smallest grey relational grades, i.e.Re has the smallest correlation with k/(Bu*), k/(Hu*), k/(BU) and k/(HU) because flows in natural rivers are turbulent (Re>500) and because Re exerts little impact on dispersion during turbulent flow. (Reis actually considerably high in a natural river, e.g., the smallest Re in this paper is 5,460.) Thus, in recent studies, Re is not always considered when studying k in natural rivers.
To obtain the matrix for the grey relational grades of {k/(Bu*), k/(Hu*), k/(BU), k/(HU)} towards u*/U, (Fr, B/H, s and Re), we switch the target and independent variables.
31
The matrix is listed in Table 9.
Table 9 shows that k/(BU) has the largest average correlation grade with {u*/U, Fr, B/H, s, Re}. Thus, k should be written in the form k/(BU)=f(u*/U, Fr, B/H, Re, s), which verifies the accuracy of Equation (4).
Theoretical solutions are easy to obtain in a simple system and difficult to acquire in a complex system (e.g. the flume shows an irregular shape and transient storage) (Baker et al., 2012; Jackson et al., 2014; Noori et al., 2005). If we want to obtain a predictor on the basis of a pre-given form in a complex system, the pre-given form is prioritised to obtain the accurate predictor. In this work, we provide a physically sound form for the longitudinal dispersion coefficient, k=URd. Thus, this work could serve as a reference for future studies on k, especially those in complex systems.
4. Conclusions
The analytical and empirical formulas for longitudinal dispersion coefficients in different cases are analyzed in this study, and a new formula for the longitudinal dispersion coefficients in natural rivers is proposed. This formula provides a commensurate accuracy, and the form is validated as physically sound by using a grey model. A concise summary is presented as follows:
32
1. On the basis of previous studies on longitudinal dispersion coefficients in circular pipe flows (turbulent or laminar), elliptical pipe flows (laminar), open channel flows (turbulent or laminar) and the flows between two planes, we put forward a general form, k=URd, for longitudinal dispersion coefficients. The general form is verified in a natural river, k=UB, (turbulent open channel flow) by utilising the grey model.
2. We obtain a concise formula, k=H/BUB, for the longitudinal dispersion coefficient in natural rivers by utilising the GP. Unlike previous predictors, whose exponents are not integers, this formula shows a concise form, in which all exponents are integers. The concise formula shows a commensurate or accuracy compared with those in previous studies. The in the formula may change when more field data are used to obtain k, but the form of Equation (9) will not change.
3. The GP can obtain the correlations among variables without pre-giving the form. Researchers, however, need to choose the optimal solution by the knowledge on the system. Balancing complexity and accuracy should also be based on theoretical background.
4. The study on k in open channel flow caused by velocity variance in the width direction faces two obstacles: a lack of a velocity distribution for turbulent flow and a mixing coefficient in the width direction regarded as a constant 0.15Hu*. The mixing
33
coefficient, however, should be a function of velocity and distance (Elder, 1959; Taylor, 1954). We should focus further on velocity distribution in a turbulent open channel flow in the width direction. We should also centre on the mixing coefficient in the width direction when studying longitudinal dispersion caused by velocity variance in the width direction. Table 8 shows that longitudinal dispersion has the weakest correlation with Re because the kinematic viscosity of water in natural rivers is considerably small (10-6m2/s) and Re is immensely large. Water kinematic viscosity, as well as density, is always assumed as constant. In stratified flow, we consider density as a variable and kinematic viscosity as a constant. In the future, we should put more effort on k when kinematic viscosity is a variable. Although such factor is not necessary in natural rivers, it is important in low Re flow with more than one matter.
Acknowledgments: The authors thank G. Katul for commenting on an earlier draft of this manuscript. Wen-Xin Huai, Yu-Fei Wang and Wei-Jie Wang acknowledge support from the National Natural Science Foundation of China (Nos. 11372232 and 51439007) and the Specialized Research Fund for the Doctoral Program for Higher Education (No. 20130141110016).
34
References: Abderrezzak, K. E. K., Ata, R., & Zaoui, F. (2015). One-dimensional numerical modelling of solute transport in streams: The role of longitudinal dispersion coefficient. Journal of Hydrology, 527, 978-989. Andrianasolo, F. N., Casadebaig, P., Maza, E., Champolivier, L., Maury, P., & Debaeke, P. (2014). Prediction of sunflower grain oil concentration as a function of variety, crop management and environment using statistical models. European Journal of Agronomy, 54, 84-96. Aris, R. (1956, April). On the dispersion of a solute in a fluid flowing through a tube. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences (Vol. 235, No. 1200, pp. 67-77). The Royal Society. Azamathulla, H. M., & Ghani, A. A. (2011). Genetic programming for predicting longitudinal dispersion coefficients in streams. Water resources management, 25(6), 1537-1544. Azamathulla, H. M., & Wu, F. C. (2011). Support vector machine approach for longitudinal dispersion coefficients in natural streams. Applied Soft Computing, 11(2), 2902-2905.
Azamathulla, H. M., & Wu, F. C. (2011). Support vector machine approach for
35
longitudinal dispersion coefficients in natural streams. Applied Soft Computing, 11(2), 2902-2905. Azamathulla, H. M. (2012). Gene expression programming for prediction of scour depth downstream of sills. Journal of Hydrology, 460, 156-159. Baker, D. W., Bledsoe, B. P., & Price, J. M. (2012). Stream nitrate uptake and transient storage over a gradient of geomorphic complexity, north‐central Colorado, USA. Hydrological Processes, 26(21), 3241-3252. Bogle, G. V. (1997). Stream velocity profiles and longitudinal dispersion.Journal of Hydraulic Engineering, 123(9), 816-820. Bowden, G. J., Maier, H. R., & Dandy, G. C. (2002). Optimal division of data for neural network models in water resources applications. Water Resources Research, 38(2), 2-1. Brown, R. D., & Martin, Y. C. (1997). The information content of 2D and 3D structural descriptors relevant to ligand-receptor binding. Journal of Chemical Information and Computer Sciences, 37(1), 1-9. Burns, P., & Meiburg, E. (2012). Sediment-laden fresh water above salt water: linear stability analysis. Journal of Fluid Mechanics, 691, 279-314. Campolo, M., Soldati, A., & Andreussi, P. (1999). Forecasting river flow rate during low-flow
periods
using
neural
networks. Water
resources
research,35(11),
3547-3552. Camus, P., Mendez, F. J., Medina, R., & Cofiño, A. S. (2011). Analysis of clustering and selection algorithms for the
study of multivariate wave climate.Coastal
36
Engineering, 58(6), 453-462. Cassol, M., Wortmann, S., & Rizza, U. (2009). Analytic modeling of two-dimensional transient atmospheric pollutant dispersion by double gitt and laplace transform techniques. Environmental Modelling & Software, 24(1), 144-151. Chatwin, P. C., & Sullivan, P. J. (1982). The effect of aspect ratio on longitudinal diffusivity in rectangular channels. Journal of Fluid Mechanics,120, 347-358. Chen, J. S., Chen, J. T., Liu, C. W., Liang, C. P., & Lin, C. W. (2011). Analytical solutions to two-dimensional advection–dispersion equation in cylindrical coordinates in finite domain subject to first-and third-type inlet boundary conditions. Journal of Hydrology, 405(3), 522-531. Chen, G. Q., Wu, Z., & Zeng, L. (2012). Environmental dispersion in a two-layer wetland: analytical solution by method of concentration moments. International Journal of Engineering Science, 51(2), 272-291. Chikwendu, S. C. (1986). Calculation of longitudinal shear dispersivity using an N-zone model as N [rightward arrow][infty infinity]. Journal of Fluid Mechanics, 167, 19-30. Dandy, G., & Crawley, P. (1992). Optimum operation of a multiple reservoir system including salinity effects. Water Resources Research, 28(4), 979-990. Dandy, G. C., Simpson, A. R., & Murphy, L. J. (1996). An improved genetic algorithm for pipe network optimization. Water Resources Research, 32(2), 449-458. Dawson, C. W., & Wilby, R. (1998). An artificial neural network approach to rainfall-runoff modelling. Hydrological Sciences Journal, 43(1), 47-66.
37
Deng, Z. Q., Singh, V. P., & Bengtsson, L. (2001). Longitudinal dispersion coefficient in straight rivers. Journal of Hydraulic Engineering, 127(11), 919-927. Dewey, R., & Sullivan, P. J. (1979). Longitudinal dispersion in flows that are homogeneous in the streamwise direction. Zeitschrift für angewandte Mathematik und Physik ZAMP, 30(4), 601-613. Disley, T., Gharabaghi, B., Mahboubi, A. A., & McBean, E. A. (2015). Predictive equation for longitudinal dispersion coefficient. Hydrological Processes, 29(2), 161-172. Escobar, H. (2015). Mud tsunami wreaks ecological havoc in brazil. Science, 350. Etemad-Shahidi, A., & Taghipour, M. (2012). Predicting longitudinal dispersion coefficient in natural streams using M5 ′ model tree. Journal of Hydraulic engineering, 138(6), 542-554. Ferguson, A. M., Patterson, D. E., Garr, C. D., & Underiner, T. L. (1996). Designing chemical libraries for lead discovery. Journal of Biomolecular Screening, 1(2), 65-73. Fischer, H. B. (1967). The mechanics of dispersion in natural streams.Journal of the Hydraulics Division, 93(6), 187-216. Fischer, H. B. (1975). Discussion of" Simple Method for Predicting Dispersion in Streams". Journal of the Environmental Engineering Division,101(3), 453-455. Fischer, H. B. (1979). Mixing in Inland and Coastal Waters. Academic Press. Goldstein, E. B., Coco, G., & Murray, A. B. (2013). Prediction of wave ripple characteristics using genetic programming. Continental Shelf Research, 71, 1-15. Goldstein, E. B., Coco, G., Murray, A. B., & Green, M. O. (2014). Data-driven
38
components in a model of inner-shelf sorted bedforms: a new hybrid model. Earth Surface Dynamics, 2(1), 67.
Guerrero, J. P., & Skaggs, T. H. (2010). Analytical solution for one-dimensional advection–dispersion transport equation with distance-dependent coefficients. Journal of Hydrology, 390(1), 57-65. Hanson, J. L., Tracy, B. A., Tolman, H. L., & Scott, R. D. (2009). Pacific hindcast performance of three numerical wave models. Journal of Atmospheric and Oceanic Technology, 26(8), 1614-1633. Holliday, J. D., Ranade, S. S., & Willett, P. (1995). A fast algorithm for selecting sets of dissimilar molecules from large chemical databases.Quantitative Structure‐Activity Relationships, 14(6), 501-506. Holliday, J. D., & Willett, P. (1996). Definitions of" dissimilarity" for dissimilarity-based compound selection. Journal of Biomolecular Screening,1(3), 145-151. Huai, W., Hu, Y., Zeng, Y., & Han, J. (2012). Velocity distribution for open channel flows with suspended vegetation. Advances in Water Resources, 49, 56-61. Hudson, B. D., Hyde, R. M., Rahr, E., Wood, J., & Osman, J. (1996). Parameter based methods for compound selection from chemical databases. Quantitative Structure‐Activity Relationships, 15(4), 285-289. Jackson, T. R., Apte, S. V., & Haggerty, R. (2014). Effect of multiple lateral cavities on stream solute transport under non-Fickian conditions and at the Fickian asymptote. Journal of Hydrology, 519, 1707-1722.
39
Jin, G., Tang, H., Li, L., & Barry, D. A. (2015). Prolonged river water pollution due to variable‐density flow and solute transport in the riverbed. Water Resources Research, 51(4), 1898-1915. Johnson, Z. C., Warwick, J. J., & Schumer, R. (2014). Factors affecting hyporheic and surface transient storage in a western US river. Journal of Hydrology, 510, 325-339. Kambekar, A. R., & Deo, M. C. (2010). Wave prediction using genetic programming and model trees. Journal of Coastal Research, 28(1), 43-50. Kashefipour, S. M., & Falconer, R. A. (2002). Longitudinal dispersion coefficients in natural channels. Water Research, 36(6), 1596-1608. Kaski, S., Kangas, J., & Kohonen, T. (1998). Bibliography of self-organizing map (SOM) papers: 1981–1997. Neural computing surveys, 1(3&4), 1-176. Kelleher, C., Wagener, T., McGlynn, B., Ward, A. S., Gooseff, M. N., & Payn, R. A. (2013). Identifiability of transient storage model parameters along a mountain stream. Water Resources Research, 49(9), 5290-5306. Kerr, P. C., Gooseff, M. N., & Bolster, D. (2013). The significance of model structure in one-dimensional stream solute transport models with multiple transient storage zones–competing vs. nested arrangements. Journal of Hydrology, 497, 133-144. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological cybernetics, 43(1), 59-69. Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE,78(9), 1464-1480. Kohonen, T. (2001). Self-organizing maps, vol. 30 of Springer Series in Information
40
Sciences. Koussis, A. D., &Rodríguez-Mirasol, J. (1998). Hydraulic estimation of dispersion coefficient for streams. Journal of hydraulic Engineering, 124(3), 317-320. Koza, J. R. (1992). Genetic programming: on the programming of computers by means of natural selection (Vol. 1). MIT press. Koza, J. R. (2010). Human-competitive results produced by genetic programming. Genetic Programming and Evolvable Machines, 11(3-4), 251-284. Lajiness, M., & Watson, I. (2008). Dissimilarity-based approaches to compound acquisition. Current opinion in chemical biology, 12(3), 366-371. Lees, M. J., Camacho, L. A., & Chapra, S. (2000). On the relationship of transient storage and aggregated dead zone models of longitudinal solute transport in streams. Water Resources Research, 36(1), 213-224. Li, X., Liu, H., & Yin, M. (2013). Differential evolution for prediction of longitudinal dispersion coefficients in natural streams. Water resources management, 27(15), 5245-5260. Limber, P. W., Brad Murray, A., Adams, P. N., & Goldstein, E. B. (2014). Unraveling the dynamics that scale cross‐shore headland relief on rocky coastlines: 1. Model development. Journal of Geophysical Research: Earth Surface, 119(4), 854-873. Liu, H. (1977). Predicting dispersion coefficient of streams. Journal of the Environmental Engineering Division, 103(1), 59-69. Liu, Y., & Weisberg, R. H. (2005). Patterns of ocean current variability on the West Florida Shelf using the self‐organizing map. Journal of Geophysical Research: Oceans
41
(1978–2012), 110(C6). Maier, H. R., & Dandy, G. C. (1996). The use of artificial neural networks for the prediction of water quality parameters. Water resources research, 32(4), 1013-1022. Matter, H. (1997). Selecting optimally diverse compounds from structure databases: a validation
study
of
two-dimensional
and
three-dimensional
molecular
descriptors. Journal of medicinal chemistry, 40(8), 1219-1229. May, R. J., Maier, H. R., Dandy, G. C., & Fernando, T. G. (2008). Non-linear variable selection
for
artificial
neural
networks
using
partial
mutual
information. Environmental Modelling & Software, 23(10), 1312-1326. May, R. J., Maier, H. R., & Dandy, G. C. (2010). Data splitting for artificial neural networks using SOM-based stratified sampling. Neural Networks,23(2), 283-294. Miño, G. L., Dunstan, J., Rousselet, A., Clement, E., & Soto, R. (2013). Induced diffusion of tracers in a bacterial suspension: theory and experiments. Journal of Fluid Mechanics, 729, 423-444. Najafzadeh, M., & Azamathulla, H. M. (2013). Neuro-fuzzy GMDH to predict the scour pile groups due to waves. Journal of Computing in Civil Engineering, 29(5), 04014068. Najafzadeh, M., Barani, G. A., & Azamathulla, H. M. (2013). GMDH to predict scour depth around a pier in cohesive soils. Applied Ocean Research,40, 35-41. Najafzadeh, M., Barani, G. A., & Hessami Kermani, M. R. (2014). Estimation of pipeline scour due to waves by GMDH. Journal of Pipeline Systems Engineering and Practice, 5(3), 06014002. Najafzadeh, M. (2015). Neurofuzzy-Based GMDH-PSO to Predict Maximum Scour
42
Depth at Equilibrium at Culvert Outlets. Journal of Pipeline Systems Engineering and Practice, 7(1), 06015001. Najafzadeh, M., & Zahiri, A. (2015). Neuro-fuzzy GMDH-based evolutionary algorithms to predict flow discharge in straight compound channels. Journal of Hydrologic Engineering, 20(12), 04015035. Najafzadeh, M., Barani, G. A., & Hessami-Kermani, M. R. (2015). Evaluation of GMDH networks for prediction of local scour depth at bridge abutments in coarse sediments with thinly armored beds. Ocean Engineering, 104, 387-396. Najafzadeh, M., & Bonakdari, H. (2016). Application of a neuro-fuzzy GMDH model for predicting the velocity at limit of deposition in storm sewers.Journal of Pipeline Systems Engineering and Practice, 06016003. Najafzadeh, M., Balf, M. R., & Rashedi, E. (2016a). Prediction of maximum scour depth around piers with debris accumulation using EPR, MT, and GEP models. Journal of Hydroinformatics, jh2016212. Najafzadeh, M., Etemad-Shahidi, A., & Lim, S. Y. (2016b). Scour prediction in long contractions using ANFIS and SVM. Ocean Engineering, 111, 128-135. Najafzadeh, M., Laucelli, D. B., & Zahiri, A. (2016c). Application of model tree and Evolutionary Polynomial Regression for evaluation of sediment transport in pipes. KSCE Journal of Civil Engineering, 1-8. Najafzadeh, M., & Tafarojnoruz, A. (2016). Evaluation of neuro-fuzzy GMDH-based particle swarm optimization to predict longitudinal dispersion coefficient in rivers. Environmental Earth Sciences, 75(2), 1-12.
43
Nee, S., Colegrave, N., West, S. A., & Grafen, A. (2005). The illusion of invariant quantities in life histories. Science, 309(5738), 1236-1239.
Ng, C. O., & Zhou, Q. (2012). Dispersion due to electroosmotic flow in a circular microchannel with slowly varying wall potential and hydrodynamic slippage. Physics of Fluids (1994-present), 24(11), 112002. Noori, R., Deng, Z., Kiaghadi, A., & Kachoosangi, F. T. (2015). How Reliable Are ANN, ANFIS, and SVM Techniques for Predicting Longitudinal Dispersion Coefficient in Natural Rivers?. Journal of Hydraulic Engineering, 142(1), 04015039. Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. Poli, R., Langdon, W. B., McPhee, N. F., & Koza, J. R. (2008). A field guide to genetic programming. Lulu. com. Polinsky, A., Feinstein, R. D., Shi, S., & Kuki, A. (1996). LiBrain: software for automated design of exploratory and targeted combinatorial libraries.Molecular Diversity and Combinatorial Chemistry: Libraries and Drug Discovery, 1(996), 219-232. Riahi-Madvar, H., Ayyoubzadeh, S. A., Khadangi, E., & Ebadzadeh, M. M. (2009). An expert system for predicting longitudinal dispersion coefficient in natural streams by using ANFIS. Expert Systems with Applications, 36(4), 8589-8596. Sahay, R. R., & Dutta, S. (2009). Prediction of longitudinal dispersion coefficients in natural rivers using genetic algorithm. Hydrology Research, 40(6), 544-552.
44
Sahay, R. R. (2011). Prediction of longitudinal dispersion coefficients in natural rivers using artificial neural network. Environmental Fluid Mechanics,11(3), 247-261. Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323-2326. Sahay, R. R. (2013). Predicting longitudinal dispersion coefficients in sinuous rivers by genetic algorithm. Journal of Hydrology and Hydromechanics, 61(3), 214-221. Sattar, A. M., & Gharabaghi, B. (2015). Gene expression models for prediction of longitudinal dispersion coefficient in streams. Journal of Hydrology, 524, 587-596. Schmidt, M., & Lipson, H. (2009). Distilling free-form natural laws from experimental data. science, 324(5923), 81-85. Schmidt, M., and H. Lipson (2013), Eureqa (version 0.99.5 beta) [software]. [Available at www.eureqa.com.] Schulz, M., Priegnitz, J., Klasmeier, J., Heller, S., Meinecke, S., & Feibicke, M. (2012). Effect of bed surface roughness on longitudinal dispersion in artificial open channels. Hydrological Processes, 26(2), 272-280. Seo, I. W., & Cheong, T. S. (1998). Predicting longitudinal dispersion coefficient in natural streams. Journal of Hydraulic Engineering, 124(1), 25-32. Seo, I. W., & Baek, K. O. (2004). Estimation of the longitudinal dispersion coefficient using the velocity profile in natural streams. Journal of hydraulic engineering, 130(3), 227-236. Snarey, M., Terrett, N. K., Willett, P., & Wilton, D. J. (1997). Comparison of algorithms for dissimilarity-based compound selection. Journal of Molecular Graphics and
45
Modelling, 15(6), 372-385. Solidoro, C., Bandelj, V., Barbieri, P., Cossarini, G., & Fonda Umani, S. (2007). Understanding dynamic of biogeochemical properties in the northern Adriatic Sea by using self‐organizing maps and k‐means clustering.Journal of Geophysical Research: Oceans (1978–2012), 112(C7). Taylor, G. (1953, August). Dispersion of soluble matter in solvent flowing slowly through a tube. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences (Vol. 219, No. 1137, pp. 186-203). The Royal Society. Taylor, G. (1954, May). The dispersion of matter in turbulent flow through a pipe. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences (Vol. 223, No. 1155, pp. 446-468). The Royal Society. Tinoco, R. O., Goldstein, E. B., & Coco, G. (2015). A data‐driven approach to develop physically sound predictors: Application to depth‐averaged velocities on flows through submerged arrays of rigid cylinders. Water Resources Research, 51(2), 1247-1263. Toprak, Z. F., & Cigizoglu, H. K. (2008). Predicting longitudinal dispersion coefficient in natural streams by artificial intelligence methods. Hydrological Processes, 22(20), 4106-4129. Tourassi, G. D., Frederick, E. D., Markey, M. K., & Floyd Jr, C. E. (2001). Application of the mutual information criterion for feature selection in computer-aided diagnosis. Medical physics, 28(12), 2394-2402.
46
Trévisan, D., & Periáñez, R. (2016). Coupling catchment hydrology and transient storage to model the fate of solutes during low-flow conditions of an upland river. Journal of Hydrology. Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self-organizing map.Neural Networks, IEEE Transactions on, 11(3), 586-600 Wang, Y., & Huai, W. (2016). Estimating the Longitudinal Dispersion Coefficient in Straight Natural Rivers. Journal of Hydraulic Engineering, 04016048.
Wu, Z., & Chen, G. Q. (2014)a. Analytical solution for scalar transport in open channel flow: slow-decaying transient effect. Journal of Hydrology, 519(6), 1974-1984. Wu, Z., & Chen, G. Q. (2014)b. Approach to transverse uniformity of concentration distribution of a solute in a solvent flowing along a straight pipe. Journal of Fluid Mechanics, 740, 196-213. Yapo, P. O., Gupta, H. V., & Sorooshian, S. (1998). Multi-objective global optimization for hydrologic models. Journal of hydrology, 204(1), 83-97. Zaramella, M., Marion, A., Lewandowski, J., & Nützmann, G. (2016). Assessment of transient
storage
exchange
and
advection–dispersion
mechanisms
from
concentration signatures along breakthrough curves. Journal of Hydrology, 538, 794-801. Zeng, L., Wu, Z., Fu, X., & Wang, G. (2015). Performance of the analytical solutions for Taylor dispersion process in open channel flow. Journal of Hydrology, 528, 301-311. Zhou, Y., Wilson, G. V., Fox, G. A., Rigby, J. R., & Dabney, S. M. (2015). Soil pipe flow
47
tracer experiments: 2. Application of a stream flow transient storage zone model. Hydrological Processes.
Figure 1. The mechanism of longitudinal dispersion in natural rivers (in a moving coordinate system).
Figure 2. Histograms for dimensional variables.
Figure 3. Histograms for dimensionless variables.
Figure 4. Distribution of field data.
Figure 5. Distribution of data selected by MDA.
48
Figure 6. Pareto front for the solutions.
Figure 7. The MAE versus complexity at different operations.
Figure 8. Histograms for dimensionless variables.
Table 1. Formulae for k. Authors
Formulae
Taylor (1953)
k=a2U2/(48D); k=[aU/(48D)]aU.
(T1.a)
49
(T1.b) Taylor (1954)
k=10.05au0.052au or k=7.14aU,where u* was (gaJ/2)1/2 for a pipe, J was the energy slope andwas resistance coefficient;(T2.a) k=(10.1/C*)aU. (T2.b)
Aris (1956)
k=U2a1b1[(5+14rs2+5rs4)/(12(rs+rs3))]/(192D) (T3.a)
or
k={Ub1[(5+14rs2+5rs4)/(12(rs+rs3))]/(192D)}a1U (T3.b) Dewey & Sullivan [1979]
k=U2Ht2/(210D)
(T4.a)
or k=[UHt/(210D)]UHt(T4.b) Chatwin & Sullivan [1982]
k=2H2U2/(105D); (T5.a) k=[2HU/(105D)]HU. (T5.b)
Elder (1959)
k=5.93Hu*,where u* is shear of the channel; (T6.a) or k=(5.93/ C*)HU. (T6.b)
Chikwendu (1986)
for laminar channel flow: k=2H2U2/(105D)+D; (T7.a)
50
for turbulent channel flow: k=0.4041Hu*/+ Hu*/6; where is von Karman constant;(T7.b) for laminar pipe flow: k=a2U2/(48D)+D. (T7.c) Fischer (1975)
k=0.011U2B2/Hu*; (T8.a) k= (0.011UB/Hu*)BU. (T8.b)
Liu (1977)
k=0.18(u*/U)1.5(UB)2/(Hu*); (T9.a) k=(0.18(u*/U)1.5(UB)/(Hu*))BU. (T9.b)
Bogle (1997)
k=0.011U2B2/Hu*/(50~25); (T10.a) k=0.011UB/Hu*/(50~25)BU. (T10.b)
Seo & Cheong (1998)
k=5.92(U/u*)1.43(B/H)0.62Hu*(T11)
Deng et al. (2001)
k/(Hu*)=0.15/(8d)(U/u*)2(B/H)5/3 whered=0.145+(1/3520)(U/u*)(B/H)1.38. (T12)
Kashefipour & Falconer (2002)
k=[7.428+1.775(B/H)0.62(u*/U)0.572]HU(U/u*).
51
(T13) k/ Hu*= 2(B/H)0.96(U/u*)1.25
Rajeev & Dutta (2009)
(T14) Azamathulla & Ghani (2011)
k/Hu*=exp{exp[cos(U/u*)]+[(U/u*)2/(B/H+3.956)]} +sin[BU/(Hu*)]*BU/Hu*/exp[sin(B/H)] +U/u*/1.037-10.76*B/H/(U/u*-11.38). (T15)
Etemad-Shahidi
&
Taghipour k=15.49(B/H)0.78(U/u*)0.11H u*, if B/H<=30.6;
(2012)
k=14.12(B/H)0.61(U/u*)0.85H u*, if B/H>30.6.(T16)
Li et al. (2013)
k=2.828(B/H)0.7613(U/u*)1.4713H u*.(T17)
Zeng & Huai (2014)
k=5.4(B/H)0.7(U/u*)0.13HU. (T18)
Disley et al. (2015)
k/Hu*=3.563Fr-0.4117(B/H)0.6776(U/u*)1.0132, where Fr is Froude number, and Fr=U/(gH)0.5. (T19)
Sattar & Gharabaghi (2015)
k/Hu*=2.9*4.6(Fr)^0.5Fr—0.5(B/H)0.5-Fr(U/u*)1+(Fr)^0.5. (T20)
Wang & Huai (2016)
k=17.648(B/H)0.3619(U/u*)1.16H u*.( T21)
Table 2. for different typical flows and for the natural river. Flows Laminar circular pipe flow
au/(48D)
52
Turbulent circular pipe flow
10.1/C*
Laminar elliptical pipe flow
Ub1[(5+14r2+5r4)/(12(r+r3))]/(192D)
Laminar flow between in two planes
UHt/(210D)
Laminar open channel flow
2HU/(105D)
turbulent open-channel flow
5.93 /C*
turbulent open-channel flow (natural rivers)
(47.9H/B+0.718)
Table 3. the solutions obtained by dimensional data. Sizes
r2
Solutions
MSE
MAE
(m4/s2)
(m2/s)
1
k=B
0.587
57003.82
97.07
2
k=BU
0.858
48639.18
90.60
5
k=16.4+BU
0.858
46085.34
86.20
7
k=19.6+BU2
0.913
34133.39
77.01
9
k=25+BU3
0.921
20024.84
62.88
11
k=24.5+U+BU3
0.921
19979.67
62.82
16
k=BU+U10.2/(3.58-H)
0.895
17766.20
57.96
Table 4. Statistical properties of the data sets. Method
Variable
Group
MDA
u*/U
Training
Validation
Maximum
Minimum
Mean
Variance
4.5
0.015909
0.259242
0.414824
0.773333
0.041135
0.177681
0.016049
53
Testing
0.428333
0.065789
0.17548
0.006732
1000
13.82186
115.9732
39790.85
Validation
121.1905
16.04938
45.94077
463.4089
Testing
78.54077
18.01515
37.58463
204.6045
Training
40183.91
6.169556
3864.419
64278888
Validation
3465.567
98.52534
837.6326
695519
3023.8
101.4386
724.1013
392263
u*/U
4.5
0.015909
0.21031
0.177486
B/H
1000
13.82186
72.7312
17631.42
40183.91
6.169556
2042.471
28663003
B/H
k/Hu*
Training
Testing
All data
k/Hu*
Table 5. The solutions obtained by the GP. Size
Solutions
1
k/(Hu*)=673
3
k/(Hu*)=14.9B/H
4
k/(Hu*)= (B/H)/(u*/U)
6
k/(Hu*)= (37.4+B/H)/(u*/U)
8
k/(Hu*)= (47.9+0.718B/H)/(u*/U)
10
k/(Hu*)= (76.3+0.00107(B/H)2)/(u*/U)
…
…
26
k/(Hu*)=-1820/(84.2-1680(u*/U))-1890/(84.8-177
54
0(u*/U))+(52.9+0.718(B/H))/(u*/U)
Table 6. Comparison of the formulae given by previous studies and formula (9).
Researchers
Nd
ME (m2/s)
MAE(m2/s)
SI (all data
Seo & Cheong (1998) Deng et al. (2001) Kashefipour & Falconer (2002)
35 73 81
(all data
(all data
/testing group)
/testing group)
108.9/43.7
134.2/47.4
53.6/27.1
92.8/34.2
54.2/4.8
97.6/26.4
/testing group)
3.00/1.43 2.87/1.68 2.78/1.87
55
Rajeev & Dutta (2009) Etemad-Shahidi&Taghipour(2012)
Li et al. (2013) Zeng & Huai (2014) Disley et al. (2015) Sattar & Gharabaghi (2015) Wang & Huai (2016) Formula (9)
65 119 65 116 56 100 93 94
81.9/28.3
117.1/35.1
-9.2/-3.5
81.0/29.0
56.6/11.0
98.1/23.9
9.8/2.7
79.8/25.0
6.8/7.9
93.3/29.2
-67.7/-45.2
81.8/46.4
-4.6/-1.5
78.8/25.7
-27.1/-12.9
Table 7. Ten formulae obtained by the GP.
77.9/31.4
2.95/1.47 2.95/1.72 2.69/1.66 2.84/1.99 3.17/2.15 3.09/2.70 2.89/2.21 2.94/2.10
56
Number
g
g
MAE
1 2 3 4 5 6 7 8 9 10
0.718 0.720 0.715 0.712 0.720 0.726 0.718 0.719 0.716 0.720
47.9 47.5 50.5 53.2 47.6 47.8 48.3 47.5 49.2 47.3
419.5 419.0 419.9 420.4 419.5 420.0 420.0 419.4 419.7 419.4
Table 8. Matrix for grey relational grades. u*/U
Fr
Re
B/H
s
0.9707
0.9841
0.8892
0.9726
0.9691
K/(Hu*) 0.9614
0.9750
0.8833
0.9774
0.9591
k/(BU)
0.9800
0.9871
0.8895
0.9768
0.9765
K/(HU)
0.9768
0.9853
0.8876
0.9809
0.9721
k/(Bu*)
Table 9. Matrix for grey relational grades. k/(Bu*)
K/(Hu*)
k/(BU)
K/(HU)
u*/U
0.9316
0.9143
0.9519
0.9444
Fr
0.9603
0.9423
0.9652
0.9610
B/H
0.9323
0.9420
0.9408
0.9506
S
0.9496
0.9343
0.9615
0.9537
57
Re
0.8894
0.8837
0.8895
0.8876
Average
0.9326
0.9233
0.9418
0.9395
Figure 1
Figure 2
100
number
number
100 50 0
0
200
400 B(m)
600
0
800
20 0
0.5
1 U(m/s)
1.5
2
number
100 50 0
0
5
10 H(m)
15
20
100
number
number
40
0
50
0
500
1000 2
k(m /s)
1500
50 0
0
0.5 u*(m/s)
1
Figure 3
100
100
100
50
0
number
150
number
150
number
150
50
0
2
4 u*/U
6
0
50
0
500 B/H
1000
0
0
2 4 k/(Hu*)
6 4
x 10
Figure 4
4
x 10
*
k/(Hu )
*
log(k/(Hu ))
5
0 1000
5
500 B/H
0 0
2 log(B/H)
1
-2
2
0 log(u*/U)
6 log(k/(Hu ))
4
*
*
log(k/(Hu ))
0 3
u*/U
6
2 0 -2
5
-1
0 log(u*/U)
1
4 2 0
1
1.5
2 log(B/H)
2.5
3
Figure 5
log(k/(Hu ))
*
*
log(k/(Hu ))
6 5
0 3
2 log(B/H)
1 -2
2
0 log(u*/U)
4 2 0 -2
-1
0 log(u*/U)
*
log(k/(Hu ))
6 validation
4
testing 2 training 0
1
1.5
2 log(B/H)
2.5
3
1
Figure 6
540 Frontier 520
MAE
500 480 460 440 420 400 0
5
10
15
Complexity
20
25
30
Figure 7
550
MAE
500
450
400 0
5
10
15
Complexity
20
25
30
Figure 8
20 10
0
0.5 Fr
0
1
0
0.5
100 50
0
500 k/(Bu*)
1000
50
0
0.05 s
0.1
0
1000 k/(HU)
2000
100
50
0
100
0
1 1.5 7 Re x 10
100 number
number
150
0
50
number
0
150 number
100 number
number
30
0
50 k/(BU)
50
0
58
Highlights: A concise form for longitudinal dispersion coefficients is put forward A formula for the longitudinal dispersion coefficients in natural rivers is obtained Genetic method is used to obtain a predictor without pre-given form