Modeling chlorophyll-a concentrations using an artificial neural network for precisely eco-restoring lake basin

Modeling chlorophyll-a concentrations using an artificial neural network for precisely eco-restoring lake basin

Ecological Engineering 95 (2016) 422–429 Contents lists available at ScienceDirect Ecological Engineering journal homepage: www.elsevier.com/locate/...

2MB Sizes 0 Downloads 25 Views

Ecological Engineering 95 (2016) 422–429

Contents lists available at ScienceDirect

Ecological Engineering journal homepage: www.elsevier.com/locate/ecoleng

Modeling chlorophyll-a concentrations using an artificial neural network for precisely eco-restoring lake basin Fang Lu a , Zhi Chen a , Wenquan Liu b,∗ , Hongbo Shao c,∗ a b c

Department of Building, Civil and Environmental Engineering, Concordia University, Montreal, Quebec H3G 1M8, Canada Key Laboratory of Marine Sedimentology and Environmental Geology, First Institute of Oceanography, SOA, Qingdao 266061, China Institute of Agro-biotechnology, Jiangsu Academy of Agricultural Sciences, Nanjing210014, China

a r t i c l e

i n f o

Article history: Received 2 April 2016 Received in revised form 2 June 2016 Accepted 16 June 2016 Keywords: Back-propagation artificial neural network Chlorophyll-a Prediction Lake water quality Eco-restoring

a b s t r a c t A back-propagation artificial neural network (BPANN) model was developed in this study for the prediction of chlorophyll-a concentration in Lake Champlain. 21 years of monitoring data (1992–2012) of water quality parameters was used to train, validate and test the BPANN models. The optimal input parameters of the model were selected on the basis of the performance of models built with different combinations of input variables. To verify the model performances, the trained models were applied to field monitoring data from Lake Champlain. Prediction accuracy was measured by using the coefficient of determination (R2 ) and RMSE-observations standard deviation ratio (RSR). The R2 values of the best-performed model in the training set, validation set, testing set, and all-year data were 0.82, 0.93, 0.81, and 0.87, respectively. The corresponding RSR values of the three data sets and all-year were 0.62, 0.38, 0.53, and 0.48, respectively. Results indicated that the developed BPANN model can predict chlorophyll-a concentrations in Lake Champlain with high accuracy and provide a quick assessment of chlorophyll-a variation for lake management and eco-restoration. © 2016 Elsevier B.V. All rights reserved.

1. Introduction Lakes are significant sources of water for recreation, fishing and drinking water supply. The lake water system is affected by the municipal, agricultural and industrial waste disposal directly or indirectly discharged into lakes. Chlorophyll-a is a widely used environmental indicator of algae biomass and the eutrophication condition in lakes (Latif et al., 2003). Elevated levels of chlorophylla are interpreted as the possible presence of algae blooms, and the amount of algae in a lake has significant effects on the lake’s physical, chemical, and biological processes (United States Environmental Protection Agency, 2014). Algal blooms could also decline lake water quality. Cyanotoxins generated by cyanobacteria in lake water could introduce human health risk during recreational activities or by drinking water (Kalaji et al., 2016; McQuaid et al., 2011; Watzin et al., 2006). In the case that immediately concentration of cyanotoxins is unavailable, chlorophyll-a is also widely accepted as a surrogate measure of cyanobacterial density and of the potential public health risk posed by cyanobacterial blooms (Wheeler et al., 2012). Therefore, it’s essential to predict the chlorophyll-a concentration and, in turn, to provide advance infor-

∗ Corresponding authors. E-mail addresses: [email protected] (W. Liu), [email protected] (H. Shao). http://dx.doi.org/10.1016/j.ecoleng.2016.06.072 0925-8574/© 2016 Elsevier B.V. All rights reserved.

mation for water quality management and facilitate public health risk assessment. Several studies have recently been conducted on water quality prediction models (Chibole, 2013; Wu and Xu, 2011; Zhao et al., 2012). Physical-based water quality modeling approaches are capable of simulating the internal physical processes of the aquatic system, but require extensive information that is not easily accessible (Dogan et al., 2009). Moreover, many physical-based water quality models are time-consuming (Singh et al., 2009). Artificial neural networks have several advantages over the physical-based models, they avoid the need for a hydraulic model and the specialized knowledge of the physical processes governing the fate and transport of pollutants (Bowden et al., 2006). Since a large number of factors that affect the water quality have a complicated non-linear relationship, an artificial neural network (ANN) tries to simulate the learning processes of human being; through training and calibration of the network, ANN has the ability to reflect the linear or non-linear relationships among the data (Dogan et al., 2009). Once the network is trained satisfactorily, the ANN model should be able to obtain output for a new data set. Therefore, there has been a significant increase in their application in different scientific areas, such as modeling process, pattern recognition, and time series analysis (Li et al., 2007; Patil et al., 2008). The applications of ANN models in the prediction and forecasting of water resource variables were reviewed by Maier and Dandy

F. Lu et al. / Ecological Engineering 95 (2016) 422–429

423

Fig. 1. Study area and sampling sites (Vermont Department of Environmental Conservation Water Quality Division and New York State Department of Environmental Conservation, 2012).

(2000), and the majority of those models used back-propagation artificial neural networks (BPANN). The BPANN model showed great potential to simulate water quality parameters, such as water temperature, salinity, DO, and Chl-a in Singapore coastal water (Palani et al., 2008). In Sahoo’s study, a genetic algorithm-optimized BPANN model, multiple regression analysis, and chaotic non-linear dynamics algorithms were used to predict the stream water temperature, the results indicated that the performance efficiency of the optimized BPANN was the highest among all considered algorithms (Sahoo et al., 2009). By comparing the results of the BPANN model and the MIKE 11 physically based hydrodynamic model in the simulation of water levels in a river, the results obtained from the BPANN model were much better than that of the MIKE 11, as indicated by the Nash-Sutcliffe index and root mean square error (Panda et al., 2010). Besides the vast references of BPANN applications, the methods used for determining model inputs, dividing data sets, determining the best model structure, and the comparison between different training algorithms were also reviewed (Maier and Dandy, 2000; Maier et al., 2010; May et al., 2008; Piotrowski et al., 2014). In Lake Champlain area, physical-based models have been applied for lake water quality modeling. A hydrodynamic model in conjunction with a water quality model was applied to Lake Champlain to simulate the current, transport and mixing for the

main lake basin, as well as the kinetics, distribution and concentration of phosphorus in the lake (Mendelsohn et al., 1996). By estimating the mean annual tributary loadings using the FLUX program, a mass balance model was developed and calibrated using the BATHTUB program to simulate the total phosphorus concentration in the thirteen lake segments of Lake Champlain (Smeltzer and Quinn, 1996; Vermont DEC and New York State DEC, 1997). As illustrated previously, these models require numerous input data that are related to the modeling parameters, e.g., water quality sampling data in the lake and tributaries, tributary flow data, wastewater treatment facilities sampling data, meteorology data (including wind, precipitation and evaporation), biological data (including phytoplankton and zooplankton data), and sediment sampling data. Some of these data may be difficult to obtain, and the simulation is time-consuming if a large quantity of numerical calculation is included. This study presents a first attempt to predict the chlorophylla concentration in Lake Champlain by using the back propagation artificial neural network model (BPANN). The optimal input variables of the model will be selected based on linear correlation analysis and domain knowledge. The results of the developed ANN model will be validated with the observed chlorophyll-a concentration, and the model performance will be evaluated using statistical

424

F. Lu et al. / Ecological Engineering 95 (2016) 422–429

indices. All these results will be valuable for precisely eco-restoring lake basin. 2. Material and methods 2.1. Study area Lake Champlain is one of the numerous large lakes located mainly within the United States (states of Vermont and New York) and partially situated in the province of Quebec, Canada. The acreage of Lake Champlain is approximately 1269 km2 in area, the lake is roughly 201 km long, and 0.8–23 km wide (The Canadian Encyclopedia, 2014). The maximum depth is approximately 120 m. The drainage area of Lake Champlain covers about 21,326 km2 and includes the area between the Adirondack Mountains in northeastern New York State and the Green Mountains in Vermont (The Lake Champlain Basin, 2009).

Fig. 2. Architecture of back-propagation neural network for the chlorophyll-a prediction in Lake Champlain.

2.2. Water quality data collection Data used in this study were obtained from the Lake Champlain long-term water quality and biological monitoring project database (Lake Champlain, 2014). Details of sampling and analytical procedures are available in the “Lake Champlain Long-Term Water Quality and Biological Monitoring Program Description” (Vermont Department of Environmental Conservation, 2012). The water samples were collected at 15 lake stations (Fig. 1). Except the stations 9, 16 and 51 that were added after 2001, other stations have been sampled consistently during the entire monitoring period from 1992 to 2012. Considering the availability of long-term monitoring data, the data collected from the epilimnion layer at 12 of the 15 lake sampling stations (except stations 9, 16, and 51) were included in this study. Monthly average values of the 9 parameters that have long-term monitoring data were calculated on the basis of the monitoring data in the lake water chemistry database from 1992 to 2012, including pH, chloride, water temperature (T), Secchi depth (SD), dissolved oxygen (DO), dissolved phosphorus (DP), total phosphorus (TP), total nitrogen (TN), and chlorophyll-a. Chlorophyll-a will be used as the output of the ANN model in this study. 2.3. Construction of the artificial neural network model The most commonly used ANN is the three-layer feedforward model, which is comprised of three distinctive layers: the input layer, the hidden layer, and the output layer. ANN architectures with one hidden layer are applied here since a neural network with one hidden layer can approximate any finite non-linear function with high accuracy (He et al., 2011). 2.3.1. Back propagation neural network and learning algorithm Back-propagation (BP) is a commonly used learning algorithm in ANN applications, which uses the back-propagation algorithm as the gradient descent technique to minimize network error. Each layer in the BPANN has several neurons and each neuron transmits input values and processes to the next layer. As can be seen in Fig. 2, the values of input water quality variables are multiplied by the connection weights w’ij between the input and hidden layer. The weighted signals and bias from the input neurons are summed by the hidden neurons and then projected through a hyperbolic tangent sigmoid transfer function (Eq. (1)) (Sedki et al., 2009). The results of this function are weighted by the connection weights w j between the hidden and output neurons and sent to the output nodes. A linear transfer function (Eq. (2)) (Sedki et al., 2009) is then projected by the output neurons. The output of this neuron is the predicted concentration of chlorophyll-a. The error between the

target and the predicted concentrations is calculated at the output layer and then redistributed back through the model using the back propagation algorithm. The weights are adjusted accordingly until the error meets the pre-specified goal or the pre-determined maximum epoch is achieved through several iterations (Li et al., 2012; Sahoo and Ray, 2006). Xj = tanh(ni=1 xi wij + bjh )

(1)

y = m X w + bko ) j=1 j j

(2)

where i, j and k are the numbers of neurons in the input, hidden, and output layers, respectively. In this case, i represents the number of water quality variables used in the input layer. xi represents the values of the input water quality variables; Xj is the results obtained by the active function from the input neurons; w’ij is the connection weight between the input and hidden neuron, and w j is the connection weight between the hidden neuron and output neuron; bjh and bko are the bias for the jth and kth neuron, respectively. The constructed model is trained using the Levenberg-Marquardt algorithm (LMA) in this study. 2.3.2. Input variables and data processing The techniques for selecting input variables can be classified into model-free techniques and model-based techniques (Maier et al., 2010). Model-free techniques include analytical methods based on linear correlation and non-linear mutual information, and ad-hoc methods on the basis of available data and domain knowledge. Model-based techniques include stepwise methods, sensitive analysis, global optimization methods, and ad-hoc methods. Among these methods, the most commonly used technique is the linear correlation, which is used in conjunction with the consideration of domain knowledge to select the alternative input for the BPANN model in this study. Models with a different combination of alternative input variables are further carried out to determine the optimum input of the BPANN model that has the best performance in predicting chlorophyll-a concentration. The complete lake water quality data are divided into three subsets: the training, validation, and testing data sets. The training data set is used for calibrating the gradient, weights and biases of the network. The validation data set is used to decide when to stop training to prevent over-fitting problems during the training. The testing data set is a new data set that has never been used during the training process, and it is used to assess the generalization ability of the trained model and to predict the selected target outputs using the trained network. The complete lake water quality data set comprises 12 stations × 21 years × 6 months/year = 1512 samples. Data for the first

F. Lu et al. / Ecological Engineering 95 (2016) 422–429

thirteen years of each station were used as the training data (936 samples), the next four years as the validation data (288 samples) and the remaining four years as the testing data (288 samples). Since different variables have different ranges, the variables should be scaled to uniform ranges that are coincident with the limits of the activation functions used in the output layer (Maier and Dandy, 2000). Normalization could be done by the following equation (Liu and Chen, 2012): YN = (ymax − ymin )

 x −x  min i xmax − xmin

− ymin

(3)

YN denotes the data value after normalization, xmax and xmin are the maximum and the minimum values of each water quality variable used in the input layer; ymax and ymin denote the boundary values of the specific range determined by the user, which are taken as 1 and −1 respectively in this study. 2.3.3. Optimization of the BPANN model structure The optimal structure of the BPANN model is determined based on the minimum value of the mean square error (MSE) of the training and validation data sets. MSE is used as a target error goal in network training. The training will stop when it reaches the setting MSE or the maximum epoch. The MSE is defined as: MSE =

1 n 2  (y − yi ) n i=1 i

(4)

yi and yi denote the modeled concentration and the observed concentration of chlorophyll-a, respectively; n is the number of data in each data set. The neuron numbers in the hidden layer (Nh ) play important roles in the optimization of the model structure. Palani et al. (2008) supported that Nh can lie between I and 2I+ 1 and it should be greater than the maximum of I/3 and O. Within this range, the optimal hidden neuron number is determined through a trial and error approach. The application of the BPANN model in this study aims at developing a more generalized neural network rather than the one that optimally fitted the training set. Thus, networks with few hidden nodes are preferable to that with many hidden nodes, because the networks with fewer hidden neurons usually have better generalization capabilities and fewer over-fitting problems. The one that has the minimum MSE in training and validation data sets is chosen as the optimal network structure. 2.3.4. Model performance evaluation To determine the performance of the trained model, the estimated responses and the target outputs are compared. Both graphical techniques and quantitative statistics are used in model evaluation herein. For graphical techniques, the standard regression line and the response curves are applied to indicate the matching degree between estimated responses and target outputs. The slope represents the relative relationship between estimated responses and target outputs. The y-intercept of the regression line indicates the lag between observed data and model predictions (Moriasi et al., 2007). For quantitative statistical methods, the coefficient of determination (R2 ) and RMSE-observations standard deviation ratio (RSR) are used to evaluate the model performance. The coefficient of determination (R2 ) is used to evaluate the model performance, which represents the percentage of variability that can be explained by the model. It is calculated as:

⎡ R2 = ⎣ 

⎤2 nni=1 yi yˆ i − (ni=1 yi )(ni=1 yˆ i ) [nni=1 yi2

2 − (ni=1 yi ) ] × [nni=1 yˆ i2

2 − (ni=1 yˆ i ) ]



(5)

425

RSR is calculated as the ratio of the RMSE and standard deviation of observed data:

 RSR =

RMSE = STDEVobs





2

ni=1 (yi − yi ) /n ni=1 (yi − ymean )2 /n

=



ni=1 (yi − yi )

2

(6)

ni=1 (yi − ymean )2

where yi and yi denote the modeled concentration and the observed concentration of chlorophyll-a, respectively; ymean is the mean value of the observed chlorophyll-a concentration in each data set; n is the total number of observations in each data set. R2 ranges from 0 to 1 and higher values indicate better model performance, and R2 values greater than 0.5 are considered acceptable (Singh et al., 2012). RSR ranges from zero, which indicates perfect model simulation, to any larger positive values. A lower RSR indicates a lower RMSE and better model performance. The model performance can be judged as satisfactory if RSR is less than 0.70 (Moriasi et al., 2007).

3. Results 3.1. Alternative input variable selection The alternative input data were obtained using the results of the Pearson’s correlation analysis between the concentration of chlorophyll-a and other water quality variables. The absolute value of the correlation coefficient represents the degree of correlation, i.e. more than 0.5 represents strong correlation, from 0.3 to 0.5 indicates a moderate correlation, from 0.1 to 0.3 shows a weak correlation, and less than 0.1 is a very weak correlation (explorable.com, 2014). Therefore, the variables with the absolute value of correlation coefficients greater than 0.1 were selected as alternative input variables. Based on the correlative analysis of the concentration of chlorophyll-a and other eight parameters (Table 1), the variables that have the absolute value of correlation coefficients greater than 0.1 were SD, DP, TP, TN, pH, and T. Moreover, algae produce oxygen in daylight and consume oxygen during the night. Oxygen consumption also occurs during the process of death and decay of algae. Therefore, dissolved oxygen is also correlated with chlorophyll-a concentration. Consequently, seven water quality variables, including SD, DP, TP, TN, pH, DO and T, were chosen as alternative input variables of the BPANN model for chlorophyll-a concentration prediction.

3.2. Spatial variation of the alternative input variables Seven water quality variables, including SD, DP, TP, TN, pH, DO, and T were prepared as alternative input variables of the BPANN model for chlorophyll-a computation. The basic statistical parameters, i.e. the minimum, median, maximum, mean, the standard deviation, and the coefficient of variation of the selected variables were shown in Table 2. Dissolved phosphorus, total phosphorus, and chlorophyll-a showed large variation between lake sampling sites. For the twelve sampling sites in Lake Champlain, measured DP, TP, and chlorophyll-a averaged 10.79 ± 7.21 ␮g/L (mean ± S.D.), 22.93 ± 16.15 ␮g/L, and 5.8 ± 4.68 ␮g/L, respectively. Correspondingly, the coefficient of variation of these three variables were 0.67, 0.70, and 0.81, respectively. The large variation in DP and TP in Lake Champlain is presumably attributed to the large geographical area of the lake and different loads from tributaries distributed around the lake and subsequently resulted in the variation in chlorophyll-a.

426

F. Lu et al. / Ecological Engineering 95 (2016) 422–429

Table 1 Correlation analysis of the concentration of chlorophyll-a and other water quality variables. Variables Dissolved Phosphorus

Pearson Correlation Sig. (2-tailed) Pearson Correlation Sig. (2-tailed) Pearson Correlation Sig. (2-tailed) Pearson Correlation Sig. (2-tailed)

Chloride Secchi Depth Total Nitrogen *

Chl-a

Variables

0.340* 0.000 −0.047 0.068 −0.444* 0.000 0.453* 0.000

Dissolved Oxygen

Chl-a Pearson Correlation Sig. (2-tailed) Pearson Correlation Sig. (2-tailed) Pearson Correlation Sig. (2-tailed) Pearson Correlation Sig. (2-tailed)

pH WaterTemperature Total Phosphorus

−0.090* 0.000 0.263* 0.000 0.177* 0.000 0.567* 0.000

Correlation is significant at the 0.01 level (2-tailed).

Table 2 Basic statistics of the selected water quality variables measured between 1992 and 2012 in Lake Champlain (n = 1512). Variable

Unit

Minimum

Maximum

Mean

Median

Standard Deviation

Coefficient of Variation

Secchi Depth pH Temperature Dissolved Oxygen Dissolved Phosphorus Total Phosphorus Total Nitrogen Chlorophyll-a

m – ◦ C mg/L ␮g/L ␮g/L mg/L ␮g/L

0.30 6.58 3.63 3.28 3.00 5.33 0.20 0.50

9.05 9.55 26.60 14.45 54.00 140.50 1.35 40.85

3.76 8.01 17.51 9.51 10.79 22.93 0.42 5.80

4.00 8.00 18.43 9.37 8.00 16.00 0.40 4.50

1.80 0.36 4.76 1.41 7.21 16.15 0.12 4.68

0.48 0.04 0.27 0.15 0.67 0.70 0.28 0.81

Table 3 Scenarios of various input variables for the chlorophyll-a prediction. Input Variables

TP, TN, SD, DP, pH, T, DO TP, TN, SD, DP, pH, T TP, TN, SD, DP, pH TP, TN, SD, DP TP, TN, SD TP, TN TP

Topology

7–12−1 6–11−1 5–11−1 4–7−1 3–5−1 2–4−1 1–3−1

Training Set

Validation Set

Testing Set

R2

RSR

R2

RSR

R2

RSR

0.82 0.82 0.79 0.76 0.81 0.75 0.76

0.62 0.83 1.10 1.49 0.85 1.39 1.26

0.93 0.87 0.88 0.89 0.90 0.84 0.84

0.38 0.93 0.67 0.78 0.89 1.33 1.10

0.81 0.76 0.74 0.73 0.79 0.73 0.72

0.53 1.08 1.25 1.45 0.83 1.32 1.53

3.3. Performance of BPANN models Seven scenarios established by the variables obtained through the Pearson correlation analysis are presented in Table 3. The input variable was sequentially excluded one by one in descending order of its correlation coefficient with chlorophyll-a. Then, the BPANN model was performed for each scenario. As stated previously, within the range of [I, 2I + 1], different BPANN models were constructed and tested to determine the optimum number of hidden neurons for each scenario. 3.3.1. Statistical evaluation of the developed BPANN models Table 3 shows the results of performance of each scenario for the training, validation, and testing data sets. Among the seven scenarios, the one that uses all the seven variables as model input showed the best prediction performance in the testing data set. The values of R2 and RSR were 0.82 and 0.62 in the training set, 0.93 and 0.38 in the validation set, 0.81 and 0.53 in the testing set, respectively. The topology of this model has seven neurons in the input layer, twelve neurons in the hidden layer and one neuron in the output layer. 3.3.2. Graphical evaluation of the optimal BPANN models For the optimal model selected above, Figs. 3–5 present the comparison of modeled and observed values of chlorophyll-a in training, validation and testing data sets. To further illustrate the model performance, Fig. 6 shows the comparison of all the observed and modeled values of monthly chlorophyll-a concentration, the coefficient of determination (R2 ) for all the data was 0.87, RSR value

was 0.48. The intercepts of the regression line of training data set, validation data set, testing data set and all data are 0.36, 2.4, 2.4, and 1.1, respectively. Correspondingly, the slopes of the regression lines of these datasets were 0.85, 0.73, 0.68, and 0.79, respectively.

4. Discussion Though the BPANN predictions reproduce the measured values satisfactorily, as can be seen in Figs. 3–5, the regression lines intersected with the dot lines that represent a perfect fitting. There are problems of underestimation for high values of chlorophyll-a concentrations and overestimation of low values in the training, validation, and testing data sets. For example, the observed chlorophyll-a is 40.85 ␮g/L while the modeled value is 30.36 ␮g/L. For these problems in the training set, the reason is that the object of this study is to develop a BPANN model that has a better generalization ability, rather than a model that perfectly fits the training data. The reasons for the underestimation and overestimation in the validation and testing data sets could be summarized as follows: Maier et al. (2010) stated that ANN models have the best performance when they are applied within the range of the training data; and in order to develop the best-performed model, the statistical properties of the training, validation and testing data sets should be similar. Therefore, the underestimation and overestimation in the validation and testing data sets are probably due to the non-homogeneous nature of the input and output water quality variables. Since these data were measured over a long period of 21 years and the sampling sites distributed over a large geographical

F. Lu et al. / Ecological Engineering 95 (2016) 422–429

Fig. 3. Comparison of the modeled and observed concentration of chlorophyll-a in the training data set: (a) Response of the training data set; (b) Regression of the training data set.

Table 4 Statistical properties of chlorophyll-a concentration in training, validation, and testing data sets.

n Mean (␮g/L) Median (␮g/L) Standard Deviation (␮g/L) Minimum (␮g/L) Maximum (␮g/L)

Training set

Validation set

Testing set

936 4.90 4.20 2.86 0.50 28.07

288 8.35 5.75 7.35 0.64 40.85

288 6.17 4.39 5.06 0.50 32.80

area, this resulted in a large variation of the values of water quality variables in this study, it is difficult to ensure the similarity of the statistical properties of the three data sets. As listed in Table 4, the range of chlorophyll-a concentration in the training, validation and testing data set was 0.5-28.07 ␮g/L, 0.64-40.85 ␮g/L, and 0.532.80 ␮g/L, respectively. The median and standard deviation of the validation and testing data sets were also higher than that of the training data set. However, the developed model was trained to fit the data in the training data set, which has a smaller range as

427

Fig. 4. Comparison of the modeled and observed concentration of chlorophyll-a in the validation data set: (a) Response of the validation data set; (b) Regression of the validation data set.

compared to the other two data sets. Although the R2 values of the testing and validation data sets were similar to or even higher than that of the training set, the y-intercepts of the validation and testing data sets are larger than that of the training data set, and the slopes of the regression line of the validation and testing data sets are smaller than that of the training data set. Larger y-intercepts and smaller slopes of the regression lines indicate larger differences between the observed and modeled values, and the model performance in these two data sets are not as good as the model performance in the training data set. Abrupt changes of target data, which is chlorophyll-a concentration in this study, probably arose from the inevitable errors during the sampling and analysis of chlorophyll-a, were found during the study period and created a considerable challenge for the prediction. The above factors would probably result in the underestimation or overestimation of the values in the data sets. Although there are some underestimation problems, Fig. 6 shows a good agreement of the total response between all the observed and modeled chlorophyll-a values. The coefficient of

428

F. Lu et al. / Ecological Engineering 95 (2016) 422–429

Fig. 5. Comparison of the modeled and observed concentration of chlorophyll-a in the testing data set: (a) Response of the testing data set; (b) Regression of the testing data set.

determination is 0.87. Moreover, Moriasi et al. (2007) stated that the model simulation can be considered satisfactory if RSR ≤ 0.7. In this study, the RSR value of the testing data set is 0.53, which indicates that the developed BPANN model is applicable to the chlorophyll-a prediction in Lake Champlain. For the determination of model input, Cho et al. (2014) stated that the accuracy of the model could be improved by using a relatively small number of input variables instead of using various environmental factors. This is also supported by the research results of Lee et al. (2003). On the contrary, the results of this study indicate that all the seven variables were selected as the optimal model input rather than using a relatively small number of input variables, and similar results were noted by He et al. (2011). These contradictions are probably due to the different properties of the water body and the study areas, e.g. the hydraulic conditions in the water body and the meteorological conditions in the study area. The shallow bays of Lake Champlain, especially the Missisquoi Bay have experienced increasingly severe algal blooms over recent decades. Many researches have been conducted to monitor Chl-a concentration and cyanobacteria density in Missisquoi Bay and, in turn, to reveal the dynamic internal drivers of cyanobacteria bloom

Fig. 6. Comparison of the modeled and observed values of chlorophyll-a for all data: (a) Response of all data; (b) Regression of all data.

and to establish a framework for assessing the public health risk caused by cyanobacterial toxins (Isles et al., 2015; Watzin et al., 2006; Tarczynska et al., 2001; Wheeler et al., 2012). Consistent with the results reported in these researches, high level of chlorophylla concentration was presented by the model results in this study. Chl-a is a key parameter for lake managers and is used to determine whether and how much action should be taken to reduce the external nutrient loading into the lake. Therefore, the results of the developed BPANN model can provide advance information of the Chl-a concentration and, in turn, the managers could do preliminary assessment of public health risk and take measures to minimize or eliminate the adverse effects of algal bloom on the drinking water supply and recreational activities (Paerl et al., 2012). 5. Conclusions The ability of a developed BPANN model for the prediction of chlorophyll-a concentration in Lake Champlain was investigated in this study. Input variables used in the BPANN models were selected based on the linear correlation analysis results and domain knowledge. Seven scenarios including different combinations of water

F. Lu et al. / Ecological Engineering 95 (2016) 422–429

quality variables were carried out to build different models and to obtain the optimal set of input variables. The performances of the BPANN models were examined against observed chlorophylla concentration. The topology with seven neurons in the input layer, twelve neurons in the hidden layer, and one neuron in the output layer provided the best performance in the prediction of chlorophyll-a concentration among all the scenarios. The R2 values of the best-performed model in the training set, validation set, testing set, and all-year data were 0.82, 0.93, 0.81, and 0.87, respectively. The corresponding RSR values of the three data sets and all-year data were 0.62, 0.38, 0.53, and 0.48, respectively. The results indicate that the developed BPANN model could provide satisfactory results in predicting the chlorophyll-a concentration in Lake Champlain. The model can be used to provide a quick modeling assessment of chlorophyll-a variation for water environment managers and, in turn, to facilitate preliminary public health risk assessment. The developed BPANN model has great potential in further work to estimate interpolated data between two samplings and to estimate the chlorophyll-a concentration when the monitoring data cannot be acquired. Acknowledgments The authors acknowledge the financial support of the Scientific Research Fund of the First Institute of Oceanography, SOA [No. 2015T01], as well as the help of Tim Clear from the Vermont Department of Environmental Conservation in searching the data, and Mr. Tao Lu from Qingdao University of Technology for the assistance of organizing data. References Bowden, G.J., Nixon, J.B., Dandy, G.C., Maier, H.R., Holmes, M., 2006. Forecasting chlorine residuals in a water distribution system using a general regression neural network. Math. Comput. Modell. 44, 469–484. Chibole, O.K., 2013. Modeling River Sosiani’s water quality to assess human impact on water resources at the catchment scale. Ecohydrol. Hydrobiol. 13, 241–245. Cho, S., Lim, B., Jung, J., Kim, S., Chae, H., Park, J., Park, S., Park, J.K., 2014. Factors affecting algal blooms in a man-made lake and prediction using an artificial neural network. Measurement 53, 224–233. Dogan, E., Sengorur, B., Koklu, R., 2009. Modeling biological oxygen demand of the Melen River in Turkey using an artificial neural network technique. J. Environ. Manag. 90, 1229–1235. He, B., Oki, T., Sun, F., Komori, D., Kanae, S., Wang, Y., Kim, H., Yamazaki, D., 2011. Estimating monthly total nitrogen concentration in streams by using artificial neural network. J. Environ. Manag. 92, 172–177. Isles, P.D.F., Giles, C.D., Gearhart, T.A., Xu, Y., Druschel, G.K., Schroth, A.W., 2015. Dynamic internal drivers of a historically severe cyanobacteria bloom in Lake Champlain revealed through comprehensive monitoring. J. Great Lakes Res. 41, 818–829. Kalaji, H.M., Sytar, O., Brestic, M., Samborska, I.A., Cetner, M.D., Carpentier, C., 2016. Risk assessment of urban lake water quality based on in-situ cyanobacterial and total chlorophyll-a monitoring. Pol. J. Environ. Stud. 25, 45–56. Lake Champlain long-term water quality and biological monitoring project database, 2014. Lake Champlain long-term water quality and biological monitoring project database (accessed 26.05.14.) http://www.watershedmanagement.vt.gov/lakes/htm/lp longterm.htm. Latif, Z., Tasneem, M.A., Javed, T., Butt, S., Fazil, M., Ali, M., Sajjad, M.I. 2003. Evaluation of Water-Quality by Chlorophyll and Dissolved Oxygen. Water Resources in the South: Present Scenario and Future Prospects: 122. Lee, J.H., Huang, Y., Dickman, M., Jayawardena, A., 2003. Neural network modelling of coastal algal blooms. Ecol. Modell. 159, 179–201. Li, H., Hou, G., Dakui, F., Xiao, B., Song, L., Liu, Y., 2007. Prediction and elucidation of the population dynamics of Microcystis spp. in Lake Dianchi (China) by means of artificial neural networks. Ecol. Inf. 2, 184–192. Li, J., Cheng, J.-H., Shi, J.-Y., Huang, F., 2012. Brief Introduction of Back Propagation (BP) Neural Network Algorithm and Its Improvement. Advances in Computer Science and Information Engineering. Springer. Liu, W., Chen, W., 2012. Prediction of water temperature in a subtropical subalpine lake using an artificial neural network and three-dimensional circulation models. Comput. Geosci. 45, 13–25. Maier, H.R., Dandy, G.C., 2000. Neural networks for the prediction and forecasting of water resources variables: a review of modelling issues and applications. Environ. Model. Softw. 15, 101–124. Maier, H.R., Jain, A., Dandy, G.C., Sudheer, K.P., 2010. Methods used for the development of neural networks for the prediction of water resource variables

429

in river systems: current status and future directions. Environ. Model. Softw. 25, 891–909. May, R.J., Maier, H.R., Dandy, G.C., Fernando, T.M.K.G., 2008. Non-linear variable selection for artificial neural networks using partial mutual information. Environ. Model. Softw. 23, 1312–1326. McQuaid, N., Zamyadi, A., Prevost, M., Bird, D.F., Dorner, S., 2011. Use of in vivo phycocyanin fluorescence to monitor potential microcystin-producing cyanobacterial biovolume in a drinking water source. J. Environ. Monit. 13, 455–463. Mendelsohn, D., lsaji, T., Rines, H., 1996. Hydrodynamic and Water Quality Modeling of Lake Champlain. Applied Science Associates, Inc. Moriasi, D., Arnold, J., Van Liew, M., Bingner, R., Harmel, R., Veith, T., 2007. Model evaluation guidelines for systematic quantification of accuracy in watershed simulations. Trans. ASABE 50, 885–900. PAERL, H.W., PAUL, V.J. 2012. Climate change: Links to global expansion of harmful cyanobacteria. Water Res. 46 (5), 1349–1356. Palani, S., Liong, S.-Y., Tkalich, P., 2008. An ANN application for water quality forecasting. Mar. Pollut. Bull. 56, 1586–1597. Panda, R.K., Pramanik, N., Bala, B., 2010. Simulation of river stage using artificial neural network and MIKE 11 hydrodynamic model. Comput. Geosci. 36, 735–745. Patil, S.L., Tantau, H.J., Salokhe, V.M., 2008. Modelling of tropical greenhouse temperature by auto regressive and neural network models. Biosyst. Eng. 99, 423–431. Piotrowski, A.P., Osuch, M., Napiorkowski, M.J., Rowinski, P.M., Napiorkowski, J.J., 2014. Comparing large number of metaheuristics for artificial neural networks training to predict water temperature in a natural river. Comp. Geosci. 64, 136–151. Sahoo, G.B., Ray, C., 2006. Flow forecasting for a Hawaii stream using rating curves and neural networks. J. Hydrol. 317, 63–80. Sahoo, G.B., Schladow, S.G., Reuter, J.E., 2009. Forecasting stream water temperature using regression analysis, artificial neural network, and chaotic non-linear dynamic models. J. Hydrol. 378, 325–342. Sedki, A., Ouazar, D., El Mazoudi, E. 2009. Evolving neural network using real coded genetic algorithm for daily rainfall–runoff forecasting. Expert Systems with Applications. 36, 4523–4527. Singh, K.P., Basant, A., Malik, A., Jain, G., 2009. Artificial neural network modeling of the river water quality—a case study. Ecol. Modell. 220, 888–895. Singh, A., Imtiyaz, M., Isaac, R., Denis, D., 2012. Comparison of soil and water assessment tool (SWAT) and multilayer perceptron (MLP) artificial neural network for predicting sediment yield in the Nagwa agricultural watershed in Jharkhand, India. Agric. Water Manag. 104, 113–120. Smeltzer, E., Quinn, S., 1996. A phosphorus budget, model, and load reduction strategy for Lake Champlain. Lake Reserv. Manag. 12, 381–393. TARCZYNSKA M., ROMANOWSKA-DUDA Z., JURCZAK T.2011. ZALEWSKI M. 2001.Toxic cyanobacterial blooms in drinking water reservoir- causes, consequences and management strategy, Water Science and Technology: Water Supply,IWA Publishing 2001, 1 (2), 237–247. in Applied Phycology 4, DOI 10.1007/978-90-481-9268-7 11, © Springer Science+Business Media B.V. The Canadian Encyclopedia. 2014. Lake Champlain. (accessed 25.08.14) http://www.thecanadianencyclopedia.ca/en/article/lake-champlain/. The Lake Champlain Basin, 2009. The Lake Champlain Basin Waterbody Inventory and Priority Waterbodies List. Bureau of Watershed Assessment and Management,Division of Water, NYS Department of Environmental Conservation. United States Environmental Protection Agency, 2014. Chapter 4: Eutrophication (accessed 24.08.14) http://www.epa.gov/emap2/maia/html/docs/Est4.pdf. Vermont DEC; New York State DEC. A phosphorus budget, model, and load reduction strategy for Lake Champlain (Lake Champlain Diagnostic-Feasibility Study final report). Waterbury, VT and Albany, NY.; 1997. Vermont Department of Environmental Conservation Water Quality Division; New York State Department of Environmental Conservation 2012. Lake Champlain Long-Term Water Quality and Biological Monitoring Program Description. in: Vermont Department of Environmental Conservation Water Quality Division, New York State Department of Environmental Conservation, eds. Watzin, M.C., Miller, E.B., Shambaugh, A.D., Kreider, M.A., 2006. Application of the WHO alert level framework to cyanobacterial monitoring of Lake Champlain, Vermont. Environ. Toxicol. 21, 278–288. Wheeler, S.M., Morrissey, L.A., Levine, S.N., Livingston, G.P., Vincent, W.F. 2012. Mapping cyanobacterial blooms in Lake Champlain’s Missisquoi Bay using QuickBird and MERIS satellite data. J. Great Lakes Res. 38, Suppl. 1:68–75. Wu, G., Xu, Z., 2011. Prediction of algal blooming using EFDC model: case study in the Daoxiang Lake. Ecol. Modell. 222, 1245–1252. Zhao, L., Zhang, X., Liu, Y., He, B., Zhu, X., Zou, R., Zhu, Y. 2012. Three-dimensional hydrodynamic and water quality model for TMDL development of Lake Fu. explorable.com. 2014. Statistical Correlation (accessed 2014. 08.30): https://explorable.com/statistical-correlation.