Science of the Total Environment 628–629 (2018) 1178–1190
Contents lists available at ScienceDirect
Science of the Total Environment journal homepage: www.elsevier.com/locate/scitotenv
Comparative predictive modelling of the occurrence of faecal indicator bacteria in a drinking water source in Norway Hadi Mohammed a,⁎, Ibrahim A. Hameed b, Razak Seidu a a
Water and Environmental Engineering Group, Institute of Marine Operations and Civil Engineering, Norway Dept. of ICT and Natural Sciences, Faculty of Information Technology and Electrical Engineering, Norwegian University of Science and Technology (NTNU) in Ålesund, Larsgårdsvegen 2, 6009 Ålesund, Norway
b
H I G H L I G H T S
G R A P H I C A L
A B S T R A C T
• Zero-inflated (ZI) regression accommodate typical variations in faecal indicator bacteria observed in water. • ZI models predict zero counts of faecal indicator bacteria better than high positive counts. • Adaptive neuro-fuzzy inference system may not be very efficient for predicting count variables such as indicator bacteria. • Random Forest and ZI regression models have high potentials for real time prediction of indicator bacteria in water supply.
a r t i c l e
i n f o
Article history: Received 13 October 2017 Received in revised form 8 February 2018 Accepted 12 February 2018 Available online xxxx Editor: H.M. Solo-Gabriele Keywords: Faecal indicator bacteria Zero-inflated models Random forest Adaptive neuro-fuzzy inference system
a b s t r a c t Presently, concentrations of fecal indicator bacteria (FIB) in raw water sources are not known before water undergoes treatment, since analysis takes approximately 24 h to produce results. Using data on water quality and environmental variables, models can be used to predict real time concentrations of FIB in raw water. This study evaluates the potentials of zero-inflated regression models (ZI), Random Forest regression model (RF) and adaptive neuro-fuzzy inference system (ANFIS) to predict the concentration of FIB in the raw water source of a water treatment plant in Norway. The ZI, RF and ANFIS faecal indicator bacteria predictive models were built using physico-chemical (pH, temperature, electrical conductivity, turbidity, color, and alkalinity) and catchment precipitation data from 2009 to 2015. The study revealed that pH, temperature, turbidity, and electrical conductivity in the raw water were the most significant factors associated with the concentration of FIB in the raw water source. Compared to the other models, the ANFIS model was superior (Mean Square Error = 39.49, 0.35, 0.09, 0.23 CFU/100 ml respectively for coliform bacteria, E. coli, Intestinal enterococci and Clostridium perfringens) in predicting the variations of FIB in the raw water during model testing. However, the model was not capable of predicting low counts of FIB during both training and testing stages of the models. The ZI and RF models were more consistent when applied to testing data, and they predicted FIB concentrations that characterized the observed FIB concentrations. While these models might need further improvement, results of this study indicate that ZI and RF regression models have high prospects as tools for the real-time prediction of FIB in raw water sources for proactive microbial risk management in water treatment plants. © 2018 Elsevier B.V. All rights reserved.
⁎ Corresponding author. E-mail address:
[email protected] (H. Mohammed).
https://doi.org/10.1016/j.scitotenv.2018.02.140 0048-9697/© 2018 Elsevier B.V. All rights reserved.
H. Mohammed et al. / Science of the Total Environment 628–629 (2018) 1178–1190
1. Introduction Worldwide, microbial contamination of raw water sources and the associated waterborne disease outbreaks remains a major challenge. Surface water bodies are particularly vulnerable to microbial contamination from natural and anthropogenic activities mostly within their watersheds and catchments (Pandey et al., 2014; Wang et al., 2015). Although microbial organisms can enter drinking water supply systems through contamination of storage facilities and distribution networks, the quality of treated water largely depends on the quality of the raw water (WHO, 2011). Less advanced treatment may be required for less contaminated raw water sources (Davies and Mazumder, 2003). Protection of source water is therefore critical in the provision of safe drinking water (Allen et al., 2000). Managers of drinking water supply systems routinely assess the levels of microbial organisms in raw water sources. Such routine monitoring exercises mostly target faecal indicator bacteria (FIB) such as coliform bacteria, total coliforms, fecal coliforms and Escherichia coli (E. coli) since their presence gives an indication of deteriorating microbiological quality of water. For instance, E. coli in water provides conclusive evidence of direct fecal contamination from warm-blooded animals, whereas organisms such as Clostridium perfringens when found in water suggest potential contamination by sewage (Myers et al., 2007). The easiness and relatively low cost of identifying these FIB in water make their presence or absence a fundamental basis for establishing the microbial quality of water (Myers et al., 2007). However, existing methods for the detection of FIB are still fraught with challenges including sensitivity, specificity, speed, cost and false negatives (Kostic et al., 2011; Law et al., 2015). Results of laboratory analysis of raw water often become ready after raw water has undergone treatment and reached consumers. In cases of waterborne outbreaks, the levels of microbial organisms in drinking water are only investigated after the incidents. To achieve the ultimate goal of protecting public health, early detection of microbial organisms in raw water is necessary for the development of proactive risk management strategies. Besides enhancing microbial detection tools, mathematical models can be employed to reliably predict the concentration of microbial organisms in raw water. Studies have shown that the growth and survival of microbial organisms in raw water sources as well as their resilience to treatment are significantly controlled by physical and chemical water quality parameters (e.g. turbidity, colour, pH, temperature and electrical conductivity) (Delpla et al., 2009; Lage Filho, 2010; Khan et al., 2013; Tang et al., 2006; Barry et al., 2016). These water quality variables are typically measured in real time before raw water undergoes treatment. Parameters such as turbidity may give an indication of particulate concentrations in the raw water, which nourish microorganisms (Sinclair et al., 2012), while others such as pH and temperature affect the effectiveness of treatment processes including coagulation and flocculation (Crittenden et al., 2012). Therefore, managers of water treatment plants can take advantage of real time measurements of physical and chemical parameters to predict the concentrations of FIB in raw water before treatment. In relation to the application of mathematical models, it is important that a suite of models are applied and compared, and the best one selected based on sound statistical parameters. Microbial data are typically non-negative integers that are skewed with the possibility of large number of zeros. This skewness and zero-inflated data have generally been accounted for by approximation to normal distribution in multiple linear regression models (Eregno et al., 2014a; Herrig et al., 2015). However, studies have shown that such transformations do not entirely eliminate certain violations of traditional regression models such as heteroscedasticity, and could produce biased parameter estimates (Coxe et al., 2009; Beaujean and Morgan, 2016). Both Poisson and negative binomial models are often regarded as the reliable methods for analyzing count data in general (Zuur et al., 2009), yet for the Poisson model, the conditional restriction on the equality of mean and variance makes it inadequate for application in
1179
count data analysis (Dohoo et al., 2009; Cameron and Trivedi, 2013). While the negative binomial model is able to address the limitations of the Poisson model for overdispersed count data (due to the additional parameter that accounts for extra variance), some studies suggest that the model may not be suitable when excessive zeros are present in count data (Rose et al., 2006; Phang and Loh, 2014; Hüls et al., 2017). There still exists a risk of influencing the precision of parameters estimated by the model if the variance is misspecified, potentially resulting in misleading inferences and conclusions. Excessive zeros characterize typical FIB observed in water. Accordingly, it is necessary to apply flexible models that incorporate the excessive zeros in the modeling process. Methods such as zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) are two-part models that model both zero and non-zero counts. These methods model the zero outcomes in count data as the probability of an observation being always zero, plus the probability of an observation not being zero, all multiplied by the probability that the observation yields a zero (Hilbe and Green, 2007). Relative to standard forms of Poisson and negative binomial models, zero-inflated models have been shown to be more appropriate in modeling count data that contain excessive zeros and overdispersion (Sae-Lim et al., 2017; Yang et al., 2017). Studies have used more advanced modelling approaches such as Artificial Neural Network (ANN) to predict the occurrence of FIB in water (Mas and Ahlfeld, 2007; Zhang et al., 2015). However, ANN is a blackbox model that is not able to explain the individual contributions of the predictor variables in the model. Other highly efficient artificial intelligence (AI) models such as Random Forest (RF) and Adaptive Neuro-fuzzy Inference System (ANFIS) models are capable of explaining the relationships and importance of model input variables to the outputs. Despite successful applications of these highly efficient models in other aspects of water research (Wu and Lo, 2008; Salahi et al., 2015; Walsh et al., 2017), the techniques are rarely applied in the prediction of FIB in drinking water sources. This study applies and compares non-linear zero-inflated models (ZIP and ZINB) as well as RF and ANFIS models to predict the concentrations of FIB in raw water sources using data from the raw water intake of the Oset Water Treatment Plant (WTP) in Oslo, Norway. The specific aim is to predict the concentrations of coliform bacteria, E. coli, Intestinal enterococci and C. perfringens in the raw water using physical and chemical water quality and environmental parameters as predictors. These models account for deficiencies in traditional multiple linear and non-linear regression models as well as the blackbox elements in ANN. The comparative assessment of zero inflated regression and AI models in this study will contribute to a further understanding of the selection and applicability of appropriate models in the prediction of microbial organism in raw water sources. 2. Materials and methods 2.1. Study site Data used in this study were obtained from the raw water source of the Oset WTP in Oslo, Norway (Fig. 1). The WTP has a capacity of 390,000 m3/day and is the largest municipal water treatment plant in Scandinavia providing safe drinking water to about 90% of the inhabitants of Oslo. The main raw water source of the Oset WTP is Maridalen Lake, which has a surface area of 3.83 km2 within a catchment of 252 km2. The Lake has an average annual flow of 184 million m3/year of water from two primary inflows (Skajærsjølva and Dausjølva) from the north, and drains into the Akerselva River to the south (Oslo Kommune Vann og Avløpsetaten, 2012). The raw water intake of the WTP is located at a depth of 32 m in the lake. Currently, no microbial source tracking in the catchment has been carried out to establish the sources of faecal contamination to the lake. However, the land uses and activities within the lake's catchment area provide many potential sources of faecal contamination. There are nearly 1500 people living
1180
H. Mohammed et al. / Science of the Total Environment 628–629 (2018) 1178–1190
within the catchment area, with about 100 horses and few farm animals (Tryland et al., 2015). Management of wastewater from households is primarily through containment of blackwater in closed tanks and treatment of greywater on-site. Incidents such as leakages from the closed blackwater tanks combined with poor greywater treatment that contribute to microbial contamination of the water source remains an ongoing challenge for the Oset water treatment plant operators. Wild animals and birds are also predominant in the catchment and contribute to microbial contamination of the water source. In addition, the catchment is a popular area for recreational activities, although bathing and camping are not allowed (Oslo Kommune Vann og Avløpsetaten, 2012). 2.2. Data set Raw water samples are routinely collected on a weekly basis at the raw water source of the Oset WTP and analyzed for physical, chemical and microbial parameters using standard laboratory methods. The data used in this study consist of weekly samples taken from the raw water source of the WTP from January 2009 to December 2015 for the physical, chemical and FIB parameters, constituting 364 data points for each parameter. The physical and chemical parameters were pH, temperature (°C), electrical conductivity (μS/cm), turbidity (NTU), color (mg Pt/l) and alkalinity (mmol/l). The FIB parameters accounted for were coliform bacteria, E. coli, Intestinal enterococci and C. perfringens. In addition, we averaged daily precipitation data taken from three meteorological stations within the catchment of the drinking water source (Maridalsvannet Lake). All the physical and chemical parameters were
carried out by the Water and Wastewater Department of the Oslo municipality while the ALS Laboratory in Oslo carried out the FIB analysis. The ALS laboratory uses the International Organization for Standardization (ISO) standard procedures. All the ISO standard procedures used involved concentration through membrane filtration (MF), as well as detection/enumeration using selective growth media. The MF method largely produces microbial count data that is non-censored and is more precise and accurate than the classical most probable number method that produces censored microbial count data. The ISO 9308-1 method with Coliform Chromogenic Agar (CCA), which is a solid, selective and differential culture medium was used to detect and enumerate E. coli and Coliform bacteria by membrane filtration. The ISO 7899-2 method with Bile Esculin Azide Agar as selective medium was used for the isolation and presumptive identification of Intestinal enterococci after membrane filtration, while ISO 14189:2013 with m-CP agar selective medium was used for enumerating vegetative cells and spores of C. perfringens after membrane filtration. The membrane filtration involved filtering 100 mL of raw through microporous membranes to recover cells of the FIB (typically 47 mm diameter, 0.22 to 0.45 ± 0.02 μm pore size). After the filtration, the membranes with the recovered cells were placed on the surfaces of selective growth media such that no air bubbles were entrapped between the membrane and the surface of the media. Subsequently, the recovered cells on the membrane were grown by incubation for 24 h in the agar petri dishes with nutrient media (Oshiro, 2002). The incubation temperature and duration depended on the target FIB. For instance, in ISO 9308-1, the membrane was incubated for 18-24 hours at 36 ± 2 °C,
Fig. 1. Study area map showing the location of the Oset Water Treatment Plant and Maridalsvannet Lake in Oslo, Norway.
H. Mohammed et al. / Science of the Total Environment 628–629 (2018) 1178–1190
while a temperature of 44 ± 1 °C was used for ISO 14189:2013 for 21 ± 3 h for C. perfringens. The ISO 7899-2 for Intestinal enterococci involved incubation at 36 ± 2 °C for 44 ± 4 h. Colonies of salmon-rose to red were counted as Coliform bacteria, while all colonies of deep blue to violet were counted as E. coli. Typical colonies of Intestinal enterococci appear with red, maroon or pink color. For C. perfringens, the colonies of opaque yellow that turn to pink or red when exposed to ammonium hydroxide were counted. To ensure differential counting of well-separated target faecal indicator organisms, 10 to 15 power microscope or illuminated magnifier was used. The results for the FIBs were reported as colony forming units (CFU) per100 ml. As a rule, the allowable limit of acceptable number of FIB colonies present on a single membrane filter should be ≤200 (WHO, 2003). Accordingly, during the analysis of each sample, a minimum of three dilutions are carried out to ensure a countable plate. The dilutions enable much higher concentrations in a given sample to be enumerated. While the method may not be sufficiently specific, and may not identify stressed/injured microorganisms, its simplicity enables large volumes of water, resulting in good sensitivity and reliability (Bartram and Rees, 1999; Rompré et al., 2002). Detailed description of these methods can be found in ISO 9308-1 (2000), ISO-H (2000), and Lušić et al., 2014. ). The seven-year data was randomly split into two parts, 70 % for training each of the models and the remaining 30% for testing the models.
1181
water quality variables xi, follows a Poisson distribution given by Zuur et al. (2009): i
P ðyi ; λi jyi ≥0Þ ¼
e−λi λyi ; for yi ¼ 0; 1; 2; …: yi !
ð2Þ
The Poisson model restricts the conditional variance Var (Yi) to equal the conditional mean E (Yi) (equi-dispersion), thus the distribution is described by one parameter, the conditional mean (λi). That is, E (Yi) = Var (Yi) = λi. An alternative approach to accounting for the excess variance or heterogeneity is the negative binomial model. A detailed explanation of the negative binomial distribution can be found in Hilbe (2008). For a count variable (Yi| λi) following a Poisson with conditional mean E(Yi| λi) = λi, if the conditional mean λi is assumed to follow a gamma distribution with mean E (λi) = μi and variance Var (μi) = μi(1 + κi), then using a combination of Poisson and Gamma distributions, it can be shown that the unconditional distribution of Yi follows a negative binomial distribution with probability density function: Pr ðyi ; k; λi jyi ≥0Þ ¼ ¼ 1; 2; 3…:
yi κ1 Г κ −1 þ yi κλi 1 ; yi 1 þ κλi Гð1 þ yi ÞГðκ −1 Þ 1 þ κλi
ð3Þ
2.3. Descriptive statistics Descriptive analysis of the raw data was undertaken to extract pertinent information from the data. The key statistical features extracted were minimum and maximum values, mean, standard deviation, variance as well as skewness and kurtosis. This provided useful insight into the statistical properties such as the probability distribution, particularly of the observed FIB in the raw water and therefore the selection of an appropriate distribution to fit the data. Water quality variables, particularly, FIBs often contain data points that are numerically distant from the rest of the observations. Such observations can introduce bias in the estimates of correlation and regression model parameters. However, in this study, all data points were included in building the models to ensure that the actual characteristics of the data are captured. Pearson correlation coefficients were calculated to quantify the strengths of linear relations amongst the data variables. For each variable, the coefficient was determined using: SSXY r ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðSSXX ÞðSSYY Þ
where k is the dispersion parameter estimated by maximum likelihood. Thus, if k equals zero, the mean and the variance become equal, resulting in a Poisson distribution. However, when k is greater than zero, the variance exceeds the mean, and the distribution allows for over-dispersion. Due to the use of additional dispersion parameter α, the negative binomial model has an advantage over the Poisson model. However, the excessive zeros in FIB typically observed in water may still pose a challenge to this standard model. To address these challenges, it is necessary to apply models that can adequately incorporate the excess zeros without bias in the estimation of the coefficients of the predictors. Zero-inflated Poisson (ZIP) and negative binomial (ZINB) models account for the excess zeros as well as over-dispersion. Zuur et al. (2009) showed that the probability of zero counts in data can be expressed as: PrðY i ¼ 0Þ ¼ PrðFalse zerosÞ þ ð1− PrðFalse zerosÞÞ PrðCount process gives zeroÞ
ð4Þ
ð1Þ
where SSXX and SSYY are the sums of squares of variables x and y respectively and SSXY is the sum of the cross products of variables. The value of r lies between +1 and −1 and these respectively indicate strong positive and negative correlations. Following the descriptive analysis, the following models were applied to predict the concentrations of Coliform bacteria, E. coli, Intestinal enterococci and C. perfringens using pH, temperature, electrical conductivity (EC), turbidity, color and alkalinity as predictors: a) Zero-inflated Poisson, b) Zero-inflated negative binomial, c) Random Forest, and d) ANFIS. The R programming software (R-3.4.1 for windows) was used in performing the descriptive analysis and building the regression models. The Random forest model was built using written codes in MATLAB R2016b, while the ANFIS model was built with the Neurofuzzy Designer toolbox in MATLAB. The application of the models is presented as follows: 2.4. Zero-inflated models The basic Poisson regression model links the logarithm of the response count variable to a linear function of the explanatory variables such that the probability of the count of the FIB (yi) given a vector of
If the probability that Yi is a false zero is assumed to be binomially distributed with probability πi, then the probability that Yi is not a false zero is 1 − πi. Therefore, Eq. (3) can be written as: PrðY i ¼ 0Þ ¼ π i þ ð1−π i Þ PrðCount process gives zeroÞ
ð5Þ
This mixture of probabilities mainly differentiate the zero-inflate Poisson and negative binomial models from their standard forms. Thus, if the count process is assumed to follow a Poisson distribution with expectation λi, then from (Eq. (1)), the probability of zero counts in the data is given by: P ð0; λi jyi ≥0Þ ¼
e−λi λ0i ¼ e−λi 0!
ð6Þ
and Eq. (4) becomes: PrðY i ¼ 0Þ ¼ π i þ ð1−π i Þ x e−λi
ð7Þ
This means the probability of observing a zero outcome is given by the sum of probabilities of a false zero and not a false zero, multiplied by the probability of observing a true zero. Similarly, the probability of
1182
H. Mohammed et al. / Science of the Total Environment 628–629 (2018) 1178–1190
a count can be expressed as (Zuur et al., 2009): PrðY i ¼ 0Þ ¼ π i þ ð1−πi Þ x PrðCount processÞ
ð8Þ
If the binary part (false zeros versus all other types of data) is assumed to be binomially distributed, then for a Poisson distributed count data, y
PrðY i ¼ yi jyi N0Þ ¼ ð1−πi Þ
e−λi λi i y!
ð8Þ
The probability distribution for a ZIP model can therefore be written as: Prðyi ¼ 0Þ ¼ πi þ ð1−πi Þ x e−λi ; if yi ¼ 0 andPrðY i ¼ yi jyi N0Þ λyi e−λi ; if yi N0 ¼ ð1−πi Þ x yi !
ð9Þ
Here, 0 ≤ πi ≤ 1, and the mean and variance are respectively E (Yi) = λi(1 − πi), while the variance is Var (Yi) = λi(1 − πi)(1 + πiλi). Therefore, the model incorporates extra zeros, differentiating it from the standard Poisson form for which πi = 0. Finally, to obtain the ZINB model, the Poisson distribution is replaced with a negative binomial distribution. This can be expressed as: Prðyi ¼ 0Þ ¼ πi þ ð1−πi Þ ¼ ð1−πi Þ f NB ðyÞ
k λi þ k
k ; if yi ¼ 0 and Prðyi jyi N0Þ ð10Þ
where the fNB(y) is the negative binomial distribution function given as defined in Eq. (2) above. Thus, the mean E(Yi) = λi(1 − πi) and the variance Var (Yi) = λi(1 − πi)(1 + kλi + πiλi). The parameters λi and πiare obtained from the covariates and k ≥ 0 is a scalar. In this study, ZIP and ZINB models were built for each of the four FIB using the seven water quality and environmental parameters as inputs. 2.5. Random forest (RF) Random Forest (Breiman, 2001) is an ensemble machine-learning technique based on the aggregation of a large set of decision trees. A random set of predictor variables and data samples are picked from the training dataset to train each tree. Random forest generates multiple decisions through random data sampling for bootstrapping. Subsequently, either classification or regression tree is fit to each bootstrap sample such that the output variable can be predicted as an average of the decisions of all the trees (Cutler et al., 2011). The algorithm applied in growing each tree involves two main steps: (1) A random selection of approximately two-thirds of the training data set with replacement, to formulate the decision tree, each time, testing the accuracy of the tree using the remaining onethird training data set, known as “out-of-bag” (oob) samples. (2) Determining the split conditions of each node in the tree using random subset of the predictor variable.
compared the calculated MSE from the results to those of the other models. In addition, after training the models using all the seven water quality and environmental variables, we built reduced versions of the models using only four most important variables as provided by the full model. In each case, the MSEoob value of the reduced model is compared with the full version. The RF technique has a high accuracy of predictions, and is capable of handling correlated variables without necessarily affecting the prediction accuracy. The method also provides useful information about the importance of the predictor variables. 2.6. ANFIS models ANFIS is a hybrid model that harnesses the adaptive capability of ANNs and fuzzy logic approach, overcoming their limitations in order to achieve optimum performance in learning from data. The technique analyzes mapping relationships between input and output parameters to optimize the distribution of membership functions using a back propagation gradient-descent algorithm either alone or combined with least-squares method (Bisht and Jangid, 2011; Jang, 1993; Lou and Loparo, 2004). ANFIS has been applied in modelling mostly water treatment processes such as coagulant dosage in water treatment (Heddam et al., 2012), prediction of disinfection by-products (Chowdhury et al., 2009) and other water quality indices (Sahu et al., 2011) but not used in the prediction of FIB (Mohammed et al., 2017). In this paper, ANFIS models were built to train the measured concentrations of the water quality parameters to be capable of predicting the counts of the FIB in raw water. The proposed ANFIS has seven inputs; pH, temperature, EC, turbidity, color, alkalinity, and precipitation with one output for each model (FIB). Each input is represented by two fuzzy sets and the output by a first-order polynomial of the inputs. Two membership functions were assigned to each input variable in the ANFIS model resulting in 128 rules (i.e., 27 rules). The ANFIS implements n rules as follows: R1 : If x1 is A1 AND x2 is B2 AND………AND x6 is F 1 T Then y ¼ f 1 ¼ k10 þ k1 x1 þ k2 x1 ……… þ k6 x6 Rn : If x1 is A2 AND x2 is B2 AND………AND x6 is F 2 Then y ¼ f n ¼ kn0 þ kn1 x1 þ kn2 x2 ……… þ kn6 x6
ð12Þ
where x1, x2, …. , x7 are the input variables of the model, A1 and A2 are the fuzzy sets on the universe of discourse of x1, B1 and B2 are the fuzzy sets on the universe of discourse of x2,etc., y is the output variable (FIB), and ki0, ki1, …ki7,are a set of parameters specified for rule, Ri. To avoid overfitting of the models, different values of epochs (number of times the algorithm sees the entire data set) were set and the optimum number of epochs with the least MSE selected. ANFIS models are capable of interpreting the relationships between the input and output variables by implementing fuzzy rules and has the tendency of handling more complex nonlinear relationships. However, in this study, we present and discuss only the predictive ability of the technique. 2.7. Model selection
In a typical regression problem, the generalization error is estimated using the out-of-bag mean square error (MSE) expressed as: MSEoob ¼
2 1 N y −^f oob ðxi Þ ∑ n I¼1 i
ð11Þ
Generally, the MSEoob helps to prevent overfitting the model, and therefore the use of testing data set to evaluate the predictive ability of the model is often not necessary. However, since the model performance is compared to the outputs of other models in this study, we applied the testing data set (30% of the entire data) on the model and
Given the nature of the observed counts of FIB in the raw water, the principal basis for comparison of the performances of the modelling approaches applied in this study is the models’ ability to estimate counts that are close to observed values as well as predict periods of high counts. Moreover, the consistency and accuracy of each model when subjected to new data set is assessed to determine which of the approaches demonstrate predictive superiority for this type of data. For the ZIP and ZINB models, their predictive accuracies were examined by first comparing each with the intercept only (null) model using Pearson’s chi-square test. In addition, we calculated values for Arkaik Information Criterion (AIC) for each model. By the AIC statistic, models are
H. Mohammed et al. / Science of the Total Environment 628–629 (2018) 1178–1190
ranked using the score determined by: AIC i ¼ −2 logLi þ 2P i
Table 1 Statistical description of raw water quality parameters.
ð13Þ
where Li is the maximum log-likelihood estimated by the model, and Pi is the number of estimable parameters in the model. Thus, the best model is the one with the lowest AIC score. In addition, the zeroinflated models were compared with their standard Poisson and negative binomial versions using the Vuong's non-nested test (Vuong, 1989). The test states that under the null hypothesis that two nonnested models A and B are indistinguishable, a scaled form of the likelihood ratio test statistic converges asymptotically to a standard normal distribution, N (0, 1) as the sample size approaches infinity. The statistic can be expressed as: ðQ−θÞ Q v ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 ðV ððQ ÞnÞ
1183
ð14Þ
where V(Q) is the variance of the log-likelihood test statistic and θ is a bias correction term. To determine a model’s performance over another, Qv is compared with the quantiles of a standard normal distribution. Thus, a large, positive value of Qv indicates the superiority of model A over model B, while a large, negative value of Qv suggests that B is more superior to A. Subsequently, the suitability of each model for estimating the levels of the FIB in raw water based on the physical and chemical water quality and environmental parameters is assessed by applying the trained models on the testing dataset. Total correct predictions (percentages of FIBs predicted within one standard deviation of each FIB) in the testing stage are determined. In addition, the MSE of predictions are computed. Based on these and the ability of the models to explain the relative contribution of each predictor input in the variation of the counts of the FIB, conclusions are drawn on the best method for application in the drinking water treatment facility. 3. Results and discussion The results are structured as follows: We initially present the descriptive features in the data set, followed by the results of the output of each model. Finally, the performances of the various models in the prediction of FIB in the raw water are compared based on the MSE values in both the model training and testing stages. 3.1. Water quality features Table 1 presents a summary of the descriptive statistics of the raw data taken from 357 samples. Water quality parameters including temperature and color showed high variabilities, respectively ranging from 0 to 13.7 °C and 19 to 57 mg Pt/l. Both the water quality parameters and the observed counts of FIB exhibited distinct periodic variations and seasonality across the years, though some variabilities were more pronounced. Higher counts of FIB in the raw water intake were observed in the winter of each year, with the highest counts of Coliform bacteria (300 CFU/100 ml) occurring in the autumn of 2014. In addition, the coliform bacteria in the raw water showed a very large variance, relative to the mean concentration. This distinct variance was however not present in the other FIB which had lower maximum concentrations. Compared to reported concentrations in other surface water sources in Norway (Eregno et al., 2014), the concentrations of FIB in this water source are much lower. This could be due to robust risk management measures within the catchment area of the water source. Results of the Pearson's correlation coefficient among the water quality variables for the seven-year period are shown in Table 2. Almost all the physical and chemical variables showed positive correlations with each of the FIB at p-values of b0.05. However, both pH and temperature had very low r-values with the FIB. A related study that analyzed
Min. Max. Mean Std. Variance Skewness Kurtosis Dev. pH Temperature (°C) EC (μS/cm) Turbidity (NTU) Color (mgPt/l) Alkalinity (mmol/l) Precipitation (mm/day) Coliform bacteria (CFU/100 ml) E. coli (CFU/100 ml) Intestinal enterococci (CFU/100 ml) C. perfringens (CFU/100 ml)
−0.1 1.7 -0.9 2.8 2.2 0.7 1.3
347.8 1.3 1.3 17.9 5.4 1.8 1.5
24.7 609.1
7.2
67.3
1.2 0.3
15.9 0.5 0.6 0.4
3.5 3.9
16.7 20.6
0.4
0.8
0.2
4.8
6.3 2.8 2.1 0.2 19 0.1 0
7 13.7 3.1 2.3 57 0.2 11.6
6.5 1.2 2.6 0.5 27.2 0.2 2.6
0.2 3.3 0.12 0.3 7.2 0.1 2.5
0
300
7.2
0 0
6 5
0
4
0.2 10.5 0.1 0.1 51.6 0 6.3
0.6
water quality variables in Norway, however reported much stronger correlations between these parameters and the FIB in raw water (Eregno, 2014a). With the exception of Intestinal enterococci, strong positive correlations were generally observed between the FIB in the raw water and turbidity (from r = 0.43 to 0.58). While these strong correlations are consistent with results of studies carried out elsewhere (Ortega et al., 2009; Liang et al., 2015; King, 2016), weak negative correlations between turbidity and FIB were found in the study of Eregno (2014). The water color similarly showed strong positive correlations with Coliform bacteria (r = 0.56), E. coli (r = 0.49), and Intestinal enterococci (r = 0.35), while rarely showing any correlation with C. perfringens. The strength of positive correlations between alkalinity, precipitation and the FIB were even more distinct. For instance, with the exception of C. perfringens, the strongest correlation coefficients were observed between alkalinity and the other FIBs (from r = 0.65 to r = 0.89). Alkalinity in natural waters is mostly due to hydroxides, carbonates and bicarbonates, and these may be increased in surface waters during high rainfall events, as sources in soil and bedrock are transported. The associated excess dissolved solids may affect the color, but other indirect effects on parameters such as pH may cause the high correlations with the FIBs. As for precipitation, significant positive correlations were observed with all the FIBs, particularly for E. coli (r = 0.78) and Intestinal enterococci (r = 0.76). It is also worth noting that all the variables that had strong positive correlations with majority of the FIBs (turbidity, color and alkalinity) equally showed significant positive relations with precipitation. This clearly indicates the significance of the impact of precipitation on the microbial quality of surface water resources mostly through the transport of sources and microbe-bearing particles in
Table 2 Pearson's correlation coefficients among water quality and environmental variables between 2009 and 2015.
pH Temp EC Turb Col Alk Prec Colif E. coli I. ent C. perf
pH
Temp EC
Turb Col
Alk
Prec Colif E. coli I. ent C. perf
0 0 0.02 0 0 0.03 0.03 0.25 0.12 0.09
0 0.79 0.03 0.32 0.01 0 0.01 0.01 0.06
0.01 0 0.33 0.53 0.43 0.02 0.58
0.72 0.85 0.65 0.89 0.04
0.24 0.79 0 0.76 0.14 0.23 0.09
0.91 0 0 0.07 0.31 0.07 0 0
0 0.69 0.55 0.49 0.35 0.01
0 0.01
0
*Temp = Temperature, EC = Electrical conductivity, Turb = Turbidity, Col = Color, Alk = Alkalinity, Prec = Precipitation, Colif = Coliform bacteria, I. ent = Intestinal enterococci, C. perf = Clostridium perfringens.
1184
H. Mohammed et al. / Science of the Total Environment 628–629 (2018) 1178–1190
catchments into the water through runoff. The fraction of all potential pathogen sources in the catchment that reach drinking water sources largely depend on the intensity of precipitation, which can result in increased pathogen loads in runoffs and incidence of diseases (Jokinen et al., 2012). For further examination of the effects of precipitation and temperature on the occurrence of the FIBs in the raw water, these variables, together with the FIB observations were decomposed into the four main seasons of Norway. Fig. 2 shows the average seasonal catchment precipitation, water temperature and the frequencies of detections of each of the FIBs in each season. While some FIBs were observed throughout all seasons, the highest number of detections distinctly occurred in the autumn of most years. However, average precipitations were highest in the summer. This high average precipitation in summer is due to high precipitation in the area occurring in August, just prior to the autumn season that begins in September. High precipitation in the catchment occurring at the end of summer and throughout the autumn season significantly elevated contaminant concentrations in the raw water, potentially resulting in high FIB concentrations. The water temperature was also higher in the summer and autumn seasons as shown in Fig. 2. However, very low correlations were obtained with the concentrations of the FIBs. 3.2. Zero-inflated model outputs Results of the count models for the four FIBs are presented in Table 3. Only the statistically significant estimates (p-values b 0.05) are included in the table. The table comprises two parts; the upper part (count) in each model represents the standard Poisson and Negative Binomial models while the lower parts (zero-inflated) show the zero inflation components. The ZIP model retained all the water quality and environmental variables for Coliform bacteria. The ZINB model for the same FIB
however did not establish any statistically significant relations with color, alkalinity and precipitation. Both models are consistent on the effects of pH, temperature, and turbidity on positive counts of the FIB observed in the raw water. This suggests that while higher counts of Coliform bacteria in the raw water occur with increasing temperature and turbidity, higher pH results in lower counts of the FIB in the raw water. Similarly, both models indicate that temperature and alkalinity significantly influence the odds of observing zero counts of Coliform bacteria in the raw water. With the exception of color and precipitation, observed positive counts of E. coli in the raw water were significantly associated with all the predictor variables. Both models retained the same predictors with similar coefficients and standard errors, particularly for temperature, EC, turbidity, and alkalinity. This indicates that lower counts of E. coli in the raw water occurred when temperature, electrical conductivity, turbidity and alkalinity are low, while higher counts are observed when the pH is high. These directly contrast the results of the Coliform bacteria models. In addition, the models statistically associated the odds of observing zero counts of E. coli in the raw water with temperature, electrical conductivity, turbidity, and alkalinity. However, the ZINB model included the effect of pH in this respect. For the positive counts of Intestinal enterococci in the raw water, both models retained temperature, electrical conductivity, turbidity, and color, with alkalinity only treated significant in the ZIP model. In addition, while the ZIP model established significant associations between the zero counts of Intestinal enterococci and pH, turbidity and alkalinity, the ZINB model rejected all the predictor variables. Finally, both models (ZIP and ZINB) indicate that amongst the predictor variables, only precipitation is significantly associated with the observed counts of C. perfringens in the raw water. Precipitation, in addition to temperature and EC are significantly associated with the odds of
Fig. 2. Seasonal variations of precipitation and air temperature in the catchment area of the Lake and the frequencies of FIB observations in the raw water during each season during the seven-year period.
H. Mohammed et al. / Science of the Total Environment 628–629 (2018) 1178–1190
1185
Table 3 Estimated parameters zero-inflated Poisson and negative binomial models Model
Coliform bacteria
E. coli
I. enterococci
C. perfringens
ZIP
Count Intercept pH Temperature Conductivity Turbidity Color Alkalinity Precipitation Log(theta) Zero-Inflated Temperature Alkalinity Count pH Temperature Conductivity Turbidity Alkalinity Zero-inflated pH Temperature Conductivity Turbidity Alkalinity Count Temperature Conductivity Turbidity Color Alkalinity Zero-Inflated Intercept pH Turbidity Alkalinity Count Precipitation Zero-Inflated Temperature Conductivity Precipitation
ZINB
Estimate
Std. err
p-Value
Estimate
Std. err
p-Value
22.53 −4.37 0.49 1.56 0.86 −0.03 19.36 -0.08
2.16 0.27 0.02 0.27 0.11 0.01 3.05 0.01
*** *** *** *** *** *** *** ***
47.35 −5.69 0.33 −3.91 1.51
9.59 1.22 0.07 1.33 0.83
*** *** *** ** *
−1.04
0.16
***
−0.63 −45.59
0.12 20.86
*** *
−0.85 −223.85
0.27 80.11
** **
3.95 −0.19 −6.75 −1.56 −69.09
1.72 0.08 2.05 0.70 24.82
* * *** ** **
11.20 −0.17 −6.23 −1.41 -66.38
5.65 0.09 2.27 0.76 26.88
* * ** * *
−1.25 −13.64 −10.64 -269.83
0.33 5.54 3.97 98.79
*** * ** **
11.21 −1.39 −15.09 −12.41 −312.38
5.65 0.38 6.42 4.52 105.24
* *** * * *
0.16 −6.69 2.51 −0.13 46.15
0.07 1.56 0.81 0.03 23.27
* * ** *** *
0.18 −4.48 1.54 −0.07
0.09 2.04 0.71 0.04
* * * *
−400.88 35.25 31.55 917.18
203.53 17.88 13.98 412.98
* * * *
−0.13
0.06
*
−0.13
0.06
*
10.43 −0.46
4.76 0.21
* *
−0.51 11.58 −0.47
0.31 6.06 0.23
* * *
p-Value codes: ***: b0.001; ** b0.01; *: b0.05.
observing zero counts of the FIB in the raw water. Literature is replete with evidence of strong positive correlations existing between waterborne pathogens and variables such as temperature, precipitation and turbidity (Tryland et al., 2011; Schijven et al., 2013; Abia et al., 2015). While results of the models for Coliform bacteria, Intestinal enterococci, and C. perfringens generally support these facts, the estimates in the E. coli model show the opposite. In order to check if the ZIP and ZINB models adequately handle the overdispersion caused by excessive zeros in the FIB data, standard forms of Poisson and negative binomial models were fit for each FIB, and the percentage of zeros estimated by these models were calculated using the likelihood ratio test. We compared the results with the outputs of the ZIP and ZINB models. Observed proportions of zeros in Coliform bacteria, E. coli, Intestinal enterococci, and C. perfringens in the dataset were respectively; 53%, 81%, 85%, and 72%. It was observed that the zero-inflated models and their respective standard versions estimated zero counts close to the proportions observed in the E. coli (Poisson: 75%, NB: 75%, ZIP: 80%, ZINB: 79%), Intestinal enterococci (Poisson: 83%, NB: 75%, ZIP: 84%, ZINB: 83%), and C. perfringens (Poisson: 67%%, NB: 75%, ZIP: 71%, ZINB: 72%). However, for Coliform bacteria, only the ZIP model accurately predicted the zero counts in the data (53.1%). The other models performed poorly (Poisson: 21%, NB: 14%, ZINB: 31%), compared to 53% of zeros in the data.
Pearson’s chi-squared tests done to compare the models with intercept only models yielded highly significant p-values (p b 0.001), indicating that the overall zero inflated models are statistically significant. Results of the performance indices (AIC and Vuong statistic) calculated for the ZIP and ZINB models are presented in Table 4. The ZINB model for Coliform bacteria produced lower AIC value (1179) compared with its corresponding ZIP model (3712). This indicates that the ZINB model fits the Coliform bacteria data better that the ZIP model. However, by this index, no distinct disparities in the goodness of fit were observed between the two the models in the cases of E. coli, Intestinal enterococci, and C. perfringens. The Vuong test returned highly significant (p b 0.001) negative values for all the ZIP and ZINB models for the various FIBs. This indicate that the zero-inflated models relative to their respective standard versions achieved substantial improvement in model fitting. Further, performances of these models have been compared with results of the AI models based on the MSE values of predictions in both the training (fitting) and testing data sets. These results are presented in the comparison section. 3.3. RF model outputs For each FIB, two RF models were built; full and reduced models. The full models with all the input variables and 100 trees generally performed well. Their respective reduced versions with four most
1186
H. Mohammed et al. / Science of the Total Environment 628–629 (2018) 1178–1190
Table 4 Performance indices of the ZIP and ZINB models. Model
Coliform bacteria E. coli Intestinal enterococci Clostridium perfringens
AIC
Vuong statistic
ZIP
ZINB
ZIP
ZINP
3712.27 335.64 277.55 414.49
1179 335.14 272.13 412.32
-2.08 -2.65 -1.88 -2.36
-3.43 -2.93 -3.27 -1.97
important variables were equally good in predicting the intense variations in the data. However, the reduced models yielded considerably higher out-of-bag error (MSEoob) values during model training. Accordingly, since the purpose of the reduced models was to determine if improved prediction accuracy could be achieved with only the most important variables, we rejected the reduced models due to their larger MSEoob values. Fig. 3 shows comparisons of the outputs of the RF models with the observed counts of FIBs. The trained RF model generally captured the variations in the data sets, particularly where periods of zero counts distinctly separate periods of observed positive counts as in the Coliform bacteria data (Fig. 3A). However, the model gets strained if variations in the FIB counts are rapid (Fig. 3D). Moreover, although the model predicts zero counts of the FIBs with considerable precision, it could not predict peak counts in the data. For instance, with observed maximum counts of 200 CFU/100 ml for Coliform bacteria and 6 CFU/100 ml for E. coli in the raw water (training data), the model estimated 70 CFU/100 ml and 1.4 CFU/100 ml respectively. A model's ability to predict such extreme counts of FIB in the raw water based on real time measurements of water quality variables is vital for the water supply industry, since it provides basis for improving the performances of treatment processes meant for removing microorganisms. Further, we compared the means and variances of the counts of the FIBs predicted by the RF to their corresponding observed values. We observed that except for the E. coli model, the predicted means are close to values estimated in the observed data (example, Coliform bacteria: 7.21 observed, 7.51 predicted). The variances are however much lower, compared to those in the observed data (example, Coliform bacteria: 609 observed, 136 predicted).
3.3.1. Variable importance and training error (MSEoob) The predictor variables generally showed varying significance in predicting the various FIBs modeled in this study. In each of the subplots in Fig. 4, the plots of MSEoob during model training as well as variable importance (bar graph) are shown. The bar graphs numbered 1 to 7 respectively represent pH, temperature, EC, turbidity, color, alkalinity, and precipitation. The raw water pH, temperature, EC and turbidity were the four most important variables in the Coliform bacteria prediction model (Fig. 4A), with alkalinity being the least significant variable. For the E. coli model (Fig. 4B), pH, temperature, and electrical conductivity show higher levels of significance than the other variables. Variable importance for the Intestinal enterococci and C. perfringens models are also shown in the figure (C and D respectively). It is interesting to note that the temperature of the raw water was ranked highest in all the models. This suggest the control temperature has on the overall quality of water since it affects almost all other water quality variables, including FIBs (Delpla et al., 2009; Whitehead et al., 2009). In growing each decision tree, a third of the training data is used by the in the RF algorithm to check the accuracy of the decision. This automatic way of assessing prediction accuracy in the RF model obviates the need for testing the model on a testing data set. However, for the purpose of comparison with the other models applied in this study, we applied a separate testing data to all the models and calculated the prediction MSE. The MSEoob decreases as the number of grown regression trees increases. The mean training errors achieved for the 100 trees were; 351, 0.43, 0.39, and 0.56 (CFU/100 ml) for the Coliform bacteria, E. coli, Intestinal enterococci, and C. perfringens models respectively. 3.4. ANFIS model outputs To achieve ANFIS models with high performance, optimal model parameters were achieved through several iterations, each time assessing the model's accuracy graphically and based on MSE of the predictions. Key among these model parameters was the number of epochs. In this study, best models performances were achieved after 200 epochs in the Coliform bacteria model, and 250 epochs in the other FIBs. Fig. 5 shows results of the model training outputs compared with the observed data. Clearly, the ANFIS models predicted most of the FIB counts with high precision. The models also reproduced periods of high counts
Fig. 3. Observed counts of Coliform bacteria (A), E. coli (B), Intestinal enterococci (C), and C. perfringens (D) compared with the predictions of RF models at the training stage.
H. Mohammed et al. / Science of the Total Environment 628–629 (2018) 1178–1190
1187
Fig. 4. RF model training error and input variable importance.
in the data. However, as can be seen in these figures, the models were generally poor in predicting the zero counts, and yielded negative values in most cases. The MSE values of the predictions during training were; 30, 0.27, 0.09, and 0.23 (CFU/100 ml) respectively in the Coliform bacteria (Fig. 4A), E. coli (B), Intestinal enterococci (C), C. perfringens (D) models. 3.5. Comparison of model performances The various modelling approaches applied in predicting the various FIBs have shown varying performances when subjected to different dataset during the model testing stage. From the perspective of managing the FIB quality of raw water, the applicability of a model depends on its accuracy in predicting large proportions of the concentrations of the
FIBs in the raw water when it is subjected to “unseen” data. Out of 107 data points (30% of raw data) used in testing the models, prediction rates were calculated. Average rates and MSE values of the various models as applied to the testing dataset are presented in Tables 5 and 6 respectively. The ANFIS models clearly predicted larger proportions of the testing dataset (87%), compared to the other models. The fitted ZIP model for Coliform bacteria produced the lowest MSE value (MSE = 17.86 CFU/100 ml) compared to the other models. The ZINB and ANFIS models for the same FIB had almost the same MSE values (28.82 and 30.04 CFU/100 ml respectively), but considerably larger than the value in ZIP model. In the model testing stage however, the ZINB model performed best, with MSE value lower than the model achieved with the training data (MSE = 14.39 CFU/100 ml). The errors in the other three models rather increased when applied to the testing
Fig. 5. Observed counts of the four FIBs compared with the predictions of ANFIS models at the training stage.
1188
H. Mohammed et al. / Science of the Total Environment 628–629 (2018) 1178–1190
data set. Although the AI models (RF and ANFIS) produced lower MSE values in the E. coli models during testing (0.23 and 0.27 CFU/100 ml respectively), the zero-inflated models had marginal reductions in MSEs relative to values achieved at the training stage. The AI models similarly achieved lower MSEs in testing the models for the other FIBs as shown in the table. The mean counts of the FIBs predicted by the models were close to the observed mean values. For instance, compared to observed mean values of 7.21 and 0.23 (CFU/100 ml) for Coliform bacteria and Intestinal enterococci respectively, the corresponding values predicted by the ZIP and the ZINB models were 7. 5 and 0.28 (CFU/100 ml). However, the variances in the predictions were far lower than observed. Variances in Coliform bacteria predicted were; ZIP (248.7), ZINB (112.3), while a variance of 609 was estimated in the observed data. It is accordingly difficult to judge between the performances of the models (ZIP and ZINB). However, the results generally show that the ZIP models distinctly exaggerate the significance of the regression parameters, yielding lower standard error values. Further, the AIC statistics for the two suggest that the ZINB models perform better than the ZIP models. Results of related studies that compared the appropriateness of zero-inflated models in predicting count data indicate that the ZINB model offers more flexibility since it is capable of accounting for overdispersion from excess zeros and unobserved heterogeneities, and generally produces more accurate fits for count data than the ZIP model (Lewsey and Thomson, 2004; Rose et al., 2006). For the RF models, the MSE values of the predictions at the training stage were considerably high, relative to mean counts of the FIB observed. For instance, MSE of 163 CFU/100 ml and 0.23 CFU/100 ml were achieved for Coliform bacteria and E. coli models respectively. Although this modeling technique is rarely applied in waterborne FIB prediction, studies that have applied the method mostly in solving classification problems have reported high accuracies. For instance, RF has been reported to outperform ANN, support vector machines (SVM) and linear regression models in the prediction of lake water level (Li et al., 2016). Similarly, a study that compared the performances of RF and ANN to investigate soil infiltration rate reported prediction errors of ±25% in the testing data set (Singh et al., 2017). While large MSE values were achieved in the RF models in this study during both training and testing, a graphical inspection of the model predictions (Fig. 3) indicate that the performance of this model is consistent with accuracies reported in literature. Similarly, the prediction accuracy of the ANFIS model achieved in this study is consistent with results of other studies that applied the technique in water quality forecasting. In the work of Najah et al. (2014), ANFIS models trained to predict concentrations of dissolved oxygen in water achieved a correlation coefficient of up to 0.96 between observed and predicted concentrations. Similar high prediction accuracy has been reported by Mohammed et al., 2017, when the technique was applied in predicting concentrations of Norovirus in water based on physical and chemical water quality variables. Despite the lower overall prediction rates of the zero-inflated models, the ZINB model in particular consistently maintains its prediction accuracy (in terms of MSE) with a different dataset. This model also predicts the zeros in the FIB data with remarkable accuracy as presented in Section 3.2 of this study. Their main challenge was inability to accurately predict extremely high counts that characterize the FIB data. Yet, compared to most AI-based models that show higher prediction accuracies, the models offer considerable explanation to the effect of each
Table 5 Summary of model prediction rates during the testing stage. Model
Testing samples
Number approximately estimated
Estimation rate (%)
ZIP ZINB RF ANFIS
107 107 107 107
82 83 82 93
76 77 76 87
Table 6 MSE of predictions of the models Model
Coliform bacteria E. coli Intestinal enterococci C. perfringens
MSE
Training Testing Training Testing Training Testing Training Testing
ZIP
ZINB
RF
ANFIS
17.86 33.36 0.68 0.63 0.48 0.79 0.71 0.91
28.82 14.39 0.68 0.63 0.62 0.61 0.70 0.92
163.24 576.42 0.23 0.56 0.21 0.45 0.26 0.33
30.04 39.49 0.27 0.35 0.09 0.09 0.16 0.23
water quality parameter. The zero-inflated models obviously have great potentials for application in the water industry since they are more flexible to the nature of the FIB data, relative to multiple regression models, which are often applied in this respect. Owing to its ability to explain the contributions of each input variable to the counts of the FIBs, the RF model is a more reliable alternative to the zero-inflated model for application in the industry than the ANFIS model. In fact, understanding the influences of the various water quality on the concentrations of FIBs in the raw water is equally important in managing the microbial quality of the raw water. The ANFIS model, by virtue of its fuzzy rules, is also capable of interpreting the relationships between the input variables and the outputs, yet it is strained by intense variations in the FIB data, predicting negative values in some cases. The suit of models considered and built in this study can be implemented in a single interphase such that real time FIB quality of water can be readily predicted when all the input variables are available. While this study found the usefulness of these models in the water supply industry, it may be necessary to apply them to different WTPs to assess their replicability. As mentioned in Section 2.2, the characteristics of the catchment area of the Lake present many potential sources of FIB that may result in significant health risks. Although FIBs may have poor relationships with the occurrence of actual pathogenic organisms of public health concern, measuring these organisms provides the necessary information about potential concentrations and variabilities of microbial pathogens in raw water sources (Payment and Locas, 2011; Petterson et al., 2016). Moreover, within the risk-based framework of managing waterborne disease outbreaks, comprehensive health risk assessment requires monitoring the raw water quality of drinking water supply systems, particularly for FIBs (WHO, 2011). Monitoring of FIBs in raw water sources can provide early warning information to the water utility operators prior to the treatment of water, such that treatment processes can be optimized where necessary. For this reason, public health burdens associated with waterborne disease outbreaks occurring after events such as extreme precipitation can be reduced if FIBs in the raw water are estimated on “real time” basis before treatment. Through such predictive modelling as used in this study, not only the presence/ absence of the FIBs in the raw water can be established prior to treatment, but also reliable estimates of the actual concentrations may be achievable. In addition, while parameters such as turbidity, rainfall and temperature may be the key surrogates for the occurrence of FIBs in the water sources, as well as associated risks of AGII (Beaudeau et al., 2014), it is necessary to evaluate the direct/indirect effects of other important parameters such as pH, which can impact of water treatment processes. In this study, the concentration of FIBs in the raw water showed clear seasonal variations (Fig. 2) with potential health impacts on the population connected to the water supply system. However, the severity of the health impacts (e.g. acute gastrointestinal illnesses), which will largely depend on the efficacy of existing water treatment processes was not assessed as it was out of the scope of this study. Nonetheless, it will be vital to examine this aspect in further studies.
H. Mohammed et al. / Science of the Total Environment 628–629 (2018) 1178–1190
4. Conclusions This study illustrates and evaluates the performances of four different models (ZIP, ZINB, RF, ANFIS) in predicting concentrations of four FIBs (Coliform bacteria, E. coli, Intestinal enterococci, and C. perfringens) in the raw water source of Oset WTP in Oslo, Norway. In terms of capturing extreme counts of FIBs in the raw water, ANFIS model performs best. This model however predicts negative values, which do not exists in such data in reality. Based on a model’s ability to predict concentrations of FIB that reflect the typical character of actual observations, model consistency, and the ability to explain the significance of the water quality variables in the control of FIB in the raw water, zero-inflated (ZIP, ZINB) and RF models appear to be the best for application in the WTP. Although all the water quality and environmental variables influence the concentrations of FIBs in the raw water, the significance of pH, temperature, turbidity, and electrical conductivity are found to be paramount. Precipitation within the catchment area of the water source did not appear to have distinct effect on the microbial concentrations. While comprehensive weekly sampling and analysis of water quality variables are carried out at the WTP, zero-inflated and RF models have high potential of providing reliable estimates of FIB in the raw water to augment existing analysis. Thus, we believe both models can reliably be used as part of routine water quality monitoring exercises in drinking water treatment plants in Norway.
Acknowledgements The authors wish to show gratitude to the management of the Oset drinking water treatment plant in Oslo, Norway, particularly Marie Fossum, for providing the water quality data used in this study. Funding This work is part of the KLIMAFORSK project - On the Association between climate change and Waterborne Diseases funded by Research Council of Norway (Project No: 244147/E10). References Abia, A.L.K., Ubomba-Jaswa, E., Momba, M.N.B., 2015. Impact of seasonal variation on Escherichia coli concentrations in the riverbed sediments in the Apies River, South Africa. Sci. Total Environ. 537, 462–469. Allen, M.J., Clancy, J.L., Rice, E.W., 2000. The plain, hard truth about pathogen monitoring. J. Am. Water Works Assoc. 92 (9), 64. Barry, M., Chao-An, C., Westerhoff, P., 2016. Severe weather effects on water quality in Central Arizona. Am. Water Works Assoc. 108 (4), E221–E231. Bartram, J., Rees, G., 1999. Monitoring Bathing Waters: A Practical Guide to the Design and Implementation of Assessments and Monitoring Programmes. CRC Press. Beaudeau, P., Schwartz, J., Levin, R., 2014. Drinking water quality and hospital admissions of elderly people for gastrointestinal illness in Eastern Massachusetts, 1998–2008. Water research 52, 188–198. Beaujean, A.A., Morgan, G.B., 2016. Tutorial on using regression models with count outcomes using R. Pract. Assess. Res. Eval. 21 (2). Bisht, D.C., Jangid, A., 2011. Discharge modelling using adaptive neuro-fuzzy inference system. Int. J. Adv. Sci. Technol. 31, 99–114. Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. Cameron, A.C., Trivedi, P.K., 2013. Regression Analysis of Count Data. Vol. 53. Cambridge University Press. Chowdhury, S., Champagne, P., McLellan, P.J., 2009. Models for predicting disinfection byproduct (DBP) formation in drinking waters: a chronological review. Sci. Total Environ. 407 (14), 4189–4206. Coxe, S., West, S.G., Aiken, L.S., 2009. The analysis of count data: A gentle introduction to Poisson regression and its alternatives. J. Personal. Assess. 91 (2), 121–136. Crittenden, J.C., Trussell, R.R., Hand, D.W., Howe, K.J., Tchobanoglous, G., 2012. MWH's Water Treatment: Principles and Design. John Wiley & Sons. Cutler, A., Cutler, D.R., Stevens, J.R., 2011. Random forests, chapter in machine learning. In: Zhang, Cha, Ma, Yunqian (Eds.), In Book: Ensemble Machine Learning: Methods and Applications, Chapter: 5. 45(1). Springer:pp. 157–176. https://doi.org/10.1007/9781-4419-9326-7_5 (Source: DBLP). Davies, J.M., Mazumder, A., 2003. Health and environmental policy issues in Canada: the role of watershed management in sustaining clean drinking water quality at surface sources. J. Environ. Manag. 68 (3), 273–286.
1189
Delpla, I., Jung, A.V., Baures, E., Clement, M., Thomas, O., 2009. Impacts of climate change on surface water quality in relation to drinking water production. Environ. Int. 35 (8), 1225–1233. Dohoo, I.R., Martin, W., Stryhn, H., 2009. Veterinary Epidemiologic Research. 2nd Ed. VER, Inc., Prince Edward Island, Canada, pp. 91–127. Eregno, F.E., 2014. Multiple linear regression models for estimating microbial load in a drinking water source case from the Glomma River, Norway. VANN 03 (335-350). Eregno, F.E., Nilsen, V., Seidu, R., Heistad, A., 2014. Evaluating the trend and extreme values of faecal indicator organisms in a raw water source: a potential approach for watershed management and optimizing water treatment practice. Environ. Process. 1 (3), 287–309. Heddam, S., Bermad, A., Dechemi, N., 2012. ANFIS-based modelling for coagulant dosage in drinking water treatment plant: a case study. Environ. Monit. Assess. 184 (4), 1953–1971. Herrig, I.M., Böer, S.I., Brennholt, N., Manz, W., 2015. Development of multiple linear regression models as predictive tools for fecal indicator concentrations in a stretch of the lower Lahn River, Germany. Water research 85, 148–157. Hilbe, J.M., 2008. Negative Binomial Regression. Cambridge. Hilbe, J.M., Green, W., 2007. Count response regression models. In: Rao, C., Miller, J., Rao, D. (Eds.), Epidermiology and Medical Statistics, Elsevier Handbook of Statistics Series. Elsevier, London. Hüls, A., Frömke, C., Ickstadt, K., Hille, K., Hering, J., von Münchhausen, C., ... Kreienbrock, L., 2017. Antibiotic resistances in livestock: A comparative approach to identify an appropriate regression model for count data. Frontiers in Veterinary Science 4. ISO 9308-1, 2000. Water Quality - Enumeration of Escherichia coli and Coliform Bacteria - Part 1: Membrane Filtration Method. International Standards Organization, Geneva. ISO, H. 7899-2, 2000. Water Quality–Detection and Enumeration of Intestinal Enterococci, Part 2: Membrane Filtration Method. Organization for Standardization, Geneva. Jang, J.S., 1993. ANFIS: adaptive-network-based fuzzy inference system. IEEE Trans. Syst. Man Cybern. 23 (3), 665–685. Jokinen, C.C., Edge, T.A., Koning, W., Laing, C.R., Lapen, D.R., Miller, J., ... Topp, E., 2012. Spatial and temporal drivers of zoonotic pathogen contamination of an agricultural watershed. J. Environ. Qual. 41 (1), 242–252. Khan, F.A., Ali, J., Ullah, R., Ayaz, S., 2013. Bacteriological quality assessment of drinking water available at the flood affected areas of Peshawar. Toxicol. Environ. Chem. 95 (8), 1448–1454. King, A.M., 2016. Relationships between fecal indicator bacteria and environmental factors at Edgewater Beach. Honors Research Projects. 331. http://ideaexchange.uakron.edu/ honors_research_projects/331, Accessed date: 22 September 2017. Kostic, T., Stessl, B., Wagner, M., Sessitsch, A., 2011. Microarray analysis reveals the actual specificity of enrichment media used for food safety assessment. J. Food Prot. 74 (6), 1030–1034. Lage Filho, F.A., 2010. Ozone application in water sources: effects of operational parameters and water quality variables on ozone residual profiles and decay rates. Braz. J. Chem. Eng. 27 (4), 545–554. Law, J.W.F., Ab Mutalib, N.S., Chan, K.G., Lee, L.H., 2015. Rapid methods for the detection of foodborne bacterial pathogens: principles, applications, advantages and limitations. Front. Microbiol. 5, 770. Lewsey, J.D., Thomson, W.M., 2004. The utility of the zero-inflated Poisson and zeroinflated negative binomial models: a case study of cross-sectional and longitudinal DMF data examining the effect of socio-economic status. Commun. Dent. Oral Epidemiol. 32 (3), 183–189. Li, B., Yang, G., Wan, R., Dai, X., Zhang, Y., 2016. Comparison of random forests and other statistical methods for the prediction of lake water level: a case study of the Poyang Lake in China. Hydrol. Res. 47 (S1), 69–83. Liang, L., Goh, S.G., Vergara, G.G.R.V., Fang, H.M., Rezaeinejad, S., Chang, S.Y., ... Gin, K.Y.H., 2015. Alternative fecal indicators and their empirical relationships with enteric viruses, Salmonella enterica, and Pseudomonas aeruginosa in surface waters of a tropical urban catchment. Appl. Environ. Microbiol. 81 (3), 850–860. Lou, X., Loparo, K.A., 2004. Bearing fault diagnosis based on wavelet transform and fuzzy inference. Mech. Syst. Signal Process. 18 (5), 1077–1095. Lušić, D.V., Cenov, A., Glad, M., Žarkovac, I.S., Lušić, D., Mićović, V., 2014. Methods for the isolation and enumeration of sulphite-reducing clostridia and Clostridium perfringens in water by membrane filtration. XVIII znanstveno-stručni skup “Voda i javna vodoopskrba”. Mas, D.M.L., Ahlfeld, D.P., 2007. Comparing artificial neural networks and regression models for predicting faecal coliform concentrations. Hydrol. Sci. J. 52 (4), 713–731. Mohammed, H., Hameed, I.A., Seidu, R., 2017. Adaptive Neuro-Fuzzy Inference System for Predicting Norovirus in Drinking Water Supply. IEEE International Conference on Informatics, Health & Technology (ICIHT), Riyadh, KSA. Myers, D.N., Stoeckel, D.M., Bushon, R.N., Francy, D.S., Brady, A.M.G., 2007. Fecal indicator bacteria (ver. 2.0). US Geological Survey Techniques of Water-Resources Investigations. 9 A7, section, 7(3)). Najah, A., El-Shafie, A., Karim, O.A., El-Shafie, A.H., 2014. Performance of ANFIS versus MLP-NN dissolved oxygen prediction models in water quality monitoring. Environ. Sci. Pollut. Res. 21 (3), 1658–1670. Ortega, C., Solo-Gabriele, H.M., Abdelzaher, A., Wright, M., Deng, Y., Stark, L.M., 2009. Correlations between microbial indicators, pathogens, and environmental factors in a subtropical estuary. Mar. Pollut. Bull. 58 (9), 1374–1381. Oshiro, R., 2002. Method 1604: Total Coliforms and Escherichia coli in water by membrane filtration using a simultaneous detection technique (MI Medium). US Environmental Protection Agency, Washington, DC.
1190
H. Mohammed et al. / Science of the Total Environment 628–629 (2018) 1178–1190
Oslo Municipality Water and Sewer Authority (Oslo Kommune, Vann og Avløpsetaten), 2012. Municipal Water and Waste Department. Retrieved online. http://www. vannavlopsetaten.oslo.kommune.no/vannet_vart/drikkevann/vannkilder/ restriksjoner/, Accessed date: 15 November 2016. Pandey, P.K., Philip, H.K., Michelle, L.S., Sagor, B., Vijay, P., 2014. Contamination of water resources by pathogenic bacteria. AMB Express 4, 51. Payment, P., Locas, A., 2011. Pathogens in water: value and limits of correlation with microbial indicators. Groundwater 49 (1), 4–11. Petterson, S.R., Stenström, T.A., Ottoson, J., 2016. A theoretical approach to using faecal indicator data to model norovirus concentration in surface water for QMRA: Glomma River, Norway. Water Res. 91, 31–37. Phang, Y.N., Loh, E.F., 2014. Statistical analysis for overdispersed medical count data. World Academy of Science, Engineering and Technology. Int. J. Math. Comput. Phys. Electr. Comput. Eng. 8 (2), 292–294. Rompré, A., Servais, P., Baudart, J., De-Roubin, M.R., Laurent, P., 2002. Detection and enumeration of coliforms in drinking water: current methods and emerging approaches. J. Microbiol. Methods 49 (1), 31–54. Rose, C.E., Martin, S.W., Wannemuehler, K.A., Plikaytis, B.D., 2006. On the use of zeroinflated and hurdle models for modeling vaccine adverse event count data. Journal of Biopharm. Stat. 16 (4), 463–481. Sae-Lim, P., Grøva, L., Olesen, I., Varona, L., 2017. A comparison of nonlinear mixed models and response to selection of tick-infestation on lambs. PloS One 12 (3), e0172711. Sahu, M., Mahapatra, S.S., Sahu, H.B., Patel, R.K., 2011. Prediction of water quality index using neuro fuzzy inference system. Water Qual. Expo. Health 3 (3-4), 175–191. Salahi, A., Mohammadi, T., Behbahani, R.M., Hemmati, M., 2015. A symmetric polyethersulfone ultrafiltration membranes for oily wastewater treatment: synthesis, characterization, ANFIS modeling, and performance. J. Environ. Chem. Eng. 3 (1), 170–178. Schijven, J., Bouwknegt, M., Husman, A.M.D., Rutjes, S., Sudre, B., Suk, J.E., Semenza, J.C., 2013. A decision support tool to compare waterborne and foodborne infection and/ or illness risks associated with climate change. Risk Anal. 33 (12), 2154–2167. Sinclair, R.G., Rose, J.B., Hashsham, S.A., Gerba, C.P., Haas, C.N., 2012. Criteria for selection of surrogates used to study the fate and control of pathogens in the environment. Appl. Environ. Microbiol 78 (6), 1969–1977. Singh, B., Sihag, P., Singh, K., 2017. Modelling of impact of water quality on infiltration rate of soil by random forest regression. Model. Earth Syst. Environ. 3 (3), 999–1004.
Tang, Z., Butkus, M.A., Xie, Y.F., 2006. Crumb rubber filtration: a potential technology for ballast water treatment. Mar. Environ. Res. 61 (4), 410–423. Tryland, I., Robertson, L., Blankenberg, A.G.B., Lindholm, M., Rohrlack, T., Liltved, H., 2011. Impact of rainfall on microbial contamination of surface water. Int. J. Clim. Chang. Strat. Manag. 3 (4), 361–373. Tryland, I., Eregno, F.E., Braathen, H., Khalaf, G., Sjølander, I., Fossum, M., 2015. On-line monitoring of Escherichia coli in raw water at Oset drinking water treatment plant, Oslo (Norway). Int. J. Environ. Res. Publ. Health 12 (2), 1788–1802. Vuong, Q.H., 1989. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica 307–333. Walsh, E.S., Kreakie, B.J., Cantwell, M.G., Nacci, D., 2017. A Random forest approach to predict the spatial distribution of sediment pollution in an estuarine system. PloS One 12 (7), e0179473. Wang, Y., Chen, Y., Zheng, X., Gui, C., Wei, Y., 2015. Spatio-temporal distribution of fecal indicators in three rivers of the Haihe River Basin, China. Environ. Sci. Pollut. Res. https://doi.org/10.1007/s11356-015-5907-3. Whitehead, P.G., Wilby, R.L., Battarbee, R.W., Kernan, M., Wade, A.J., 2009. A review of the potential impacts of climate change on surface water quality. Hydrol. Sci. J. 54 (1): 101–123. https://doi.org/10.1623/hysj.54.1.10. WHO, 2003. Assessing Microbial Safety of Drinking Water Improving Approaches and Methods: Improving Approaches and Methods. OECD Publishing. WHO, 2011. Guidelines for Drinking-Water Quality. 4th Edn. 38. World Health Organization Chronicle, pp. 104–108. Wu, G.D., Lo, S.L., 2008. Predicting real-time coagulant dosage in water treatment by artificial neural networks and adaptive network-based fuzzy inference system. Eng. Appl. Artif. Intel. 21 (8), 1189–1195. Yang, S., Harlow, L.I., Puggioni, G., Redding, C.A., 2017. A comparison of different methods of zero-inflated data analysis and an application in health surveys. J. Mod. Appl. Stat. Methods 16 (1):518–543. https://doi.org/10.22237/jmasm/1493598600. Zhang, Z., Deng, Z., Rusch, K.A., 2015. Modeling fecal coliform bacteria levels at Gulf Coast Beaches. Water Qual. Expo. Health 7 (3), 255–263. Zuur, A.F., Ieno, E.N., Walker, N.J., Saveliev, A.A., Smith, G.M., 2009. Mixed Effects Models and Extensions in Ecology with R, Statistics for Biology and Health, Statistics for Biology and Health. Springer, New York https://doi.org/10.1007/978-0-387-87458-6.