Accepted Manuscript Water quality assessment with emphasis in parameter optimisation using pattern recognition methods and genetic algorithm Gonzalo Sotomayor, Henrietta Hampel, Raúl F. Vázquez PII:
S0043-1354(17)31005-9
DOI:
10.1016/j.watres.2017.12.010
Reference:
WR 13408
To appear in:
Water Research
Received Date: 8 August 2017 Revised Date:
6 December 2017
Accepted Date: 7 December 2017
Please cite this article as: Sotomayor, G., Hampel, H., Vázquez, Raú.F., Water quality assessment with emphasis in parameter optimisation using pattern recognition methods and genetic algorithm, Water Research (2018), doi: 10.1016/j.watres.2017.12.010. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
study site (period 2008-2013)
INPUT: Study site (Paute basin, Ecuador) + WQ monitoring network
ACCEPTED MANUSCRIPT
RI PT
Quito
SC
Guayaquil Peru
EP C1 - low pollution C2 - high pollution
OUTPUT: Lar
x11
x12
x21 X= ... ... xn1
x22 ... ...
INPUT: Large
AC C
vegetation native vegetation t cover/urbanised
PROCESS: Data mining
TE D
Q classes + key WQ parameters explaining WQ i.e. reduction of parameter space dimension)
M AN U
Pacific Ocean
Colombia
- Pattern recognition - Machine learning - Genetic Algorithm
ACCEPTED MANUSCRIPT 1
Water quality assessment with emphasis in parameter
2
optimisation using pattern recognition methods and genetic
3
algorithm
4 Gonzalo Sotomayor1a, Henrietta Hampel2,1, Raúl F. Vázquez3,1
RI PT
5 6
1
Laboratorio de Ecología Acuática, Departamento de Recursos Hídricos y Ciencias
9 10
E-mail:
[email protected] 2
Facultad de Ciencias Químicas, Universidad de Cuenca, Av. 12 de Abril S/N, Cuenca, Ecuador.
11 12
E-mail:
[email protected]
13
3
14
E-mail:
[email protected]
15
a
Facultad de Ingeniería, Universidad de Cuenca, Av. 12 de Abril S/N, Cuenca, Ecuador.
EP
18
ABSTRACT
TE D
Corresponding Author
16 17
SC
Ambientales, Universidad de Cuenca, Av. 12 de Abril S/N, Cuenca, Ecuador.
8
M AN U
7
A non-supervised (k-means) and a supervised (k-Nearest Neighbour in combination
20
with genetic algorithm optimisation, k-NN/GA) pattern recognition algorithms were
21
applied for evaluating and interpreting a large complex matrix of water quality (WQ)
22
data collected during five years (2008, 2010-2013) in the Paute river basin (southern
23
Ecuador). 21 physical, chemical and microbiological parameters collected at 80
24
different WQ sampling stations were examined. At first, the k-means algorithm was
25
carried out to identify classes of sampling stations regarding their associated WQ
26
status by considering three internal validation indexes, i.e., Silhouette coefficient,
27
Davies-Bouldin and Caliński-Harabasz. As a result, two WQ classes were identified,
28
representing low (C1) and high (C2) pollution. The k-NN/GA algorithm was applied on
AC C
19
1
ACCEPTED MANUSCRIPT the available data to construct a classification model with the two WQ classes,
2
previously defined by the k-means algorithm, as the dependent variables and the 21
3
physical, chemical and microbiological parameters being the independent ones. This
4
algorithm led to a significant reduction of the multidimensional space of independent
5
variables to only nine, which are likely to explain most of the structure of the two
6
identified WQ classes. These parameters are, namely, electric conductivity, faecal
7
coliforms, dissolved oxygen, chlorides, total hardness, nitrate, total alkalinity,
8
biochemical oxygen demand and turbidity. Further, the land use cover of the study
9
basin revealed a very good agreement with the WQ spatial distribution suggested by
10
the k-means algorithm, confirming the credibility of the main results of the used WQ
11
data mining approach.
M AN U
SC
RI PT
1
12
EP
TE D
Key Words: Water quality, Pattern recognition, Genetic algorithm, Land cover.
AC C
13
2
ACCEPTED MANUSCRIPT 1. INTRODUCTION
2
Present-day changes to fluvial systems include a great variety of direct and indirect
3
anthropogenic activities, which in many cases result in a drastic deterioration of the
4
water quality (Barnett et al., 2008; Harper et al., 2008). A variety of contaminants, in
5
addition to a multitude of imprudent water management practices and destructive land
6
uses, are currently threatening aquatic systems on a world-wide scale (Sanderson et
7
al., 2002). Moreover, water of good quality is a crucial component for sustainable
8
socio-economic development in any region of the world (Bartram and Balance, 1996).
9
In response to this, and owing to spatial-temporal variations in water chemistry, a
10
monitoring program that provides a representative and reliable estimation of the water
11
quality (WQ) is necessary (Simeonov et al., 2003). The large number of samples that
12
need to be considered in appropriate WQ assessments and the number of constituents
13
that must be considered per sample gives rise to data sets of enormous size and
14
complexity.
M AN U
SC
RI PT
1
Furthermore, the relationships investigated in these data sets usually cannot be
16
expressed in quantitative terms and they are better expressed in terms of similarity or
17
dissimilarity among groups of multivariate data (Lavine and Rayens, 2009). Herein, the
18
application of different supervised and non-supervised pattern recognition techniques
19
has the potential of facilitating the interpretation of complex data matrices and, as such,
20
of understanding the WQ status of the studied systems without losing important
21
information; moreover, they constitute valuable tools for reliable management of water
22
resources (Shrestha and Kazama, 2007; Juahir et al., 2010; Bücker et al., 2010).
EP
AC C
23
TE D
15
In Ecuador, governmental efforts have been carried out in the past to establish
24
various WQ monitoring programs in some particular locations of the country and, more
25
recently, to a national level (SENAGUA, 2016). However, the resulting observed data
26
are not really public available, and as such the data can hardly be used for research or
27
management purposes. Nevertheless, the Ecuadorian National Secretary of Water
28
(SENAGUA) - Santiago River Hydrographic Demarcation (DHS) generated, and made 3
ACCEPTED MANUSCRIPT 1
available to the current study, an extensive database derived from 80 monitoring
2
stations located in the Paute river basin, which is one of the most important
3
hydrographic systems of southern Ecuador. As such, in the present study, a large and complex database (including data on
5
14427 observations), collected throughout a 5-year monitoring program (year 2008 and
6
period 2010-2013), was used in the context of supervised and non-supervised pattern
7
recognition techniques that were applied with the aim of handling the complexity of the
8
database and mining relevant information about: (i) the WQ
9
dissimilarities between the 80 monitoring stations based on the measured river WQ
10
parameters; (ii) the most significant WQ related parameters that explain the WQ
11
classes identified through the previous objective; and (iii) the correspondence among
12
the identified WQ classes, the most significant WQ related parameters and the land
13
cover of the study river basin.
RI PT
4
14
M AN U
SC
similarities or
2. STUDY AREA AND METHODS
16
2.1. Study area and water quality (WQ) monitoring sites
17
The study area, the Paute river basin, is located in the south of Ecuador (Fig. 1). It has
18
a surface of about 6442 km2 and its main reach length is approximately 120.4 Km. This
19
is one of the most important hydrographic systems of Ecuador due to the exploitation of
20
part of its enormous hydroelectric potential, which, currently, enables covering about 40
21
% of the energetic demand of the country (CONELEC, 2011). The annual average
22
discharge of the river is 136.3 m3 s-1 (in its lower part); its temporal variation is strongly
23
influenced by the presence of some hydroelectric generation subsystems such as
24
Mazar, Amaluza and Sopladora, that are conforming the Integrated Paute (generation)
25
System (CONELEC, 2009). The Paute river provides water resources to rural,
26
agricultural, urban and industrial areas of an important part of the southern Andes of
27
Ecuador (Da Ros, 1995) and discharges into the Upano river, which belongs to the
28
Amazon river system. Two important cities are located inside the river basin,
AC C
EP
TE D
15
4
ACCEPTED MANUSCRIPT 1
respectively Cuenca and Azogues with approximately 500000 and 33850 inhabitants.
2
The main pollutant loads, from both point and non-point sources, include domestic
3
wastewaters, agricultural runoff, animal husbandry and industrial effluents (Da Ros,
4
1995). The geological features of the basin are very complex, primarily due to the
6
subduction of the Nazca plate towards the South American plate, which causes that
7
this basin is still in the process of rising, sharpening its slope, resulting in large
8
amounts of suspended sediments that are transported by the river to the Amazon basin
9
(Astudillo et al., 2010). The elevation ranges approximately between 500 and 4250 m
10
above the sea level (a.s.l.), with the majority (61.3 %) of the basin in the elevation
11
range 2550-3575 m, 4.3 % in the range 500-1525 m, 13.7 % in the elevation band
12
1525-2550 m, and 20.7 % of the basin is situated above 3575 m. In average terms,
13
slopes vary between 25 % and 50 %; in the upper part of the basin (west) a
14
mountainous relief is dominant whilst a gentler relief is representative in the central and
15
lower parts (east).
TE D
M AN U
SC
RI PT
5
Multi-year average temperature varies between 4.4 °C and 18.6 °C. The lower
17
temperatures correspond to the western Andes range with an average of 6 °C
18
(Páramo), while the warmest areas are situated in the valleys and subtropical zones
19
(Amazonia regions), with an average temperature fluctuating between 22 °C and 26
20
°C. Due to the wide elevation range, rainfall oscil lates in intensity and duration, with
21
maximum annual averages between 2500 mm and 3000 mm at higher elevations and
22
minimum annual averages between 600 mm and 800 mm in the valleys.
AC C
23
EP
16
The sampling design was planned to cover a wide range of parameters (21 in
24
total) at key monitoring sites (80 in total; Fig. 1), aiming at representing as accurately
25
as possible the WQ distribution in the study basin (SENAGUA, 2016).
26 27
2.2. Sampled WQ parameters
5
ACCEPTED MANUSCRIPT The studied 21 WQ parameters include: aluminium (Al), ammonia (N-NH4), 5-day
2
biochemical oxygen demand (BOD5), chloride (Cl), dissolved oxygen (DO), electric
3
conductivity (EC), faecal coliforms (FC), fluoride (F), iron (Fe), nickel (Ni), nitrate (N-
4
NO3), pH, phosphates (P-PO4), potassium (K), sodium (Na), total alkalinity (TALK),
5
total hardness (TH), total phosphorus (P-tot), total solids (TS), turbidity (TU) and water
6
temperature (WT). On average, the monitoring stations were visited five times per year,
7
except in 2008, when they were only sampled three times. Some stations were
8
sampled more often, because they were located either in highly polluted sites or, on the
9
contrary, in unaltered environmental (i.e., reference) locations. As a result of the
10
applied monitoring protocol, a large and complex WQ database was established for the
11
nsp = 80 monitoring stations, upon nrep = 687 sampling replicates and the surveying of
12
the nv = 21 WQ parameters for replicate and monitoring stations, resulting in a total of
13
nobs = nrep x nv = 14427 observations, that are represented by xi,j, with i = 1, 2, …., nv
14
and j = 1, 2, …, nrep (Fig. 2). Unfortunately, despite the fact that this is a very interesting
15
and rich data set, there was never a scientifically driven monitoring plan to collect it but,
16
rather, its structure responds to political/personal initiatives that varied throughout time.
17
Table 1 lists the basic statistics for each of the studied WQ parameters.
SC
M AN U
TE D
AC C
EP
18
RI PT
1
19
6
ACCEPTED MANUSCRIPT 1
Figure 1. Location of the Paute river basin in the context of continental Ecuador and associated
2
distribution of the 80 water quality monitoring stations. UTM coordinate system (WGS84).
3
TE D
M AN U
SC
RI PT
4
5
Figure 2. Flowchart of the modelling protocol that was implemented in the current study and that combines
7
a non-supervised (k-means) and a supervised (k-NN/GA) pattern recognition methods.
9 10
AC C
8
EP
6
2.3. Processing land coverage data for assessing the spatial congruency of WQ classification
11
The correspondence of the spatial distribution of the WQ classes was assessed by
12
comparing it with the land cover distribution of the Paute river basin, using as well
13
auxiliary topographical information. Land cover raster data was available for the year
14
2001, covering the whole extent of the study catchment. This is the most recent dataset
15
that is available publicly. Geographical Information Systems (GIS) algorithms were
16
applied on the original land cover data so that it was reclassified to a more convenient 7
ACCEPTED MANUSCRIPT 1
form, which enabled a direct association between land cover and the spatial distribution
2
of the WQ classes.
3 Table 1. Main statistics associated to the water quality (WQ) parameters that were monitored throughout
5
year 2008 and period 2010-2013 in the Paute river basin (Fig. 1).
Mean
Median
STD
Range
0.62
0.06
1.50
0.00 ─ 23.00
0.06
0.00
0.22
0.00 ─ 1.59
0.50
0.00
7.2
2.3
4.26
0.82
-
TALK (mg L ¹) -
Al (mg L ¹) -
N-NH₄ (mg L ¹)
SC
WQ Parameter
RI PT
4
0.00 ─ 11.35
11.9
0.0 ─ 141.0
16.63
0.00 ─ 263.06
6.79
0.80
3.30 ─ 9.75
76.4
178.0
0.16 ─ 1810.00
1700.0
6668.5
0.0 ─ 16000.0
0.40
4.51
0.0 ─ 54.6
36.8
55.8
0.0 ─ 657.0
0.63
0.00
2.28
0.00 ─ 31.00
0.01
0.00
0.09
0.00 ─ 1.51
0.57
0.06
2.16
0.00 ─ 23.47
7.53
7.58
0.68
4.17 ─ 9.43
1.09
0.41
1.63
0.00 ─ 7.72
0.58
0.19
0.92
0.00 ─ 4.76
K (mg L ¹)
2.64
0.47
13.06
0.00 ─ 227.06
-
6.57
3.48
12.38
0.00 ─ 112.89
-
12.47
0.01
76.36
0.00 ─ 1160.00
TU (NTU)
23.3
2.9
80.9
0.0 ─ 1190.5
WT (⁰C)
15.2
14.6
3.5
8.1 ─ 23.8
-
BOD₅ (mg L ¹) -
CL (mg L ¹) -
DO (mg L ¹)
6.77
-1
EC (µS cm )
130.3 -1
-1
FC (bacteria 100 ml )
5675.2
-
1.03
-
47.8
FL (mg L ¹)
-
Ni (mg L ¹) -
N-NO₃ (mg L ¹) pH -
P-PO₄ (mg L ¹) -
AC C
P-tot (mg L ¹)
EP
-
Fe (mg L ¹)
TE D
TH (mg L ¹)
-
Na (mg L ¹) TS (mg L ¹)
M AN U
1.25
6
Legend: STD = standard deviation; TALK = total alkalinity; Al = Aluminium; N-NH4 = Ammonia, BOD5 = 5-
7
day biochemical oxygen demand; Cl = chlorides; DO = dissolved oxygen; EC = electric conductivity; FC =
8
faecal coliforms; F = fluorides; Fe = iron; Ni = nickel; N-NO3 = nitrate; P-PO4 = phosphates; K = potassium;
8
ACCEPTED MANUSCRIPT 1
Na = sodium; TH = total hardness; P-tot = total phosphorus; TS = total solids; TU = turbidity; WT = water
2
temperature.
3 The considered land cover classes were: (i) altered vegetation; (ii) woody native
5
vegetation; (iii) without cover/urbanised; and (iv) Páramo (unaltered). Additionally, a
6
digital elevation (raster) model of the whole catchment was available with a resolution
7
of 50x50 m2. The referred congruency assessment was based on the visual inspection
8
of the geographical distribution of the WQ classes, the land cover and the topography.
9
ArcGis® was used for all GIS analyses, including the latter visual inspection.
SC
RI PT
4
10
2.4. Data processing and multivariate statistical assessment
12
Figure 2 depicts the multivariate statistical protocol applied in this study. All the
13
statistical analyses involved in it were implemented with MATLAB® (Hanselman and
14
Littlefield, 2012) version 2014. As a first step, the Shapiro-Wilk test (Shapiro and Wilk,
15
1965) was used to evaluate whether the distributions of the WQ parameters were
16
normal. This test showed with a 95% confidence level that none of the WQ parameters
17
are normally distributed (p probability value = 0.0001 lower than the significance level α
18
= 0.05), exhibiting in all cases a positive skew (Table 1).
TE D
M AN U
11
Therefore, it was decided to continue with the statistical assessment by using a
20
central tendency value (i.e. aggregated value) of the distribution of every WQ
21
parameter, for a given monitoring station, instead of using the whole set of available
22
observations. Further, this aggregated parameter was standardised through range
23
scaling so that a more normal associated distribution could be obtained; this kind of
24
standardisation procedure is recommended in a cluster analysis, particularly for the k-
25
means method (Steinley, 2004).
AC C
EP
19
26
Herein, the applied aggregation procedure reduced the dimension of the analysis
27
to nmed = nv x nsp = 1680 median observations (xi,m, with i = 1, 2, …., nv and m = 1, 2, …,
28
nsp). Given the significant skewness of the WQ parameters distributions, the use of 9
ACCEPTED MANUSCRIPT 1
median rather than of mean values was preferred since the median is expected to be a
2
better central tendency measure (Anderson and Finn, 1996; Helsel and Hirsch, 2002).
3
Standardisation of xi,m was achieved by means of:
4
im
=
x im − L U
m
−L
m
m
for i = 1, 2, …., n and m = 1, 2, …, n v sp
(1)
RI PT
Z
5
, where Lm and Um are the minimum and maximum limits of the parameter range, so
6
that zi,m varies between 0 and 1 (Frank and Todeschini, 1994).
Then, the aggregated dataset of nmed = 1680 zi,m values was used in the context
8
of the k-means algorithm (MacQueen, 1967), a non-hierarchical cluster analysis (CA)
9
and a non-supervised pattern recognition method, were carried out with the intention of
10
defining k groups of monitoring stations with common WQ characteristics. As a prior
11
process, validation indices (Wang et al., 2009) were calculated to determine the
12
number of WQ clusters, k, for the application of the k-means method.
M AN U
SC
7
Once the k WQ groups of monitoring stations were defined through the k-means
14
process (Fig. 2), for a given monitoring station, the respective identifier (or “label”) of
15
the corresponding WQ (k-means) group (or class) was assigned to the original
16
observations of the station. After following this procedure for all of the sampling
17
stations, on the resulting new data matrix of nobs observations (xi,j), the k-Nearest
18
Neighbour (k-NN), a supervised pattern recognition method, in combination with a
19
genetic algorithm (GA) for optimisation, was applied.
EP
AC C
20
TE D
13
It is worth noticing that the WQ related parameters frequently are of different
21
nature or have different units. Many pattern recognition techniques, like the ones used
22
in this study, are very sensitive to these issues (Todeschini et al., 2015). Therefore,
23
range scaling (Eq. 1) was used to eliminate this dependence on the referred issues,
24
regardless of whether the applied recognition techniques were parametric or non-
25
parametric.
26 27
2.4.1. The non-supervised k-means agglomerative method 10
ACCEPTED MANUSCRIPT The goal of CA is to determine the intrinsic grouping in a set of unlabelled (i.e.
2
unclassified) data, upon the similar characteristics that their members possess (Frank
3
and Todeschini, 1994). In this context non-hierarchical clustering techniques, such as
4
k-means, have been widely used in different applications, among them the
5
investigation of the hydro-chemical processes in-stream water quality (Güler et al.,
6
2002; Caccia and Boyer, 2005). Cluster membership is determined by calculating the
7
centroid for each group and assigning each object to the group with the closest
8
centroid, which minimises the overall within-cluster dispersion by iterative reallocation
9
of cluster members (Hartigan and Wong, 1979). In this study, this method was applied
10
using the Euclidean distance (expressed in the units of measure of v) as the measure
11
of similarity between objects (Chen et al., 2002):
M AN U
SC
RI PT
1
Di, j =
12
v=n v
∑ (z v=1
− z j,v )
2
i ,v
(2)
, where zi,v and zj,v are the values of the normalised parameter v for object i and j,
14
respectively, where i is different from j and both are < nsp.
TE D
13
Further, the evaluation of clustering solutions is fundamental in CA. Validity
16
indices, both external and internal, were used for this purpose (Wang et al., 2009). An
17
external index measures the agreement between a priori known clustering structure
18
and the result from a current clustering procedure (Dudoit and Fridlyand, 2002), whilst
19
an internal index measures the appropriateness (or goodness) of a clustering partition
20
without external information, using quantities and features inherent in the data
21
(Thalamuthu et al., 2006). Internal validity indices were applied in the current analysis,
22
namely, the Silhouette Coefficient (SC), the Davies-Bouldin (DB) index and the
23
Caliński-Harabasz (CH) index. In this context, the Cluster Validity Analysis Platform
24
(CVAP) was used for this purpose (Wang et al., 2009). These were adopted herein
25
since they are widely used for k estimation and clustering quality evaluation (Wang et
26
al., 2009).
AC C
EP
15
11
ACCEPTED MANUSCRIPT SC (Kaufman and Rousseeuw, 1990) is a dimensionless measure that evaluates
2
the quality of compactness and separation of clusters; with an upper bound equal to 1,
3
the optimum k value corresponds to its largest average (Chen et al., 2002). DB (Davies
4
and Bouldin, 1979) is a function of the ratio of the sum of within-cluster scatter to
5
between-cluster separation; as such, the objective is to minimise it, which implies
6
minimising the within-cluster scatter and maximising the between-cluster separation
7
(Ray and Turi, 1999). CH (Caliński and Harabasz, 1974), or variance ratio criterion, is a
8
measure of the between-cluster isolation and the within-cluster coherence; the
9
objective is to maximise it (Kryszczuk and Hurley, 2010).
SC
RI PT
1
Each of the nsp = 80 monitoring stations was assigned to either of the identified k
11
WQ-clusters, as the result of the application of the k-means methodology. For a given
12
WQ monitoring station, the identifier (ID) of the associated WQ-cluster was designated
13
to all the parameters and temporal replicates sampled at this station (Fig. 2). This was
14
done for all 80 monitoring stations and 687 temporal replicates. In this way, the whole
15
matrix of original data had associated k WQ-clusters and was ready for the k-NN/GA
16
supervised pattern recognition analysis (Fig. 2).
19
TE D
18
2.4.2. The supervised k-Nearest Neighbour (k-NN) classification algorithm in
EP
17
M AN U
10
combination with a genetic algorithm (GA) Supervised pattern recognition methods identify models (or classification rules), which
21
can carry out a cluster partitioning using local information provided by descriptive
22
parameters (Lavine and Rayens, 2009). In the current study, the k-NN method (Fix and
23
Hodges, 1951) was used in combination with a GA to: (i) verify the allocation of
24
monitoring stations (or objects or samples) into the classes determined previously by
25
the k-means method, according to the descriptive parameters of the stations (pattern
26
recognition); and (ii) identify the descriptive parameters that explain most of the internal
27
structure of the WQ classes, parameters that should be monitored in future, reducing in
AC C
20
12
ACCEPTED MANUSCRIPT 1
this way the complexity of the original database since the other monitored parameters
2
would not really be necessary for this explanation. The k-NN is a non-parametric algorithm through which a sample point is
4
classified according to the majority vote of its k nearest neighbouring points (Frank and
5
Todeschini, 1994). Thus, for a given sample point and for defining its nearest
6
neighbours (NN), distances (Euclidian, herein) are evaluated from the sample point to
7
every other point in the data set. Based on the class label (defined previously in this
8
study through the k-means) of most of the NN in a sample point, the sample point is
9
assigned that class; when the newly assigned class matches the previously assigned
10
class of the sample point (defined by the k-means method), the k-NN algorithm is
11
considered successful (Lavine and Rayens, 2009). Thus, the k−NN classification has
12
two steps, namely, (i) the definition of the neighbouring points (NN-points) of the
13
sample point; and (ii) the subsequent determination of the class of the sample point,
14
using the information provided by those NN-points.
M AN U
SC
RI PT
3
The k-NN algorithm performance was measured through the non-error rate
16
(NER), which is defined as the percentage of monitoring stations correctly assigned by
17
the k-NN algorithm to the classes determined previously (Fig. 2) by the k-means
18
method (classification accuracy criteria; Hand, 2012). In this study, NER was estimated
19
using a testing data set (i.e., 40% of total data). Sample points were classified
20
according to the WQ parameters (physicochemical and microbiological) that are known
21
to be indirectly related to WQ, usually through some undetermined mathematical
22
relationships. A training data set (i.e., 60% of the total data) for which the property of
23
interest (WQ) and the indirectly related descriptors (in the current case the
24
physicochemical and microbiological parameters measured at every WQ monitoring
25
station) are known was used for developing a classification rule that was in turn utilised
26
to predict the WQ of data samples that were not part of the training set (Lavine and
27
Rayens, 2009).
AC C
EP
TE D
15
13
ACCEPTED MANUSCRIPT Besides the fact that, in general, the more the number of measured parameters
2
the more a classification algorithm gets confused, for the particular case of the k-NN
3
algorithm, there are three main limitations, namely (Frank and Todeschini, 1994;
4
Suguna and Thanushkodi, 2010): (i) calculation complexity due to the fact that all of the
5
similarities between the training sample points must be calculated for classification; (ii)
6
the performance of the classification algorithm is over-dependent on the training set;
7
and (iii) there is no weight difference between the training sample points, despite the
8
implicit differences in terms of available data for every one of those points.
RI PT
1
Thus, the combination of the k-NN algorithm with a GA was applied to overcome
10
these limitations by: (i) generating a profile about the internal data structure of every
11
class; and (ii) getting the contribution of individual related parameters (measured at
12
every WQ monitoring station) to the classification of the main property of interest (in the
13
current case, WQ). In this context, GAs demonstrated to be robust searching
14
techniques that in most cases outperform traditional optimisation methods in water
15
resources applications (Mulligan and Brown, 1998; Ng and Perera, 2003; Liu et al.,
16
2007), through the emulation of evolution by natural selection and genetic inheritance,
17
so that a population of competing solutions evolve over time to converge to a single
18
optimal one (Holland, 1975).
EP
TE D
M AN U
SC
9
Thus, every model parameter (every WQ related parameter, in this study) is a
20
gene, while a complete set of genes (21 parameters) is a chromosome. Every gene
21
adopts the value of the respective WQ related parameter encoded as a variable-length
22
binary number. The length of this binary number depends on the magnitude of the WQ
23
parameter value. For example, if a gene represents pH (4.17 < pH < 9.43) the length of
24
the respective binary number will be shorter than the corresponding length when a
25
given gene represents, say, faecal coliforms (varying between 0 and 16000), owing to
26
the magnitude difference of the values adopted by pH and faecal coliforms. The
27
advantage of using this binary system is that very small value variations of a given WQ
28
parameter can be taken into account in the analysis.
AC C
19
14
ACCEPTED MANUSCRIPT The optimisation process is carried out by using three biological operators
2
(Sivanandam and Deepa, 2008; Liu et al., 2007), namely: (i) selection; (ii) cross-over
3
(reproduction); and (iii) mutation. Selection makes sure that only the best
4
chromosomes (solutions) could cross-over or mutate. During successive iterations
5
(generations), initial chromosomes evolves into stronger ones by reproduction among
6
members of the previous generation. Each GA run consists of several generations with
7
constant population size of chromosomes. The strength of each chromosome is
8
evaluated by means of a fitness function. In the current study, a pattern recognition
9
function through a classification model, NER, was used to characterise the
10
performance of the k-NN/GA method regarding the prior classification produced by the
11
unsupervised k-means method. So, the highest NER values were obtained for a cross-
12
over probability of 60%, a mutation probability of 3% and a constant population of 21
13
chromosomes that were evolved over 100 generations. Thus, subsequent generations
14
are formed by combining (stronger) chromosomes, with associated higher fitness
15
values, from the previous (or parent) population. Further, a random change of the value
16
of a gene to form a new chromosome is achieved through mutation, on a bit-by-bit
17
basis (from 0 to 1 or in reverse order).
TE D
M AN U
SC
RI PT
1
The combination of the k-NN and GA that was used in the current study follows to
19
a great extent the approach presented by Suguna and Thanushkodi (2010);
20
alternatives are for instance Chang and Lippmann (1990) and Raymer et al. (2000).
21
Hereafter, the approach involved the following steps: (i) data processing in general,
22
including the encoding of the genes that form the chromosomes; (ii) selecting the
23
distance (or similarity) measure among the test stations whose category are being
24
predicted and the respective neighbouring training stations; (iii) choosing the k-number
25
of samples (or objects) from the training set to generate the initial population on which
26
genetic operators are applied during optimisation; (iv) calculating the distance
27
measures among testing and training samples through a goodness of fit (or similarity)
28
index; (v) choosing the chromosome with the highest goodness of fit value and storing
AC C
EP
18
15
ACCEPTED MANUSCRIPT it as the global maximum; and (vi) refining the global maximum. The latter step is
2
performed through an iterative procedure that is based on the application of genetic
3
operators, namely, reproduction, crossing-over and mutation to get a new population
4
upon which a newer evaluation of the goodness of fit measure takes place to define a
5
local maximum; only if the newer goodness of fit measure (local maximum) is higher
6
than the one associated to the older global maximum, the chromosome that
7
corresponds to the newer local maximum is the newer global maximum. The iterative
8
process is over once the total number of generations (100 in this study) are taken into
9
account. The chromosome that corresponds to the global maximum has the optimum
11
SC
k-neighbours and the respective classification results.
M AN U
10
RI PT
1
3. RESULTS
13
3.1. Land cover processing
14
Based on the adopted classes, the land cover percentages within the study basin are
15
respectively altered vegetation (37.12 %), woody native vegetation (34.38 %), without
16
cover/urbanised (3.54 %), and Páramo (unaltered vegetation) (24.96 %). Figure 3
17
illustrates, in addition to the distribution of land cover, the WQ class, C1 or C2, of each
18
monitoring station and the bounds of the 18 sub-basins in which the Paute river basin
19
was sub-divided. The altered vegetation is present particularly in the mid corridor of the
20
basin; both, the higher western and lower eastern sides of the basin still have a
21
significant presence of woody native vegetation and Páramo. Woody native vegetation
22
is present for instance in the sub-basins Negro (77.8 %), Juval (86.4 %), Pulpito (77.3
23
%) and Machángara (76.4 %). Páramo (unaltered) is present in the sub-basins
24
Yanuncay (60.1 %), Pulpito (59.5 %), Machángara (56.7 %) and Tomebamba (48.3 %).
25
Further, the Paute basin includes important extents of protected areas (Fig. 1). The
26
most renowned are the Cajas National Park (PNC, located in the western, higher,
27
extreme of the basin; Mosquera et al., 2017), which is a Ramsar-Convention
28
(RAMSAR) wetland site, and the Sangay National Park (PNS, located at the north-east
AC C
EP
TE D
12
16
ACCEPTED MANUSCRIPT extreme); both recognised by the the United Nations Educational, Scientific and
2
Cultural Organization (UNESCO) as World Heritage Sites. Nearly 41% of the Paute
3
basin area is subjected to management and conservation programs.
4 5 6
Figure 3. Spatial distribution in the Paute river basin of (i) the soil cover (year 2001); (ii) the 18 sub-basins;
7
and (iii) the water quality (WQ) monitoring stations, classified according to the k-NN/GA algorithm as
8
having associated low (C1) or high (C2) pollution. Sub-basins are: 1 = Sidcay, 2 = Collay, 3 = Cuenca, 4 =
9
Jadan, 5 = Paute, 6 = Machangara, 7 = Magdalena, 8 = Mazar, 9 = Juval, 10 = Pindilig, 11 = Pulpito, 12 =
10
Sta. Bárbara, 13 = Burgay, 14 = Tarqui, 15 = Tomebamba, 16 = Yanuncay, 17 = Paute bajo, and 18 =
11
Negro.
TE D
EP
12
M AN U
SC
RI PT
1
3.2. k-means algorithm: spatial similarity and monitoring stations grouping
14
The maximum SC value (SCmax), the minimum DB value (DBmin), and the maximum CH
15
value (CHmax) were all obtained for k = 2 (Table 2), that is, implying that there are two
16
statistically significant clusters. As for the results of the k-means method, these two
17
groups of sites in the Paute river basin coincide with prior WQ studies in the study
18
region (Da Ros, 1995; Pauta-Calle and Chang-Gómez, 2014); cluster 1 (C1)
19
corresponds to relatively less polluted sites (67 monitoring stations) while cluster 2 (C2)
20
matches highly polluted sites (13 monitoring stations).
AC C
13
17
ACCEPTED MANUSCRIPT Figure 3 highlights, as it might be expected, that most WQ monitoring stations
2
belonging to C1 are situated at the upper lands and in areas with a high level of
3
conservation and good land management. On the other hand, the monitoring stations
4
classified as C2 are associated to sites that receive pollution from, both, point and non-
5
point sources such as agricultural fields, orchard plantations and municipal and
6
industrial wastewater effluents, particularly those stations located in the Burgay sub-
7
basin, near the Azogues city. The rest of the WQ monitoring stations belonging to C2
8
(13 stations) are located in the Magdalena and Pindilig sub-basins, which are in the
9
vicinity of the Burgay and the Santa Bárbara sub-basins. All these sub-basins are
SC
located in the middle elevation part of the Paute river basin.
11
M AN U
10
RI PT
1
12
Table 2. Values adopted by the internal validation indices as a function of the number of clusters k
13
(application of the non-supervised k-means algorithm for classification).
Inspected number of clusters k
Internal 2
SC
0.60
DB CH
14
3
4
5
6
0.32
0.25
0.28
0.28
TE D
Indexes
1.31
1.65
1.66
1.54
1.50
28.22
20.75
19.54
17.55
16.77
Legend: SC = Silhouette coefficient; DB = Davies-Bouldin; CH = Caliński-Harabasz.
EP
15
3.3. k-NN/GA: pattern recognition performance and significant WQ parameters
17
In this analysis, for a given monitoring station, the respective (k-means) WQ class was
18
the (dependent) grouping variable, whilst the corresponding measured parameters
19
constituted the independent variables. A good classification accuracy was obtained
20
with the k-NN/GA algorithm, since the average NER for 100 runs (Fig. 4a) was about
21
0.87 (± STD = ± 0.006), implying that about 87 % of the WQ monitoring stations were
22
correctly assigned with respect to the previously, k-means derived WQ classes (Fig.
23
4a).
AC C
16
24
18
ACCEPTED MANUSCRIPT 0.88
0.88 NER (-)
(b) 0.89
NER (-)
(a) 0.89
0.87 0.86
0.87 0.86 0.85
0.86
0.84 25 50 75 Run number (-)
100
0 5 10 15 20 Number of WQ parameters (-)
RI PT
0
80 60 40 20 0 EC
FC
DO
TH
CL
SC
Frequency (%)
(c) 100
N - NO₃₃ BOD₅₅
TALK
TU
Significant WQ parameters
M AN U
1
Figure 4. (a) Evolution of the classification model NER as a function of the run number throughout the
3
classification process; (b) the number of WQ variables (i.e. measured parameters) included in the final
4
stepwise selection analysis; and (c) the frequency of k-NN/GA variable (i.e. measured parameter) selection
5
throughout the classification process (EC = electric conductivity; FC = faecal coliforms; DO = dissolved
6
oxygen; TH = total hardness; CL = chlorides; N–NO3 = nitrates; BOD5 = 5-day biochemical oxygen
7
demand; TALK = total alkalinity; and TU = turbidity). The k-value that was used through the application of
8
the k-NN/GA process is 2.
9
TE D
2
With the objective of getting the most important independent variables (i.e.
11
measured parameters) that explain the structure of the two WQ classes, the final
12
stepwise selection with the best NER indicated that nine parameters are the optimum
13
to interpret the two WQ classes (Fig. 4b), namely, EC, FC, DO, TH, CL, N-NO3, BOD5,
14
TALK and TU, that account for most of the expected spatial variations of the WQ in the
15
Paute river basin (Fig. 4c). Further, the frequency of the k-NN/GA variable selection
16
throughout the classification process (Fig. 4c), illustrates that the EC parameter is the
17
most significant by the classification model, whilst the remaining 8 WQ parameters
18
seem to have a similar (lower) significance.
AC C
EP
10
19
In the following, the sub-index C1 indicates that a given parameter or
20
characteristic is associated to the WQ class C1; whilst the sub-index C2 refers to WQ 19
ACCEPTED MANUSCRIPT class C2. Figure 5 shows the Box plots of the 9 significant WQ parameters as a
2
function of the two WQ classes C1 (less polluted) and C2 (highly polluted). In general,
3
the values adopted by EC (Fig. 5a), FC (Fig. 5b), TH (Fig. 5d), CL (Fig. 5e), N–NO3
4
(Fig. 5f) and TU (Fig. 5i) are higher for the sampling points belonging to C2 than for the
5
ones belonging to C1 (average values are ECC1 = 87.7 µS cm-1, ECC2 = 280.3 µS cm-1;
6
FCC1 = 3936.7 bacteria 100-1 ml-1, FCC2 = 11794.3 bacteria 100-1 ml-1; THC1 = 37.3 mg
7
L-1, THC2 = 84.5 mg L-1; CLC1 = 1.7 mg L-1, CLC2 = 13.2 mg L-1; (N–NO3)C1 = 0.3 mg L-1,
8
(N–NO3)C2 = 1.5 mg L-1; and TUC1 = 14.7 mg L-1, TUC2 = 53.5 mg L-1). For the DO (Fig.
9
5c) the inverse condition was observed, that is, lower values for waters in C2 (average
10
6.32 mg L-¹) than for the waters in C1 (average 6.90 mg L-¹). For TALK (Fig. 5g) and
11
BOD5 (Fig. 5h) no considerable differences between waters belonging to C1 and the
12
ones belonging to C2 were observed (average TALKC1 = 0.62 mg L-1, TALKC2 = 0.61
13
mg L-1 and (BOD5)C1 = 6.43 mg L-1, (BOD5)C2 = 9.92 mg L-1); however, the dispersion is
14
slightly lower for the peak values of waters in C2 than for the waters in C1, similarly to
15
what is observed for the DO. It is important to observe that most of the peak values
16
(Fig. 5) were observed at the Burgay sub-basin (Fig. 3); thus, they do not seem to be
17
outliers, but they are simply characterising the extreme WQ conditions taking place
18
within such sub-basin.
SC
M AN U
TE D
EP
19
RI PT
1
4. DISCUSSION
21
Because the pre-selection of the number of clusters (k) at the start of the k-means
22
method strongly influences its outcome (Güler et al., 2002), the reliability of the two
23
clusters was satisfactorily tested through the internal validation indexes SC, DB and
24
CH. In the current case a SC value of 0.6 corresponds to “a reasonable structure has
25
been found” category (Kaufman and Rousseeuw, 1990). DB and CH are not valued
26
with a categorization likewise SC; however, the fact that both suggest two WQ classes
27
emphasises the similarity with the results of the k-means algorithm that partitioned the
28
monitoring stations into two homogeneous groups.
AC C
20
20
ACCEPTED MANUSCRIPT
TE D
M AN U
SC
RI PT
1
EP
2
Figure 5. Box plots of the WQ significant parameters as a function of the two WQ classes C1 (less
4
polluted) and C2 (highly polluted): (a) electric conductivity (EC); (b) faecal coliforms (FC); (c) dissolved
5
oxygen (DO); (d) total hardness (TH); (e) chlorides (CL); (f) nitrates (N–NO3); (g) total alkalinity (TALK); (h)
6
5-day biochemical oxygen demand (BOD5); and (i) turbidity (TU).
7 8
AC C
3
9
Further, the current application of the k-means algorithm was based on a non-
10
conventional approximation. Namely, due to the significant skewness present in the
11
original WQ parameters, a simplified data matrix of the median values of the WQ
12
parameters,
that
were
previously
standardised,
was
used
(Ouyang,
2005). 21
ACCEPTED MANUSCRIPT Standardisation of the WQ parameters was achieved by considering their ranges of
2
variation rather than by using the z-scale procedure, as conventionally done in most
3
WQ studies (Shrestha and Kazama, 2007; Kannel et al., 2007; Singh et al., 2004). In
4
this context, Steinley (2004) concludes that standardisation of the studied parameters,
5
by considering their range of variation, was the most effective approach while using the
6
k-means method.
RI PT
1
In classification models, parametric methods, such as the linear discriminant
8
analysis, widely used in surface water resources (Hajigholizadeh & Melesse, 2017;
9
Gholizadeh et al., 2016; Kovács et al., 2014), require a priori knowledge of the
10
probability density functions of the classes. However, in most real-world applications
11
these statistical properties are very hard to be known in advance (Lavine and Rayens,
12
2009) and/or classes substantially depart from normality (Hirsch and Alexander, 1991).
13
Consequently, parametric methods are no longer adequate (Huberty and Olejnik, 2006)
14
and non-parametric methods, such as the k-NN algorithm, are a more appropriate
15
alternative (Hirsch et al., 1982; Towler et al., 2009), despite of which the k-NN
16
approach is normally not chosen as a primary classification method in surface water
17
studies.
TE D
M AN U
SC
7
The k-NN/GA method gave relevant results in selecting appropriate explanatory
19
variables representative of the WQ status in the study basin. Hence, the method
20
yielded an important data reduction as it identified only nine (EC, FC, DO, TH, CL, N-
21
NO3, BOD5, TALK and TU) out of the 21 WQ measured parameters, that are explaining
22
about 86 % of the cluster assignations in supervised pattern recognition, reflecting
23
most of the WQ spatial variation in the study basin.
AC C
EP
18
24
Hence, the average ECC2 (280.3 µS cm-1) was almost three times higher than the
25
respective value for ECC1 (87.7 µS cm-1). Because EC increases nearly linearly with
26
increasing ion concentration, it implies that inorganic inputs are high in waters
27
belonging to C2. Most of C2 associated monitoring stations belong to the Burgay sub-
28
basin (Fig. 3), where wastewater effluents often contain high amounts of dissolved 22
ACCEPTED MANUSCRIPT salts from inorganic pollution sources such as domestic sewage, municipal storm water
2
drainage and industrial effluent discharges (Da Ros, 1995; Pauta-Calle and Chang-
3
Gómez, 2014). Further, other variables related to inorganic pollution sources such as
4
TH, CL, TU, N-NO3 and TALK have similar evolution to the one of EC. In this regard,
5
Olajire and Imeokparia (2001), Daniel et al. (2002) and Begum et al. (2009) observed
6
in similar studies that in water with high EC values nitrate and chloride ions and
7
calcium or magnesium salts are predominant.
RI PT
1
The high TH values associated to C2 might be linked to the use of inorganic
9
fertilisers (WHO, 1998). Hereafter, high concentrations of the nitrate ions (inorganic
10
nutrient) associated to C2 might be attributed to the runoff of fertiliser residues from
11
agricultural lands, mainly urea (CON2H4). Further, FC, DO, and BOD5 normally reflect
12
contamination by microbial communities and human activities (Liou et al., 2004).
13
Hereafter, the trends observed for these variables suggest an important organic
14
pollution from anthropogenic source associated to the WQ monitoring sites belonging
15
to C2, resulting in high oxygen-demand consumption, characterised by average BOD5
16
values higher in C2 (9.92 mg L-¹) than in C1 (6.43 mg L-¹).
TE D
M AN U
SC
8
A similar trend was found in other studies such as Vega et al. (1998), Singh et al.
18
(2004) and Kannel et al. (2007), which reported that high levels of dissolved organic
19
matter (DOM, i.e., carbohydrates, proteins, lipids, etc.) consume large amounts of
20
oxygen, which deploys the amount of available DO. The latter is in line with the results
21
of the current study. However, on the contrary to what is addressed in those studies, in
22
our study basin, the DO did not drop to a minimum level and, as such did not induce
23
anaerobic fermentation leading to development of ammonia and organic acids. This is
24
emphasised by the evolution of the pH distributions associated to the C1 and C2
25
monitoring sites, which shows that the acidity of the C2-associated waters is not
26
systematically higher than the one recorded in the C1-associated waters, except for
27
very few pH data near the lower, i.e. acidic, extreme of the pHC2 range of variation
28
(pHC1 range = 5.3 - 9.43, pHC2 range = 4.17 - 8.95).
AC C
EP
17
23
ACCEPTED MANUSCRIPT Several authors emphasised the importance of finding land use/cover attributes
2
that explain the spatial variation of WQ (Bücker et al., 2010; Bu et al., 2014; Cunha et
3
al., 2016). Examination of the spatial distribution of both, the C1 and C2 stations, as
4
well as the land cover, suggest that most of the WQC2 monitoring stations (Fig. 3) are
5
not directly exposed to, nor have a significant influence from, native forest and
6
unaltered vegetation cover, which is particularly the case in the Burgay sub-basin,
7
where most of the WQC2 monitoring stations are located and where 55.3 % of the
8
surface has altered native forest and vegetation (resulting from the introduction of
9
exotic species such as eucalyptus, or replacement of native vegetation by pastures).
10
Further, besides the Azogues city, which does not possess a water treatment system
11
prior to final in-river disposal, this sub-basin has an important number of small villages
12
that produces a significant contribution of untreated sewage (Da Ros, 1995). This
13
problem is even accentuated by the industrial production of cement (Da Ros, 1995;
14
Pauta-Calle and Chang-Gómez, 2014). Similarly, the neighbouring Magdalena sub-
15
basin has a very high proportion (70.7 %) of its surface covered by altered vegetation
16
and a limited native forests cover (24.2 %), which explains the presence of a WQC1
17
station located in the native forest area and a WQC2 station by the outlet of the sub-
18
basin. Both, the Burgay and the Magdalena sub-basins, do not have adequate land
19
management programs.
EP
TE D
M AN U
SC
RI PT
1
In the other sub-basins, with the presence of WQC2 stations (Paute, Pindilig and
21
Sta. Bárbara), the proportion of surface with altered vegetation cover is significant,
22
even though the majority of the WQ sampling stations are classified as C1. All WQ
23
sampling stations of the Paute sub-basin (Fig. 3), except one, are classified as C1 and
24
are located in areas where no significant human settlements are present; only the WQ
25
station located at the outlet of the sub-basin is classified as C2, which can be explained
26
not only by the direct influence of human settlements located immediately upstream
27
along the river course, but also by the direct contribution from the Magdalena sub-basin
28
into the lower part of the Paute sub-basin (Fig. 2 and Fig. 3).
AC C
20
24
ACCEPTED MANUSCRIPT For the Pindilig and Santa Bárbara sub-basins that contain WQC2 stations, the
2
location of the WQC1 stations is congruent with its location near a native forest and
3
unaltered vegetation. Further, a significant proportion of their surface (50.2 % for the
4
Santa Bárbara sub-basin and 32.6 % for the Pindilig sub-basin) is covered by altered
5
vegetation and there is only a very scarce presence of human settlements. In this
6
regard, once again, the land cover distribution suggests point source pollution from
7
focalised human settlements, which is accentuated in the Santa Bárbara sub-basin by
8
the fact that the average FCC2 (13573.1 bacteria 100-1 ml-1) doubles the respective
9
average FCC1 (6272.4 bacteria 100-1 ml-1). For the Pindilig sub-basin, this is confirmed
10
by the average FCC2 value (12621.4 bacteria 100-1 ml-1) that nearly doubles the FCC1
11
value (6909.8 bacteria 100-1 ml-1), as well as, by the fact that the average (BOD₅)C2
12
value (22.6 mg L-1) triplicates the respective (BOD₅)C1 value (6.1 mg L-1).
M AN U
SC
RI PT
1
Hereafter, point source pollution locations seem to have a low influence on the
14
downstream WQ, as the presence of human settlements nearly vanishes, there exists
15
an increment of discharge and the presence of native and altered vegetation prevails.
16
Particularly the presence of native vegetation has the potential of determining the
17
existence of riparian ecosystems in several spots along the river courses forming buffer
18
systems for the enhancement of riparian ecosystems and, finally, of the downstream
19
WQ. Formation of these riparian buffers should be encouraged in the study basin, for
20
instance for the removal of nitrates (Lowrance et al., 1997; Sweeney and Newbold,
21
2014; Connolly et al., 2015), that in WQC2 sites reaches an average value of 1.51 mg L-
22
¹, which is about 5 times the average in the WQC1 sites (0.31 mg L-¹).
EP
AC C
23
TE D
13
On the other hand, the sampling sites located by Cuenca city (Tomebamba and
24
Yanuncay sub-basins) are classified as belonging to the C1 group, despite of being
25
located at the most urbanised area of the Paute river basin. This is logic and explained
26
by the fact that in the city of Cuenca an acceptable treatment of most of the wastewater
27
takes place, with favourable consequences for the WQ of the Paute river. In the
28
remaining sub-basins of the Paute river basin the presence of native forest and 25
ACCEPTED MANUSCRIPT 1
unaltered vegetation (such as páramo) is significant and the level of anthropisation is
2
low, which emphasises the congruency of the current WQ classification results.
3 5. CONCLUSIONS
5
The study illustrated the usefulness of the utilised non-supervised (k-means) and
6
supervised (k-NN/GA) pattern recognition algorithms for the analysis and interpretation
7
of complex data sets, in the context of river WQ assessment, the identification of
8
pollution sources/factors and the understanding of spatial variations in river WQ,
9
offering the potential of contributing positively to effective river WQ management in the
10
study basin, such as, reducing the number of parameters to be monitored in the future,
11
which shall result in a significant reduction of the monitoring costs, without surrendering
12
on accuracy. Hence, the study revealed as well the worth of combining data mining
13
techniques and common GIS tools to cross-check on congruency of results.
14
M AN U
SC
RI PT
4
6. ACKNOWLEDGEMENTS
16
This study was conducted in the context of the Master's thesis (National University of
17
La Plata, Argentine) carried out by the first author and co-directed by the second
18
author. Additional work took place within the frame of the fellowship awarded to the
19
second author by the Ecuadorian Secretary for Higher Education, Science, Technology
20
and Innovation (SENESCYT) through its “PROMETEO Viejos Sabios” Program. The
21
authors would like to express their gratitude to SENAGUA for making the row data
22
accessible to the current study. Further, the authors are thankful to Jan Feyen for going
23
through the first draft of this manuscript and to the anonymous reviewers for their
24
important remarks.
AC C
EP
TE D
15
25
26
ACCEPTED MANUSCRIPT 1
REFERENCES
2
Anderson, T., Finn, J., 1996. The New Statistical Analysis of Data. Springer-Verlag
3
New York, Inc. Astudillo, S., Astudillo, P., Cisneros, P., Coello, C., García, J., González, C., Pacheco,
5
E., Rengel, A., Stoop, B., Van Noten, S., Wijffels, A., Zúñiga, A., 2010. Atlas de la
6
cuenca del río Paute, PROMAS - Universidad de Cuenca. ed. Consejo de gestión
7
de aguas de la cuenca del Paute, Cuenca - Ecuador.
RI PT
4
Barnett, T., Pierce, D., Hidalgo, H., Bonfils, C., Santer, B., Das, T., Bala, G., Wood, A.,
9
Nozawa, T., Mirin, A., Cayan, D., Dettinger, M., 2008. Human-induced changes in
10
the hydrology of the western United States. Science 319, 1080–1083.
11
doi:10.1126/science.1152538
M AN U
SC
8
Bartram, J., Ballance, R., 1996. Water Quality Monitoring - A Practical Guide to the
13
Design and Implementation of Freshwater Quality Studies and Monitoring
14
Programmes, First Edit. ed. United Nations Environment Programme / World
15
Health Organization. doi:10.1159/000170272
TE D
12
Begum, A., Ramaiah, M., Harikrishna, Khan, I., Veena, K., 2009. Heavy Metal Pollution
17
and Chemical Profile of Cauvery River Water. E-Journal of Chemistry 6, 47–52.
18
doi:10.1155/2009/154610
20 21
Bu, H., Meng, W., Zhang, Y., Wan, J., 2014. Relationships between land use patterns
AC C
19
EP
16
and water quality in the Taizi River basin, China. Ecological Indicators 41, 187–
197. doi:10.1016/j.ecolind.2014.02.003
22
Bücker, A., Crespo, P., Frede, H.-G., Vaché, K., Cisneros, F., Breuer, L., 2010.
23
Identifying Controls on Water Chemistry of Tropical Cloud Forest Catchments:
24
Combining Descriptive. Aquat Geochem 127–149. doi:10.1007/s10498-009-9073-
25
4
26 27
ACCEPTED MANUSCRIPT 1
Caccia, V.G., Boyer, J.N., 2005. Spatial patterning of water quality in Biscayne Bay,
2
Florida as a function of land use and water management. Marine Pollution Bulletin
3
50, 1416–1429. doi:10.1016/j.marpolbul.2005.08.002
5 6 7
Caliński, T., Harabasz, J., 1974. A dendrite method for cluster analysis. Communications in Statistics 3, 1–27. doi:10.1080/03610927408827101
RI PT
4
Chang, E.I., Lippmann, R.P., 1990. Using Genetic Algorithms to Improve Pattern Classification Performance. NIPS 1990 797–803.
Chen, G., Jaradat, S.A., Banerjee, N., Tanaka, T.S., Ko, M.S., Zhang, M.Q., 2002.
9
Evaluation and comparison of clustering algorithms in analyzing ES cell gene expression data. Statistica Sinica 12, 241–262.
M AN U
10
SC
8
CONELEC, 2011. Estadística del sector eléctrico ecuatoriano - Folleto resumen.
12
CONELEC, 2009. Plan maestro de electrificación del Ecuador.
13
Connolly, N.M., Pearson, R.G., Loong, D., Maughan, M., Brodie, J., 2015. Water
14
quality variation along streams with similar agricultural development but
15
contrasting riparian vegetation. Agriculture, Ecosystems and Environment 213,
16
11–20. doi:10.1016/j.agee.2015.07.007
EP
TE D
11
Cunha, D.G.F., Sabogal-Paz, L.P., Dodds, W.K., 2016. Land use influence on raw
18
surface water quality and treatment costs for drinking supply in São Paulo State
19
AC C
17
(Brazil). Ecological Engineering 94, 516–524. doi:10.1016/j.ecoleng.2016.06.063
20
Da Ros, G., 1995. La contaminación de aguas en Ecuador: una aproximación
21
económica. Instituto de Investigaciones Económicas, Pontificia Universidad
22
Católica del Ecuador.
23
Daniel, M.H., Montebelo, A.A., Bernardes, M.C., Ometto, J.P., De Camargo, P.B.,
24
Krusche, A. V., Martinelli, L.A., Ballester, R.L., Martinelli, L.A., 2002. Effects of
25
urban sewage on dissolved oxygen, dissolved inorganic and organic carbon, and 28
ACCEPTED MANUSCRIPT 1
electrical conductivity of small streams along a gradient of urbanization in the
2
Piracicaba river basin. Water, Air, and Soil Pollution 136, 189–206.
3 4
Davies, D.L., Bouldin, D.W., 1979. A Cluster Separation Measure. IEEE Transactions o Pattern Analysis and Machine Intelligence PAM1-1, 224–227. Dudoit, S., Fridlyand, J., 2002. A prediction-based resampling method for estimating
6
the number of clusters in a dataset. Genome Biology 3, 1–21. doi:10.1186/gb-
7
2002-3-7-research0036
RI PT
5
Fix, E., Hodges, J.L., 1951. Discriminatory Analysis - Nonparametric discrimination
9
consistency properties. Usaf school of aviation medicine Randolph Field, Texas
11 12
1–21.
M AN U
10
SC
8
Frank, I.E., Todeschini, R., 1994. The Data Analysis Handbook. B.V., Elsevier Science, pp. 1-364. doi:10.1016/S0922-3487(08)70048-0
Gholizadeh, M.H., Melesse, A.M., Reddi, L., 2016. Discriminant analysis application in
14
spatiotemporal evaluation of water quality in South Florida. Journal of
15
Hydroinformatics 18, 1019–1032. doi:10.2166/hydro.2016.023
TE D
13
Güler, C., Thyne, G.D., McCray, J.E., Turner, A.K., 2002. Evaluation of graphical and
17
multivariate statistical methods for classification of water chemistry data.
18
Hydrogeology Journal 10, 455–474. doi:10.1007/s10040-002-0196-6
AC C
EP
16
19
Hajigholizadeh, M., Melesse, A.M., 2017. Assortment and spatiotemporal analysis of
20
surface water quality using cluster and discriminant analyses. Catena 151, 247–
21 22 23
258. doi:10.1016/j.catena.2016.12.018
Hand, D.J., 2012. Assessing the Performance of Classification Methods. International Statistical Review 80, 400–414. doi:10.1111/j.1751-5823.2012.00183.x
24
Hanselman, D., Littlefield, B., 2012. Mastering MATLAB®. Pearson Education Limited.
25
Harper, D., Zalewski, M., Pacini, N., 2008. Ecohydrology: processes, models and case 29
ACCEPTED MANUSCRIPT 1
studies - An approach to the sustainable management of water resources.
2
International, CAB. doi:10.1111/j.1365-2427.2010.02451.x
3 4
Hartigan, A., Wong, M.A., 1979. A K-Means Clustering Algorithm. Journal of the Royal Statistical Society 28, 100–108. doi:10.2307/2346830 Helsel, D.R., Hirsch, R.M., 2002. Statistical methods in water resources, in: Techniques
6
of Water-Resources Investigations of the United States Geological Survey. U.S.
7
Geological Survey, pp. 1–510.
RI PT
5
Hirsch, R.M., Alexander, R.B., 1991. Selection of Methods for Detection and Estimation
9
of Trends in Water Quality Data. Water Resources Research 27, 803–813.
11
doi:10.1029/91WR00259
M AN U
10
SC
8
Hirsch, R.M., Slack, J.R., Smith, R.A., 1982. Techniques of Trend Analysis for Monthly
12
Water-Quality
13
doi:10.1029/WR018i001p00107
17 18 19 20
Research
18,
107–121.
Holland, J.H., 1975. Adaptation in Natural and Artificial Systems. The MIT Press.
TE D
16
Resources
doi:10.1137/1018105
Huberty, C.J., Olejnik, S., 2006. Applied MANOVA and Discriminant Analysis, Second
EP
15
Water
Edi. ed. John Wiley & Sons, Inc. Juahir, H., Zain, S.M., Aris, A.Z., Yusoff, M.K., Mokhtar, M. Bin, 2010. Spatial
AC C
14
Data.
assessment of Langat River water quality using chemometrics. Journal of
environmental monitoring 12, 287–295. doi:10.1039/b907306j
21
Kannel, P.R., Lee, S., Kanel, S.R., Khan, S.P., 2007. Chemometric application in
22
classification and assessment of monitoring locations of an urban river system.
23
Analytica Chimica Acta 582, 390–399. doi:10.1016/j.aca.2006.09.006
24 25
Kaufman, L., Rousseeuw, P.J., 1990. Finding groups in data - An intorduction to cluster analysis. John Wiley & Sons, Inc. 30
ACCEPTED MANUSCRIPT 1
Kovács, J., Kovács, S., Magyar, N., Tanos, P., Hatvani, I.G., Anda, A., 2014.
2
Classification into homogeneous groups using combined cluster and discriminant
3
analysis.
4
doi:10.1016/j.envsoft.2014.01.010
Environmental
Modelling
and
Software
57,
52–59.
Kryszczuk, K., Hurley, P., 2010. Estimation of the number of clusters using multiple
6
clustering validity indices. International Workshop on Multiple Classifier Systems
7
114–123. doi:10.1007/978-3-642-12127-2_12
RI PT
5
Lavine, B.K., Rayens, W.S., 2009. Classification: Basic Concepts, in: Brown, S.D.,
9
Tauler, R., Walczak, B. (Eds.), Comprehensive Chemometrics. Elsevier B.V., pp.
11
507–515.
M AN U
10
SC
8
Liou, S.M., Lo, S.L., Wang, S.H., 2004. A generalized water quality index for Taiwan.
12
Environmental
Monitoring
and
13
doi:10.1023/B:EMAS.0000031715.83752.a1
Assessment
96,
35–52.
Liu, S., Butler, D., Brazier, R., Heathwaite, L., Khu, S.T., 2007. Using genetic
15
algorithms to calibrate a water quality model. Science of the Total Environment
16
374, 260–272. doi:10.1016/j.scitotenv.2006.12.042
TE D
14
Lowrance, R., Altier, L.S., Newbold, J.D., Schnabel, R.R., Groffman, P.M., Denver,
18
J.M., Correll, D.L., Gilliam, J.W., Robinson, J.L., Brinsfield, R.B., Staver, K.W.,
19
Lucas, W., Todd, A.H., 1997. Water quality functions of riparian forest buffers in
21
AC C
20
EP
17
Chesapeake bay
watersheds.
Environmental
Management
21,
687–712.
doi:10.1007/s002679900060
22
MacQueen, J., 1967. Some methods for classification and analysis of multivariate
23
observations. Proceedings of the Fifth Berkeley Symposium on Mathematical
24
Statistics and Probability 1, 281–297. doi:citeulike-article-id:6083430
25
31
ACCEPTED MANUSCRIPT 1
Mosquera, P.V., Hampel, H., Vázquez, R.F., Alonso, M., Catalan, J., 2017. Abundance
2
and morphometry changes across the high mountain lake-size gradient in the
3
tropical Andes of Southern Ecuador. Water Resources Research 53(8), 7269-
4
7280. doi:10.1002/2017WR020902
6
Mulligan, A., Brown, L., 1998. Genetic Algorithms for Calibrating Water Quality Models. Journal of Environmental Engineering 124, 202–211.
RI PT
5
Ng, A.W.M., Perera, B.J.C., 2003. Selection of genetic algorithm operators for river
8
water quality model calibration. Engineering Applications of Artificial Intelligence
9
16, 529–541. doi:10.1016/j.engappai.2003.09.001
SC
7
Olajire, A.A., Imeokparia, F.E., 2001. Water quality assessment of Osun river: Studies
11
on inorganic nutrients. Environmental Monitoring and Assessment 69, 17–28.
12
doi:10.1023/A:1010796410829
13
M AN U
10
Ouyang, Y., 2005. Evaluation of river water quality monitoring stations by principal component
analysis.
Water
15
doi:10.1016/j.watres.2005.04.024
Research
39,
2621–2635.
TE D
14
Pauta-Calle, G., Chang-Gómez, J., 2014. Indices de calidad del agua de fuentes
17
superficiales y aspectos toxicológicos, evaluación del Río Burgay. MASKANA,
18
I+D+ingeniería 165–176.
AC C
EP
16
19
Ray, S., Turi, R.H., 1999. Determination of number of clusters in k-means clustering
20
and application in colour image segmentation. Proceedings of the 4th international
21
conference on advances in pattern recognition and digital techniques 137–143.
22
Raymer, M.L., Punch, W.F., Goodman, E.D., Kuhn, L.A., Jain, A.K., 2000.
23
Dimensionality reduction using genetic algorithms. IEEE Transactions on
24
Evolutionary Computation 4, 164–171. doi:10.1109/4235.850656
25
Sanderson, E.W., Jaiteh, M., Levy, M. a., Redford, K.H., Wannebo, A. V., Woolmer, G.,
32
ACCEPTED MANUSCRIPT 1
2002. The Human Footprint and the Last of the Wild. BioScience 52, 891–904.
2
doi:10.1641/0006-3568(2002)052[0891:THAT]2.0.CO;2 SENAGUA, 2016. Plan de monitoreo de calidad del agua de los sistemas de agua de
4
abastecimiento público de Portoviejo, Manta, Chone, Pedernales, Jama, Bahia de
5
Caráquez, San Vicente, Canoa, Calceta, Junín, Tosagua, Flavio Alfaro y Muisne.
6 7
RI PT
3
Shapiro, S.S., Wilk, M.B., 1965. An Analysis of Variance Test for Normality (Complete Samples). Biometrika Trust 52, 591–611. doi:10.1093/biomet/52.3-4.591
Shrestha, S., Kazama, F., 2007. Assessment of surface water quality using multivariate
9
statistical techniques: A case study of the Fuji river basin, Japan. Environmental Modelling & Software 22, 464–475. doi:10.1016/j.envsoft.2006.02.001
M AN U
10
SC
8
11
Simeonov, V., Stratis, J.A., Samara, C., Zachariadis, G., Voutsa, D., Anthemidis, A.,
12
Sofoniou, M., Kouimtzis, T., 2003. Assessment of the surface water quality in
13
Northern
14
1354(03)00398-1
Water
Research
TE D
Greece.
37,
4119–4124.
doi:10.1016/S0043-
15
Singh, K.P., Malik, A., Mohan, D., Sinha, S., 2004. Multivariate statistical techniques for
16
the evaluation of spatial and temporal variations in water quality of Gomti River
17
(India)
18
doi:10.1016/j.watres.2004.06.011
20
case
EP
a
study.
Water
Research
38,
3980–3992.
AC C
19
-
Sivanandam, S.N., Deepa, S.N., 2008. Introduction to Genetic Algorithm. SpringerVerlag Berlin Heidelberg. doi:10.1007/978-3-540-73190-0
21
Steinley, D., 2004. Standardizing variables in K-means clustering, in: Banks, D.,
22
House, L., McMorris, F.R., Arabie, P., Gaul, W.A. (Eds.), Classification, Clustering,
23
and Data Mining Applications. Springer-Verlag Berlin, pp. 53–60.
24 25 33
ACCEPTED MANUSCRIPT 1
Suguna, N., Thanushkodi, K., 2010. An Improved k-Nearest Neighbor Classification
2
Using Genetic Algorithm. International Journal of Computer Science Issues 7, 18–
3
21. Sweeney, B.W., Newbold, J.D., 2014. Streamside forest buffer width needed to protect
5
stream water quality, habitat, and organisms: A literature review. Journal of the
6
American Water Resources Association 50, 560–584. doi:10.1111/jawr.12203
7
Thalamuthu, A., Mukhopadhyay, I., Zheng, X., Tseng, G., 2006. Evaluation and
8
comparison of gene clustering methods in microarray analysis. Bioinformatics 22,
9
2405–2412. doi:10.1093/bioinformatics/btl406
SC
Todeschini, R., Ballabio, D., Consonni, V., 2015. Distances and Other Dissimilarity in
M AN U
10
RI PT
4
11
Measures
Chemometrics.
12
doi:10.1002/9780470027318.A9438
Encyclopedia
of
Analytical
Chemistry.
Towler, E., Rajagopalan, B., Seidel, C., Summers, R.S., 2009. Simulating ensembles of
14
source water quality using a K-nearest neighbor resampling approach.
15
Environmental Science and Technology 43, 1407–1411. doi:10.1021/es8021182
TE D
13
Vega, M., Pardo, R., Barrado, E., Debán, L., 1998. Assessment of seasonal and
17
polluting effects on the quality of river water by exploratory data analysis. Water
18
Research 32, 3581–3592. doi:10.1016/S0043-1354(98)00138-9
20 21 22
AC C
19
EP
16
Wang, K., Wang, B., Peng, L., 2009. CVAP: Validation for Cluster Analyses. Data Science Journal 8, 88–93. doi:10.2481/dsj.007-020
WHO, 1998. Guidelines for Drinking-Water Quality, in: Health Criteria and Other Supporting Information. World Health Organization, pp. 1–283.
23 24
34
ACCEPTED MANUSCRIPT
HIGHLIGHTS • K-means was useful for classifying the water quality (WQ) in the Paute river basin.
RI PT
• Impacted and less impacted WQ sites were identified in the study basin. • k-NN/GA identified 9 key WQ parameters, out of 21, explaining the WQ classification.
SC
• k-NN/GA suggested a more optimal and, as such, economical WQ sampling strategy.
M AN U
• Predictions of applied data mining methods had congruency with GIS-based
AC C
EP
TE D
analyses.