Water quality assessment with emphasis in parameter optimisation using pattern recognition methods and genetic algorithm

Water quality assessment with emphasis in parameter optimisation using pattern recognition methods and genetic algorithm

Accepted Manuscript Water quality assessment with emphasis in parameter optimisation using pattern recognition methods and genetic algorithm Gonzalo S...

2MB Sizes 0 Downloads 32 Views

Accepted Manuscript Water quality assessment with emphasis in parameter optimisation using pattern recognition methods and genetic algorithm Gonzalo Sotomayor, Henrietta Hampel, Raúl F. Vázquez PII:

S0043-1354(17)31005-9

DOI:

10.1016/j.watres.2017.12.010

Reference:

WR 13408

To appear in:

Water Research

Received Date: 8 August 2017 Revised Date:

6 December 2017

Accepted Date: 7 December 2017

Please cite this article as: Sotomayor, G., Hampel, H., Vázquez, Raú.F., Water quality assessment with emphasis in parameter optimisation using pattern recognition methods and genetic algorithm, Water Research (2018), doi: 10.1016/j.watres.2017.12.010. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

study site (period 2008-2013)

INPUT: Study site (Paute basin, Ecuador) + WQ monitoring network

ACCEPTED MANUSCRIPT

RI PT

Quito

SC

Guayaquil Peru

EP C1 - low pollution C2 - high pollution

OUTPUT: Lar

x11

x12

x21 X= ... ... xn1

x22 ... ...

INPUT: Large

AC C

vegetation native vegetation t cover/urbanised

PROCESS: Data mining

TE D

Q classes + key WQ parameters explaining WQ i.e. reduction of parameter space dimension)

M AN U

Pacific Ocean

Colombia

- Pattern recognition - Machine learning - Genetic Algorithm

ACCEPTED MANUSCRIPT 1

Water quality assessment with emphasis in parameter

2

optimisation using pattern recognition methods and genetic

3

algorithm

4 Gonzalo Sotomayor1a, Henrietta Hampel2,1, Raúl F. Vázquez3,1

RI PT

5 6

1

Laboratorio de Ecología Acuática, Departamento de Recursos Hídricos y Ciencias

9 10

E-mail: [email protected] 2

Facultad de Ciencias Químicas, Universidad de Cuenca, Av. 12 de Abril S/N, Cuenca, Ecuador.

11 12

E-mail: [email protected]

13

3

14

E-mail: [email protected]

15

a

Facultad de Ingeniería, Universidad de Cuenca, Av. 12 de Abril S/N, Cuenca, Ecuador.

EP

18

ABSTRACT

TE D

Corresponding Author

16 17

SC

Ambientales, Universidad de Cuenca, Av. 12 de Abril S/N, Cuenca, Ecuador.

8

M AN U

7

A non-supervised (k-means) and a supervised (k-Nearest Neighbour in combination

20

with genetic algorithm optimisation, k-NN/GA) pattern recognition algorithms were

21

applied for evaluating and interpreting a large complex matrix of water quality (WQ)

22

data collected during five years (2008, 2010-2013) in the Paute river basin (southern

23

Ecuador). 21 physical, chemical and microbiological parameters collected at 80

24

different WQ sampling stations were examined. At first, the k-means algorithm was

25

carried out to identify classes of sampling stations regarding their associated WQ

26

status by considering three internal validation indexes, i.e., Silhouette coefficient,

27

Davies-Bouldin and Caliński-Harabasz. As a result, two WQ classes were identified,

28

representing low (C1) and high (C2) pollution. The k-NN/GA algorithm was applied on

AC C

19

1

ACCEPTED MANUSCRIPT the available data to construct a classification model with the two WQ classes,

2

previously defined by the k-means algorithm, as the dependent variables and the 21

3

physical, chemical and microbiological parameters being the independent ones. This

4

algorithm led to a significant reduction of the multidimensional space of independent

5

variables to only nine, which are likely to explain most of the structure of the two

6

identified WQ classes. These parameters are, namely, electric conductivity, faecal

7

coliforms, dissolved oxygen, chlorides, total hardness, nitrate, total alkalinity,

8

biochemical oxygen demand and turbidity. Further, the land use cover of the study

9

basin revealed a very good agreement with the WQ spatial distribution suggested by

10

the k-means algorithm, confirming the credibility of the main results of the used WQ

11

data mining approach.

M AN U

SC

RI PT

1

12

EP

TE D

Key Words: Water quality, Pattern recognition, Genetic algorithm, Land cover.

AC C

13

2

ACCEPTED MANUSCRIPT 1. INTRODUCTION

2

Present-day changes to fluvial systems include a great variety of direct and indirect

3

anthropogenic activities, which in many cases result in a drastic deterioration of the

4

water quality (Barnett et al., 2008; Harper et al., 2008). A variety of contaminants, in

5

addition to a multitude of imprudent water management practices and destructive land

6

uses, are currently threatening aquatic systems on a world-wide scale (Sanderson et

7

al., 2002). Moreover, water of good quality is a crucial component for sustainable

8

socio-economic development in any region of the world (Bartram and Balance, 1996).

9

In response to this, and owing to spatial-temporal variations in water chemistry, a

10

monitoring program that provides a representative and reliable estimation of the water

11

quality (WQ) is necessary (Simeonov et al., 2003). The large number of samples that

12

need to be considered in appropriate WQ assessments and the number of constituents

13

that must be considered per sample gives rise to data sets of enormous size and

14

complexity.

M AN U

SC

RI PT

1

Furthermore, the relationships investigated in these data sets usually cannot be

16

expressed in quantitative terms and they are better expressed in terms of similarity or

17

dissimilarity among groups of multivariate data (Lavine and Rayens, 2009). Herein, the

18

application of different supervised and non-supervised pattern recognition techniques

19

has the potential of facilitating the interpretation of complex data matrices and, as such,

20

of understanding the WQ status of the studied systems without losing important

21

information; moreover, they constitute valuable tools for reliable management of water

22

resources (Shrestha and Kazama, 2007; Juahir et al., 2010; Bücker et al., 2010).

EP

AC C

23

TE D

15

In Ecuador, governmental efforts have been carried out in the past to establish

24

various WQ monitoring programs in some particular locations of the country and, more

25

recently, to a national level (SENAGUA, 2016). However, the resulting observed data

26

are not really public available, and as such the data can hardly be used for research or

27

management purposes. Nevertheless, the Ecuadorian National Secretary of Water

28

(SENAGUA) - Santiago River Hydrographic Demarcation (DHS) generated, and made 3

ACCEPTED MANUSCRIPT 1

available to the current study, an extensive database derived from 80 monitoring

2

stations located in the Paute river basin, which is one of the most important

3

hydrographic systems of southern Ecuador. As such, in the present study, a large and complex database (including data on

5

14427 observations), collected throughout a 5-year monitoring program (year 2008 and

6

period 2010-2013), was used in the context of supervised and non-supervised pattern

7

recognition techniques that were applied with the aim of handling the complexity of the

8

database and mining relevant information about: (i) the WQ

9

dissimilarities between the 80 monitoring stations based on the measured river WQ

10

parameters; (ii) the most significant WQ related parameters that explain the WQ

11

classes identified through the previous objective; and (iii) the correspondence among

12

the identified WQ classes, the most significant WQ related parameters and the land

13

cover of the study river basin.

RI PT

4

14

M AN U

SC

similarities or

2. STUDY AREA AND METHODS

16

2.1. Study area and water quality (WQ) monitoring sites

17

The study area, the Paute river basin, is located in the south of Ecuador (Fig. 1). It has

18

a surface of about 6442 km2 and its main reach length is approximately 120.4 Km. This

19

is one of the most important hydrographic systems of Ecuador due to the exploitation of

20

part of its enormous hydroelectric potential, which, currently, enables covering about 40

21

% of the energetic demand of the country (CONELEC, 2011). The annual average

22

discharge of the river is 136.3 m3 s-1 (in its lower part); its temporal variation is strongly

23

influenced by the presence of some hydroelectric generation subsystems such as

24

Mazar, Amaluza and Sopladora, that are conforming the Integrated Paute (generation)

25

System (CONELEC, 2009). The Paute river provides water resources to rural,

26

agricultural, urban and industrial areas of an important part of the southern Andes of

27

Ecuador (Da Ros, 1995) and discharges into the Upano river, which belongs to the

28

Amazon river system. Two important cities are located inside the river basin,

AC C

EP

TE D

15

4

ACCEPTED MANUSCRIPT 1

respectively Cuenca and Azogues with approximately 500000 and 33850 inhabitants.

2

The main pollutant loads, from both point and non-point sources, include domestic

3

wastewaters, agricultural runoff, animal husbandry and industrial effluents (Da Ros,

4

1995). The geological features of the basin are very complex, primarily due to the

6

subduction of the Nazca plate towards the South American plate, which causes that

7

this basin is still in the process of rising, sharpening its slope, resulting in large

8

amounts of suspended sediments that are transported by the river to the Amazon basin

9

(Astudillo et al., 2010). The elevation ranges approximately between 500 and 4250 m

10

above the sea level (a.s.l.), with the majority (61.3 %) of the basin in the elevation

11

range 2550-3575 m, 4.3 % in the range 500-1525 m, 13.7 % in the elevation band

12

1525-2550 m, and 20.7 % of the basin is situated above 3575 m. In average terms,

13

slopes vary between 25 % and 50 %; in the upper part of the basin (west) a

14

mountainous relief is dominant whilst a gentler relief is representative in the central and

15

lower parts (east).

TE D

M AN U

SC

RI PT

5

Multi-year average temperature varies between 4.4 °C and 18.6 °C. The lower

17

temperatures correspond to the western Andes range with an average of 6 °C

18

(Páramo), while the warmest areas are situated in the valleys and subtropical zones

19

(Amazonia regions), with an average temperature fluctuating between 22 °C and 26

20

°C. Due to the wide elevation range, rainfall oscil lates in intensity and duration, with

21

maximum annual averages between 2500 mm and 3000 mm at higher elevations and

22

minimum annual averages between 600 mm and 800 mm in the valleys.

AC C

23

EP

16

The sampling design was planned to cover a wide range of parameters (21 in

24

total) at key monitoring sites (80 in total; Fig. 1), aiming at representing as accurately

25

as possible the WQ distribution in the study basin (SENAGUA, 2016).

26 27

2.2. Sampled WQ parameters

5

ACCEPTED MANUSCRIPT The studied 21 WQ parameters include: aluminium (Al), ammonia (N-NH4), 5-day

2

biochemical oxygen demand (BOD5), chloride (Cl), dissolved oxygen (DO), electric

3

conductivity (EC), faecal coliforms (FC), fluoride (F), iron (Fe), nickel (Ni), nitrate (N-

4

NO3), pH, phosphates (P-PO4), potassium (K), sodium (Na), total alkalinity (TALK),

5

total hardness (TH), total phosphorus (P-tot), total solids (TS), turbidity (TU) and water

6

temperature (WT). On average, the monitoring stations were visited five times per year,

7

except in 2008, when they were only sampled three times. Some stations were

8

sampled more often, because they were located either in highly polluted sites or, on the

9

contrary, in unaltered environmental (i.e., reference) locations. As a result of the

10

applied monitoring protocol, a large and complex WQ database was established for the

11

nsp = 80 monitoring stations, upon nrep = 687 sampling replicates and the surveying of

12

the nv = 21 WQ parameters for replicate and monitoring stations, resulting in a total of

13

nobs = nrep x nv = 14427 observations, that are represented by xi,j, with i = 1, 2, …., nv

14

and j = 1, 2, …, nrep (Fig. 2). Unfortunately, despite the fact that this is a very interesting

15

and rich data set, there was never a scientifically driven monitoring plan to collect it but,

16

rather, its structure responds to political/personal initiatives that varied throughout time.

17

Table 1 lists the basic statistics for each of the studied WQ parameters.

SC

M AN U

TE D

AC C

EP

18

RI PT

1

19

6

ACCEPTED MANUSCRIPT 1

Figure 1. Location of the Paute river basin in the context of continental Ecuador and associated

2

distribution of the 80 water quality monitoring stations. UTM coordinate system (WGS84).

3

TE D

M AN U

SC

RI PT

4

5

Figure 2. Flowchart of the modelling protocol that was implemented in the current study and that combines

7

a non-supervised (k-means) and a supervised (k-NN/GA) pattern recognition methods.

9 10

AC C

8

EP

6

2.3. Processing land coverage data for assessing the spatial congruency of WQ classification

11

The correspondence of the spatial distribution of the WQ classes was assessed by

12

comparing it with the land cover distribution of the Paute river basin, using as well

13

auxiliary topographical information. Land cover raster data was available for the year

14

2001, covering the whole extent of the study catchment. This is the most recent dataset

15

that is available publicly. Geographical Information Systems (GIS) algorithms were

16

applied on the original land cover data so that it was reclassified to a more convenient 7

ACCEPTED MANUSCRIPT 1

form, which enabled a direct association between land cover and the spatial distribution

2

of the WQ classes.

3 Table 1. Main statistics associated to the water quality (WQ) parameters that were monitored throughout

5

year 2008 and period 2010-2013 in the Paute river basin (Fig. 1).

Mean

Median

STD

Range

0.62

0.06

1.50

0.00 ─ 23.00

0.06

0.00

0.22

0.00 ─ 1.59

0.50

0.00

7.2

2.3

4.26

0.82

-

TALK (mg L ¹) -

Al (mg L ¹) -

N-NH₄ (mg L ¹)

SC

WQ Parameter

RI PT

4

0.00 ─ 11.35

11.9

0.0 ─ 141.0

16.63

0.00 ─ 263.06

6.79

0.80

3.30 ─ 9.75

76.4

178.0

0.16 ─ 1810.00

1700.0

6668.5

0.0 ─ 16000.0

0.40

4.51

0.0 ─ 54.6

36.8

55.8

0.0 ─ 657.0

0.63

0.00

2.28

0.00 ─ 31.00

0.01

0.00

0.09

0.00 ─ 1.51

0.57

0.06

2.16

0.00 ─ 23.47

7.53

7.58

0.68

4.17 ─ 9.43

1.09

0.41

1.63

0.00 ─ 7.72

0.58

0.19

0.92

0.00 ─ 4.76

K (mg L ¹)

2.64

0.47

13.06

0.00 ─ 227.06

-

6.57

3.48

12.38

0.00 ─ 112.89

-

12.47

0.01

76.36

0.00 ─ 1160.00

TU (NTU)

23.3

2.9

80.9

0.0 ─ 1190.5

WT (⁰C)

15.2

14.6

3.5

8.1 ─ 23.8

-

BOD₅ (mg L ¹) -

CL (mg L ¹) -

DO (mg L ¹)

6.77

-1

EC (µS cm )

130.3 -1

-1

FC (bacteria 100 ml )

5675.2

-

1.03

-

47.8

FL (mg L ¹)

-

Ni (mg L ¹) -

N-NO₃ (mg L ¹) pH -

P-PO₄ (mg L ¹) -

AC C

P-tot (mg L ¹)

EP

-

Fe (mg L ¹)

TE D

TH (mg L ¹)

-

Na (mg L ¹) TS (mg L ¹)

M AN U

1.25

6

Legend: STD = standard deviation; TALK = total alkalinity; Al = Aluminium; N-NH4 = Ammonia, BOD5 = 5-

7

day biochemical oxygen demand; Cl = chlorides; DO = dissolved oxygen; EC = electric conductivity; FC =

8

faecal coliforms; F = fluorides; Fe = iron; Ni = nickel; N-NO3 = nitrate; P-PO4 = phosphates; K = potassium;

8

ACCEPTED MANUSCRIPT 1

Na = sodium; TH = total hardness; P-tot = total phosphorus; TS = total solids; TU = turbidity; WT = water

2

temperature.

3 The considered land cover classes were: (i) altered vegetation; (ii) woody native

5

vegetation; (iii) without cover/urbanised; and (iv) Páramo (unaltered). Additionally, a

6

digital elevation (raster) model of the whole catchment was available with a resolution

7

of 50x50 m2. The referred congruency assessment was based on the visual inspection

8

of the geographical distribution of the WQ classes, the land cover and the topography.

9

ArcGis® was used for all GIS analyses, including the latter visual inspection.

SC

RI PT

4

10

2.4. Data processing and multivariate statistical assessment

12

Figure 2 depicts the multivariate statistical protocol applied in this study. All the

13

statistical analyses involved in it were implemented with MATLAB® (Hanselman and

14

Littlefield, 2012) version 2014. As a first step, the Shapiro-Wilk test (Shapiro and Wilk,

15

1965) was used to evaluate whether the distributions of the WQ parameters were

16

normal. This test showed with a 95% confidence level that none of the WQ parameters

17

are normally distributed (p probability value = 0.0001 lower than the significance level α

18

= 0.05), exhibiting in all cases a positive skew (Table 1).

TE D

M AN U

11

Therefore, it was decided to continue with the statistical assessment by using a

20

central tendency value (i.e. aggregated value) of the distribution of every WQ

21

parameter, for a given monitoring station, instead of using the whole set of available

22

observations. Further, this aggregated parameter was standardised through range

23

scaling so that a more normal associated distribution could be obtained; this kind of

24

standardisation procedure is recommended in a cluster analysis, particularly for the k-

25

means method (Steinley, 2004).

AC C

EP

19

26

Herein, the applied aggregation procedure reduced the dimension of the analysis

27

to nmed = nv x nsp = 1680 median observations (xi,m, with i = 1, 2, …., nv and m = 1, 2, …,

28

nsp). Given the significant skewness of the WQ parameters distributions, the use of 9

ACCEPTED MANUSCRIPT 1

median rather than of mean values was preferred since the median is expected to be a

2

better central tendency measure (Anderson and Finn, 1996; Helsel and Hirsch, 2002).

3

Standardisation of xi,m was achieved by means of:

4

im

=

x im − L U

m

−L

m

m

for i = 1, 2, …., n and m = 1, 2, …, n v sp

(1)

RI PT

Z

5

, where Lm and Um are the minimum and maximum limits of the parameter range, so

6

that zi,m varies between 0 and 1 (Frank and Todeschini, 1994).

Then, the aggregated dataset of nmed = 1680 zi,m values was used in the context

8

of the k-means algorithm (MacQueen, 1967), a non-hierarchical cluster analysis (CA)

9

and a non-supervised pattern recognition method, were carried out with the intention of

10

defining k groups of monitoring stations with common WQ characteristics. As a prior

11

process, validation indices (Wang et al., 2009) were calculated to determine the

12

number of WQ clusters, k, for the application of the k-means method.

M AN U

SC

7

Once the k WQ groups of monitoring stations were defined through the k-means

14

process (Fig. 2), for a given monitoring station, the respective identifier (or “label”) of

15

the corresponding WQ (k-means) group (or class) was assigned to the original

16

observations of the station. After following this procedure for all of the sampling

17

stations, on the resulting new data matrix of nobs observations (xi,j), the k-Nearest

18

Neighbour (k-NN), a supervised pattern recognition method, in combination with a

19

genetic algorithm (GA) for optimisation, was applied.

EP

AC C

20

TE D

13

It is worth noticing that the WQ related parameters frequently are of different

21

nature or have different units. Many pattern recognition techniques, like the ones used

22

in this study, are very sensitive to these issues (Todeschini et al., 2015). Therefore,

23

range scaling (Eq. 1) was used to eliminate this dependence on the referred issues,

24

regardless of whether the applied recognition techniques were parametric or non-

25

parametric.

26 27

2.4.1. The non-supervised k-means agglomerative method 10

ACCEPTED MANUSCRIPT The goal of CA is to determine the intrinsic grouping in a set of unlabelled (i.e.

2

unclassified) data, upon the similar characteristics that their members possess (Frank

3

and Todeschini, 1994). In this context non-hierarchical clustering techniques, such as

4

k-means, have been widely used in different applications, among them the

5

investigation of the hydro-chemical processes in-stream water quality (Güler et al.,

6

2002; Caccia and Boyer, 2005). Cluster membership is determined by calculating the

7

centroid for each group and assigning each object to the group with the closest

8

centroid, which minimises the overall within-cluster dispersion by iterative reallocation

9

of cluster members (Hartigan and Wong, 1979). In this study, this method was applied

10

using the Euclidean distance (expressed in the units of measure of v) as the measure

11

of similarity between objects (Chen et al., 2002):

M AN U

SC

RI PT

1

Di, j =

12

v=n v

∑ (z v=1

− z j,v )

2

i ,v

(2)

, where zi,v and zj,v are the values of the normalised parameter v for object i and j,

14

respectively, where i is different from j and both are < nsp.

TE D

13

Further, the evaluation of clustering solutions is fundamental in CA. Validity

16

indices, both external and internal, were used for this purpose (Wang et al., 2009). An

17

external index measures the agreement between a priori known clustering structure

18

and the result from a current clustering procedure (Dudoit and Fridlyand, 2002), whilst

19

an internal index measures the appropriateness (or goodness) of a clustering partition

20

without external information, using quantities and features inherent in the data

21

(Thalamuthu et al., 2006). Internal validity indices were applied in the current analysis,

22

namely, the Silhouette Coefficient (SC), the Davies-Bouldin (DB) index and the

23

Caliński-Harabasz (CH) index. In this context, the Cluster Validity Analysis Platform

24

(CVAP) was used for this purpose (Wang et al., 2009). These were adopted herein

25

since they are widely used for k estimation and clustering quality evaluation (Wang et

26

al., 2009).

AC C

EP

15

11

ACCEPTED MANUSCRIPT SC (Kaufman and Rousseeuw, 1990) is a dimensionless measure that evaluates

2

the quality of compactness and separation of clusters; with an upper bound equal to 1,

3

the optimum k value corresponds to its largest average (Chen et al., 2002). DB (Davies

4

and Bouldin, 1979) is a function of the ratio of the sum of within-cluster scatter to

5

between-cluster separation; as such, the objective is to minimise it, which implies

6

minimising the within-cluster scatter and maximising the between-cluster separation

7

(Ray and Turi, 1999). CH (Caliński and Harabasz, 1974), or variance ratio criterion, is a

8

measure of the between-cluster isolation and the within-cluster coherence; the

9

objective is to maximise it (Kryszczuk and Hurley, 2010).

SC

RI PT

1

Each of the nsp = 80 monitoring stations was assigned to either of the identified k

11

WQ-clusters, as the result of the application of the k-means methodology. For a given

12

WQ monitoring station, the identifier (ID) of the associated WQ-cluster was designated

13

to all the parameters and temporal replicates sampled at this station (Fig. 2). This was

14

done for all 80 monitoring stations and 687 temporal replicates. In this way, the whole

15

matrix of original data had associated k WQ-clusters and was ready for the k-NN/GA

16

supervised pattern recognition analysis (Fig. 2).

19

TE D

18

2.4.2. The supervised k-Nearest Neighbour (k-NN) classification algorithm in

EP

17

M AN U

10

combination with a genetic algorithm (GA) Supervised pattern recognition methods identify models (or classification rules), which

21

can carry out a cluster partitioning using local information provided by descriptive

22

parameters (Lavine and Rayens, 2009). In the current study, the k-NN method (Fix and

23

Hodges, 1951) was used in combination with a GA to: (i) verify the allocation of

24

monitoring stations (or objects or samples) into the classes determined previously by

25

the k-means method, according to the descriptive parameters of the stations (pattern

26

recognition); and (ii) identify the descriptive parameters that explain most of the internal

27

structure of the WQ classes, parameters that should be monitored in future, reducing in

AC C

20

12

ACCEPTED MANUSCRIPT 1

this way the complexity of the original database since the other monitored parameters

2

would not really be necessary for this explanation. The k-NN is a non-parametric algorithm through which a sample point is

4

classified according to the majority vote of its k nearest neighbouring points (Frank and

5

Todeschini, 1994). Thus, for a given sample point and for defining its nearest

6

neighbours (NN), distances (Euclidian, herein) are evaluated from the sample point to

7

every other point in the data set. Based on the class label (defined previously in this

8

study through the k-means) of most of the NN in a sample point, the sample point is

9

assigned that class; when the newly assigned class matches the previously assigned

10

class of the sample point (defined by the k-means method), the k-NN algorithm is

11

considered successful (Lavine and Rayens, 2009). Thus, the k−NN classification has

12

two steps, namely, (i) the definition of the neighbouring points (NN-points) of the

13

sample point; and (ii) the subsequent determination of the class of the sample point,

14

using the information provided by those NN-points.

M AN U

SC

RI PT

3

The k-NN algorithm performance was measured through the non-error rate

16

(NER), which is defined as the percentage of monitoring stations correctly assigned by

17

the k-NN algorithm to the classes determined previously (Fig. 2) by the k-means

18

method (classification accuracy criteria; Hand, 2012). In this study, NER was estimated

19

using a testing data set (i.e., 40% of total data). Sample points were classified

20

according to the WQ parameters (physicochemical and microbiological) that are known

21

to be indirectly related to WQ, usually through some undetermined mathematical

22

relationships. A training data set (i.e., 60% of the total data) for which the property of

23

interest (WQ) and the indirectly related descriptors (in the current case the

24

physicochemical and microbiological parameters measured at every WQ monitoring

25

station) are known was used for developing a classification rule that was in turn utilised

26

to predict the WQ of data samples that were not part of the training set (Lavine and

27

Rayens, 2009).

AC C

EP

TE D

15

13

ACCEPTED MANUSCRIPT Besides the fact that, in general, the more the number of measured parameters

2

the more a classification algorithm gets confused, for the particular case of the k-NN

3

algorithm, there are three main limitations, namely (Frank and Todeschini, 1994;

4

Suguna and Thanushkodi, 2010): (i) calculation complexity due to the fact that all of the

5

similarities between the training sample points must be calculated for classification; (ii)

6

the performance of the classification algorithm is over-dependent on the training set;

7

and (iii) there is no weight difference between the training sample points, despite the

8

implicit differences in terms of available data for every one of those points.

RI PT

1

Thus, the combination of the k-NN algorithm with a GA was applied to overcome

10

these limitations by: (i) generating a profile about the internal data structure of every

11

class; and (ii) getting the contribution of individual related parameters (measured at

12

every WQ monitoring station) to the classification of the main property of interest (in the

13

current case, WQ). In this context, GAs demonstrated to be robust searching

14

techniques that in most cases outperform traditional optimisation methods in water

15

resources applications (Mulligan and Brown, 1998; Ng and Perera, 2003; Liu et al.,

16

2007), through the emulation of evolution by natural selection and genetic inheritance,

17

so that a population of competing solutions evolve over time to converge to a single

18

optimal one (Holland, 1975).

EP

TE D

M AN U

SC

9

Thus, every model parameter (every WQ related parameter, in this study) is a

20

gene, while a complete set of genes (21 parameters) is a chromosome. Every gene

21

adopts the value of the respective WQ related parameter encoded as a variable-length

22

binary number. The length of this binary number depends on the magnitude of the WQ

23

parameter value. For example, if a gene represents pH (4.17 < pH < 9.43) the length of

24

the respective binary number will be shorter than the corresponding length when a

25

given gene represents, say, faecal coliforms (varying between 0 and 16000), owing to

26

the magnitude difference of the values adopted by pH and faecal coliforms. The

27

advantage of using this binary system is that very small value variations of a given WQ

28

parameter can be taken into account in the analysis.

AC C

19

14

ACCEPTED MANUSCRIPT The optimisation process is carried out by using three biological operators

2

(Sivanandam and Deepa, 2008; Liu et al., 2007), namely: (i) selection; (ii) cross-over

3

(reproduction); and (iii) mutation. Selection makes sure that only the best

4

chromosomes (solutions) could cross-over or mutate. During successive iterations

5

(generations), initial chromosomes evolves into stronger ones by reproduction among

6

members of the previous generation. Each GA run consists of several generations with

7

constant population size of chromosomes. The strength of each chromosome is

8

evaluated by means of a fitness function. In the current study, a pattern recognition

9

function through a classification model, NER, was used to characterise the

10

performance of the k-NN/GA method regarding the prior classification produced by the

11

unsupervised k-means method. So, the highest NER values were obtained for a cross-

12

over probability of 60%, a mutation probability of 3% and a constant population of 21

13

chromosomes that were evolved over 100 generations. Thus, subsequent generations

14

are formed by combining (stronger) chromosomes, with associated higher fitness

15

values, from the previous (or parent) population. Further, a random change of the value

16

of a gene to form a new chromosome is achieved through mutation, on a bit-by-bit

17

basis (from 0 to 1 or in reverse order).

TE D

M AN U

SC

RI PT

1

The combination of the k-NN and GA that was used in the current study follows to

19

a great extent the approach presented by Suguna and Thanushkodi (2010);

20

alternatives are for instance Chang and Lippmann (1990) and Raymer et al. (2000).

21

Hereafter, the approach involved the following steps: (i) data processing in general,

22

including the encoding of the genes that form the chromosomes; (ii) selecting the

23

distance (or similarity) measure among the test stations whose category are being

24

predicted and the respective neighbouring training stations; (iii) choosing the k-number

25

of samples (or objects) from the training set to generate the initial population on which

26

genetic operators are applied during optimisation; (iv) calculating the distance

27

measures among testing and training samples through a goodness of fit (or similarity)

28

index; (v) choosing the chromosome with the highest goodness of fit value and storing

AC C

EP

18

15

ACCEPTED MANUSCRIPT it as the global maximum; and (vi) refining the global maximum. The latter step is

2

performed through an iterative procedure that is based on the application of genetic

3

operators, namely, reproduction, crossing-over and mutation to get a new population

4

upon which a newer evaluation of the goodness of fit measure takes place to define a

5

local maximum; only if the newer goodness of fit measure (local maximum) is higher

6

than the one associated to the older global maximum, the chromosome that

7

corresponds to the newer local maximum is the newer global maximum. The iterative

8

process is over once the total number of generations (100 in this study) are taken into

9

account. The chromosome that corresponds to the global maximum has the optimum

11

SC

k-neighbours and the respective classification results.

M AN U

10

RI PT

1

3. RESULTS

13

3.1. Land cover processing

14

Based on the adopted classes, the land cover percentages within the study basin are

15

respectively altered vegetation (37.12 %), woody native vegetation (34.38 %), without

16

cover/urbanised (3.54 %), and Páramo (unaltered vegetation) (24.96 %). Figure 3

17

illustrates, in addition to the distribution of land cover, the WQ class, C1 or C2, of each

18

monitoring station and the bounds of the 18 sub-basins in which the Paute river basin

19

was sub-divided. The altered vegetation is present particularly in the mid corridor of the

20

basin; both, the higher western and lower eastern sides of the basin still have a

21

significant presence of woody native vegetation and Páramo. Woody native vegetation

22

is present for instance in the sub-basins Negro (77.8 %), Juval (86.4 %), Pulpito (77.3

23

%) and Machángara (76.4 %). Páramo (unaltered) is present in the sub-basins

24

Yanuncay (60.1 %), Pulpito (59.5 %), Machángara (56.7 %) and Tomebamba (48.3 %).

25

Further, the Paute basin includes important extents of protected areas (Fig. 1). The

26

most renowned are the Cajas National Park (PNC, located in the western, higher,

27

extreme of the basin; Mosquera et al., 2017), which is a Ramsar-Convention

28

(RAMSAR) wetland site, and the Sangay National Park (PNS, located at the north-east

AC C

EP

TE D

12

16

ACCEPTED MANUSCRIPT extreme); both recognised by the the United Nations Educational, Scientific and

2

Cultural Organization (UNESCO) as World Heritage Sites. Nearly 41% of the Paute

3

basin area is subjected to management and conservation programs.

4 5 6

Figure 3. Spatial distribution in the Paute river basin of (i) the soil cover (year 2001); (ii) the 18 sub-basins;

7

and (iii) the water quality (WQ) monitoring stations, classified according to the k-NN/GA algorithm as

8

having associated low (C1) or high (C2) pollution. Sub-basins are: 1 = Sidcay, 2 = Collay, 3 = Cuenca, 4 =

9

Jadan, 5 = Paute, 6 = Machangara, 7 = Magdalena, 8 = Mazar, 9 = Juval, 10 = Pindilig, 11 = Pulpito, 12 =

10

Sta. Bárbara, 13 = Burgay, 14 = Tarqui, 15 = Tomebamba, 16 = Yanuncay, 17 = Paute bajo, and 18 =

11

Negro.

TE D

EP

12

M AN U

SC

RI PT

1

3.2. k-means algorithm: spatial similarity and monitoring stations grouping

14

The maximum SC value (SCmax), the minimum DB value (DBmin), and the maximum CH

15

value (CHmax) were all obtained for k = 2 (Table 2), that is, implying that there are two

16

statistically significant clusters. As for the results of the k-means method, these two

17

groups of sites in the Paute river basin coincide with prior WQ studies in the study

18

region (Da Ros, 1995; Pauta-Calle and Chang-Gómez, 2014); cluster 1 (C1)

19

corresponds to relatively less polluted sites (67 monitoring stations) while cluster 2 (C2)

20

matches highly polluted sites (13 monitoring stations).

AC C

13

17

ACCEPTED MANUSCRIPT Figure 3 highlights, as it might be expected, that most WQ monitoring stations

2

belonging to C1 are situated at the upper lands and in areas with a high level of

3

conservation and good land management. On the other hand, the monitoring stations

4

classified as C2 are associated to sites that receive pollution from, both, point and non-

5

point sources such as agricultural fields, orchard plantations and municipal and

6

industrial wastewater effluents, particularly those stations located in the Burgay sub-

7

basin, near the Azogues city. The rest of the WQ monitoring stations belonging to C2

8

(13 stations) are located in the Magdalena and Pindilig sub-basins, which are in the

9

vicinity of the Burgay and the Santa Bárbara sub-basins. All these sub-basins are

SC

located in the middle elevation part of the Paute river basin.

11

M AN U

10

RI PT

1

12

Table 2. Values adopted by the internal validation indices as a function of the number of clusters k

13

(application of the non-supervised k-means algorithm for classification).

Inspected number of clusters k

Internal 2

SC

0.60

DB CH

14

3

4

5

6

0.32

0.25

0.28

0.28

TE D

Indexes

1.31

1.65

1.66

1.54

1.50

28.22

20.75

19.54

17.55

16.77

Legend: SC = Silhouette coefficient; DB = Davies-Bouldin; CH = Caliński-Harabasz.

EP

15

3.3. k-NN/GA: pattern recognition performance and significant WQ parameters

17

In this analysis, for a given monitoring station, the respective (k-means) WQ class was

18

the (dependent) grouping variable, whilst the corresponding measured parameters

19

constituted the independent variables. A good classification accuracy was obtained

20

with the k-NN/GA algorithm, since the average NER for 100 runs (Fig. 4a) was about

21

0.87 (± STD = ± 0.006), implying that about 87 % of the WQ monitoring stations were

22

correctly assigned with respect to the previously, k-means derived WQ classes (Fig.

23

4a).

AC C

16

24

18

ACCEPTED MANUSCRIPT 0.88

0.88 NER (-)

(b) 0.89

NER (-)

(a) 0.89

0.87 0.86

0.87 0.86 0.85

0.86

0.84 25 50 75 Run number (-)

100

0 5 10 15 20 Number of WQ parameters (-)

RI PT

0

80 60 40 20 0 EC

FC

DO

TH

CL

SC

Frequency (%)

(c) 100

N - NO₃₃ BOD₅₅

TALK

TU

Significant WQ parameters

M AN U

1

Figure 4. (a) Evolution of the classification model NER as a function of the run number throughout the

3

classification process; (b) the number of WQ variables (i.e. measured parameters) included in the final

4

stepwise selection analysis; and (c) the frequency of k-NN/GA variable (i.e. measured parameter) selection

5

throughout the classification process (EC = electric conductivity; FC = faecal coliforms; DO = dissolved

6

oxygen; TH = total hardness; CL = chlorides; N–NO3 = nitrates; BOD5 = 5-day biochemical oxygen

7

demand; TALK = total alkalinity; and TU = turbidity). The k-value that was used through the application of

8

the k-NN/GA process is 2.

9

TE D

2

With the objective of getting the most important independent variables (i.e.

11

measured parameters) that explain the structure of the two WQ classes, the final

12

stepwise selection with the best NER indicated that nine parameters are the optimum

13

to interpret the two WQ classes (Fig. 4b), namely, EC, FC, DO, TH, CL, N-NO3, BOD5,

14

TALK and TU, that account for most of the expected spatial variations of the WQ in the

15

Paute river basin (Fig. 4c). Further, the frequency of the k-NN/GA variable selection

16

throughout the classification process (Fig. 4c), illustrates that the EC parameter is the

17

most significant by the classification model, whilst the remaining 8 WQ parameters

18

seem to have a similar (lower) significance.

AC C

EP

10

19

In the following, the sub-index C1 indicates that a given parameter or

20

characteristic is associated to the WQ class C1; whilst the sub-index C2 refers to WQ 19

ACCEPTED MANUSCRIPT class C2. Figure 5 shows the Box plots of the 9 significant WQ parameters as a

2

function of the two WQ classes C1 (less polluted) and C2 (highly polluted). In general,

3

the values adopted by EC (Fig. 5a), FC (Fig. 5b), TH (Fig. 5d), CL (Fig. 5e), N–NO3

4

(Fig. 5f) and TU (Fig. 5i) are higher for the sampling points belonging to C2 than for the

5

ones belonging to C1 (average values are ECC1 = 87.7 µS cm-1, ECC2 = 280.3 µS cm-1;

6

FCC1 = 3936.7 bacteria 100-1 ml-1, FCC2 = 11794.3 bacteria 100-1 ml-1; THC1 = 37.3 mg

7

L-1, THC2 = 84.5 mg L-1; CLC1 = 1.7 mg L-1, CLC2 = 13.2 mg L-1; (N–NO3)C1 = 0.3 mg L-1,

8

(N–NO3)C2 = 1.5 mg L-1; and TUC1 = 14.7 mg L-1, TUC2 = 53.5 mg L-1). For the DO (Fig.

9

5c) the inverse condition was observed, that is, lower values for waters in C2 (average

10

6.32 mg L-¹) than for the waters in C1 (average 6.90 mg L-¹). For TALK (Fig. 5g) and

11

BOD5 (Fig. 5h) no considerable differences between waters belonging to C1 and the

12

ones belonging to C2 were observed (average TALKC1 = 0.62 mg L-1, TALKC2 = 0.61

13

mg L-1 and (BOD5)C1 = 6.43 mg L-1, (BOD5)C2 = 9.92 mg L-1); however, the dispersion is

14

slightly lower for the peak values of waters in C2 than for the waters in C1, similarly to

15

what is observed for the DO. It is important to observe that most of the peak values

16

(Fig. 5) were observed at the Burgay sub-basin (Fig. 3); thus, they do not seem to be

17

outliers, but they are simply characterising the extreme WQ conditions taking place

18

within such sub-basin.

SC

M AN U

TE D

EP

19

RI PT

1

4. DISCUSSION

21

Because the pre-selection of the number of clusters (k) at the start of the k-means

22

method strongly influences its outcome (Güler et al., 2002), the reliability of the two

23

clusters was satisfactorily tested through the internal validation indexes SC, DB and

24

CH. In the current case a SC value of 0.6 corresponds to “a reasonable structure has

25

been found” category (Kaufman and Rousseeuw, 1990). DB and CH are not valued

26

with a categorization likewise SC; however, the fact that both suggest two WQ classes

27

emphasises the similarity with the results of the k-means algorithm that partitioned the

28

monitoring stations into two homogeneous groups.

AC C

20

20

ACCEPTED MANUSCRIPT

TE D

M AN U

SC

RI PT

1

EP

2

Figure 5. Box plots of the WQ significant parameters as a function of the two WQ classes C1 (less

4

polluted) and C2 (highly polluted): (a) electric conductivity (EC); (b) faecal coliforms (FC); (c) dissolved

5

oxygen (DO); (d) total hardness (TH); (e) chlorides (CL); (f) nitrates (N–NO3); (g) total alkalinity (TALK); (h)

6

5-day biochemical oxygen demand (BOD5); and (i) turbidity (TU).

7 8

AC C

3

9

Further, the current application of the k-means algorithm was based on a non-

10

conventional approximation. Namely, due to the significant skewness present in the

11

original WQ parameters, a simplified data matrix of the median values of the WQ

12

parameters,

that

were

previously

standardised,

was

used

(Ouyang,

2005). 21

ACCEPTED MANUSCRIPT Standardisation of the WQ parameters was achieved by considering their ranges of

2

variation rather than by using the z-scale procedure, as conventionally done in most

3

WQ studies (Shrestha and Kazama, 2007; Kannel et al., 2007; Singh et al., 2004). In

4

this context, Steinley (2004) concludes that standardisation of the studied parameters,

5

by considering their range of variation, was the most effective approach while using the

6

k-means method.

RI PT

1

In classification models, parametric methods, such as the linear discriminant

8

analysis, widely used in surface water resources (Hajigholizadeh & Melesse, 2017;

9

Gholizadeh et al., 2016; Kovács et al., 2014), require a priori knowledge of the

10

probability density functions of the classes. However, in most real-world applications

11

these statistical properties are very hard to be known in advance (Lavine and Rayens,

12

2009) and/or classes substantially depart from normality (Hirsch and Alexander, 1991).

13

Consequently, parametric methods are no longer adequate (Huberty and Olejnik, 2006)

14

and non-parametric methods, such as the k-NN algorithm, are a more appropriate

15

alternative (Hirsch et al., 1982; Towler et al., 2009), despite of which the k-NN

16

approach is normally not chosen as a primary classification method in surface water

17

studies.

TE D

M AN U

SC

7

The k-NN/GA method gave relevant results in selecting appropriate explanatory

19

variables representative of the WQ status in the study basin. Hence, the method

20

yielded an important data reduction as it identified only nine (EC, FC, DO, TH, CL, N-

21

NO3, BOD5, TALK and TU) out of the 21 WQ measured parameters, that are explaining

22

about 86 % of the cluster assignations in supervised pattern recognition, reflecting

23

most of the WQ spatial variation in the study basin.

AC C

EP

18

24

Hence, the average ECC2 (280.3 µS cm-1) was almost three times higher than the

25

respective value for ECC1 (87.7 µS cm-1). Because EC increases nearly linearly with

26

increasing ion concentration, it implies that inorganic inputs are high in waters

27

belonging to C2. Most of C2 associated monitoring stations belong to the Burgay sub-

28

basin (Fig. 3), where wastewater effluents often contain high amounts of dissolved 22

ACCEPTED MANUSCRIPT salts from inorganic pollution sources such as domestic sewage, municipal storm water

2

drainage and industrial effluent discharges (Da Ros, 1995; Pauta-Calle and Chang-

3

Gómez, 2014). Further, other variables related to inorganic pollution sources such as

4

TH, CL, TU, N-NO3 and TALK have similar evolution to the one of EC. In this regard,

5

Olajire and Imeokparia (2001), Daniel et al. (2002) and Begum et al. (2009) observed

6

in similar studies that in water with high EC values nitrate and chloride ions and

7

calcium or magnesium salts are predominant.

RI PT

1

The high TH values associated to C2 might be linked to the use of inorganic

9

fertilisers (WHO, 1998). Hereafter, high concentrations of the nitrate ions (inorganic

10

nutrient) associated to C2 might be attributed to the runoff of fertiliser residues from

11

agricultural lands, mainly urea (CON2H4). Further, FC, DO, and BOD5 normally reflect

12

contamination by microbial communities and human activities (Liou et al., 2004).

13

Hereafter, the trends observed for these variables suggest an important organic

14

pollution from anthropogenic source associated to the WQ monitoring sites belonging

15

to C2, resulting in high oxygen-demand consumption, characterised by average BOD5

16

values higher in C2 (9.92 mg L-¹) than in C1 (6.43 mg L-¹).

TE D

M AN U

SC

8

A similar trend was found in other studies such as Vega et al. (1998), Singh et al.

18

(2004) and Kannel et al. (2007), which reported that high levels of dissolved organic

19

matter (DOM, i.e., carbohydrates, proteins, lipids, etc.) consume large amounts of

20

oxygen, which deploys the amount of available DO. The latter is in line with the results

21

of the current study. However, on the contrary to what is addressed in those studies, in

22

our study basin, the DO did not drop to a minimum level and, as such did not induce

23

anaerobic fermentation leading to development of ammonia and organic acids. This is

24

emphasised by the evolution of the pH distributions associated to the C1 and C2

25

monitoring sites, which shows that the acidity of the C2-associated waters is not

26

systematically higher than the one recorded in the C1-associated waters, except for

27

very few pH data near the lower, i.e. acidic, extreme of the pHC2 range of variation

28

(pHC1 range = 5.3 - 9.43, pHC2 range = 4.17 - 8.95).

AC C

EP

17

23

ACCEPTED MANUSCRIPT Several authors emphasised the importance of finding land use/cover attributes

2

that explain the spatial variation of WQ (Bücker et al., 2010; Bu et al., 2014; Cunha et

3

al., 2016). Examination of the spatial distribution of both, the C1 and C2 stations, as

4

well as the land cover, suggest that most of the WQC2 monitoring stations (Fig. 3) are

5

not directly exposed to, nor have a significant influence from, native forest and

6

unaltered vegetation cover, which is particularly the case in the Burgay sub-basin,

7

where most of the WQC2 monitoring stations are located and where 55.3 % of the

8

surface has altered native forest and vegetation (resulting from the introduction of

9

exotic species such as eucalyptus, or replacement of native vegetation by pastures).

10

Further, besides the Azogues city, which does not possess a water treatment system

11

prior to final in-river disposal, this sub-basin has an important number of small villages

12

that produces a significant contribution of untreated sewage (Da Ros, 1995). This

13

problem is even accentuated by the industrial production of cement (Da Ros, 1995;

14

Pauta-Calle and Chang-Gómez, 2014). Similarly, the neighbouring Magdalena sub-

15

basin has a very high proportion (70.7 %) of its surface covered by altered vegetation

16

and a limited native forests cover (24.2 %), which explains the presence of a WQC1

17

station located in the native forest area and a WQC2 station by the outlet of the sub-

18

basin. Both, the Burgay and the Magdalena sub-basins, do not have adequate land

19

management programs.

EP

TE D

M AN U

SC

RI PT

1

In the other sub-basins, with the presence of WQC2 stations (Paute, Pindilig and

21

Sta. Bárbara), the proportion of surface with altered vegetation cover is significant,

22

even though the majority of the WQ sampling stations are classified as C1. All WQ

23

sampling stations of the Paute sub-basin (Fig. 3), except one, are classified as C1 and

24

are located in areas where no significant human settlements are present; only the WQ

25

station located at the outlet of the sub-basin is classified as C2, which can be explained

26

not only by the direct influence of human settlements located immediately upstream

27

along the river course, but also by the direct contribution from the Magdalena sub-basin

28

into the lower part of the Paute sub-basin (Fig. 2 and Fig. 3).

AC C

20

24

ACCEPTED MANUSCRIPT For the Pindilig and Santa Bárbara sub-basins that contain WQC2 stations, the

2

location of the WQC1 stations is congruent with its location near a native forest and

3

unaltered vegetation. Further, a significant proportion of their surface (50.2 % for the

4

Santa Bárbara sub-basin and 32.6 % for the Pindilig sub-basin) is covered by altered

5

vegetation and there is only a very scarce presence of human settlements. In this

6

regard, once again, the land cover distribution suggests point source pollution from

7

focalised human settlements, which is accentuated in the Santa Bárbara sub-basin by

8

the fact that the average FCC2 (13573.1 bacteria 100-1 ml-1) doubles the respective

9

average FCC1 (6272.4 bacteria 100-1 ml-1). For the Pindilig sub-basin, this is confirmed

10

by the average FCC2 value (12621.4 bacteria 100-1 ml-1) that nearly doubles the FCC1

11

value (6909.8 bacteria 100-1 ml-1), as well as, by the fact that the average (BOD₅)C2

12

value (22.6 mg L-1) triplicates the respective (BOD₅)C1 value (6.1 mg L-1).

M AN U

SC

RI PT

1

Hereafter, point source pollution locations seem to have a low influence on the

14

downstream WQ, as the presence of human settlements nearly vanishes, there exists

15

an increment of discharge and the presence of native and altered vegetation prevails.

16

Particularly the presence of native vegetation has the potential of determining the

17

existence of riparian ecosystems in several spots along the river courses forming buffer

18

systems for the enhancement of riparian ecosystems and, finally, of the downstream

19

WQ. Formation of these riparian buffers should be encouraged in the study basin, for

20

instance for the removal of nitrates (Lowrance et al., 1997; Sweeney and Newbold,

21

2014; Connolly et al., 2015), that in WQC2 sites reaches an average value of 1.51 mg L-

22

¹, which is about 5 times the average in the WQC1 sites (0.31 mg L-¹).

EP

AC C

23

TE D

13

On the other hand, the sampling sites located by Cuenca city (Tomebamba and

24

Yanuncay sub-basins) are classified as belonging to the C1 group, despite of being

25

located at the most urbanised area of the Paute river basin. This is logic and explained

26

by the fact that in the city of Cuenca an acceptable treatment of most of the wastewater

27

takes place, with favourable consequences for the WQ of the Paute river. In the

28

remaining sub-basins of the Paute river basin the presence of native forest and 25

ACCEPTED MANUSCRIPT 1

unaltered vegetation (such as páramo) is significant and the level of anthropisation is

2

low, which emphasises the congruency of the current WQ classification results.

3 5. CONCLUSIONS

5

The study illustrated the usefulness of the utilised non-supervised (k-means) and

6

supervised (k-NN/GA) pattern recognition algorithms for the analysis and interpretation

7

of complex data sets, in the context of river WQ assessment, the identification of

8

pollution sources/factors and the understanding of spatial variations in river WQ,

9

offering the potential of contributing positively to effective river WQ management in the

10

study basin, such as, reducing the number of parameters to be monitored in the future,

11

which shall result in a significant reduction of the monitoring costs, without surrendering

12

on accuracy. Hence, the study revealed as well the worth of combining data mining

13

techniques and common GIS tools to cross-check on congruency of results.

14

M AN U

SC

RI PT

4

6. ACKNOWLEDGEMENTS

16

This study was conducted in the context of the Master's thesis (National University of

17

La Plata, Argentine) carried out by the first author and co-directed by the second

18

author. Additional work took place within the frame of the fellowship awarded to the

19

second author by the Ecuadorian Secretary for Higher Education, Science, Technology

20

and Innovation (SENESCYT) through its “PROMETEO Viejos Sabios” Program. The

21

authors would like to express their gratitude to SENAGUA for making the row data

22

accessible to the current study. Further, the authors are thankful to Jan Feyen for going

23

through the first draft of this manuscript and to the anonymous reviewers for their

24

important remarks.

AC C

EP

TE D

15

25

26

ACCEPTED MANUSCRIPT 1

REFERENCES

2

Anderson, T., Finn, J., 1996. The New Statistical Analysis of Data. Springer-Verlag

3

New York, Inc. Astudillo, S., Astudillo, P., Cisneros, P., Coello, C., García, J., González, C., Pacheco,

5

E., Rengel, A., Stoop, B., Van Noten, S., Wijffels, A., Zúñiga, A., 2010. Atlas de la

6

cuenca del río Paute, PROMAS - Universidad de Cuenca. ed. Consejo de gestión

7

de aguas de la cuenca del Paute, Cuenca - Ecuador.

RI PT

4

Barnett, T., Pierce, D., Hidalgo, H., Bonfils, C., Santer, B., Das, T., Bala, G., Wood, A.,

9

Nozawa, T., Mirin, A., Cayan, D., Dettinger, M., 2008. Human-induced changes in

10

the hydrology of the western United States. Science 319, 1080–1083.

11

doi:10.1126/science.1152538

M AN U

SC

8

Bartram, J., Ballance, R., 1996. Water Quality Monitoring - A Practical Guide to the

13

Design and Implementation of Freshwater Quality Studies and Monitoring

14

Programmes, First Edit. ed. United Nations Environment Programme / World

15

Health Organization. doi:10.1159/000170272

TE D

12

Begum, A., Ramaiah, M., Harikrishna, Khan, I., Veena, K., 2009. Heavy Metal Pollution

17

and Chemical Profile of Cauvery River Water. E-Journal of Chemistry 6, 47–52.

18

doi:10.1155/2009/154610

20 21

Bu, H., Meng, W., Zhang, Y., Wan, J., 2014. Relationships between land use patterns

AC C

19

EP

16

and water quality in the Taizi River basin, China. Ecological Indicators 41, 187–

197. doi:10.1016/j.ecolind.2014.02.003

22

Bücker, A., Crespo, P., Frede, H.-G., Vaché, K., Cisneros, F., Breuer, L., 2010.

23

Identifying Controls on Water Chemistry of Tropical Cloud Forest Catchments:

24

Combining Descriptive. Aquat Geochem 127–149. doi:10.1007/s10498-009-9073-

25

4

26 27

ACCEPTED MANUSCRIPT 1

Caccia, V.G., Boyer, J.N., 2005. Spatial patterning of water quality in Biscayne Bay,

2

Florida as a function of land use and water management. Marine Pollution Bulletin

3

50, 1416–1429. doi:10.1016/j.marpolbul.2005.08.002

5 6 7

Caliński, T., Harabasz, J., 1974. A dendrite method for cluster analysis. Communications in Statistics 3, 1–27. doi:10.1080/03610927408827101

RI PT

4

Chang, E.I., Lippmann, R.P., 1990. Using Genetic Algorithms to Improve Pattern Classification Performance. NIPS 1990 797–803.

Chen, G., Jaradat, S.A., Banerjee, N., Tanaka, T.S., Ko, M.S., Zhang, M.Q., 2002.

9

Evaluation and comparison of clustering algorithms in analyzing ES cell gene expression data. Statistica Sinica 12, 241–262.

M AN U

10

SC

8

CONELEC, 2011. Estadística del sector eléctrico ecuatoriano - Folleto resumen.

12

CONELEC, 2009. Plan maestro de electrificación del Ecuador.

13

Connolly, N.M., Pearson, R.G., Loong, D., Maughan, M., Brodie, J., 2015. Water

14

quality variation along streams with similar agricultural development but

15

contrasting riparian vegetation. Agriculture, Ecosystems and Environment 213,

16

11–20. doi:10.1016/j.agee.2015.07.007

EP

TE D

11

Cunha, D.G.F., Sabogal-Paz, L.P., Dodds, W.K., 2016. Land use influence on raw

18

surface water quality and treatment costs for drinking supply in São Paulo State

19

AC C

17

(Brazil). Ecological Engineering 94, 516–524. doi:10.1016/j.ecoleng.2016.06.063

20

Da Ros, G., 1995. La contaminación de aguas en Ecuador: una aproximación

21

económica. Instituto de Investigaciones Económicas, Pontificia Universidad

22

Católica del Ecuador.

23

Daniel, M.H., Montebelo, A.A., Bernardes, M.C., Ometto, J.P., De Camargo, P.B.,

24

Krusche, A. V., Martinelli, L.A., Ballester, R.L., Martinelli, L.A., 2002. Effects of

25

urban sewage on dissolved oxygen, dissolved inorganic and organic carbon, and 28

ACCEPTED MANUSCRIPT 1

electrical conductivity of small streams along a gradient of urbanization in the

2

Piracicaba river basin. Water, Air, and Soil Pollution 136, 189–206.

3 4

Davies, D.L., Bouldin, D.W., 1979. A Cluster Separation Measure. IEEE Transactions o Pattern Analysis and Machine Intelligence PAM1-1, 224–227. Dudoit, S., Fridlyand, J., 2002. A prediction-based resampling method for estimating

6

the number of clusters in a dataset. Genome Biology 3, 1–21. doi:10.1186/gb-

7

2002-3-7-research0036

RI PT

5

Fix, E., Hodges, J.L., 1951. Discriminatory Analysis - Nonparametric discrimination

9

consistency properties. Usaf school of aviation medicine Randolph Field, Texas

11 12

1–21.

M AN U

10

SC

8

Frank, I.E., Todeschini, R., 1994. The Data Analysis Handbook. B.V., Elsevier Science, pp. 1-364. doi:10.1016/S0922-3487(08)70048-0

Gholizadeh, M.H., Melesse, A.M., Reddi, L., 2016. Discriminant analysis application in

14

spatiotemporal evaluation of water quality in South Florida. Journal of

15

Hydroinformatics 18, 1019–1032. doi:10.2166/hydro.2016.023

TE D

13

Güler, C., Thyne, G.D., McCray, J.E., Turner, A.K., 2002. Evaluation of graphical and

17

multivariate statistical methods for classification of water chemistry data.

18

Hydrogeology Journal 10, 455–474. doi:10.1007/s10040-002-0196-6

AC C

EP

16

19

Hajigholizadeh, M., Melesse, A.M., 2017. Assortment and spatiotemporal analysis of

20

surface water quality using cluster and discriminant analyses. Catena 151, 247–

21 22 23

258. doi:10.1016/j.catena.2016.12.018

Hand, D.J., 2012. Assessing the Performance of Classification Methods. International Statistical Review 80, 400–414. doi:10.1111/j.1751-5823.2012.00183.x

24

Hanselman, D., Littlefield, B., 2012. Mastering MATLAB®. Pearson Education Limited.

25

Harper, D., Zalewski, M., Pacini, N., 2008. Ecohydrology: processes, models and case 29

ACCEPTED MANUSCRIPT 1

studies - An approach to the sustainable management of water resources.

2

International, CAB. doi:10.1111/j.1365-2427.2010.02451.x

3 4

Hartigan, A., Wong, M.A., 1979. A K-Means Clustering Algorithm. Journal of the Royal Statistical Society 28, 100–108. doi:10.2307/2346830 Helsel, D.R., Hirsch, R.M., 2002. Statistical methods in water resources, in: Techniques

6

of Water-Resources Investigations of the United States Geological Survey. U.S.

7

Geological Survey, pp. 1–510.

RI PT

5

Hirsch, R.M., Alexander, R.B., 1991. Selection of Methods for Detection and Estimation

9

of Trends in Water Quality Data. Water Resources Research 27, 803–813.

11

doi:10.1029/91WR00259

M AN U

10

SC

8

Hirsch, R.M., Slack, J.R., Smith, R.A., 1982. Techniques of Trend Analysis for Monthly

12

Water-Quality

13

doi:10.1029/WR018i001p00107

17 18 19 20

Research

18,

107–121.

Holland, J.H., 1975. Adaptation in Natural and Artificial Systems. The MIT Press.

TE D

16

Resources

doi:10.1137/1018105

Huberty, C.J., Olejnik, S., 2006. Applied MANOVA and Discriminant Analysis, Second

EP

15

Water

Edi. ed. John Wiley & Sons, Inc. Juahir, H., Zain, S.M., Aris, A.Z., Yusoff, M.K., Mokhtar, M. Bin, 2010. Spatial

AC C

14

Data.

assessment of Langat River water quality using chemometrics. Journal of

environmental monitoring 12, 287–295. doi:10.1039/b907306j

21

Kannel, P.R., Lee, S., Kanel, S.R., Khan, S.P., 2007. Chemometric application in

22

classification and assessment of monitoring locations of an urban river system.

23

Analytica Chimica Acta 582, 390–399. doi:10.1016/j.aca.2006.09.006

24 25

Kaufman, L., Rousseeuw, P.J., 1990. Finding groups in data - An intorduction to cluster analysis. John Wiley & Sons, Inc. 30

ACCEPTED MANUSCRIPT 1

Kovács, J., Kovács, S., Magyar, N., Tanos, P., Hatvani, I.G., Anda, A., 2014.

2

Classification into homogeneous groups using combined cluster and discriminant

3

analysis.

4

doi:10.1016/j.envsoft.2014.01.010

Environmental

Modelling

and

Software

57,

52–59.

Kryszczuk, K., Hurley, P., 2010. Estimation of the number of clusters using multiple

6

clustering validity indices. International Workshop on Multiple Classifier Systems

7

114–123. doi:10.1007/978-3-642-12127-2_12

RI PT

5

Lavine, B.K., Rayens, W.S., 2009. Classification: Basic Concepts, in: Brown, S.D.,

9

Tauler, R., Walczak, B. (Eds.), Comprehensive Chemometrics. Elsevier B.V., pp.

11

507–515.

M AN U

10

SC

8

Liou, S.M., Lo, S.L., Wang, S.H., 2004. A generalized water quality index for Taiwan.

12

Environmental

Monitoring

and

13

doi:10.1023/B:EMAS.0000031715.83752.a1

Assessment

96,

35–52.

Liu, S., Butler, D., Brazier, R., Heathwaite, L., Khu, S.T., 2007. Using genetic

15

algorithms to calibrate a water quality model. Science of the Total Environment

16

374, 260–272. doi:10.1016/j.scitotenv.2006.12.042

TE D

14

Lowrance, R., Altier, L.S., Newbold, J.D., Schnabel, R.R., Groffman, P.M., Denver,

18

J.M., Correll, D.L., Gilliam, J.W., Robinson, J.L., Brinsfield, R.B., Staver, K.W.,

19

Lucas, W., Todd, A.H., 1997. Water quality functions of riparian forest buffers in

21

AC C

20

EP

17

Chesapeake bay

watersheds.

Environmental

Management

21,

687–712.

doi:10.1007/s002679900060

22

MacQueen, J., 1967. Some methods for classification and analysis of multivariate

23

observations. Proceedings of the Fifth Berkeley Symposium on Mathematical

24

Statistics and Probability 1, 281–297. doi:citeulike-article-id:6083430

25

31

ACCEPTED MANUSCRIPT 1

Mosquera, P.V., Hampel, H., Vázquez, R.F., Alonso, M., Catalan, J., 2017. Abundance

2

and morphometry changes across the high mountain lake-size gradient in the

3

tropical Andes of Southern Ecuador. Water Resources Research 53(8), 7269-

4

7280. doi:10.1002/2017WR020902

6

Mulligan, A., Brown, L., 1998. Genetic Algorithms for Calibrating Water Quality Models. Journal of Environmental Engineering 124, 202–211.

RI PT

5

Ng, A.W.M., Perera, B.J.C., 2003. Selection of genetic algorithm operators for river

8

water quality model calibration. Engineering Applications of Artificial Intelligence

9

16, 529–541. doi:10.1016/j.engappai.2003.09.001

SC

7

Olajire, A.A., Imeokparia, F.E., 2001. Water quality assessment of Osun river: Studies

11

on inorganic nutrients. Environmental Monitoring and Assessment 69, 17–28.

12

doi:10.1023/A:1010796410829

13

M AN U

10

Ouyang, Y., 2005. Evaluation of river water quality monitoring stations by principal component

analysis.

Water

15

doi:10.1016/j.watres.2005.04.024

Research

39,

2621–2635.

TE D

14

Pauta-Calle, G., Chang-Gómez, J., 2014. Indices de calidad del agua de fuentes

17

superficiales y aspectos toxicológicos, evaluación del Río Burgay. MASKANA,

18

I+D+ingeniería 165–176.

AC C

EP

16

19

Ray, S., Turi, R.H., 1999. Determination of number of clusters in k-means clustering

20

and application in colour image segmentation. Proceedings of the 4th international

21

conference on advances in pattern recognition and digital techniques 137–143.

22

Raymer, M.L., Punch, W.F., Goodman, E.D., Kuhn, L.A., Jain, A.K., 2000.

23

Dimensionality reduction using genetic algorithms. IEEE Transactions on

24

Evolutionary Computation 4, 164–171. doi:10.1109/4235.850656

25

Sanderson, E.W., Jaiteh, M., Levy, M. a., Redford, K.H., Wannebo, A. V., Woolmer, G.,

32

ACCEPTED MANUSCRIPT 1

2002. The Human Footprint and the Last of the Wild. BioScience 52, 891–904.

2

doi:10.1641/0006-3568(2002)052[0891:THAT]2.0.CO;2 SENAGUA, 2016. Plan de monitoreo de calidad del agua de los sistemas de agua de

4

abastecimiento público de Portoviejo, Manta, Chone, Pedernales, Jama, Bahia de

5

Caráquez, San Vicente, Canoa, Calceta, Junín, Tosagua, Flavio Alfaro y Muisne.

6 7

RI PT

3

Shapiro, S.S., Wilk, M.B., 1965. An Analysis of Variance Test for Normality (Complete Samples). Biometrika Trust 52, 591–611. doi:10.1093/biomet/52.3-4.591

Shrestha, S., Kazama, F., 2007. Assessment of surface water quality using multivariate

9

statistical techniques: A case study of the Fuji river basin, Japan. Environmental Modelling & Software 22, 464–475. doi:10.1016/j.envsoft.2006.02.001

M AN U

10

SC

8

11

Simeonov, V., Stratis, J.A., Samara, C., Zachariadis, G., Voutsa, D., Anthemidis, A.,

12

Sofoniou, M., Kouimtzis, T., 2003. Assessment of the surface water quality in

13

Northern

14

1354(03)00398-1

Water

Research

TE D

Greece.

37,

4119–4124.

doi:10.1016/S0043-

15

Singh, K.P., Malik, A., Mohan, D., Sinha, S., 2004. Multivariate statistical techniques for

16

the evaluation of spatial and temporal variations in water quality of Gomti River

17

(India)

18

doi:10.1016/j.watres.2004.06.011

20

case

EP

a

study.

Water

Research

38,

3980–3992.

AC C

19

-

Sivanandam, S.N., Deepa, S.N., 2008. Introduction to Genetic Algorithm. SpringerVerlag Berlin Heidelberg. doi:10.1007/978-3-540-73190-0

21

Steinley, D., 2004. Standardizing variables in K-means clustering, in: Banks, D.,

22

House, L., McMorris, F.R., Arabie, P., Gaul, W.A. (Eds.), Classification, Clustering,

23

and Data Mining Applications. Springer-Verlag Berlin, pp. 53–60.

24 25 33

ACCEPTED MANUSCRIPT 1

Suguna, N., Thanushkodi, K., 2010. An Improved k-Nearest Neighbor Classification

2

Using Genetic Algorithm. International Journal of Computer Science Issues 7, 18–

3

21. Sweeney, B.W., Newbold, J.D., 2014. Streamside forest buffer width needed to protect

5

stream water quality, habitat, and organisms: A literature review. Journal of the

6

American Water Resources Association 50, 560–584. doi:10.1111/jawr.12203

7

Thalamuthu, A., Mukhopadhyay, I., Zheng, X., Tseng, G., 2006. Evaluation and

8

comparison of gene clustering methods in microarray analysis. Bioinformatics 22,

9

2405–2412. doi:10.1093/bioinformatics/btl406

SC

Todeschini, R., Ballabio, D., Consonni, V., 2015. Distances and Other Dissimilarity in

M AN U

10

RI PT

4

11

Measures

Chemometrics.

12

doi:10.1002/9780470027318.A9438

Encyclopedia

of

Analytical

Chemistry.

Towler, E., Rajagopalan, B., Seidel, C., Summers, R.S., 2009. Simulating ensembles of

14

source water quality using a K-nearest neighbor resampling approach.

15

Environmental Science and Technology 43, 1407–1411. doi:10.1021/es8021182

TE D

13

Vega, M., Pardo, R., Barrado, E., Debán, L., 1998. Assessment of seasonal and

17

polluting effects on the quality of river water by exploratory data analysis. Water

18

Research 32, 3581–3592. doi:10.1016/S0043-1354(98)00138-9

20 21 22

AC C

19

EP

16

Wang, K., Wang, B., Peng, L., 2009. CVAP: Validation for Cluster Analyses. Data Science Journal 8, 88–93. doi:10.2481/dsj.007-020

WHO, 1998. Guidelines for Drinking-Water Quality, in: Health Criteria and Other Supporting Information. World Health Organization, pp. 1–283.

23 24

34

ACCEPTED MANUSCRIPT

HIGHLIGHTS • K-means was useful for classifying the water quality (WQ) in the Paute river basin.

RI PT

• Impacted and less impacted WQ sites were identified in the study basin. • k-NN/GA identified 9 key WQ parameters, out of 21, explaining the WQ classification.

SC

• k-NN/GA suggested a more optimal and, as such, economical WQ sampling strategy.

M AN U

• Predictions of applied data mining methods had congruency with GIS-based

AC C

EP

TE D

analyses.