13th IFAC Workshop on Intelligent Manufacturing Systems 13th IFAC Workshop on Manufacturing 13th IFAC Workshop on Intelligent Intelligent Manufacturing Systems Systems August 12-14, 2019. Oshawa, Canada 13th IFAC Workshop on Intelligent Manufacturing Systems August 12-14, 2019. Canada Available online at www.sciencedirect.com August 12-14, 2019. Oshawa, Oshawa, Canada 13th IFAC Workshop on Intelligent Manufacturing Systems August 12-14, 2019. Oshawa, Canada August 12-14, 2019. Oshawa, Canada
ScienceDirect
IFAC PapersOnLine 52-10 (2019) 340–345
Groundwater Groundwater quality quality assessment assessment combining combining Groundwater quality assessmentmethods combining supervised and unsupervised Groundwater quality assessmentmethods combining supervised and unsupervised supervised and unsupervised methods supervised and ∗ unsupervised R. Ratolojanahary e Ngouna ∗∗∗ K.methods Medjaher ∗∗∗ ∗ ∗ R. Hou´
R. Ratolojanahary ∗∗ R. Hou´ Ngouna∗∗∗ Medjaher ∗∗ R. R. e K. ∗∗ e ∗ K. ∗ F. Dauriac Sebilo R. Ratolojanahary Ratolojanahary R. Hou´ Hou´ e Ngouna Ngouna K. Medjaher Medjaher ∗∗ ∗∗∗ ∗∗ M. ∗∗∗ F. Dauriac M. Sebilo ∗ ∗ ∗ F. Dauriac M. Sebilo ∗∗ ∗∗∗ ∗∗ M. R. Ratolojanahary R. Hou´ e Ngouna F. Dauriac Sebilo ∗∗∗K. Medjaher ∗∗ ∗∗∗ ∗ ´ Dauriac M. Sebilo G´ enieF.de Production, Ecole Nationale d’Ing´eenieurs de ∗ ´ ∗ Laboratoire ´ G´ Production, Ecole Nationale Laboratoire Tarbes, G´eenie nie de de Production, Ecole Nationale d’Ing´ d’Ing´enieurs nieurs de de ∗ Laboratoire ´
[email protected]). ∗ France (e-mail: Laboratoire G´ e nie de Production, Ecole Nationale d’Ing´ e nieurs de Tarbes, France France (e-mail: (e-mail:´
[email protected]).
[email protected]). ∗ ∗∗ Tarbes, Laboratoire G´ e nie de Production, Ecole Nationale d’Ing´ e nieurs de Chambre d’Agriculture des Hautes Pyr´ e n´ e es, Tarbes, France ∗∗ France (e-mail:
[email protected]). ∗∗ ChambreTarbes, d’Agriculture des des Hautes Hautes Pyr´ Pyr´een´ n´eees, es, Tarbes, Tarbes, France France Chambre d’Agriculture ∗∗∗∗∗ ∗∗ Tarbes, France (e-mail:
[email protected]). Sorbonne Universit´ e Facult´ e de Sciences iEES, Paris, France Chambre d’Agriculture des Hautes Pyr´ e n´ e es, Tarbes, France ∗∗∗ ∗∗∗∗∗ Sorbonne Universit´ ee -- Facult´ e de Sciences -- iEES, Paris, France de iEES, Paris, France ∗∗∗ ∗∗∗ Sorbonne ChambreUniversit´ d’Agriculture des eeHautes Pyr´en´e- es, Tarbes, France Sorbonne Universit´ e - Facult´ Facult´ de Sciences Sciences iEES, Paris, France ∗∗∗ Sorbonne Universit´e - Facult´e de Sciences - iEES, Paris, France Abstract: Human practices and industrial activities are increasingly becoming critical sources Abstract: Human and activities are becoming critical sources Abstract: Human practices practices and industrial industrial activities are increasingly increasingly becoming critical sources of water contamination, leading to groundwater quality deterioration. Unlike most studies Abstract: Human practices and industrial activities are increasingly becoming critical sources of water contamination, leading to groundwater quality deterioration. Unlike most studies of water contamination, leading to groundwater quality deterioration. Unlike most studies Abstract: Human practices and industrial activities are increasingly becoming critical sources in the literature that focus on chemical analyses, using prior knowledge on the deterioration of water contamination, leading to groundwater quality deterioration. Unlike most studies in the literature that focus on chemical analyses, using prior knowledge on the deterioration in the literature that focus on chemical analyses, using prior knowledge on the deterioration of water contamination, leading to groundwater qualitythis deterioration. Unlike studies phenomenon and predefined contaminating parameters, works aims at aa general in the literature that focus on chemical analyses, using prior knowledge ondefining the most deterioration phenomenon and predefined contaminating parameters, this works aims at defining general phenomenon and predefined contaminating parameters, this aims at defining aa general in the literature that focus on chemical detection. analyses, using priorworks knowledge ondecision the deterioration method for water quality contamination Thereafter, it will allow makers in phenomenon and predefined contaminating parameters, this works aims at defining general method water quality contamination detection. Thereafter, it allow decision makers in methodoffor forwater water quality management contamination detection. Thereafter, it will will allow decision makers in phenomenon andresource predefined contaminating parameters, this works aims atand defining a general charge to predict its quality deterioration anticipate any method for water quality contamination detection. Thereafter, it will allow decision makers in charge of water resource management to predict its quality deterioration and anticipate any charge of water resource management to predict its quality deterioration and anticipate any method for water quality contamination detection. Thereafter, it will allow decision makers in consequence on human health. To this end, Prognostics and Health Management, which is charge of water resource management to predict its quality deterioration and anticipate any consequence on human health. To this end, Prognostics and Health Management, which is consequence on human health. To this Prognostics and Health Management, which is charge of water resource management toend, predict its quality deterioration and anticipate any usually applied on industrial systems, has been transposed to water to continuously monitor consequence on human health. To this end, Prognostics and Health Management, which is usually applied on industrial systems, has been transposed to water to continuously monitor usually applied on industrial systems, has been transposed to water to continuously monitor consequence on predict human health. To this end, Prognostics and Health Management, which is it, detect and anomalous cases. Due to a high rate of missing values, a model usually applied on industrial systems, has been transposed to water to continuously monitor it, and anomalous cases. Due to aa high of missing values, aa model it, detect detect and predict predict anomalous cases. Due to high rate rate oforder missing values, model usually applied onforindustrial systems, hashas been transposed to water to to continuously monitor selection method multiple imputation been implemented in reduce the induced it, detect and predict anomalous cases. Due to a high rate of missing values, a model selection method method for for multiple multiple imputation imputation has has been been implemented implemented in in order order to to reduce reduce the the induced induced selection it, detectmethod and predict anomalous cases. Due to implemented a high rateinof missing values, ainduced model uncertainty of the observations, while reduction method has beenthe performed selection for multiple imputation has been order tohas reduce uncertainty of the observations, while aaa dimensionality dimensionality reduction method been performed uncertainty of the observations, while dimensionality reduction method has been performed selection method for multiple imputation has been implemented inof order tohas reduce the induced to mitigate the noise induced by the large amount of parameters interest (more than fifty). uncertainty of the observations, while a dimensionality reduction method been performed to mitigate the noise induced by the amount of of interest (more than fifty). to mitigate the noise induced bywhile the large large amount of parameters parameters of interest fifty). uncertainty of the observations, aand dimensionality reductionmethods method has(more been than performed To process the detection, unsupervised supervised learning have combined, to mitigate the noise induced by the large amount of parameters of interest (more than fifty). To process the detection, unsupervised and supervised learning methods have been combined, To process the detection, unsupervised and supervised learning methods have been combined, to mitigate the noise induced by the large amount of parameters of interest (more than fifty). which allowed to determine four classes of water quality and to provide relevant insights on To process the detection, unsupervised and supervised learning methods have been combined, which allowed allowed to to determine determine four four classes classes of of water water quality quality and and to to provide provide relevant relevant insights insights on on the the which the To process the detection, unsupervised and supervised learning methods have been combined, anomalous cases. Finally, the homogeneity of the resulting classes has been evaluated using aa which allowed to determine four classes of water andclasses to provide insightsusing on the anomalous cases. Finally, the the homogeneity of the thequality resulting classes has relevant been evaluated evaluated using anomalous cases. Finally, homogeneity of resulting has been aa which allowed toMachine determine four classes of water quality and tonot provide relevant insightscontext on the Support Vector classifier. The proposed method does depend on a specific anomalous cases. Finally, the homogeneity of the resulting classes has been evaluated using Support Vector Machine classifier. The proposed method does not depend on aa specific context Support Vector Machine classifier. The proposed method does not depend on specific context anomalous cases. Finally, the homogeneity of themethod resulting classes has beenclassification evaluated using a and therefore, can be generalized to provide accurate rules for water quality of any Support Vector Machine classifier. proposed not depend a specific context and can be toThe provide accurate rules for quality classification of and therefore, therefore, can be generalized generalized provide accurate rulesdoes for water water qualityon classification of any any Support Vector Machine classifier.to The proposed method does not depend on a specific context source of data. and therefore, can be generalized to provide accurate rules for water quality classification of any source of source of data. data.can be generalized to provide accurate rules for water quality classification of any and therefore, source ofIFAC data. © 2019, (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. source of data. Keywords: water resource monitoring; groundwater quality assessment; groundwater Keywords: water resource monitoring; groundwater quality assessment; groundwater Keywords: water resource prognostics monitoring;and groundwater quality assessment; assessment; groundwater contamination detection; health management; machine learning; Linear Keywords: water resource monitoring; groundwater quality groundwater contamination detection; prognostics and health management; machine learning; Linear contamination detection; prognostics and health management; machine learning; Linear Linear Keywords: water resource monitoring; groundwater quality assessment; groundwater Discriminant Analysis, Hierarchical Cluster Analysis; Support Vector Machine. contamination detection; prognostics and health management; machine learning; Discriminant Analysis, Hierarchical Cluster Analysis; Support Vector Machine. Discriminant Analysis, Hierarchical Cluster Analysis; Support Vector Machine. contamination detection; prognostics and health management; machine learning; Discriminant Analysis, Hierarchical Cluster Analysis; Support Vector Machine. Linear Discriminant Analysis, Hierarchical Cluster Analysis; Support Vector Machine. 1. INTRODUCTION ability of organizations to acquire large amounts of data 1. INTRODUCTION INTRODUCTION ability of of organizations organizations to to acquire acquire large large amounts amounts of of data data 1. ability (through technological advances in sensor development) 1. INTRODUCTION ability of organizations to acquire large amounts of data (through technological advances in sensor development) (through technological advances in sensor development) 1. INTRODUCTION ability of organizations to acquire amounts of data as well as recent advances in learning methods that can (through advances inlarge sensor development) as well well as astechnological recent advances advances in learning learning methods that can can as recent in methods that Access to water, particularly drinking water, is a global (through technological advances in sensor development) provide real-time monitoring, based on intelligent systems. well as recent monitoring, advances inbased learning methods that can Access to to water, water, particularly particularly drinking drinking water, water, is is aa global global as Access provide real-time on intelligent systems. provide real-time monitoring, based on intelligent systems. public health issue due to various risks of contamination, Access to water, particularly drinking water, is a global as well as recent advances inbased learning methods that can Prognostics Health and Management (PHM), generally real-time monitoring, on intelligent systems. public health health issue issue due due to to various various risks risks of of contamination, provide public Prognostics Health and Management Management (PHM), generally generally Access health tointentional water, drinking water, a global Prognostics Health and (PHM), whether (e.g. terrorist attack) or isaccidental accidental public issueparticularly due to terrorist various risks of contamination, contamination, provide real-time monitoring, based on intelligent systems. implemented for technological products, is a promising Prognostics Health and Management (PHM), generally whether intentional (e.g. attack) or whether intentional (e.g. terrorist attack) orisaccidental implemented for for technological technological products, products, is is a promising public health issue due towaste) varioustorisks of contamination, implemented (industrial or household which it exposed. whether intentional (e.g. terrorist Prognostics Health and Management (PHM), generally approach that can optimize the health of aa system through for technological products, is aa promising promising (industrial or household household waste) to attack) which it itoris isaccidental exposed. implemented (industrial or waste) to which exposed. approach that can optimize the health of system through whether intentional (e.g. terrorist attack) or accidental approach that can optimize the health of a system through Among these contaminating factors, the use of fertilizers in (industrial or household waste) to which it is exposed. implemented for technological products, is a promising efficient and continuous monitoring (Atamuradov al., approach that can optimize the health of a system through Among these these contaminating contaminating factors, factors, the the use use of of fertilizers fertilizers in in efficient and continuous monitoring (Atamuradov et Among et al., al., (industrial orcontaminating household waste) toathe which it fertilizers is exposed. efficient and continuous monitoring (Atamuradov et intensive agriculture appears to be major concern to the Among these factors, use of in approach that can optimize the health of a system through 2017). The key challenge of this work is therefore to transefficient and continuous monitoring (Atamuradov et al., intensive agriculture appears to be a major concern to the intensive agriculture appears to be a major concern to the 2017). The key challenge of this work is therefore to transAmongauthorities these contaminating factors, the use concern ofmanagement. fertilizers in 2017). The challenge of work is therefore to et transpublic in charge of water intensive agriculture appears be aresource major to the efficient andkey continuous monitoring al., pose it into the context of aathis natural system (i.e. water 2017). of this work (Atamuradov is therefore public authorities authorities in charge charge of to water resource management. public in of water management. pose it itThe intokey thechallenge context of of natural system (i.e.toaaa transwater intensive agriculture appears to be aresource major concern to the pose into the context a natural system (i.e. water It is therefore necessary to provide them with decision public authorities in charge of water resource management. 2017). The key challenge of this work is therefore to transresource) by considering water and its environment as the it into the context of a natural system (i.e. a water It is is therefore therefore necessary necessary to to provide provide them them with with decision decision pose It resource) by by considering considering water water and and its its environment environment as as the the public authorities inthem charge ofearly waterdetect resource management. support, allowing to the degradation It is therefore necessary provide them decision resource) pose it and into the context of a natural system (i.e. a as water system quality parameters the critical indicators resource) byits considering water andas environment the support, allowing them to totoearly early detect thewith degradation support, allowing them detect the degradation system and and its quality parameters parameters asits the critical indicators indicators It water is therefore necessary toanticipate provide them with decision system its quality as the critical of quality, and then any consequence on support, allowing them to early detect the degradation resource) byits considering water andasits environment as the of the system health. and quality parameters the critical indicators of water water quality, quality, and and then then anticipate anticipate any any consequence consequence on on system of of the the system system health. support, allowing them toanticipate early detect degradation health. human and biodiversity. of waterhealth quality, and then any the consequence on of system and its quality parameters as the critical indicators The notion of water quality is defined by the World Health of the system health. human health and biodiversity. human health and biodiversity. The notion notion of of water water quality quality is is defined defined by the the World World Health Health of water quality, then anticipate anybased consequence on The In the literature, detection approaches on physihuman andand biodiversity. of the system health.quality Organization as the “suitability water to sustain notion of (WHO) water is defined by byof the World Health In the the health literature, detection approaches based based on on physiphysi- The In literature, detection approaches Organization (WHO) as the “suitability of water to sustain human health and biodiversity. Organization (WHO) as the “suitability of water to sustain cal models have demonstrated their ability to formalize In literature, detection approaches based physi- Organization The notion of or water quality defined by the World Health various uses processes”(Bartram et al., 1996).Unlike as the is “suitability water to sustain cal the models have demonstrated their ability to on formalize cal models have their to formalize various uses uses (WHO) or processes”(Bartram processes”(Bartram etofal., al., 1996).Unlike In the literature, detection approaches based on physivarious or et 1996).Unlike the phenomenology of water contamination, into cal models have demonstrated demonstrated their ability ability totaking formalize Organization (WHO) as the “suitability of water to sustain most research studies in the literature, which generally various uses or processes”(Bartram et al., 1996).Unlike the phenomenology of water contamination, taking into the phenomenology of contamination, into most research research studies studies in in the literature, literature, which which generally generally cal models have demonstrated their ability totaking formalize account the continuous nature of the underlying phethe phenomenology of water water contamination, taking into most various uses orstudies processes”(Bartram et al., 1996).Unlike rely on few predefined parameters to analyze water quality, most research in the the literature, which generally account the continuous continuous nature of the the underlying underlying pheaccount the nature of pherely on few predefined parameters to analyze water quality, the phenomenology of water contamination, taking into rely on few predefined parameters to analyze water quality, nomenon (Das et al., 2017). However, in practice, their account continuous nature of the inunderlying phe- rely mostwork research studies in than the literature, which generally this focuses on more fifty parameters of interest. on few predefined parameters to analyze water quality, nomenonthe (Das et al., al., 2017). 2017). However, practice, their their nomenon (Das et However, in practice, this work work focuses focuses on on more more than than fifty fifty parameters parameters of of interest. interest. account the continuous nature of the underlying phe- this implementation can be complex due to the high number nomenon (Das et al., 2017). However, in practice, their rely work on fewgenerate predefined parameters toparameters analyze water quality, This may and not provide relevant on noise more thanwould fifty of interest. implementation can can be complex complex due due to the the high high number this implementation This may may focuses generate noise and would not provide provide relevant nomenon (Das et However, their This generate noise and would not relevant of critical contaminating factors. implementation canal.,be be2017). complex due to to in thepractice, high number number this work focuses on more than fifty parameters of interest. knowledge to generalize the causal analysis of the contamThis may generate noise and would not provide relevant of critical contaminating factors. of critical contaminating factors. knowledge to to generalize the the causal causal analysis analysis of of the contamcontamimplementation can beapproaches complex to the high analysis number knowledge In contrast, detection based on data of contaminating factors. due This may generate noise and would notacquired provide relevant ination phenomenon. In addition, the are to generalize generalize the causal analysis of the the data contamIn critical contrast, detection approaches approaches based on on data data analysis analysis knowledge In contrast, detection based ination phenomenon. In addition, the acquired data are of critical contaminating factors. ination phenomenon. In addition, the acquired data are can formalize complex relationships between a large numIn contrast, detection approaches based on data analysis knowledge to generalize the causal analysis of the contamcharacterized by a high rate of missing values, which causes phenomenon. In addition, the acquired data are can formalize formalize complex complex relationships relationships between between aa large large numnum- ination can characterized by by aa high high rate rate of of missing missing values, values, which which causes causes In contrast, detection approaches based on data analysis characterized ber of parameters that may be of great interest to decision can formalize complex relationships between a large numination phenomenon. In addition, acquired data are uncertainty onby the input data, whilethe novalues, prior knowledge on a high rate of missing which causes ber of of parameters parameters that that may may be be of great great interest interest to to decision decision characterized ber uncertainty on on the input data, while no prior knowledge on can of formalize complex between a large numuncertainty the input data, while no prior knowledge on makers. The underlying models also benefit the ber parameters that relationships may be of ofcan great interest tofrom decision characterized by a high rate of missing values, which causes uncertainty on the input data, while no prior knowledge on makers. The underlying models can also benefit from the makers. The models also benefit the ber of parameters that may be ofcan great interest decision makers. The underlying underlying models can also benefittofrom from the uncertainty on the input data, while no prior knowledge on makers. The underlying models can also benefit from the340 Hosting by Elsevier Ltd. All rights reserved. 2405-8963 © 2019 2019, IFAC IFAC (International Federation of Automatic Control) Copyright@ Copyright@ 2019 340 Copyright@ 2019 IFAC IFAC 340 Control. Peer review under responsibility of International Federation of Automatic Copyright@ 2019 IFAC 340 10.1016/j.ifacol.2019.10.054 Copyright@ 2019 IFAC 340
2019 IFAC IMS August 12-14, 2019. Oshawa, Canada
R. Ratolojanahary et al. / IFAC PapersOnLine 52-10 (2019) 340–345
341
the formalization of water quality is available. Moreover, unlike common PHM implementations on industrial systems where universal health indicators (through acoustic, vibration, electrical signal analysis for example) are used to study the system health, the field of water quality does not provide any. Although recent works have shown the increasing interest of researchers on physical analysis of water, using for instance spectrophotometry for determining water quality deterioration (Arnon et al., 2019), such indicators are not yet mature and not widely applied. To address these issues, it is proposed to combine supervised and unsupervised methods in order to pretreat the data and to identify different water quality classes. For brevity reasons, the other steps of PHM are not treated in the present paper. The rest of the paper is organized as follows: in the next section, related work concerning water quality assessment is presented. In section 3, the chosen approach is described, and in section 4, the method is applied to the dataset and the results are discussed. Finally, section 5 concludes the work and proposes directions for further improvements. Fig. 1. PHM applied to water quality. 2.1 Health indicators for water quality monitoring.
2. RELATED WORK PHM is a well-known approach that enables predictive maintenance by providing robust tools and methods for detecting failures along with their corresponding causes. It has been widely and successfully applied in the context of industrial systems. Thanks to its ability to also predict the Remaining Useful Life (RUL) of technological systems, it is therefore considered by industrials as one of the most powerful approaches that support continuous improvements and allow decision makers to react ahead. Thus, applying PHM can save a considerable cost for maintenance, while extending the operational time of the service (Gouriveau et al., 2017). Several review papers of PHM design for different applications have been published in the literature: rotary machinery systems (Lee et al., 2014), batteries ((Rezvanizaniani et al., 2014), train bogies (?)... In particular, PHM has never been used to control water quality at the source (at the pumping well) to our knowledge. A basic PHM is implemented according to the following main steps: (i) data acquisition (ii) data preprocessing, (iii) detection, (iv) diagnostics, (v) prognostics and (vi) decision-making. In the present study, it can be applied in the following manner (see Fig. 1): first the data are collected using sensors and laboratory analyses. Secondly, it is made sure that the data are exploitable and of good quality by handling missingness, heterogeneity and high dimensionality. Next, water quality classes are determined and a rule is inferred for classifying new data. If it is found that water is of bad quality, the causes are researched. Then, the evolution of water quality over time is predicted. Finally, a recommendation system is used to improve decisionmaking. In order to identify water quality classes, two main approaches can be used : choosing a water quality indicator or cluster the data to find water quality classes. 341
In practice, the value of each parameter is compared to its quality limit defined by the water quality standards. In the literature, the study can be focused on a specific parameter, namely nitrates (Nolan et al., 2015) or chlorine (Liu et al., 2016), depending on the needs and goals of the end-user. In this case, interaction between parameters is not taken into account. Since many physico-chemical parameters are obtained trough laboratory analyses, all parameters are not measured at the same rate, and water quality datasets are often sparse. That is why some studies only focus on a few parameters that can be monitored continuously, such as turbidity and conductivity (Leigh et al., in press 2019). Another way for assessing water quality is to use a Water Quality Index (WQI) (Brown et al., 1970), or a Groundwater Quality Index (GQI) (Machiwal and Jha, 2015) depending on the case. WQI is a weighted mean of a few water quality parameters, the weight and the parameters vary according to the type of water and region (Camejo et al., 2013). A GQI gives a higher importance to hydrochemical parameters such as calcium or magnesium. There are usually five classes of water quality, ranging from very polluted to very good. However, many configurations can lead to the same WQI value. Moreover, only a limited amount of parameters are often considered, and it does not usually allow to identify the underlying cause of the pollution, nor consider the interaction between parameters. An alternative is to identify observations with similar characteristics and group them into classes, each corresponding to a certain category of water quality. Clustering algorithms are made for that intent. 2.2 Unsupervised methods for contamination detection. K-means is one of the most widely used clustering methods. It takes as input a number K of desired classes, a distance d, a set of K observations to initialize the clusters,
2019 IFAC IMS 342 August 12-14, 2019. Oshawa, Canada
R. Ratolojanahary et al. / IFAC PapersOnLine 52-10 (2019) 340–345
and data with no labels. It was used to produce three water quality categories related to the contamination cause (mineralization, domestic and industrial intrusion, punctual industrial discharges) in (Celestino et al., 2018). K-means requires tuning of K, d and the K initial centroids. The clustering can also be fuzzy, meaning the algorithm gives a degree of belonging to each cluster instead of a definite answer (Fu-cheng and Xue-zhao, 2013). Hierarchical Cluster Analysis (HCA), is another common clustering method. It computes the distance matrix between all observations and merges the closest ones into a cluster gradually, until all the observations are gathered into one cluster. It requires less tuning than K-means, but is more computational. Two clusters were obtained in (Machiwal and Jha, 2015). They are distinguished by their geographical location and the trends in parameters. Different causes of pollution for each cluster have been found : natural and anthropogenic. In that study, the clusters lacked homogeneity and only 15 parameters were used. In this paper, the aim is to identify relevant water quality classes in a more generalizable way than a water quality index, meaning with any number of parameters and any type of water, taking water quality standards into account. Indeed, there are norms concerning those parameters that depend on the geographic situation and the designated use (human consumption, agricultural use or industrial use for example). Such standards provide useful information, among others, on the threshold a given critical parameter should not exceed for the water to be declared drinkable, resulting in two groups of water quality. To that intent, a label (compliant or non-compliant) will be added to each sample at the preprocessing step (step (ii) of the PHM process), then a clustering method will be applied to obtain subclasses.
3.2 The main steps of the method. The main steps of the proposed method are depicted in Fig. 2. European standards (SANP0720201A) are introduced to create a categorical variable specifying if a given sample complies with the norm or not (see step 1 in the figure). These standards associate each parameter to its quality threshold value. The resulting labeled data are then used for dimensionality reduction (see step 2 in the figure). In step 3, the resulting reduced dataset is split into two sub-datasets according to the compliance to the norms. Then, a clustering method is applied to each of these sub-datasets to provide the desired intermediate water quality classes (step 4). In this study, Linear Discriminant Analysis (LDA) has been chosen as a supervised dimensionality reduction method in order to find the subspace in which compliant and non-compliant observations are best separated. Then, since the number of observations is relatively small (133), HCA is used to find the subclasses. 3.3 Theoretical background. The theoretical background of the main methods implemented in this paper is presented in the following. Linear Discriminant Analysis. LDA is a supervised method that aims to reduce dimension by finding new axes for which separation between classes is maximal. It performs as follows: (1) Compute µ, µ1 and µ2 , respectively the mean vectors of the entire dataset, the observations belonging to class 1 and those belonging to class 2; (2) Within-class and between-class scatter matrices are computed. The within-class matrix Sw calculates the squared sum of each projection to their mean projection, as defined in equation (1):
3. THE PROPOSED METHOD Sw = 3.1 Context of the study and previous work.
Nj c j=1 i=1
The studied well is located at Oursbelille, in the Southwest of France. The dataset consists of 133 observations of 414 water quality parameters from 2001 to 2018. These parameters can be grouped into four categories: organoleptic (colour, smell, etc.), physico-chemical (sodium, temperature...), pesticides (atrazine, metolachlor, etc.) and microbiological parameters (enterococcus, Escherichia coli, etc.). The raw data that have been used contain a high rate of missing values. A model selection for data imputation has been performed to solve this issue, as described in (Ratolojanahary et al., in press 2019). All the learning methods have been chosen among competing candidates, through a selection process. To cope with overfitting, their respective hyper-parameters have been optimized accordingly. A combination of Multivariate Imputations by Chained Equations (MICE) and Support Vector Regression (SVR) has been chosen. Furthermore, the number of parameters has been reduced by dropping those that have too many non-reliable measurements. For brevity, in the following, clean and complete data are considered, composed of 133 observations and 56 parameters. 342
(xji − µj )(xji − µj )T
(1)
where c is the number of classes, n the number of observations. The between-class matrix represents the distance between the mean projections of each class and the mean projection of the entire dataset, and is formally defined as in equation (2). c Sb = (µj − µ)(µj − µ)T (2) j=1
The goal is to find an axis that minimizes the distance within a cluster and maximizes the distance between two clusters, which is equivalent to maxidet(Sb ) mizing det(S , which is also equivalent to calculate w) −1 Sb (Fisher, 1938); the eigenvalues of Sw (3) Compute the eigenvalues and the eigenvectors, and order them by decreasing value. The larger the eigenvalue, the more important is the retained information; (4) Choose a number of linear discriminants; (5) Project the data into the new subspace. To choose the number of linear discriminants, the cumulative percentage of explained variance is calculated.
R. Ratolojanahary et al. / IFAC PapersOnLine 52-10 (2019) 340–345
343
Fig. 2. Main steps of the proposed method. Hierarchical Cluster Analysis. HCA (Rokach and Maimon) is an unsupervised method which enumerates n optimal clustering configurations, n being the number of observations. Each configuration corresponds to a number of clusters, this number varying from n (every observation is in its own cluster) to 1 (all the observations are in the same cluster). The algorithm goes as follows: (1) Put each observation in its own cluster; (2) Choose a distance between observations, and a distance between clusters; (3) Compute distance matrix between clusters; (4) Group together the two closest clusters; (5) Repeat steps 3 and 4 until there is one cluster. All this process is visualized with a dendrogram. In this study, euclidean distance is chosen for computing distance between observations, and Ward criterion is used for calculating the distance between clusters. It maximizes the distance between two clusters and minimizes the distance within a cluster. For selecting the ideal number of clusters, silhouette score can be computed for a number of clusters (going from 2 to 20 in this study), and the number with the highest score is chosen. This result can be confirmed by observing the dendrogram which shows the distance between clusters. In order to evaluate the relevance of the obtained clusters, parameters exceeding quality limit or having a large variance will be identified. Then, cross-validation using a Support Vector Machine (Cortes and Vapnik, 1995) with a linear kernel is performed, and the precision scores are provided. That same classifier can be used to classify new data into one of the identified classes. 4. RESULTS AND DISCUSSION Since we did not have prior knowledge about water phenomena, a first strategy was to use unsupervised dimension reduction, Principal Component Analysis (PCA), and clustering, HCA. Then we integrated Water Quality Standards (WQS) before clustering. Finally a third alternative, which is the object of this present work, was to integrate them in the dimension reduction phase. In the 343
following, the three methods are respectively called PCA, PCA+WQS and LDA+WQS. 4.1 Extracting relevant features. 100% of the variance between the two classes (compliant and non-compliant) is explained by the first linear discriminant. Although, for visualization purpose, three axes will be retained. In the case of PCA, nine components are retained to retrieve 72% of the total variance. The representation of the data in the first three components for LDA and for PCA is pictured on Fig. 3. As PCA discriminates between observations, and LDA discriminates between classes, the representation is better in the linear discriminants subspace. 4.2 Clustering for identifying water quality classes. First, the transformed data are divided into two : compliant and non-compliant, then HCA is applied on each subset. The silhouette score and the dendrogram are used to determine the number of clusters. The compliant category is divided into two subclasses, and the non-compliant category into three subclasses. For example, the dendrogram for the compliant samples is given in Fig. ?? and Fig. 4. The distribution of few parameters causing the anomaly or having a large variance is represented in Fig. 5. These are desethyl atrazine (ADETD), metolachlor ESA (ESAMTC), alachlor ESA (ESALCL), acetochlor ESA (ESACETC), Total pesticides (PESTOT), Nitrates (NO3), Calcium (CA) and pH. DETD, ESAMTC, ESALCL and ESACETC are pesticides. The limit quality for the first six parameters is plotted in red, and the lower reference quality for pH is plotted in blue. The number of boxplots corresponds to the number of classes obtained with each method. Some boxplots are empty because the corresponding parameter was not filled out in the original data (the one with missing values).
2019 IFAC IMS 344 August 12-14, 2019. Oshawa, Canada
R. Ratolojanahary et al. / IFAC PapersOnLine 52-10 (2019) 340–345
Fig. 3. Data representation after dimensionality reduction. Table 1. Cross-validation scores Number of classes Precision (%)
LDA+WQS 5 90
PCA 3 86
PCA+WQS 6 72
to generalize the affectation of a new observation to this cluster. Therefore, only the four first clusters will be retained.
Fig. 4. Dendrogram for non-compliant samples. When water quality standards are not taken into account, the algorithm fails to isolate the group which could be considered as good quality water. While using PCA and clustering only, three classes are obtained but all of them contain both compliant and non-compliant data. Indeed, PCA focuses on parameters with high variance (for example calcium), even if the values are far away from the quality limit. It may be an interesting information, but does not affect the water quality. That is why it is important to consider standards. In total, five clusters are obtained with LDA+WQS. • Cluster 1 (13 observations) regroups compliant data with a low pesticide concentration, and a significantly lower calcium concentration compared to the other clusters. • Cluster 2 (47 observations) regroups compliant data with higher levels of pesticides and nitrates. • Cluster 3 (36 observations) regroups non-compliant data with high levels of pesticides (atrazine, metolachlor ESA, alachlor ESA and acetochlor ESA), nitrates, calcium, and with a low pH value. • Cluster 4 (36 observations) represents non-compliant data with a high metolachlor ESA concentration. Nitrates value are mostly under the quality limit, but are still high on average. • Finally, Cluster 5 contains only one observation. It differs from the others because of a low nitrate value and a high concentration in metolachlor. It is also the only observation for which the concentration of desethyl deisopropyl atrazine (not pictured here) exceeds the quality limit. It is considered as an outlier, since having one observation does not allow 344
Cluster 1 may refer to good quality water, and cluster 2 to drinkable water, but not advised to persons at risk such as infants. Clusters 3 and 4 depict different degrees of pollution by nitrates and pesticides (potentially due to agricultural practices). When applying cross-validation using a Support Vector Machine classifier, it is shown that the proposed method, LDA+WQS, give the most homogeneous clusters (Tab. 1). The interest of integrating water quality standards during the reduction phase and to guide the clustering is thus highlighted. 5. CONCLUSION In this paper, a method for determining water quality classes has been proposed as a first step for transposing PHM to water quality monitoring, in order to track the water quality evolution through time. To this end, supervised and unsupervised methods have been combined: the first one (LDA) to create two water quality classes and to reduce dimensionality, the second one (HCA) to obtain intermediate clusters. Finally, four different water quality classes, with accurate rules that characterize them have been identified. The proposed method is not dependent on specific context regarding predetermined physico-chemical analyses, and therefore is generalizable. Future work consists in (i) integrating environmental data such as climatic or agronomic data, and studying their impact on the determined water quality classes, and (ii) consider the temporal dimension of the studied dataset for prediction. ACKNOWLEDGEMENTS This work is co-funded by the region Occitanie and the “Agence de l’eau Adour-Garonne”.
2019 IFAC IMS August 12-14, 2019. Oshawa, Canada
R. Ratolojanahary et al. / IFAC PapersOnLine 52-10 (2019) 340–345
345
Fig. 5. Boxplot of the most relevant parameters per cluster. REFERENCES Arnon, T.A., Ezra, S., and Fishbain, B. (2019). Water characterization and early contamination detection in highly varying stochastic background water, based on machine learning methodology for processing real-time UV-spectrophotometry. Water Research, 155, 333–342. doi:10.1016/j.watres.2019.02.027. Atamuradov, V., Medjaher, K., Dersin, P., Lamoureux, B., and Zerhouni, N. (2017). Prognostics and health management for maintenance practitioners-review, implementation and tools evaluation. International Journal of Prognostics and Health Management, 8(060), 1– 31. Bartram, J., Ballance, R., Organization, W.H., and Programme, U.N.E. (1996). Water Quality Monitoring. Routledge. Brown, R.M., McClelland, N.I., Deininger, R.A., and Tozer, R.G. (1970). A water quality index - do we dare? Water Sew. Works, 117, 339–343. Camejo, J., Pacheco, O., and Guevara, M. (2013). Classifier for drinking water quality in real time. In 2013 International Conference on Computer Applications Technology (ICCAT). IEEE. doi:10.1109/iccat.2013.6521975. Celestino, A.M., Cruz, D.M., S´ anchez, E.O., Reyes, F.G., and Soto, D.V. (2018). Groundwater quality assessment: An improved approach to k-means clustering, principal component analysis and spatial analysis: A case study. Water, 10(4), 437. doi:10.3390/w10040437. Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. doi: 10.1007/bf00994018. Das, P., Begam, S., and Singh, M.K. (2017). Mathematical modeling of groundwater contamination with varying velocity field. Journal of Hydrology and Hydromechanics, 65(2). doi:10.1515/johh-2017-0013. Fisher, R.A. (1938). The statistical utilization of multiple measurements. Annals of Eugenics, 8(4), 376–386. doi: 10.1111/j.1469-1809.1938.tb02189.x. Fu-cheng, L. and Xue-zhao, H. (2013). Application of fuzzy c-means clustering for assessing rural surface water quality in lianyungang city. In 2013 Fifth International Conference on Measuring Technology and Mechatronics Automation. IEEE. doi:10.1109/icmtma.2013.75. Gouriveau, R., Medjaher, K., and Zerhouni, N. (2017). Du concept de PHM la maintenance prdictive 1. ISTE. 345
Lee, J., Wu, F., Zhao, W., Ghaffari, M., Liao, L., and Siegel, D. (2014). Prognostics and health management design for rotary machinery systems—reviews, methodology and applications. Mechanical Systems and Signal Processing, 42(1-2), 314–334. doi: 10.1016/j.ymssp.2013.06.004. Leigh, C., Alsibai, O., Hyndman, R.J., Kandanaarachchi, S., King, O.C., McGree, J.M., Neelamraju, C., Strauss, J., Talagala, P.D., Turner, R.D., Mengersen, K., and Peterson, E.E. (in press 2019). A framework for automated anomaly detection in high frequency water-quality data from in situ sensors. Science of The Total Environment, 664, 885–898. doi:10.1016/j.scitotenv.2019.02.085. Liu, Y., Zheng, Y., Liang, Y., Liu, S., and Rosenblum, D.S. (2016). Urban water quality prediction based on multitask multi-view learning. Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, 2576–2582. Machiwal, D. and Jha, M.K. (2015). Identifying sources of groundwater contamination in a hard-rock aquifer system using multivariate statistical analyses and GIS-based geostatistical modeling techniques. Journal of Hydrology: Regional Studies, 4, 80–110. doi: 10.1016/j.ejrh.2014.11.005. Nolan, B.T., Fienen, M.N., and Lorenz, D.L. (2015). A statistical learning framework for groundwater nitrate models of the central valley, california, USA. Journal of Hydrology, 531, 902–911. doi: 10.1016/j.jhydrol.2015.10.025. Ratolojanahary, R., Hou´e, R.N., Medjaher, K., JuncaBouri´e, J., Dauriac, F., and Sebilo, M. (in press 2019). Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset. Expert Systems with Applications. doi: 10.1016/j.eswa.2019.04.049. Rezvanizaniani, S.M., Liu, Z., Chen, Y., and Lee, J. (2014). Review and recent advances in battery health monitoring and prognostics technologies for electric vehicle (EV) safety and mobility. Journal of Power Sources, 256, 110– 124. doi:10.1016/j.jpowsour.2014.01.085. Rokach, L. and Maimon, O. (????). Clustering methods. In Data Mining and Knowledge Discovery Handbook, 321– 352. Springer-Verlag.