Support vector machines in water quality management

Analytica Chimica Acta 703 (2011) 152–162 Contents lists available at ScienceDirect Analytica Chimica Acta journal homepage: www.elsevier.com/locate...

Download PDF

991KB Sizes 9 Downloads 244 Views

Report

PDF Reader
Full Text

Analytica Chimica Acta 703 (2011) 152–162

Contents lists available at ScienceDirect

Analytica Chimica Acta journal homepage: www.elsevier.com/locate/aca

Support vector machines in water quality management Kunwar P. Singh a,∗ , Nikita Basant b,1 , Shikha Gupta a a Environmental Chemistry Division, CSIR-Indian Institute of Toxicology Research (Council of Scientiﬁc & Industrial Research), Post Box 80, Mahatma Gandhi Marg, Lucknow 226001, India b Laboratory of Chemometrics and Drug Design, School of Pharmacy, Howard University, Washington, DC, USA

a r t i c l e

i n f o

Article history: Received 8 May 2011 Received in revised form 11 July 2011 Accepted 16 July 2011 Available online 23 July 2011 Keywords: Support vector classiﬁcation Support vector regression Kernel discriminant analysis Kernel partial least squares Water quality Biochemical oxygen demand

a b s t r a c t Support vector classiﬁcation (SVC) and regression (SVR) models were constructed and applied to the surface water quality data to optimize the monitoring program. The data set comprised of 1500 water samples representing 10 different sites monitored for 15 years. The objectives of the study were to classify the sampling sites (spatial) and months (temporal) to group the similar ones in terms of water quality with a view to reduce their number; and to develop a suitable SVR model for predicting the biochemical oxygen demand (BOD) of water using a set of variables. The spatial and temporal SVC models rendered grouping of 10 monitoring sites and 12 sampling months into the clusters of 3 each with misclassiﬁcation rates of 12.39% and 17.61% in training, 17.70% and 26.38% in validation, and 14.86% and 31.41% in test sets, respectively. The SVR model predicted water BOD values in training, validation, and test sets with reasonably high correlation (0.952, 0.909, and 0.907) with the measured values, and low root mean squared errors of 1.53, 1.44, and 1.32, respectively. The values of the performance criteria parameters suggested for the adequacy of the constructed models and their good predictive capabilities. The SVC model achieved a data reduction of 92.5% for redesigning the future monitoring program and the SVR model provided a tool for the prediction of the water BOD using set of a few measurable variables. The performance of the nonlinear models (SVM, KDA, KPLS) was comparable and these performed relatively better than the corresponding linear methods (DA, PLS) of classiﬁcation and regression modeling. © 2011 Elsevier B.V. All rights reserved.

1. Introduction Surface water bodies in general, and the rivers, in particular are among the most vulnerable aquatic systems to contamination due to their easy accessibility for the disposal of wastes. The hydrochemistry of the river systems is largely determined by a number of factors, such as climatic conditions, soil-rock types and anthropogenic activities in the basin [1,2]. In India, almost all the surface water bodies, including the streams and the river systems, are grossly polluted and the regulatory agencies have been putting all efforts to restore the health and ecology of these aquatic ecosystems. Consequently, a number of monitoring programs have been initiated for generating baseline databases pertaining to different lakes and river systems throughout the country with a view to develop and implement appropriate pollution prevention, control and restoration strategies [3]. Water quality monitoring programs for large number of water bodies at several strategic sites for

∗ Corresponding author. Tel.: +91 522 2476091; fax: +91 522 2628227. E-mail addresses: [email protected], kpsingh [email protected] (K.P. Singh). 1 Current address. 0003-2670/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.aca.2011.07.027

various characteristic variables are very expensive, time, and manpower intensive and hence, difﬁcult to sustain over longer periods. Moreover, the determination of the biochemical oxygen demand (BOD) is very tedious and cost-intensive. BOD measures an approximate amount of bio-degradable organic matter present in the water and serves as an indicator parameter for the extent of water pollution. BOD of an aquatic system is the foremost parameter needed for assessment of the water quality as well as for development of the management strategies for the protection of water resources [4]. This warrants for the need of a foolproof method for its determination. Currently available method for BOD determination and prone to measurement errors. The method is subjected to various complicating factors, such as the oxygen demand resulting from the respiration of algae in the sample and the possible oxidation of ammonia. Presence of toxic substances in sample may also affect microbial activity leading to a reduction in the measured BOD value [5]. The laboratory conditions for BOD determination usually differ from those in aquatic systems. Therefore, interpretation of BOD results and their implications may be associated with large variations. Moreover, it is determined over a period of 5 days at a constant temperature (20 ◦ C) maintained throughout the duration, which is difﬁcult to achieve in developing countries due to frequent power interruptions. Overall, the BOD measurement result is

K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162

associated with large uncertainties, thus making its estimate largely unreliable [4,6]. Therefore, there is a need to modify and redeﬁne the water quality monitoring programs with a view to minimize the number of the sampling sites and sampling frequencies under various networks, but still generating meaningful data, without compromising with the quality and interpretability [3]. In addition, it is economically justiﬁed to develop methods which are capable of estimating the tedious parameter, such as BOD based on the knowledge of other, already identiﬁed variables. Pattern classiﬁcation method may help to classify the sites and months on the basis of similarity and dissimilarity in terms of the observed water quality, thus in grouping the identical water quality sites and months under a monitoring program. This may provide guidelines to select single representative sampling site and month from each of the cluster. Similarly, regression modeling would allow developing relationship between the independent and dependent variables and in predicting the difﬁcult variable(s) using the simpler ones as the input. Several linear (discriminant analysis, partial least squares) and nonlinear (artiﬁcial neural networks, kernel discriminant analysis, kernel partial least squares, support vector machines) modeling methods are now available for the classiﬁcation and regression problems [3,4,7–11]. The linear discriminant analysis (DA) and partial least-squares (PLS) regression capture only linear relationship and the artiﬁcial neural networks (ANNs) have some problems inherent to its architecture, such as overtraining, overﬁtting, network optimization, and reproducibility of the results, due to random initialization of the networks and variation of stopping criteria [11]. Kernel-based techniques are becoming more popular, because in contrast to ANNs, they allow interpretation of the calibration models. In kernel-based methods the calibration is carried out in space of nonlinearly transformed input data, the so-called feature space, without actually carrying out the transformation. The feature space is deﬁned by the kernel function [12]. The kernel versions of the disicriminant analysis (KDA) and partial least squares (KPLS) algorithms have been described in literature [8–10,13]. The support vector machines (SVMs), essentially a kernel-based procedure, is relatively new machine learning method based on Vapnik–Chervonenkis (VC) theory [14] that recently emerged as one of the leading techniques for pattern classiﬁcation and function approximation [15]. SVM can simultaneously minimize estimation errors and model dimensions. It has good generalization ability and is less prone to over-ﬁtting. SVMs adopt kernel functions which make the original inputs linearly separable in mapped high dimensional feature space [16]. SVMs have successfully been applied for the classiﬁcation and regression problems in various research ﬁelds [17–22]. Here, we have considered a large data set pertaining to the surface water (Gomti river, India) quality monitored for thirteen different variables each month over a period of ﬁfteen years at ten different sites spread over a distance of about 500 km. The objectives of this study were to develop the SVM models for (1) classiﬁcation of the sampling sites with a view to identify similar ones in the monitoring network for reducing their number for the future water quality monitoring; (2) classiﬁcation of the sampling months into the groups of seasons for reducing the annual sampling frequency; and (3) to predict the BOD of the river water using simple measurable water quality variables. Further, the SVM classiﬁcation (temporal and spatial) and SVM regression modeling results were compared with those obtained through the linear (DA, PLS) and the corresponding nonlinear kernel-based (KDA, KPLS) modeling approaches. This study has shown that the application of SVM classiﬁcation and regression methods to the large water quality data set successfully helped in optimization of the water quality monitoring program through minimizing the number of the sampling sites, frequency, and parameters.

153

2. Materials and methods 2.1. Problem statement For a given set of water samples (W), repeatedly collected over a period of time (monthly) from a certain number of monitoring sites spread within a geographical area, and characterized for a set of properties (N), the task is to develop a function using the known set of properties which could group the sampling sites as well as the sampling months on the basis of similarities and dissimilarities. This would help to reduce the sampling sites as well as the sampling frequency for the future monitoring program. Consider a set of water samples, W representing the spatial and temporal water quality in terms of the measured water quality variables (N), such that; W = {x|x ∈ RN }. Each water sample is represented as N-dimensional real vector and every particular coordinate xi represents a value of some measured water quality variable. Dimension N is equal to the number of measured variables used to describe each water sample. Let C = {c1, c2, . . ., cI} be the set of I classes that correspond to some predeﬁned water type such as spatial or temporal. The function fc : W → C is called classiﬁcation if for each xi ∈ W it holds that fc (xi ) = cj if xi belongs to the class cj . In real situations, a limited set of labeled water samples (xi , yi ), xi ∈ RN , yi ∈ C, I = 1, . . ., m, are available through the monitoring studies, which form the training set for the classiﬁcation problem [17]. The machine learning approach using the labeled cases (water samples) from the training set aims to ﬁnd the function f¯c which could be a good approximation of the real, unknown function fc . Another relevant task is to estimate the value of an unknown property for a particular water sample using any other set of properties available and known for the same sample. If we have a training set with m water samples (xi , yi ), xi ∈ RN , yi ∈ R, i = 1, . . ., m, where, yi is the known real value of the target property one aims to estimate for water samples in W not included in the training set. The function fr : W → R is called a regression if it estimates the value of the target property given the values of other properties for any random sample x ∈ W [17]. Here, the machine learning approach aims to ﬁnd f¯r using the training cases and a speciﬁc learning method. The basic aim of this study is to ﬁnd the most accurate possible classiﬁcation f¯c (spatial and temporal) and regression f¯r from the training data pertaining to the surface water quality monitored over a long period of time (1995–2010) at ten different sites and using the SVMs.

2.2. Brief theory of support vector machines A detailed description of SVMs may be found elsewhere [14,23,24], however, a brief account is given here. SVMs originally developed for binary classiﬁcation problems make use of the hyperplanes to deﬁne decision boundaries between the data points of different classes [25]. Now with the introduction of -insensitive loss function, SVM has been extended to solve the regression problems [15]. In SVM approach, the original data points from the input space are mapped into a high dimensional or even inﬁnite dimensional feature space using a suitable kernel function, where a maximal separating plane (SP) is constructed. Two parallel hyper-planes are constructed, one on each side of the SP that separates the data. The SP maximizes the distance between the two parallel hyper-planes. As a special property, SVM simultaneously minimizes the empirical classiﬁcation error and maximizes the geometric margin. It is assumed that larger the margin or distance between these parallel hyper-planes, the better will be the generalization error of the classiﬁer. In case of the SVM regression, the goal of SVM is to ﬁnd the optimal hyper-plane, from which the distance to all the data points is minimum [15,16,26].

154

K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162

2.2.1. SVM classiﬁcation (SVC) SVM method deals with the binary classiﬁcation model which assumes that there are just two classes (C = {c1 , c2 }) and an object (say water sample) belongs to one of these only. For a set of N data vectors {xi , yi }, i = 1, . . ., N, yi ∈ {−1, 1}, xi ∈ Rn , where xi is the ith data vector that belongs to a binary class yi , if the data are linearly separable, there exists a SP in the input space, which can be expressed [16] as: f (x) = wT x + b =

n

wj xj + b = 0

w ∈ Rn

is a weight vector, b, the “bias” term, is a scalar, and T where stands for the transpose operator. The parameters of w and b deﬁne the location of SP and are determined during the training process. In SVM, the training data points satisfying the constraints that f(xi ) = 1 if yi = 1 and f(xi ) = −1 if yi = −1 are called support vectors (SVs). Other training data points satisfy the inequalities that f(xi ) > 1 if yi = 1 and f(xi ) < −1 if yi = −1 [27]. Therefore, a complete form of the constraints for all training data can be expressed [16] as: for i = 1, 2, . . . , N

1 i ||w||2 + C 2

(3)

i=1

Subject to

yi (wT xi + b) ≥ 1 = i ; i = 1, 2, . . . , N, i ≥ 0, i = 1, 2, ..., N

where C is a positive constant and i represents the distance between the data points xi lying on the false class side and the margin of its virtual class. The optimization problem can be solved by introducing the Lagrange multipliers ˛i and ˇi as [28]:

1 i ||w||2 + C 2 N

L(w, b, , ˛, ˇ) =

i=1

N

−

˛(yi (wT xi + b) − 1 + i ) −

i=1

N

ˇi i

(4)

i=1

where ˛ = (˛1 , . . ., ˛N )T , ˇ = (ˇ1 , . . ., ˇN )T , and = ( 1 , . . ., N )T . An optimal solution of Eq. (4) may be achieved through minimizing the derivatives of the Lagrange function with respect to w, b and which yields: L(˛) =

N

1 ˛i ˛j yi yj xTi xj 2 N

˛i −

i=1

(5)

i,j=1

Under the constraints:

⎧ N ⎪ ⎨ ⎪ ⎩

i=1

yi ˛i = 0

C ≥ ˛i ≥ 0,

(6) i = 1, . . . , N

(8)

j=1

f (x) = sign

N

˛i yi (xTi x) + b

(9)

i=1

Here, if f (x) is positive, the new input data point x belongs to class 1 (yi = 1) and if f (x) is negative, x belongs to class 2 (yi = −1). However, in case of nonlinear separation the original input data is projected into a high dimensional feature space: ϕ: Rn → Rd , n d i.e. x → ϕ(x), in which the input data can be linearly separated. In such a space, the dot-product from Eq. (9) is transformed into ϕ(xi ). ϕ(x), and the non-linear function can be expressed as: f (x) = sign

N

T

˛i yi (ϕ (xi ).ϕ(x)) + b

(10)

i=1

A number of kernel functions are available for which it holds: K(xi , xj ) = (ϕT (xi ).ϕ(xj )), and it returns a dot product of the feature space mappings of the original inputs. The non-linear decision function can be written as: f (x) = sign

N

˛i yi K(xi , x) + b

(11)

The selection of kernel function is dependent on the distribution of the data. However, kernel function is generally selected through “trial and error” test [29]. There are four possible choices of the kernel functions, such as linear, polynomial, sigmoid, and radial basis function (RBF). RBF is among the most common kernel functions employed in most of the applications and it has the form [30]: K(x, xi ) = exp(−||xi − xj ||2 )

The solution yields the coefﬁcients ˛i which are required to express the w. Now, i = 1, . . . , N.

(7)

(12)

where the parameter controls the smoothness of the decision boundary in the feature space. Here, we applied the SVM classiﬁcation model with RBF to the surface water quality data with a view to achieve the spatial and temporal classiﬁcations. 2.2.2. SVM regression (SVR) For a given regression problem, the goal of SVM is to ﬁnd the optimal hyper-plane, from which the distance to all the data points is minimum. Consider a training data set, (xi , yi ), xi ∈ Rn , i = 1, . . ., m, y ∈ {+1, −1}, where yi denotes the target property of an already known ith case. During the learning phase, the aim is to ﬁnd the linear function f(x) = wx + b, w, x ∈ Rn , b ∈ R for which the difference between the actual measured value yi and estimated value f(xi ) would be at most equal to ε or [yi − f(xi )] < ε, where ε is the insensitive loss function [15,31]. If the error of estimation is taken into account by introducing slack variables and *, as well as the penalty parameter C, the corresponding problem can be expressed [32,33] as: 1 min ||w||2 + C 2

N

i=1

˛i (yi (wT xi + b)) = 0,

and b =

i=1

N

min

1 (yi − wT xj ) Nsv Nsv

˛i yi xi

where Nsv is the number of SVs. From the w and b, the linear decision function can be expressed as:

(2)

The distance between the plane crossing the points denoting the SVs (wT x + b = −1 and wT x + b = 1) is called the margin and it can be expressed as (2/||w||). SP, the line passing through mid of these two SVs (wT x + b = 0) provides the largest margin value. The optimum SP may be achieved through maximization of the margin and minimization of the noise with introducing the slack variable i ; as [16,27]:

N i=1

(1)

j=1

yi f (x) = yi (wT xi + b) ≥ 1,

Thus, w =

(i + i∗ )

Subject to ((yi − w xi − b) ≤ ε + i , w.x + b − yi ≤ ε + T

(13) i∗ ; i i∗

≥ 0)

Transforming this quadratic programming problem to its corresponding dual optimization problem and introducing the kernel

K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162

function in order to achieve the non-linearity, yields the optimal regression function [31,32] as: f (x) =

N

(˛i − ˛∗i )K(xi , x) + b

(14)

i=1

where C ≥ ˛i , ˛∗i ≥ 0, i = 1, . . . , N. ˛i and ˛∗i (with 0 ≤ ˛i , ˛∗i ≥ C) are the Lagrange multipliers and K(xi , x) represents the kernel function. In contrast of Eq. (14), data points with nonzero ˛i and ˛∗i values are SVs. The performance of SVMs for classiﬁcation and regression depends on the combination of several factors, such as the kernel function type and its corresponding parameters, capacity parameter, C, and -insensitive loss function [15]. For the RBF kernel, the most important parameter is the width of the RBF, which controls the amplitude of the kernel function and, hence, the generalization ability of SVM [34]. C is a regularization parameter that controls the trade-off between maximizing the margin and minimizing the training error. If C is too small, then insufﬁcient stress will be placed on ﬁtting the training data. If C is too large, then the algorithm will over ﬁt the training data [35]. Here, we applied the SVM regression model with RBF kernel function to the surface water quality data with a view to predict the target variables, BOD of the water samples using set of independent water quality variables. 2.3. Data set The data set used here pertains to the surface water quality monitored each month at eight different sites during ﬁrst ten years (1995–2005) and at ten different sites during the last ﬁve years (2005–2010). The ﬁrst three sites (S1–S3) are located in the area of relatively low pollution (LP) and are upstream of the Lucknow city. Other three sites (S4–S6) are in the region of gross pollution (GP) as there are a 26 wastewater drains and two highly polluted tributaries emptying in to the river in this stretch. The last four sites (S7–S10) are in the region of moderate pollution (MP) as the river considerably recovers in the course [4]. The water quality parameters measured include the water pH, total alkalinity (T-Alk, mg L−1 ), total hardness (T-Hard, mg L−1 ), total solids (TS, mg L−1 ), ammonical nitrogen (NH4 –N, mg L−1 ), nitrate nitrogen (NO3 –N, mg L−1 ), chloride (Cl, mg L−1 ), phosphate (PO4 , mg L−1 ), potassium (K, mg L−1 ), sodium (Na, mg L−1 ), dissolved oxygen (DO, mg L−1 ), chemical oxygen demand (COD, mg L−1 ), and biochemical oxygen demand (BOD, mg L−1 ). Details on analytical procedures are available elsewhere [3]. The water quality data set has the dimension of 1500 samples × 13 variables. 2.4. Data processing The data pertaining to the surface water quality monitored over a long duration at sampling sites spread over a large geographical area may be contaminated by human and measurement errors. Such erroneous data may behave as an outlier. An outlier is a data point that is numerically distant from the majority of data [16,36]. Using such contaminated data set directly for pattern classiﬁcations and predictive modeling will lead to unreliable results; hence identifying outliers and conducting data cleaning become important [16]. Moreover, some of the features in original data set may have insigniﬁcant or no relevance with the response variable rendering these useless in pattern classiﬁcation and predictive modeling; hence implementing initial feature selection is also necessary [26,36]. Therefore, the removal of outliers and initial features selection were implemented here.

155

The data were partitioned into three subsets; training, validation, and the test, using the Kennard–Stone (K–S) approach. The K–S algorithm designs the model set in such a way that the objects are scattered uniformly around the training domain. Thus, all sources of the data variance are included into the training model [37,38]. In the present study, the complete data set (1500 samples × 13 variables) was partitioned as training set (900 samples × 13 variables); validation and test set each (300 samples × 13 variables). Thus, the training, validation and test sets, thus comprised of 60%, 20% and 20% samples, respectively.

2.4.1. Data cleaning Outliers in training or validation sub-set affect the classiﬁcation as well as regression modeling results. The outliers in the training set may cause the SP falsely determined, which leads to the misclassiﬁcation of validation data; while the outliers in validation set are misclassiﬁed, as these are located in opposite class side [16]. Similarly, in case of regression, the outliers may cause higher prediction errors. Since, the outliers are physically located far away from the normal data; they will be rarely classiﬁed or predicted correctly. Therefore, the outliers may be detected if the classiﬁcation or regression process is implemented several times with changed training and validation sets and record the misclassiﬁed or predicted cases for every single run. Data misclassiﬁed or mis-predicted most of the time may be taken as candidate outlier [16]. Here, we repeated the classiﬁcation and regression processes several times with changed training and validation data sets and identiﬁed the outliers which were misrepresented most of the times. The ﬁnal outliers were determined based on the removal impacts of the candidate outliers on misclassiﬁcation rate (MR) and mean squared error (MSE) values. The data points whose removal results in the reduction of MR and MSE values are treated as ﬁnal outliers. The ﬁnal outliers were permanently removed from the data set before performing the classiﬁcation and regression modeling. A total of 53 samples were identiﬁed as the outliers and were permanently removed from the ﬁnal data set.

2.4.2. Initial feature selection Removal of the irrelevant features from the data set is required prior to the classiﬁcation or regression modeling. Presence of such variables would lead to misﬁt of the models, thus losing the predictive ability [36]. Here, the initial feature selection was performed using the multiple linear regression (MLR) method. Variables exhibiting signiﬁcant relationship with the target variable were retained while dropping the others. The insigniﬁcant variables then dropped one by one and misclassiﬁcation rate (MR) and prediction error in regression were recorded. Finally the pH, T-alk, Cl, PO4 , COD, DO and BOD were retained for the purpose of modeling of the water quality data set (Table 1).

Table 1 Basic statistics of the selected measured water quality variables in surface water, India (n = 1500). Variable

Unit

Min

Max

Median

Mean

a

pH T-Alk Cl PO4 COD BOD DO

– mg L−1 mg L−1 mg L−1 mg L−1 mg L−1 mg L−1

6.02 42.67 0.21 0.00 1.20 0.12 0.00

9.03 366.67 53.94 9.90 57.39 31.67 9.97

8.27 216.7 8.33 0.26 14.4 4.17 6.37

8.23 204.55 9.75 0.51 16.76 6.11 5.72

0.34 53.91 6.34 0.81 8.72 4.71 2.52

a b

Standard deviation. Coefﬁcient of variation.

SD

b

CV

4.14 26.36 65.01 159.21 52.06 77.09 44.12

156

K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162

Table 2a Classiﬁcation matrix for spatial classiﬁcation of surface water by DA, KDA and SVC models. Model

Actual class

Training set DA LP GP MP Total LP KDA GP MP Total SVC LP GP MP Total Validation set LP DA GP MP Total LP KDA GP MP Total LP SVC GP MP Total Test set LP DA GP MP Total KDA LP GP MP Total LP SVC GP MP Total

Total cases

291 304 268 863 291 304 268 863 291 304 268 863

Predicted correct assignations LP

GP

MP

246 15 73

5 261 4

40 28 191

246 261 191

265 2 25

16 271 17

48 5 215

265 271 215

252 5 39

5 287 12

34 12 217

252 287 217

84 16 35

1 81 2

12 4 53

84 81 53

84 3 13

8 83 10

21 2 67

84 83 67

88 7 24

1 87 4

8 7 62

88 87 62

123 6 24

0 54 2

11 7 69

123 54 69

116 4 14

6 58 3

13 2 80

116 58 80

119 4 20

0 59 1

15 4 74

119 59 74

97 101 90 288 97 101 90 288 97 101 90 288 134 67 95 296 134 67 95 296 134 67 95 296

Correct assignations

2.5. Modeling and performance criteria The main aim of this study was to build SVM models for the classiﬁcation and regression problems pertaining to the surface water quality with a view to achieve the reduction in number of the sampling sites and sampling frequency for minimization of the monitoring efforts; as well as to develop a tool for the prediction of the water BOD levels using simple and directly measurable water quality parameters as the input. 2.5.1. Optimization of SVM parameters Similar to other multivariate calibration methods, the generalization performance of SVM models depends on a proper setting of several parameters. These include the capacity parameter C, the insensitive loss function ε, and the kernel function dependent parameter in SVM classiﬁcation and regression models [39]. RBF is the most commonly used kernel in SVM and the RBF width parameter () reﬂects the distribution/range of x-values of training data [32]. The parameter C determines the trade-off between the smoothness of the regression function and the amount up to which deviations larger than ε are tolerated. Since, the highest ˛i and ˛∗i values are by deﬁnition equal to C, the robustness of the regression model depends on the choice of the later. Therefore, the choice of the C value inﬂuences the signiﬁcance of the individual data points in the training set [40]. Hence, a proper choice of C in combination with ε might result in a well performing and robust regression

% Mis-classiﬁed

Sensitivity (%)

Speciﬁcity (%)

Accuracy (%)

15.46 14.14 28.73 19.11 9.28 10.86 19.78 13.30 13.40 5.59 19.03 12.39

73.65 96.67 73.75

91.49 92.75 87.25

84.59 93.97 83.20

91.40 89.14 80.22

88.81 98.75 92.95

89.68 95.37 89.00

85.13 94.40 82.50

93.12 96.95 91.50

90.38 96.06 88.76

13.40 19.80 41.11 24.30 14.43 17.82 25.56 19.27 9.27 13.86 31.11 17.70

62.22 96.43 76.81

91.50 90.20 83.11

77.78 92.01 81.60

84.00 82.18 74.44

84.74 97.88 88.56

84.48 92.41 84.19

73.94 94.56 80.51

94.67 92.85 86.72

86.11 93.40 85.06

8.20 19.40 27.36 16.89 13.43 13.43 15.79 14.22 11.19 11.94 22.10 14.86

86.27 96.43 79.31

92.31 94.58 87.56

86.15 94.93 85.14

86.57 86.67 84.21

88.27 97.38 91.54

87.50 94.93 89.19

83.21 98.33 79.56

90.19 96.61 89.65

86.82 96.95 86.48

model, which is also insensitive to the presence of possible outliers [39]. Here, the optimum value of C was determined through grid search over a space of 0.01–50,000 with step size of 10-1. The parameter ε regulates the radius of the ε tube around the regression function and thus, the number of SVs that ﬁnally will be selected to construct the regression function (leading to a space solution). A too large value of ε results in less SVs (more data points will be ﬁt in the ε tube) and, consequently, the resulting regression model may yield large prediction errors on unseen future data. Since, ε is related to the amplitude of the noise present in the training set, and the exact contribution of the noise to real information in a data set is usually unknown, ε was optimized in the range of 0.001 and 0.2 [39]. A good combination of the two parameters (C and ε) also prevents overtraining. To achieve this, an internal cross-validation during construction of SVR models was performed. The kernel function is used to map the input data into a high dimensional feature space which is required to transform the nonlinear input space to a high-dimensional feature space where linear regression is possible [41]. The mapping depends on the intrinsic structure of the data, implying that the kernel type and parameters need be optimized to approximate the ideal mapping [23,24]. In this work, RBF kernel was used. Unlike the linear kernel, the RBF kernel can handle the case when the relation between class labels and attributes is nonlinear. Besides, the linear kernel is a special case of the RBF [42]. The RBF kernel has fewer tuning parameters than the polynomial and sigmoid kernels [43], and it tends to give good per-

K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162

formance under general smoothness assumptions [44]. The value is important in the RBF model and can lead to under or over ﬁtting in prediction. A very large value of may lead to over ﬁtting and all the support vectors distances are taken into account, while in case of a very small , the machine will ignore most of the support vectors leading to failure in the trained point prediction (under ﬁtting) [45]. Here, the optimum value of the RBF kernel function () was determined through the grid search over the space 0.001–20. Since, the model parameters exhibit an interaction, these need be optimized simultaneously. Here the parameters were optimized using the grid and pattern searches over a wide space employing the v-fold cross-validation [46]. The grid search takes samples from the space of the independent variables. For each sample, the model prediction is computed and compared with the best value found from the previous iterations. If the newly found value is better than the previous one, the new results are stored. This process is repeated until the end of the iterations is reached. The grid search technique is unguided algorithms based on brute computing power. Hence, it can be computationally more expensive than other optimization techniques. The accuracy of grid search optimization depends on the parameter range in combination with the chosen interval size. A higher accuracy in optimal solution may be achieved through increasing the parameter range and decrease in the step size [35]. Optimized values of the parameters searched over large ranges were repeatedly optimized over closer ranges with smaller step sizes to achieve ﬁnely tuned values. In v-fold cross-validation, the data in training set are divided into v subsets of equal size. Subsequently one subset is tested using the model trained on the remaining v-1 subsets. Thus, each instance of the whole training set is predicted once. The cross-validation procedure prevents the over-ﬁtting problem [46]. 2.6. Linear and kernel discriminant analysis Linear DA ﬁnds a linear projection of the data onto a onedimensional subspace such that the classes are well separated according to a certain measure of separability. Since, DA constructs a linear function; it is beyond its capabilities to separate the data with nonlinear structures. Such a problem of the nonlinearities in the data may be overcome by use of the kernel trick in KDA, which maps the input data into the kernel feature space, yielding a nonlinear discriminant function in the input space. The procedural details of the DA and KDA techniques are available elsewhere [4,7,9,10]. Here, the DA and KDA were performed for the spatial and temporal classiﬁcations using the training, validation and test sets of the water quality data as employed for the SVC modeling. 2.7. Linear and kernel partial least squares regression PLS, a linear multivariate method for relating the process variables (X) with the responses (Y) can analyze strongly collinear and noisy data. It maximizes the covariance between X and Y. PLS reduces the dimension of the predictor variables by extracting latent variables (LVs). In PLS, the scaled X and Y matrices are decomposed into their corresponding scores vectors and loadings vectors. In an inner relation, the scores vector of the predictor is linearly regressed against the scores vector of the dependent variables. However, it is crucial to determine the optimal number of the LVs, and cross-validation is a reliable approach to test the predictive signiﬁcance of the selected model [13,47]. The KPLS is a nonlinear extension of linear PLS in which training samples are transformed into a feature space via a nonlinear mapping through the kernel trick, and the PLS algorithm is then implemented in the feature space. Nonlinear data structure in the original space is most likely to be linear after high-dimensional nonlinear mapping. Therefore, KPLS can efﬁciently compute LVs

157

in the feature space by means of integral operators and nonlinear kernel function. The procedural details of PLS and KPLS regression methods are available elsewhere [9,13,37,48]. 2.8. Model performance criteria The performance of the classiﬁcation models (DA, KDA, and SVC) was assessed in terms of the misclassiﬁcation rate (MR), sensitivity, speciﬁcity and accuracy of prediction, computed as below [22]: Sensitivity =

TP TP + FN

(15)

Speciﬁcity =

TN TN + FP

(16)

Accuracy =

TP + TN TP + FP + TN + FN

(17)

where TP denotes the number of true positive, FP is the number of false positive, FN is the number of false negative, and TN is the number of true negative. The performance of the regression (PLS, KPLS and SVR) models was evaluated in terms of the mean square error (MSE), root mean square error (RMSE), bias, correlation coefﬁcient (R), accuracy factor (Af ), and the Nash–Sutcliffe coefﬁcient of efﬁciency (Ef ) computed for all the three (training, validation and test) sets. The mean square error (MSE), used as the target error goal, is deﬁned as [49]: 1 (ypred,i − ymeas,i )2 N N

MSE =

(18)

i=1

where ypred,i and ymeas,i represent the predicted and measured values of the ith variable, and N represents the number of observations. Table 2b Classiﬁcation error in spatial classiﬁcation of surface water by DA, KDA and SVC models. Model

Actual class

Training set LP DA GP MP LP KDA GP MP LP SVC GP MP Validation set LP DA GP MP KDA LP GP MP LP SVC GP MP Test set LP DA GP MP LP KDA GP MP LP SVC GP MP

Total cases

Classiﬁcation error (%) Accuracy

Sensitivity

Speciﬁcity

291 304 268 291 304 268 291 304 268

15.41 6.03 16.8 10.32 4.63 11.0 9.62 3.94 11.24

26.35 3.33 26.25 8.60 10.86 19.78 14.87 5.60 17.50

8.51 7.25 12.75 11.19 1.25 7.05 6.88 3.05 8.50

97 101 90 97 101 90 97 101 90

22.22 7.99 18.4 15.52 7.59 15.81 13.89 6.60 14.94

37.78 3.57 23.19 16.00 17.82 25.56 26.06 5.44 19.49

8.50 9.80 16.89 15.26 2.12 11.44 5.33 7.15 13.28

134 67 95 134 67 95 134 67 95

13.85 5.07 14.86 12.50 5.07 10.81 13.18 3.05 13.52

13.73 3.57 20.69 13.43 13.43 15.79 16.79 1.67 20.44

7.69 5.42 12.44 11.73 2.62 8.46 9.81 3.39 10.35

158

K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162

Table 3a Classiﬁcation matrix for temporal classiﬁcation of surface water by DA, KDA and SVC models. Model

Actual class

Training set Summer DA Monsoon Winter Total KDA Summer Monsoon Winter Total SVC Summer Monsoon Winter Total Validation set Summer DA Monsoon Winter Total KDA Summer Monsoon Winter Total Summer SVC Monsoon Winter Total Test set DA Summer Monsoon Winter Total Summer KDA Monsoon Winter Total Summer SVC Monsoon Winter Total

Total cases

Predicted correct assignations

Correct assignations

% Mis-classiﬁed

Sensitivity (%)

Speciﬁcity (%)

Accuracy (%)

20 23 254

176 146 254

53.66 61.34 85.52

81.50 77.44 91.70

70.92 73.00 89.57

73 199 15

20 19 262

215 199 262

78.18 69.34 87.04

80.94 88.54 95.03

79.95 82.16 92.25

233 75 9

29 194 8

13 18 284

233 194 284

36.00 49.12 15.61 33.25 21.82 30.66 12.96 21.81 15.27 32.40 5.64 17.61

73.50 83.98 90.15

92.30 85.28 96.95

85.39 84.93 94.50

92 96 100 288 92 96 100 288 92 96 100 288

47 39 16

37 50 4

8 7 80

47 50 80

46.08 54.95 84.21

75.81 76.65 89.64

65.28 69.79 87.85

68 21 3

26 63 7

5 11 85

68 63 85

73.91 65.63 84.16

84.26 83.42 94.68

80.97 75.51 91.00

69 32 7

16 58 8

7 6 85

69 58 85

48.91 47.91 20.00 38.54 26.09 34.38 16.0 25.49 25.00 39.58 15.00 26.38

63.88 70.73 86.73

87.22 81.55 92.10

78.47 78.47 90.27

121 85 90 296 121 85 90 296 121 85 90 296

82 50 10

31 27 6

8 8 74

82 27 74

57.75 42.19 82.22

74.68 75.00 92.23

66.55 67.91 89.19

87 28 6

21 58 6

5 4 81

87 58 81

71.90 68.24 90.00

85.14 84.83 94.17

79.73 80.07 92.91

92 48 5

18 31 5

11 6 80

92 31 80

32.23 68.23 17.77 38.17 28.10 31.76 11.11 23.66 23.96 63.52 11.11 31.41

63.44 57.40 82.47

80.79 77.68 94.97

72.29 73.98 90.87

Summer

Monsoon

Winter

275 287 301 863 275 287 301 863 275 287 301 863

176 118 34

79 146 13

215 47 13

The RMSE represents the error associated with the model and was computed as:

N (y i=1 pred,i

RMSE =

Af = 10

− ymeas,i )2

(19)

N

where ypred,i and ymeas,i represent the model computed and measured values of the variable, and N represents the number of observations. The RMSE, a measure of the goodness-of-ﬁt, best describes an average measure of the error in predicting the dependent variable. However, it does not provide any information on phase differences. The bias or average value of residuals (non-explained difference) between the measured and predicted values of the dependent variable represents the mean of all the individual errors and indicates whether, the model overestimates or underestimates the dependent variable. It is calculated as: 1 (ypred,i − ymeas,i ) N N

Bias =

(20)

i=1

The correlation coefﬁcient (R) represents the percentage of variability that can be explained by the model and is calculated as:

N R=

N

y2 i=1 meas,i

y y i=1 meas,i pred,i

−

1 N

The accuracy factor (Af ), a simple multiplicative factor indicating the spread

of results aboutthe prediction is computed as [50]:

N

N y y i=1 meas,i i=1 pred,i 2

N N 1 2

−

N y i=1 meas,i

y

−

N

N

(22)

The larger the value of Af , the less accurate is the average estimate. A value of one indicates that there is perfect agreement between all the predicted and the measured values. Each performance criteria term described above conveys speciﬁc information regarding the predictive performance/efﬁciency of a speciﬁc model. Goodnessof-ﬁt of the selected models was also checked through the analysis of the residuals. The Nash–Sutcliffe coefﬁcient of efﬁciency (Ef ), an indicator of the model ﬁt is computed as [51]:

N

Ef = 1 −

i=1 N

(ypred,i − ymeas,i )2

i=1

(ymeas,i − y¯ meas )2

(23)

where y¯ meas is the mean of the measured values. The Ef is a normalized measure (−∞ to 1) that compares the mean square error generated by a particular model simulation to the variance of the target output sequence. The Ef value of 1 indicates perfect model

1 N

i=1 pred,i

|log(ypred,i /ymeas,i )|

y i=1 pred,i

2

(21)

K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162 Table 3b Classiﬁcation error in temporal classiﬁcation of surface water by DA, KDA and SVC models. Model

Actual class

Training set DA Summer Monsoon Winter Summer KDA Monsoon Winter SVC Summer Monsoon Winter Validation set Summer DA Monsoon Winter Summer KDA Monsoon Winter SVC Summer Monsoon Winter Test set Summer DA Monsoon Winter KDA Summer Monsoon Winter SVC Summer Monsoon Winter

Total cases

Classiﬁcation error (%) Accuracy

Sensitivity

Speciﬁcity

275 287 301 275 287 301 275 287 301

29.08 27.00 10.43 20.05 17.84 7.75 14.61 15.07 5.50

46.34 38.66 14.48 21.82 30.66 12.96 26.50 16.02 9.85

18.50 22.56 8.30 9.06 11.46 4.97 7.70 14.72 3.05

92 96 100 92 96 100 92 96 100

34.72 30.21 12.15 19.03 24.49 9.00 21.53 21.53 9.73

53.92 45.05 15.79 26.09 34.37 15.84 36.12 29.27 13.27

24.19 23.35 10.36 15.74 16.58 5.32 12.78 18.45 7.90

121 85 90 121 85 90 121 85 90

33.45 32.09 10.81 20.27 19.93 7.09 20.71 26.02 9.13

42.25 57.81 17.78 28.10 31.76 10.00. 36.56 42.60 17.53

25.32 25.00 7.77 14.86 15.17 5.83 19.21 22.32 5.03

performance (the model perfectly simulates the target output), an Ef value of zero indicates that the model is, on average, performing only as good as the use of the mean target value as prediction, and an Ef value <0 indicates an altogether questionable choice of the model [52].

159

dation and test sets. Further, the results showed that the speciﬁcity of the classiﬁcation was mostly higher than the sensitivity. The classiﬁcation error ranged between 14% and 3% (Table 2b). Similarly, the temporal SVC was used for the differentiation among the three distinct groups of months (summer, monsoon and winter seasons). The season was the category (dependent) variable, while all other seven water quality variables constituted the independent set of variables. The optimal values of the temporal SVC model parameters, C, ε and kernel-dependent parameter () were found to be 3.29, 0.001, and 4.17, respectively, and the number of SVs was 362. The temporal classiﬁcation matrices (CMs) for the training, validation and test sets are presented in Table 3a. The temporal SVC rendered the mean MR of 17.61, 26.38, and 31.41% among the training, validation and test sets. Further, the results (Table 3a) showed that the speciﬁcity of the classiﬁcation was always higher than the sensitivity. The classiﬁcation error ranged between 23% and 5% (Table 3b). Fig. 1 shows the importance of variables in spatial (Fig. 1a) and temporal (Fig. 1b) classiﬁcation using the SVC models. It may be noted that T-alk followed by DO were the most important variables in seasonal classiﬁcation, whereas, the DO followed by chloride and BOD were the most important variables in spatial classiﬁcation of the surface water quality. This may be due to the fact that the seasonal variations in surface water quality in a region are largely due to the alkalinity, DO and chloride ions, whereas the anthropogenic pollution parameters such as DO, Cl, and BOD at different sites in a geographical area largely determines the water quality and largely account for the water quality variation. Further, it is evident that the results of spatial classiﬁcation are more precise as compared with the temporal classiﬁcation. This may be attributed to the fact that the seasonal variations over a long period of time are more prominent and seasonal ﬂuctuations over the study years and subsequent overlapping of months across

a

120

3.1. SVM classiﬁcation The SVC approach was used for the spatial and temporal classiﬁcation of the surface water quality data. The complete water quality data set was divided in to three sub-sets (training, validation, and test). Both the spatial and temporal SVCs were performed on data comprised of seven variables (pH, T-alk, Cl, PO4 , COD, DO, BOD). Among the linear, polynomial, sigmoid, and RBF kernel functions, the later was ﬁnally selected in SVC model, as it yielded the lowest MSE. In spatial classiﬁcation, SVC was used for the differentiation between the three classes of sampling sites, viz. low-polluted (LP), moderately polluted (MP) and grossly polluted (GP) in the water quality monitoring network. The sites were the category (dependent) variables, while all the other seven water quality variables constituted the independent set of variables. A two-step grid search method [53] with a 25% random validation was used to derive the SVC model parameters. First a coarse grid search was used to determine the best region of these three-dimensional grids and then a ﬁner grid search was conducted to ﬁnd the optimal parameters. The optimal values of the spatial SVC model parameters, C, ε and kernel-dependent parameter () were found as 12.88, 0.001, and 7.74, respectively, and the number of SVs was 551. The spatial classiﬁcation matrices (CMs) for the training, validation and test sets are presented in Table 2a. The spatial SVC rendered the mean MR of 12.39, 17.70, and 14.86% among the training, vali-

80 60 40 20 0 DO

Cl

BOD

T-Alk

COD

PO4

pH

BOD

pH

PO4

Variables

b

120 100

Impotance (%)

3. Results and discussion

Importance (%)

100

80 60 40 20 0 T-Alk

DO

Cl

COD Variables

Fig. 1. Plot showing importance of the input variables in (a) spatial SVC model and (b) temporal SVC model.

160

K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162

different seasons may have signiﬁcant inﬂuence on the resulted water quality in the study region. However, the spatial SVM classiﬁcation modeling successfully grouped the ten sampling sites in to three, the low pollution (S1–S3), moderate pollution (S7–S10) and the gross pollution (S4–S6) sites, which may serve the water quality monitoring purpose in the study area, thus, achieving the data reduction by 70%. The temporal SVC grouped the twelve monitoring months in to three seasons (summer, monsoon and winter) for monitoring, and hence achieved data reduction by 75%. The spatial and temporal SVC, thus achieved overall data reduction by 92.5%. 3.2. Linear and kernel discriminant analysis DA and KDA were performed on the water quality data for the spatial and temporal classiﬁcations using the training, validation and test sets as employed for SVC. Similar to SVC, the site (spatial) and the season (temporal) were the grouping (dependent) variables, while the selected water quality parameters constituted the independent variables. In case of the KDA, RBF was used as the kernel function. An optimum value of the kernel function parameter () was determined through generating several sets of the classiﬁcation error for the training set. The best value of , thus, achieved was 2.0, and it was then used to predict the classiﬁcation of the validation and test sub-sets. The temporal and spatial classiﬁcation results (Tables 2 and 3) achieved through DA, KDA and SVC techniques for the training, validation, and test sets suggested that both the SVC and KDA models performed relatively better than DA, suggesting that the variables exhibited nonlinear relationships. However, the performances of the SVC and KDA methods were comparable in the spatial and temporal classiﬁcations of the water quality. This may be attributed to the fact that both these methods employ the kernel trick for mapping of the input data, thus capturing the nonlinearities. 3.3. SVM regression The SVR approach was used for predicting the BOD of surface water using a set of simple and directly measurable water quality variables. The complete water quality data set was divided in to three sub-sets (training, validation, and test). In SVR, BOD was the dependent variable, whereas, the remaining six variables (pH, T-alk, Cl, PO4 , COD, DO) constituted the set of independent variables. Among the linear, polynomial, sigmoid, and RBF kernel functions, the later was ﬁnally selected in SVR models as it yielded the lowest MSE. Moreover, the RBF kernels tend to give good performance under general smoothness assumption [43,44]. A two-step grid search method [53] with a ten-fold cross-validation was used to derive the SVR model parameters. The optimal values of the SVR model parameters, C, ε and kernel-dependent parameter () were determined as 51.08, 0.001, and 0.995, respectively, and the number of SVs was 624. The values of the model performance criteria parameters (MSE, RMSE, bias, R, Af , Ef ) as computed for the training, validation and test sets used for the model are presented in Table 4. Fig. 2 shows the plots between the measured and the model predicted values of BOD in training, validation and test set, respectively. For the BOD values predicted by the model, the correlation coefﬁcient (R) values (p < 0.001) for the training, validation and test sets were 0.952, 0.909, and 0.907, respectively. The SVR predictions are precise, if R values are closer to unity [4]. The respective values of RMSE and bias for the three data sets are 1.53 and −0.06 for training, 1.44 and −0.05 for validation, and 1.32 and −0.10 for testing, respectively. A closely followed pattern of variation in the measured and model predicted BOD values (Fig. 2), a considerably high correlation, and low values of MSE, RMSE, bias, along with the values of Af and Ef closer to unity suggested for a good-ﬁt of the

Fig. 2. Plot of measured versus SVR model predicted values of BOD in surface water (a) training, (b) validation, and (c) test sets.

model to the data set and its predictive ability for the new future samples [48]. Moreover, the model-predicted BOD values and the residuals corresponding to the training, validation and test sets show almost complete independence and random distribution (Fig. 3) with negligibly small correlations between them (Table 4). Residuals versus predicted value plots can be more informative regarding model ﬁtting to a data set. If the residuals appear to behave randomly (low correlation), it suggests that the model ﬁts the data well. On the other hand, if non-random distribution is evident in the residuals, the model does not ﬁt the data adequately [48]. Further, all the six input variables participated in SVR model for BOD prediction (ﬁgure not shown due to brevity), COD had highest contribution followed by DO. Although, the SVM does not necessarily represent physical meaning through the weights, it suggests that all the input variables have direct relevance on the dependent

K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162

161

Table 4 Values of the performance criteria parameters for PLS, KPLS and SVR models. Model

Sub-set

MSE (mg L−1 )

RMSE (mg L−1 )

Bias (mg L−1 )

Ef

Af

R

R*

PLS

Training Validation Test Training Validation Test Training Validation Test

3.53 1.45 2.29 3.02 2.21 1.98 2.33 2.09 1.74

1.88 1.21 1.51 1.74 1.49 1.41 1.53 1.44 1.32

0.00 0.06 0.09 0.00 0.10 −0.11 −0.06 −0.05 −0.10

0.86 0.76 0.78 0.88 0.82 0.80 0.90 0.83 0.80

1.02 1.01 1.03 1.30 1.28 1.28 1.03 1.02 1.02

0.926 0.871 0.884 0.937 0.905 0.894 0.952 0.909 0.907

0.38 0.39 0.46 0.00 0.09 −0.09 −0.01 −0.03 0.11

KPLS

SVR

R: correlation between predicted and measured values; R*: correlation between predicted values and residuals.

variables in water and play role in determining the BOD levels. A relatively higher contribution of the COD in the BOD model may be attributed to the fact that the COD of water has direct bearing on the BOD levels in water [4]. 3.4. Linear and kernel PLS First, a linear PLS model was built between the predictor variables (X) and the response variable (y). On the basis of the cross-validation results, ﬁve latent variables were included in the model and it was then applied to the validation and test sets. A KPLS model was developed with RBF. The kernel function was selected

a

6

on the basis of RMSE value of the validation set. The kernel function parameter () and number of the LVs in the feature space were determined on the basis of the minimum cross-validation error value. The optimum values of and LVs were 0.9 and 5, respectively. These values were used as the optimal parameters in the KPLS model. The results pertaining to the performance criteria parameters for the PLS, KPLS and SVR modeling approaches (Table 4) suggested that both the KPLS and SVR models performed relatively better than the PLS in predicting the water BOD levels, suggesting for a non linear dependence of BOD on the independent variables. However, the performances of both the KPLS and SVR models were comparable. Since, both these models use kernel function for mapping the input data in feature space, these are capable in capturing the data nonlinearities, and yielding relatively better predictions.

Residuals

4

4. Conclusion

2 0

The surface water quality pertaining to a large geographical area covering the low, moderate and high pollution regions monitored over a long duration of 15 years with several seasonal variations and disturbances can be modeled as a function of selected water quality variables using the SVMs approach, which performed relatively better than the linear (DA and PLS) models for classiﬁcation and regression. Relatively high number of SVs both in case of the classiﬁcation (spatial and temporal) and regression modeling suggests that the constructed SVC and SVR models actually used most of the data points. It is also concluded that the SVC and SVR models can make future predictions possible. SVC achieved a data reduction of 92.5% and the future water quality monitoring program may be redesigned accordingly without compromising with the output quality. Further, the predictive tool for the water BOD envisaged in SVR using selected number of simple and directly measurable variables may be used for futuristic trends in water quality. Thus, the SVM based approaches helped in optimization of the water quality monitoring program through reduction in number of the sampling sites, frequency, and water quality parameters, and hence in water quality management in a large geographical area with wide seasonal variations.

-2 -4 -6 0

5

10

15

20

25

30

BODpred(mg.L-1)

Residuals

b

6 4 2 0 -2 -4 -6 0

5

10

15

20

25

BODpred(mg.L-1)

c

6 4

Acknowledgement

Residuals

2

The authors thank the Director, Indian Institute of Toxicology Research, Lucknow for his keen interest in this work.

0

-2 -4

References

-6 0

5

10

15

20

25

BODpred(mg.L-1) Fig. 3. Plot of the SVR model predicted BOD values and residuals (a) training, (b) validation, and (c) test sets.

[1] K.P. Singh, A. Malik, V.K. Singh, N. Basant, S. Sinha, Anal. Chim. Acta 571 (2006) 248–259. [2] K.P. Singh, A. Malik, V.K. Singh, Water Air Soil Pollut. 170 (2005) 383–404. [3] K.P. Singh, A. Malik, D. Mohan, S. Sinha, Water Res. 38 (2004) 3980–3992. [4] K.P. Singh, A. Basant, A. Malik, G. Jain, Ecol. Model. 220 (2009) 888–895. [5] J.W. Einax, A. Aulinger, W.V. Tumpling, A. Prange, J. Anal. Chem. 363 (1999) 655–661.

162

K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162

[6] K.P. Singh, N. Basant, A. Malik, S. Sinha, G. Jain, Chemometr. Intell. Lab. Syst. 95 (2009) 18–30. [7] K.P. Singh, A. Malik, D. Mohan, S. Sinha, V.K. Singh, Anal. Chim. Acta 532 (2005) 15–25. [8] Y. Zhang, C. Ma, Chem. Eng. Sci. 66 (2011) 64–72. [9] D.-S. Cao, Y.-Z. Liang, Q.-S. Xu, Q.-N. Hu, L.-X. Zhang, G.-H. Fu, Chemometr. Intell. Lab. Syst. 107 (2011) 106–115. [10] J. Lu, K.N. Plataniotis, A.N. Venetsanopoulos, J. Wang, Pattern Recogn. 38 (2005) 1788–1890. [11] H. Li, Y. Liang, Q. Xu, Chemometr. Intell. Lab. Syst. 95 (2009) 188–198. [12] D. Cozzolino, W.U. Cynkar, N. Shah, P. Smith, Food Res. Int 44 (2011) 181–186. [13] S.H. Woo, C.O. Jeon, Y.S. Yun, H. Choi, C.S. Lee, D.S. Lee, J. Hazard. Mater. 161 (2009) 538–544. [14] V.N. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, 1998. [15] Y. Pan, J. Jiang, R. Wang, H. Cao, Chemometr. Intell. Lab. Syst. 92 (2008) 169–178. [16] J. Qu, M.J. Zuo, Measurement 43 (2010) 781–791. [17] M. Kovacevic, B. Bajat, B. Gajic, Geoderma 154 (2010) 340–347. [18] A. Mucherino, P. Papajorgji, M.P. Paradalos, Oper. Res. 9 (2009) 121–140. [19] B. Schollopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, V. Vapnik, IEEE Trans. Signal Process. 45 (1997) 2758–2765. [20] V. Vapnik, S. Golowich, A.J. Simola, Adv. Neural Inform. Process. Syst. 9 (1996) 281–287. [21] K. Kavaklioglu, Appl. Energy 88 (2010) 368–375. [22] T. Rumpf, A.K. Mahlein, U. Steiner, E.C. Oerke, H.W. Dehne, L. Plumer, Comput. Electron. Agric. 74 (2010) 91–99. [23] N. Cristianine, J.S. Taylor, An Introduction to Support Vector Machine and other Kernel based Learning Methods, Cambridge, 2000. [24] B. Scholkopf, A.J. Smola, Learning with Kernels, MIT Press, Cambridge, 2002. [25] J. Lutsa, F. Ojedaa, R. Van de Plasa, B. De Moora, S. Van Huffela, J.A.K. Suykensa, Anal. Chim. Acta 665 (2010) 129–145. [26] S.W. Lin, Z.J. Lee, S.C. Chen, T.Y. Tseng, Appl. Soft. Comput. 8 (2008) 1505–1512. [27] S. Ekici, Expert Syst. Appl. 36 (2009) 9859–9868. [28] J.F. Wang, X. Liu, Y.L. Liao, H.Y. Chen, W.X. Li, X.Y. Zheng, Biomed. Environ. Sci. 23 (2010) 167–172.

[29] A.B. Widodo, S. yang, Mech. Syst. Signal Process. 21 (2007) 2560–2574. [30] X. Xie, W.T. Liu, B. Tang, Remote Sens. Environ. 112 (2008) 1846–1855. [31] V. Vapnik, The Nature of Statistical Learning Theory, 2nd ed., Berlin, Springer, 1999. [32] V. Cherkassky, Y. Ma, Neural Networks 17 (2004) 113–126. [33] C. Wu, X. Lv, X. Cao, Y. Mo, C. Chen, Int. J. Phys. Sci. 5 (2010) 2523–2527. [34] R. Noori, A.R. Karbassi, K. Moghaddamnia, D. Han, M.H. Zokaei-Ashtiani, A. Farokhnia, N. Ghafari Gousheh, J. Hydrol. 401 (2011) 177–189. [35] J. Wang, H.Y. Du, H.X. Liu, X.J. Yao, Z.D. Hu, B.T. Fan, Talanta 73 (2007) 147–156. [36] A.K.S. Jardine, D. Lin, D. Danjevic, Mech. Syst. Signal Process. 20 (2006) 1483–1510. [37] N. Basant, S. Gupta, A. Malik, K.P. Singh, Chemometr. Intell. Lab. Syst. 104 (2010) 172–180. [38] M. Daszykowski, S. Semeels, K. Kaczmarck, P. Van Espen, C. Croux, B. Walczak, Chemometr. Intell. Lab. Syst. 85 (2007) 269–277. [39] B. Ustun, W.J. Melssen, M. Oudenhuijzen, L.M.C. Buydens, Anal. Chim. Acta 544 (2005) 292–305. [40] W.J. Wang, Z.B. Xu, W.Z. Lu, X.Y. Zhang, Neurocomputing 55 (2003) 643–663. [41] C.H. Wu, G.H. Tzeng, R.H. Lin, Expert Syst. Appl. 36 (2009) 4725–4735. [42] S.S. Keerthi, C.J. Lin, Neural Comput. 15 (2001) 1667–1689. [43] X. Li, D. Lord, Y. Zhang, Y. Xic, Accident Anal. Prev. 40 (2008) 1611–1618. [44] R. Noori, M.A. Abdoli, A. Ameri, M. Jalili-Ghazzizade, Environ. Prog. Sustain. Energy 28 (2009) 249–258. [45] D. Han, L. Chan, N. Zhu, J. Hydroinform 09.4 (2007) 267–276. [46] C.W. Hsu, C.C. Chang, A Practical Guide to Support Vector Classiﬁcation, 2003, http://www.csie/ntu.edu.tw/∼cjlin/papers/guide/guide.pdf. [47] K.P. Singh, A. Malik, V.K. Singh, D. Mohan, S. Sinha, Anal. Chim. Acta 550 (2005) 82–91. [48] K.P. Singh, N. Basant, A. Malik, G. Jain, Anal. Chim. Acta 658 (2010) 1–11. [49] C. Karul, S. Soyupak, A.F. Clesiz, N. Akbay, E. German, Ecol. Model. 134 (2000) 145–152. [50] T. Ross, J. Appl. Biotechnol. 81 (1996) 501–508. [51] J.E. Nash, I.V. Sutcliffe, J. Hydrol. 10 (1970) 282–290. [52] S. Palani, S. Liong, P. Tkalich, Mar. Pollut. Bull. 56 (2008) 1586–1597. [53] S.T. Chen, P.S. Yu, J. Hydrol. 340 (2007) 63–77.

Support vector machines in water quality management

Support vector machines in water quality management

Recommend Documents