Adaptive kernel smoothing regression for spatio-temporal environmental datasets

Adaptive kernel smoothing regression for spatio-temporal environmental datasets

Neurocomputing 90 (2012) 59–65 Contents lists available at SciVerse ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom A...

613KB Sizes 2 Downloads 176 Views

Neurocomputing 90 (2012) 59–65

Contents lists available at SciVerse ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Adaptive kernel smoothing regression for spatio-temporal environmental datasets Federico Montesino Pouzols a,n,1 , Amaury Lendasse b,c,d a

Department of Biosciences, University of Helsinki Biocenter 3, Viikinkaari 1, PO Box 65, FI-00014 University of Helsinki, Finland Department of Information and Computer Science, Aalto University, Espoo, Finland c IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain d ´n, Spain Computational Intelligence Group, Computer Science Faculty, University of the Basque Country, Paseo Manuel Lardizabal 1, Donostia/San Sebastia b

a r t i c l e i n f o

abstract

Available online 23 March 2012

A method for performing kernel smoothing regression in an incremental, adaptive manner is described. A simple and fast combination of incremental vector quantization with kernel smoothing regression using adaptive bandwidth is shown to be effective for online modeling of environmental datasets. The approach proposed is to apply kernel smoothing regression in an incremental estimation of the (evolving) probability distribution of the incoming data stream rather than the whole sequence of observations. The method is illustrated on publicly available datasets corresponding to the Tropical Atmosphere Ocean array and the Helsinki Commission hydrographic database for the Baltic Sea. & 2012 Elsevier B.V. All rights reserved.

Keywords: Kernel smoothing regression Adaptive regression Vector quantization Spatio-temporal models Environmental applications Evolving intelligent systems

1. Introduction We describe a method for performing kernel smoothing regression in an adaptive manner. The aim of this work is to define efficient, incremental and adaptive regression methods that can be applied sequentially to data streams of incoming observations of continuous data. The motivation for this work is the need for simple and efficient regression methods that can cope with large, diverse and evolving datasets in applications in biological and environmental sciences. The idea of adaptive regression has been explored in different contexts and a large number of methods for both linear and nonlinear regression are well established in different fields of computer science. For example, the multivariate adaptive regression splines (MARS) method [1–3] builds models as a summation of weighted basis functions following a divide and conquer strategy that aims to adapt locally. However, most research efforts so far have concentrated on offline regression. Biological and environmental sciences have seen a great deal of development and attention over the last few decades. An

n

Corresponding author. Tel.: þ358 9 191 57866. E-mail addresses: federico.montesinopouzols@helsinki.fi (F. Montesino Pouzols), amaury.lendasse@aalto.fi (A. Lendasse). URLS: http://www.helsinki.fi/bioscience/consplan/ (F. Montesino Pouzols), http://research.ics.tkk.fi/eiml/ (A. Lendasse). 1 The work of F.M.P. was supported by a Marie Curie Intra-European Fellowship for Career Development (grant agreement PIEF-GA-2009-237450) within the European Community’s Seventh Framework Programme (FP7/2007–2013). 0925-2312/$ - see front matter & 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2012.02.023

impressive improvement in observational capabilities and measurement procedures has led to large databases and online monitoring systems. Biological and environmental datasets are normally defined by either regular or irregular spatial fields that can be three-dimensional, for which multivariate observations, such as temperature, salinity, nutrients, pollutants or air pressure, are recorded across time. Biological and environmental processes are usually part of intricate networks of dynamical processes where their evolution in time is a key aspect. Evolving, online or adaptive intelligent systems [4] are meant to be applied on sequential data or streams of data. These systems distinguish themselves from conventional offline learning methods and previous online methods in that their structure (in addition to their parameters) evolves in order to account for new data as it becomes available. Recently, there has been an increase of interest in this field. Specially during the last decade many advances have been made within the area of evolving neuro-fuzzy systems for modeling and control [4–6]. Two advantages of these methods are specially relevant for spatio-temporal biological and environmental datasets. On the one hand, they rely on simple and fast algorithms, usually operating in a one-pass manner. Thus, large datasets can be processed efficiently. On the other hand, their parameters but more importantly their structure evolve in order to accommodate for new data. Thus, large datasets can be efficiently processed online in a fully adaptive manner. The approach proposed here is to perform kernel smoothing regression in an estimated representation of the (potentially evolving) probability distribution rather than on the whole sequence of observations. This is achieved by performing vector

60

F. Montesino Pouzols, A. Lendasse / Neurocomputing 90 (2012) 59–65

quantization on the incoming stream. This way, a kernel smoothing regression is performed in an incrementally computed density estimation. In addition, the kernel bandwidth is adapted online. All the steps involved are incremental and the method is thus suitable for online learning. The method is simple, fast and adapts in time to evolving streams of continuous data. The remainder of the paper is organized as follows. Section 2 describes the proposed method. The method is evaluated and compared in Section 3. Results and implications are further discussed in Section 4.

2. Proposed method Kernel regression, also called kernel smoothing regression in order to avoid confusion with other kernel methods, is a nonparametric approach in estimating the conditional expectation of a random variable y [2,7,8]: Eðy9xÞ ¼ f ðxÞ, where y and x are the random variables and f ðÞ is a non-parametric function. The kernel smoothing regression approach is based on the kernel density estimation. It is assumed that the model estimation has the following form: f^ ðxÞ ¼ y þ e, i.e., the random variable modeled can be expressed as the sum of a deterministic, functional component and a noise component. The Nadaraya–Watson kernel regression method for function estimation is one particular case in which the Gaussian kernel is used. If n observations of input and output pairs, ðxi ,yi Þ, are available, the estimator of f^ ðÞ for a given input observation is defined as follows: Pn K ðx0 ,xi Þyi f^ ðx0 Þ ¼ Pi n¼ 1 h , i ¼ 1 K h ðx0 ,xi Þ where h is the bandwidth or smoothing parameter, and Kh is the kernel function [9]. Some common examples of kernels are Gaussian, Epanechnikov, biweight, rectangular and triangular [9,7]. A special case is the uniform kernel, which can be considered as the kernel used in the naive density estimation method [10,11]. Vector quantization (VQ) is an unsupervised method with parallelisms with methods for clustering and learning densities such as k-means and Voronoi diagrams [2]. It is a practical and popular approach in signal processing and machine learning for lossy data compression and correction as well as density estimation, i.e., the process of deriving from observed data an estimate of an underlying probability density function. A key aspect of VQ for the purposes of this work is that it allows one to approximate the probability distribution function of a process by the distribution of prototypes or codewords. In fact, the area closer to a particular codeword than to any other is inversely proportional to the density in that region of the input domain. The approach proposed here is to perform kernel regression in an incremental estimation of the (potentially evolving) probability distribution of the incoming data stream rather than the whole sequence of observations. This is done in two stages. First, VQ is incrementally performed on the incoming stream. Second, kernel smoothing regression for each incoming observation is computed using the codebook resulting from the first stage as an estimation of the probability distribution of the incoming data. In addition, the kernel bandwidth is adapted online. All the steps required are incremental and the method is thus suitable for online learning. The method is simple, adapts locally to fit evolving streams of data, and is fast, with run-time complexity proportional to the number of observations and their dimensionality. An scheme of this method (KSR-VQ) is shown in Fig. 1. For the sake of brevity we will refer to it as KSR-VQ.

Sequence of observations

Adaptive bandwidth

Vector Quantization Codebook (Estimated PDF) Kernel smoothing regression

Kernel bandwidth

Sequence of estimations Fig. 1. Global scheme of the KSR-VQ method.

KSR-VQ takes advantage of the density matching capability of VQ. The method is defined to be fast and suitable for streams of nonstationary data and unbounded size. The two stages of KSRVQ are detailed in what follows. 2.1. Adaptive vector quantization The first stage of KSR-VQ is performed adaptively and in an incremental manner. Observations are processed one at a time. Let m be the current number of prototypes in the codebook, initialized to 0, and M be a maximum number of prototypes. In Algorithm 1, we show a simple version of vector quantization which is used in this paper. It should be noted that no sensitivity parameters are used. Algorithm 1. Simple Adaptive Vector Quantization. Input: Sequence of observations, X ¼ fxi A Rd , i ¼ 1, . . .g Output: Codebook, (initially empty) set of prototypes, C ¼ fpj A Rd , j ¼ 1, . . . ,mg, m rM while new observations, xi , arrive do 6 6 if m o M then 6 6 6 9Add xi to C as a new prototype, pm þ 1 ; 6 6 else 66 6 6 Find in C the nearest prototype p NNðiÞ to xi 66 66 6 6 Update the codebook with learning rate a, 66 6 6 moving p NNðiÞ towards the sample 64 4 point : pNNðiÞ ’ð1aÞpNNðiÞ þ axi , a A ½0; 1;

As it will be shown in Section 3, relatively small codebooks of a few hundred prototypes can achieve satisfactory performance in a rather general setup. 2.2. Adaptive kernel smoothing regression It is generally accepted that local adaptation of the kernel bandwidth parameter is of major importance for obtaining accurate models [7,9]. However, finding optimal or good values for the bandwidth parameter and furthermore adapting it locally is not a trivial task [12,2,7,9].

F. Montesino Pouzols, A. Lendasse / Neurocomputing 90 (2012) 59–65

It is possible to define an estimator for the bandwidth or smoothing parameters optimal for normal distributions [7] in the sense that the mean integrated square error is minimized. For the multivariate case it is as follows:  ð1=ðd þ 4ÞÞ 4 hopt,j ¼ sj , j ¼ 1, . . . ,d, nðd þ2Þ where d is the input dimension, n is the number of observations, and sj is the standard deviation in the jth dimension. Nonetheless, as an enhanced standard deviation estimator, the median absolute deviation estimator can be used to approximate sj in a robust manner, as described for global bandwidth estimation in [7]. This way, even though the parameter estimators, hopt,j , are defined as optimal for normal distributions, they still perform well in more general cases. In KSR-VQ the variance term, sj , for each input dimension j is estimated using the median absolute deviation and a scaling factor

sj ¼ medianf9xij medianfxij g9g=0:6745 and the median is calculated online for observations, i ¼ 1, . . . , using an incremental statistics method by Manku et al. [13], with approximation guarantees that apply for arbitrary value distributions and arrival distributions of the dataset. The scaling factor ensures that for normal distributions sj is the standard deviation. This defines a clear criterion for the selection of the bandwidth, i.e., for model selection, without the need for validation procedures. For simplicity, in this paper we restrict our analysis to zeroorder or Nadaraya–Watson kernel smoothing regression. That is, KSR-VQ builds nonlinear locally constant models. The time complexity of the method is linear, O(n), for a stream of n observations. That is, the method takes constant time for each observation. This constant is proportional to the size of the codebook, M, and the problem dimensionality, d (assuming M 5 n and d 5n). M can be in actuality rather small as shown experimentally in the next section. The space complexity is constant, Oð1Þ, and also proportional to both d and M. Let us now discuss the relation of the proposed KSR-VQ with other methods in the literature. The use of clustering methods for the identification of neural network models and neuro-fuzzy systems is a well known approach [14–16,4]. Recently, the FLEXFIS [17] method for the identification of evolving fuzzy systems has been proposed. This method uses vector quantization in order to identify clusters and corresponding fuzzy rules in an incremental manner. The KSR-VQ method presented here uses the vector quantization technique as well. However, the interpretation of VQ here is rather different. VQ is applied on the incoming data stream in order to approximate the probability distribution function of the process by means of prototypes. These prototypes can then be used to perform kernel smoothing regression under the assumption that they represent well the input data distribution. It is also worth to note that the method proposed here is meant for performing kernel smoothing regression, rather than kernel density estimation. A method for performing kernel density estimation on large datasets has been proposed recently [18]. It is however aimed at reducing computational cost rather than adaptive estimation. A fundamental difference between the bandwidth adaptation proposed here and popular bandwidth adaptation methods used for kernel density estimation, found for example in the GNU R np package [12], is that here the adaptation is performed online in order to respond to the evolution of incoming data streams, rather than offline to adapt to the varying density over the input domain. In addition, the evolving bandwidth parameter is used to perform kernel regression in an incrementally updated codebook that represents an

61

evolving estimation of the probability distribution of the incoming data stream.

3. Experiments First, offline methods are applied in order to obtain the reference values. Then, different online adaptive methods are compared: DENFIS, eTS and KSR-VQ. In what follows, we further describe the methods applied, datasets used and results. 3.1. Methods Two well known offline methods for regression are included in this study: multivariate adaptive regression splines (MARS) [1,2] and multivariate kernel smoothing regression (KSR) [7,9,19]. These methods are applied in order to obtain the reference values. Then, different online adaptive methods are compared. MARS is a nonlinear, non-parametric regression technique that can be viewed as a generalization of stepwise linear regression. MARS models perform adaptive non-linear regression by extending linear models. In its simplest version this is done using piecewise linear basis functions. This way, MARS models can be seen as a weighted sum of basis function of three basic types: constant, hinge and product of two or more hinge functions. MARS models are built in two phases: the forward and backward passes, where the second prunes the obtained result after the first phase in order to avoid overfitting. Different variants of MARS have been proposed, including combinations of piecewise linear and piecewise cubic models. The MARS method has been successfully applied and is widely accepted in environmental and biological sciences [20]. KSR-VQ is also compared here against two well known methods in the field of evolving systems: Evolving Takagi–Sugeno (eTS) [21,4] and DENFIS [6,22]. First-order Takagi–Sugeno systems are built in both cases. eTS was applied using global parameter estimation with a recursive least squares filter (RLS) [23]. The default parameters of this implementation were used. DENFIS is one particular implementation of the more general ECOS (Evolving Connectionist Systems) framework [22]. We specify here some implementation details for the sake of reproducibility. The ARESLab: Adaptive Regression Splines toolbox for Matlab/Octave [3] was used as the implementation of MARS models. Default parameter values were set, for a maximum of 25 basis functions in the forward model building phase and restricting the regression to second-degree interactions, i.e., only pairwise products of basis functions are generated. For KSR-VQ the update constant for vector quantization is a ¼ 0:25 and a codebook of 1000 prototypes at most was used. For DENFIS we used the implementation available from the Knowledge Engineering and Discovery Research Institute (KEDRI).2 Experiments with eTS were performed using the eFSLab toolbox [24].3 3.2. Materials Two publicly available environmental databases are considered in order to illustrate the suitability of the proposed method for environmental applications, and specially for environmental data streams: UCI El Nino, and the Helsinki Commission (HELCOM) hydrographic database for the Baltic Sea. Figs. 2 and 3 show the location of the observations, respectively. 2 Available online from http://www.aut.ac.nz/research/research-institutes/ kedri/books 3 eFSLab is available online from http://eden.dei.uc.pt/  dourado/.

F. Montesino Pouzols, A. Lendasse / Neurocomputing 90 (2012) 59–65

20 15 10 5 0 −5 −10 −15 −20

28

−150

−100

−50

0 Longitude (°)

50

100

150

Fig. 2. Coordinates of the Buoys of the TAO array in the UCI El Nino dataset.

Air temperature (Celsius)

Latitude (°)

62

26

24

22

20

66 18

64 1980 1982

1984 1986

1988

1990

1992

1994

1996

1998

Latitude (°)

62 Fig. 4. Air temperature samples for the first buoy of the UCI El Nino dataset.

60 sequences, a recurrent problem in the class of datasets under study [28].

58 56

3.3. Results

54 52

5

10

15

20 Longitude (°)

25

30

Fig. 3. Locations of measurements in the HELCOM Baltic dataset.

As of October 2010 the HELCOM database consists of 623,181 multivariate observations of up to 62 variables, including physical, chemical and biological data, from 1900 up to the present time, with implications for multiple fields such as Physical Oceanography, Marine Biology and Climatology [25,26]. It is available from the oceanographic database of the International Council for the Exploration of the Seas. The UCI El Nino dataset [27] was collected by the Tropical Atmosphere Ocean (TAO) array during the period 1980–1998. Measuring oceanographic and surface meteorological variables is important for improved detection, understanding and prediction of seasonal-to-interannual climate variations originating in the tropics, specially those related to the El Nino/Southern Oscillation (ENSO) phenomenon. The TAO array provides real-time oceanographic and surface meteorological data to scientists, climate researchers and weather prediction centers around the world. This particular dataset corresponds to nearly 70 moored buoys spanning the equatorial Pacific. Recordings span from March 7, 1980 to June 15, 1998. This dataset includes originally 12 variables that describe the sequence and date of observation, latitude, longitude, zonal winds, meridional winds, relative humidity, air temperature and sea surface temperature. In the ‘‘UCI El Nino-SST’’ problem defined from this dataset, the sea surface temperature (SST) has to be modeled as a function of six inputs: time, latitude, longitude, zonal and meridional winds, and air temperature. For the HELCOM Baltic database two regression problems are defined corresponding to the dissolved oxygen concentration and salinity of the surface layer (0–25 m). Fig. 4 shows as an example the air temperature values recorded at the first buoy included in the UCI El Nino dataset. This buoy is located at approximately 01 N, 1101 E. A daily observation is usually taken. However, a considerable number of missing values can be observed as well as some long missing

For offline approaches, fitting errors are shown in Table 1. Errors are given as normalized root mean square error (NRMSE), i.e., the RMSE divided by the standard deviation of the target sequence, and symmetric mean absolute percentage error (SMAPE). Standard deviations of the absolute errors are indicated as well. The run-time required is shown as well.4 The superiority of MARS in terms of accuracy is clear. It comes however at the expense of a longer run-time which can be an order of magnitude higher than that of KSR. Online regression methods are compared in Table 2. The time column shows the processor time consumed for the learning process on the same environment used for offline models. Again, tests were run with no significant competing load. The lowest NRMSE achieved for each dataset are highlighted in bold face. Among the conclusions about KSR-VQ that can be drawn from Table 2 are (a) it achieves satisfactory accuracy in general and (b) it is consistently the fastest method. Results for these and other methods in different, generic benchmark datasets can be found in [29]. Let us analyze the effect of additive noise on the performance of KSR-VQ. To this end, we use a synthetic dataset and introduce additive noise into a controlled environment. The Mackey–Glass is a example of chaotic system [5] that can describe a complex physiological process such as a certain blood cell regulation. Here, we included a synthetic instance for comparison purposes, as it is a dataset that has been extensively used in the literature [21,6]. Concrete parameters and further details for reproducibility can be found in [5,29]. Table 3 shows the obtained results for three variants of the Mackey–Glass time series. MG-Large is an extended sequence of values (100,000 observations) of the Mackey–Glass series, generated similarly as in the case of the Mackey–Glass dataset [29], which consists of 500 observations. On the other hand, MGLNoise is generated by combining the MG-Large series with additive Gaussian noise of zero mean and a standard deviation of 13% of the standard deviation of the noise-free MG-Large time 4 These results were obtained using a standard PC with 8 GB of RAM, and an Intels CoreTR 2 Quad CPU Q9550 with maximum frequency of 2.8 GHz, running Matlab R2010b on the GNU/Linux operating system. Tests were run with no significant competing load.

F. Montesino Pouzols, A. Lendasse / Neurocomputing 90 (2012) 59–65

63

Table 1 Comparison of offline regression methods. Kernel smoothing regression and MARS using piecewise linear and piecewise cubic models (best NRMSE for each dataset in boldface). Dataset

Method

NRMSE

Std AE

SMAPE

UCI El Nino-SST

MARS linear MARS cubic KSR

4.934e  01 5.071e  01 5.872e  01

3.387e  01 3.510e  01 3.981e  01

8.19 8.31 9.71

3.34e þ 03 3.10e þ03 1.26e þ 03

Baltic dissolved O2

MARS linear MARS cubic KSR

5.169e  01 5.201e  01 5.664e  01

3.589e  01 3.644e  01 4.003e  01

7.79 7.92 8.38

3.14e þ 03 3.26e þ 03 2.13e þ 02

Baltic salinity

MARS linear MARS cubic KSR

2.419e  01 2.397e  01 4.045e  01

1.887e  01 1.870e  01 3.023e  01

10.8 10.7 16.8

Run-time (s)

3.00e þ03 3.22e þ 03 1.80e þ02

Table 2 Comparison of adaptive regression methods. Accuracy and run-time (best NRMSE for each dataset in boldface). Dataset

Method

NRMSE

Std AE

SMAPE

Run-time (s)

UCI El Nino-SST

DENFIS eTS KSR-VQ

2.388e  01 6.287e  01 3.107e  01

1.832e  01 4.309e  01 9.653e  02

3.50 10.2 4.21

1.42e þ 03 5.54e þ 03 1.59e þ 02

Baltic dissolved O2

DENFIS eTS KSR-VQ

4.349e  01 6.470e  01 3.832e  01

3.364e  01 4.400e 01 2.816e  01

5.94 9.85 5.40

3.95e þ 02 8.72e þ 02 2.58e þ 01

Baltic salinity

DENFIS eTS KSR-VQ

2.915e  01 4.300e 01 2.802e  01

2.335e  01 2.928e  01 2.441e  01

11.3 24.7 8.46

5.65e þ 02 8.86e þ 02 2.48e þ 01

Table 3 Effect of stream length and additive noise: accuracy and run-time (best NRMSE for each dataset in boldface). Dataset

Method

NRMSE

Std AE

Run-time (s)

Mackey Glass

DENFIS eTS KSR-VQ

3.749e  01 4.467e  01 2.830e  01

2.511e  01 2.554e  01 2.023e  01

1.40eþ 00 1.00eþ 00 7.96e  02

MG-Large

DENFIS eTS KSR-VQ

3.487e  01 4.544e  01 9.354e  02

2.194e  01 2.665e  01 6.437e  02

5.20eþ 02 5.65eþ 03 3.66eþ 02

MGL-Noise

DENFIS eTS KR-VQ

4.568e  01 4.843e  01 3.523e  01

2.875e  01 2.821e  01 2.220e  01

5.47eþ 02 7.43eþ 03 3.58eþ 02

series. It can be observed that KSR-VQ can take advantage of additional samples as they arrive. It is also the most accurate method for the noise-free case as well as the case with noise. This shows an example of the fact that KSR-VQ can provide very accurate models for smooth, noise-free streams while still performing well for more noisy streams.

4. Discussion An important implication of the results above is that KSR-VQ would be preferable to standard KSR for offline modeling. In fact, KSR-VQ is competitive against the highly accurate MARS models, but for a lower computational cost in general. Also, it should be noted that the adaptive selection of the kernel parameter or bandwidth in multivariate kernel smoothing regression can be seen as an indirect way of variable scaling. The focus of this paper is on streams of data, and specially on streams of unbounded size. This fact imposes strict constraints,

as computational complexity becomes the major limiting factor, and a vast range of conventional methods in machine learning and computational intelligence are not feasible [30,31]. Other alternative approaches can be used when the dataset size is limited [32]. A number of online algorithms have been proposed for training diverse types of learning machines [33]. A better scalability to large datasets is often the motivation for these algorithms. Such is the case of incremental algorithms for support vector machines [34], or incremental algorithms for single hidden layer feedforward neural networks, such as certain variants of the extreme learning machine [35]. Nevertheless, coping with nonstationarities in data streams requires methods and systems whereby not only the parameters but also the structure of the learned model is adapted in order to account for the observed variations in new data. A number of recent results have led to methods for dealing with dataset shift in machine learning [36]. In this context, the aim is to further develop probabilistic models that can cope with different forms of dataset shift. As introduced above, in the field of evolving intelligent systems [31] several methods have been proposed that can adapt to evolving data streams. These methods are to a large extent based on artificial neural networks and neuro-fuzzy systems. The field revolves around the interpretable, rule-based structure of these systems, and the definition of procedures like rule addition, removal and merging. This approach has led to diverse successful applications in the last decade [31]. Two well known methods in this field, eTS and DENFIS, have been used here for comparison purposes. Learning high-dimensional datasets is outside of the scope of this paper. A possible approach for high dimensional problems would be to introduce a dimensionality reduction stage. To this end, incremental procedures for computing linear decompositions, such as incremental PCA [37], could be employed for dimensionality reduction in combination with the regression methods analyzed. More specifically, in environmental and biological processes noise and nonstationarities are recurrent issues. In what follows we discuss

64

F. Montesino Pouzols, A. Lendasse / Neurocomputing 90 (2012) 59–65

these two issues. There are different, intertwined and difficult to control sources of noise in the datasets motivating this work. Just to name a few, changes in measurement protocols over the years, recalibration, replacement and addition of instruments, as well as environmental conditions are typical sources of noise [25,26]. Furthermore, sample selection bias and covariate shift can be significant, with diverse spatial and temporal patterns. For instance, besides an irregular and changing distribution of sampling points (see Fig. 3), in the kind of applications addressed in this paper sensor failure and difficult environmental conditions are very common sources of sampling selection bias and covariate shift. Regarding the effect of noise on the accuracy of KSR-VQ models, the results shown in Table 2 show satisfactory performance for realworld noisy environmental datasets. In addition, the results shown in Table 3 confirm this point in a controlled environment. It is also worth to mention that the run-time of KSR-VQ is independent of the residual variance of the data, an example of which can be observed in Table 3. In contrast, the run-time of the other adaptive methods analyzed is data dependent. Both eTS and DENFIS are fast methods and can process incoming data in constant time. However, this constant evolves in a way that is hard to predict. Regarding dynamic changes in the underlying process, the datasets used for these experiments include several cases where temporal and spatial nonstationary behavior has been identified. For example, climate regime drifts and major, sudden shifts [38,39], as well as severe decadal trends with different effects in the northern and southern parts of the Baltic Sea [40,25,26] have been extensively described. Particularly in the climate and ecosystems research community, the analysis of regime shifts is of a major importance. The interpretation of changes in KSR-VQ models for shift detection is an area for further research. KSR-VQ is an online learning method that is able to adapt to changes in the incoming data distribution. While the target of this work is to model nonstationary streams of data, this kind method is also useful for stationary data when resources are limited or datasets are large [34], and when data is redundant [41].

5. Conclusion The proposed KSR-VQ method consists of two stages. First, VQ is performed on the incoming data in order to approximate the probability distribution function of the process by means of a codebook of prototypes. Then, multivariate kernel smoothing regression is performed on the set of prototypes. By using VQ it is not only possible to process large data streams in an efficient manner but also to perform kernel smoothing regression in an adaptive manner. The advantages of KSR-VQ against offline KSR as well as its competitiveness with established adaptive and evolving methods have been illustrated for two environmental datasets. KSR-VQ shows good generalization capabilities in spite of its simplicity and low computational requirements. The adaptiveness of KSR-VQ is twofold. First, smoothing regression is performed in an incrementally updated, evolving estimation of the probability distribution of the incoming data stream. Second, the kernel bandwidth is adapted online using a criterion based on the median absolute deviation estimator which can be computed efficiently online. Experiments showed that this method is remarkably accurate despite its simplicity. References [1] J.H. Friedman, Multivariate adaptive regression splines, Ann. Stat. 19 (1) (1991) 1–141. [2] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer Series in Statistics, Springer, New York, NY, USA, 2009.

[3] G. Jekabsons, ARESLab: Adaptive Regression Splines Toolbox for Matlab/Octave, Faculty of Computer Science and Information Technology, Riga Technical University, September 2010. /http://www.cs.rtu.lv/jekabsons/regression.htmlS. [4] P. Angelov, N. Kasabov, D. Filev (Eds.), Evolving Intelligent Systems: Methodology and Applications, Wiley–IEEE Press, 2010. (IEEE Series on Computational Intelligence). [5] F.M. Pouzols, A. Lendasse, Evolving fuzzy optimally pruned extreme learning machine for regression problems, Evolving Syst. 1 (1) (2010) 43–58. [6] N.K. Kasabov, Q. Song, DENFIS: dynamic evolving neural-fuzzy inference system and its application for time-series prediction, IEEE Trans. Fuzzy Syst. 10 (2) (2002) 144–154. [7] A.W. Bowman, A. Azzalini, Applied Smoothing Techniques for Data Analysis: The Kernel Approach, Oxford University Press, 1997. [8] M.P. Wand, M.C. Jones, Kernel Smoothing, Chapman and Hall/CRC, 1994. [9] J.S. Simonoff, Smoothing Methods in Statistics, Springer Series in Statistics, Springer, 1998. ¨ [10] L. Gyorfi, M. Kohler, A. Krzyzak, H. Walk, A Distribution-Free Theory of Nonparametric Regression, Springer, 2010. [11] B. Silverman, Density Estimation for Statistics and Data Analysis, Chapman and Hall/CRC, 1986. [12] P. Hall, J.S. Racine, Q. Li, Cross-validation and the estimation of conditional probability densities, J. Am. Stat. Assoc. 99 (468) (2004) 1015–1026. [13] G.S. Manku, S. Rajagopalan, B.G. Lindsay, Approximate medians and other quantiles in one pass and with limited memory, ACM SIGMOD Rec. 27 (2) (1998) 426–435. [14] F.M. Pouzols, A.B. Barros, Automatic clustering-based identification of autoregressive fuzzy inference models for time series, Neurocomputing 73 (10) (2010) 1937–1949. http://dx.doi.org/10.1016/j.neucom.2009.11.028. [15] J.-S.R. Jang, C.-T. Sun, E. Mizutani, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence, Prentice Hall, Upper Saddle River, New Jersey, 1997. [16] F.M. Pouzols, A. Lendasse, A.B. Barros, Autoregressive time series prediction by means of fuzzy inference systems using nonparametric residual variance estimation, Fuzzy Sets Syst. 161 (4) (2010) 471–497. [17] E.D. Lughofer, FLEXFIS: a robust incremental learning approach for evolving Takagi–Sugeno fuzzy models, IEEE Trans. Fuzzy Syst. 16 (6) (2008) 1393–1410. [18] T.-H. Fan, D.K.J. Lin, K.-F. Cheng, Regression analysis for massive datasets, Data Knowl. Eng. 61 (3) (2007) 554–562. [19] D.W. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization, Wiley, 1992. [20] J. Elith, J. Leathwick, Predicting species distributions from museum and herbarium records using multiresponse models fitted with multivariate adaptive regression splines, Diversity Distrib. 13 (3) (2007) 265–275. [21] P.P. Angelov, D.P. Filev, An approach to online identification of Takagi– Sugeno fuzzy models, IEEE Trans. Syst. Man Cybern.—Part B: Cybern. 34 (1) (2004) 484–498. [22] N. Kasabov, Evolving Connectionist Systems: The Knowledge Engineering Approach, 2nd ed., Springer, 2007. [23] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd ed., John Wiley and Sons Ltd., 2000. [24] A. Dourado, L. Aires, J. Victor, eFSLab: developing evolving fuzzy systems from data in a friendly environment, in: Proceedings of the 10th European Control Conference, Prague, Czech Republic, 2009, pp. 922–927. [25] The BACC Author Team, Assessment of Climate Change for the Baltic Sea Basin, 1st ed., Regional Climate Studies, Springer-Verlag, 2008. [26] H. Ojaveer, A. Jaanus, B.R. MacKenzie, G. Martin, S. Olenin, et al., Status of biodiversity in the Baltic Sea, PLoS One 5 (9) (2010) e12467. http://dx.doi.or g/10.1371/journal.pone.0012467. [27] A. Frank, A. Asuncion, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, October 2010. /http://archive.ics.uci.edu/ml/S. [28] W.W. Hsieh, Machine Learning Methods in the Environmental Sciences: Neural Networks and Kernels, Cambridge University Press, Cambridge, UK, 2009. [29] F.M. Pouzols, A. Lendasse, Adaptive kernel smoothing regression using vector quantization, in: IEEE Workshop on Evolving and Adaptive Intelligent Systems (EAIS), 2011, pp. 85–92. [30] O. Bottou, L. Bousquet, The tradeoffs of large scale learning, in: J.C. Platt, D. Koller, Y. Singer, S. Roweis (Eds.), Advances in Neural Information Processing Systems, vol. 28, NIPS Foundation, 2008, pp. 161–168. [31] P. Angelov, D. Filev, N. Kasabov (Eds.), Evolving Intelligent Systems. Methodology and Applications, John Wiley and Sons, IEEE Press, 2010. (IEEE Series on Computational Intelligence). [32] I. Olier, A. Vellido, Advances in clustering and visualization of time series using GTM through time, Neural Networks 21 (7) (2008) 904–913. [33] J. Kivinen, A.J. Smola, R.C. Williamson, Online learning with kernels, IEEE Trans. Signal Process. 52 (8) (2004) 2165–2176. [34] A. Bordes, S. Ertekin, J. Weston, L. Bottou, Fast kernel classifiers with online and active learning, J. Mach. Learn. Res. 6 (2005) 1579–1619. [35] G. Feng, G.-B. Huang, Q. Lin, R. Gay, Error minimized extreme learning machine with growth of hidden nodes and incremental learning, IEEE Trans. Neural Networks 20 (8) (2009) 1352–1357. ˜ onero-Candela, M. Sugiyama, A. Schwaighofer, N.D. Lawrence (Eds.), [36] J. Quin Dataset Shift in Machine Learning, MIT Press, 2009. [37] J. Weng, Y. Zhang, W.-S. Hwang, Candid covariance-free incremental principal component analysis, IEEE Trans. Pattern Anal. Mach. Intell. 25 (8) (2003) 1034–1040.

F. Montesino Pouzols, A. Lendasse / Neurocomputing 90 (2012) 59–65

[38] A.A. Tsonis, K. Swanson, S. Kravtsov, A new dynamical mechanism for major climate shifts, Geophys. Res. Lett. 34 (2007) L13705. (5pp.). [39] B. deYoung, R. Harris, J. Alheit, G. Beaugrand, N. Mantua, L. Shannon, Detecting regime shifts in the ocean: data considerations, Prog. Oceanogr. 60 (2) (2005) 143–164. ¨ [40] J. Hanninen, I. Vuorinen, P. Hjelt, Climatic factors in the Atlantic control the oceanographic and ecological changes in the Baltic Sea, Limnol. Oceanogr. 45 (3) (2000) 703–710. [41] L. Bottou, Stochastic learning, in: O. Bousquet, U. von Luxburg (Eds.), Advanced Lectures on Machine Learning, Lecture Notes in Artificial Intelligence, LNAI, vol. 3176, Springer Verlag, Berlin, 2004, pp. 146–168.

Federico Montesino Pouzols is a postdoctoral researcher at the Biodiversity Conservation Informatics Group, Center of Excellence in Metapopulation Biology of the Dept. of Biosciences, Faculty of Biological and Environmental Sciences, University of Helsinki. Previously, he was a Marie Curie postdoctoral fellow with the Environmental and Industrial Machine Learning Group, Helsinki University of Technology. He earned a B.Sc. in Computer Science and Physics in 1999, followed by an Engineering degree in Computing in 2002, both from University of Seville. He earned his M.Sc. on intelligent microelectronic systems in 2005 and his Ph.D. in 2009 with a thesis on Computational Intelligence applied to Mining and Control of Network Traffic, at CSIC and University of Seville. During 1999–2003 he worked in the telecommunications and computing industry. From 2004 to 2007 he was a Researcher and Assistant Professor at the University of Seville. In 2008–2009 he worked as a Researcher at the Institute of Microelectronics of Seville, CSIC. Currently, his broad area of interest is computational data analysis, and applications to the modeling of environmental and ecological processes. His interests span machine learning, computational intelligence, spatio-temporal models and nonlinear dynamics, as well as various applications in ecology and biodiversity conservation.

65

Amaury Lendasse was born in 1972 in Belgium. He received the M.S. degree in Mechanical Engineering from the Universite Catholique de Louvain (Belgium) in 1996, M.S. in control in 1997 and Ph.D. in 2003 from the same university. In 2003, he has been a postdoctoral researcher in the Computational Neurodynamics Lab. at the University of Memphis. Since 2004, he is a Senior Researcher and a Docent in the Adaptive Informatics Research Centre in the Aalto University School of Science and Technology (previously Helsinki University of Technology) in Finland. He has created and is leading the Environmental and Industrial Machine Learning (previously time series prediction and chemoinformatics) Group. He is a Chairman of the annual ESTSP conference (European Symposium on Time Series Prediction) and member of the editorial board and program committee of several journals and conferences on machine learning. He is the author or the coauthor of around 100 scientific papers in international journals, books or communications to conferences with reviewing committee. His research includes time series prediction, chemometrics, variable selection, noise variance estimation, determination of missing values in temporal databases, non-linear approximation in financial problems, functional neural networks and classification.