A time-series prediction framework using sequential learning algorithms and dimensionality reduction within a sparsification approach

A time-series prediction framework using sequential learning algorithms and dimensionality reduction within a sparsification approach

A time-series prediction framework using sequential learning algorithms and dimensionality reduction within a sparsification approach Journal Pre-pro...

586KB Sizes 0 Downloads 46 Views

A time-series prediction framework using sequential learning algorithms and dimensionality reduction within a sparsification approach

Journal Pre-proof

A time-series prediction framework using sequential learning algorithms and dimensionality reduction within a sparsification approach ´ ´ S. Garcia-Vega, E.A. Leon-G omez, G. Castellanos-Dominguez PII: DOI: Reference:

S0167-8655(18)30800-6 https://doi.org/10.1016/j.patrec.2019.11.031 PATREC 7715

To appear in:

Pattern Recognition Letters

Received date: Revised date: Accepted date:

8 October 2018 18 October 2019 22 November 2019

´ ´ Please cite this article as: S. Garcia-Vega, E.A. Leon-G omez, G. Castellanos-Dominguez, A time-series prediction framework using sequential learning algorithms and dimensionality reduction within a sparsification approach, Pattern Recognition Letters (2019), doi: https://doi.org/10.1016/j.patrec.2019.11.031

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Research Highlights

• We propose a framework that addresses three main open issues of kernel-based adaptive filters. • Our proposal optimizes the bandwidth and learning-rate parameters using stochastic gradient algorithms. • A sparsification approach based on dimensionality reduction is introduced to remove redundant samples. • Results show that our proposal converges to low values of mean-square-error.

1

Pattern Recognition Letters journal homepage: www.elsevier.com

A time-series prediction framework using sequential learning algorithms and dimensionality reduction within a sparsification approach S. Garcia-Vegaa,∗∗, E. A. León-Gómeza , G. Castellanos-Domingueza a Universidad

Nacional de Colombia, Signal Processing and Recognition Group, Campus La Nubia, 170003, Manizales, Colombia

ABSTRACT The adaptive kernel filters are sequential learning algorithms that operate in a particular functional space called a reproducing kernel Hilbert space. However, their performance depends on the selection of two hyper-parameters, i.e., kernel bandwidth and learning-rate. Besides, as these algorithms train the model using a sequence of input vectors, their computation scales with the number of samples. In this work, we propose to address the previous challenges of these sequential learning algorithms. The proposed framework, unlike similar methods, maximizes the correntropy function to optimize the bandwidth and learning-rate parameters. Further, we introduce a sparsification strategy based on dimensionality reduction to remove redundant samples. The framework is tested on both synthetic and real-world data sets, showing convergence to relatively low values of mean-square-error. c 2019 Elsevier Ltd. All rights reserved.

1. Introduction The prediction of time series has been found useful in fields like the weather forecast, financial markets, electricity load prediction, among other streaming data applications [1, 2]. However, their underlying models are usually not known, representing a challenge when building estimations of these timeseries. In practice, prediction tasks are performed using statistical methods and variants of regressive models [3], meaning that they may not work under non-stationary conditions [4]. Thus, data-driven approaches based on Neural Networks have been used to predict time-series data. However, they usually demand long training time and may get stuck in local minima [5]. Other data-driven approaches that have proven useful in prediction tasks are the kernel methods, which embed data into a potentially infinite-dimensional feature space. In contrast to NNs, kernel-based adaptive filters have convex optimization and moderate computational complexity. Even that the choice of a kernel is non-trivial and does usually depend on the specific application, the kernel methods pose three main open issues [6]: i) selection of an appropriate kernel bandwidth; ii) learning-rate

∗∗ Corresponding

author: e-mail: [email protected] (S. Garcia-Vega), [email protected] (E. A. León-Gómez), [email protected] (G. Castellanos-Dominguez)

parameter tuning; iii) selection of samples to train the model. Having a significant influence on the learning performance, the kernel bandwidth controls the mapping smoothness and can be set manually or estimated in advance by Silverman’s rule [7], penalizing functions [8], and cross validation [9]. For determining an optimal bandwidth in kernel-based adaptive filters, however, an approximation in a joint space must be performed, which is different from density estimation. Thus, a stochastic gradient-based bandwidth optimization is developed in [10], showing that the variable bandwidth fosters the kernel-based adaptive filters to converge faster and achieve better accuracy. Yet, joint optimization of bandwidth and learning-rate parameters remains an open matter. Another parameter to optimize is the learning-rate that reflects a tradeoff between misadjustment and speed of adaptation. If this parameter is too large, the risk of overfitting increases, while a small learning-rate decreases misadjustment, but also gives a longer convergence time. A variety of adaptive learning-rate methods have been proposed to improve the performance of the standard least-mean-square algorithm [11, 12]. However, these methods may not work properly on kernel-based adaptive filters, as they originate from a different problem [10]. Thus, this parameter is usually calculated off-line, meaning that it remains unchanged through iterations [13]. Lastly, the main bottleneck of kernel-based adaptive filters is that computation scales with the number of samples [6]. These algorithms train the model using a sequence

2 of input vectors with their target predictions. Thus, we aim at selecting only an important subset of the training data (also known as a dictionary) to train the model [14]. Samples stored in the dictionary should cover, as much as possible, the area where new samples are likely to appear. In practice, this means to store “sparse” samples, which is why this technique is also known as sparsification [15]. However, if this technique is too restrictive, the model may not be trained correctly, and poor performance can be expected, while if the technique is relaxed, then sparsification no longer makes sense [16]. Here, we propose a framework that addresses the previous three open challenges in kernel-based adaptive filters. In contrast to similar methods, the proposed framework sequentially optimizes the bandwidth and learning-rate parameters using stochastic gradient algorithms that maximize the correntropy function [17]. Further, a sparsification approach based on dimensionality reduction is proposed to remove redundant samples. The framework is validated on both synthetic and real-world datasets. Simulation results show that our proposal avoids overfitting, reduce computational complexity, and provide stable solutions in real-world applications. 2. Implementation of kernel-based adaptive filters Given a continuous input-output mapping f :U→R, the goal is to learn f using a sequence of input-output samples {u1 , y1 }, . . . , {ut , yt }, being ut an m-dimensional input vector that belongs to the input set U⊂Rm , while yt ∈R is the output over time domain t∈N . The mapping function f can be learned using kernel-based adaptive filters, as they have shown stable performance to model non-linear systems, giving the following sequential rule through the time domain (As detailed in Appendix A):

where β∈R+ is the step-size parameter and σt is the kernel bandwidth. Therefore, taking into account Equations (1) to (3), the kernel bandwidth estimation results in the following rule (see Appendix B): σt = σt−1 + αηt t−1 ||ut − ut−1 ||2 κσt−1 (ut , ut−1 ) where α= λ2Jσt3β

t−1

and notation ||·|| is the `2 norm.

Learning-rate optimization. The gradient-descent estimation, similarly to Equation (3), yields the following learning-rate update: ∂Jt ηt = ηt−1 + β (5) ∂ηt−1 resulting in the sequential rule as below (see Equation (B.3)): 00

ηt = ηt−1 + β t t−1 κσ (ut , ut−1 ) 00

where β = β eter.

exp(−2t /2λ2 ) and β∈R+

t =yt − ft−1 (ut )

(1)

ft =ft−1 + ηt κσ (ut , ·) where κσ (·, ·)∈R+ is a Mercer kernel with a bandwidth σ∈R+ , while η∈R+ is the learning-rate. In this work, we propose to optimize both the bandwidth and learning-rate by minimizing the prediction error (see Equation (1)), using the following stages. Kernel bandwidth optimization. We propose to optimize the adaptive filter parameters using the correntropy cost function expressed over time as follows [18]:    2  (σt , ηt ) (2) Jt = arg max exp − t ∀σ,η 2λ2 where λ∈R+ is the correntropy bandwidth, which measures the similarity between data points. In terms of the first optimizing value, we perform the Kernel bandwidth estimation using the gradient descent method, that is: σt = σt−1 + β

∂Jt ∂σt−1

(3)

(6)

is the step-size param-

Dimensionality reduction. Given a high-dimensional finite set U ={ui ∈Rm :i∈[1, t − 1]} that holds m features extracted at t − 1 samples. Dimensionality reduction aims to obtain a lowdimensional representation V ={vi ∈Rn :i∈[1, t − 1]}, where n
f0 =0

(4)

t−1×t−1

(7a) (7b)

where pij =κσU (ui , uj ), qij =κσV (vi , vj ), while the Mercer kernels are κσU :Rm ×Rm →R+ and κσV :Rn ×Rn →R+ , being σU , σV ∈R+ the corresponding kernel sizes. The previous kernels are assumed to be Gaussian due to their universal approximating capability, desirable smoothness, and numeric stability [16]. The similarity of high and low dimensional spaces can be computed as follows:   −kui − uj k2 (8a) pij = exp 2 2σU   −kvi − vj k2 qij = exp (8b) 2σV2 In this work, the kernel-based framework is devised so that the more correctly the points (vi and vj ) explain the similarity between the high-dimensional data points (ui and uj ), the more alike the kernel values pij and qij become. The main rationale behind the previous strategy is to find a low-dimensional data representation V so that the mismatch between pij and qij can be minimized. Consequently, the following cost function is proposed: C=

1 E {|pij −qij |/pij :∀i, j∈t−1, j6=i} , C∈R (9) (t − 1)−1

3 where E {·} denotes the expectation operator. We suggest minimizing the previous cost function by using a gradient descent method, yielding the following learning rule:

0.35

KLMS QKLMS KLMS-VKS KLMS-VSS Proposal

0.30

(10)

where µ∈R+ is the step-size parameter and vik−1 is the lowdimensional representation of ui at iteration k − 1. Finally, combining Equations (8a), (8b) and (9), we propose the following gradient update rule (see Appendix C): vik =vik-1− µ0 E

(

(vik-1−vjk-1 )

k-1 k-1 2 pij qij −(qij ) k 1 pij |pij −qij |



:∀j∈t-1, j6=i

0.20 0.15 0.10 0.05 0.00 0

)

Then, introducing the quantization-size ε∈R+ [6], the following sparsification strategy is proposed: i) min kvt − vi k≤ε, where the closest sample weight to 1≤i≤t−1

ut is updated; ii)

0.25 Testing MSE

∂C ∂vik−1

min kvt − vi k>ε, where the input

1≤i≤t−1

sample ut is added to the dictionary. The previous strategy aims to select the input data that encodes the global structures, allowing to store the samples that are more likely to appear.

100

200

Iteration

300

400

500

400

500

(a) Learning curves

500

KLMS, KLMS-VKS and KLMS-VSS QKLMS Proposal

400 Size of dictionary

vik = vik−1 − µ

300

200

100

0 0

100

3. Results

200

300 Iteration

(b) Dictionary size

The proposed framework is validated in prediction tasks using the mean-square-error (MSE) as performance measure. The learned filter, after each iteration of the training set, is used to compute the MSE value on the test sets [16]. The following kernel-based adaptive filters are used for comparison purposes: i) Kernel least-mean-square (KLMS), which is the simplest kernel-based adaptive strategy [20]; ii) Quantized kernel least-mean-square (QKLMS), which is a KLMS variant [6]; iii) Kernel least-mean-square with variable kernel bandwidth (KLMS-VKS) [10]; iv) Kernel least-mean-square tested with a variable learning rate (KLMS-VSS) [12]. The set-up of compared methods is as follows: i) the learning-rate is adjusted at η=0.2 for KLMS and QKLMS, while the initial learning-rate η1 is set at 0 for KLMS-VKS, KLMS-VSS and our proposal; ii) the Kernel bandwidth is set p at σ= 1/2 for KLMS and QKLMS, which is also the initial bandwidth in our proposal, KLMS-VKS, and KLMS-VSS, i.e., p σ1 = 1/2; iii) the quantization size ε is set at 0.05 and 0.1 for QKLMS and our proposal, respectively; iv) the learning rate β is set at 0.1; vii) the correntropy bandwidth is set at λ=1; v) in the dimensionality reduction method, k=1000, n=2, µ=0.1 and σU , σV =0.2. Note, the kernel parameters were adjusted heuristically. Testing is carried out on the following datasets:

1.8

Desired KLMS QKLMS KLMS-VKS KLMS-VSS Proposal

1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0

0

20

40 60 Test samples

80

100

(c) Test set prediction Fig. 1. Performed results by each compared adaptive filter on MackeyGlass time-series prediction.

Mackey-Glass. This is a short-term signal set that is generated by a chaotic system whose states are governed by a set of timedelayed differential equations. The task is to predict the current value using the previous ten consecutive samples. The data, as suggested in [16], are normalized for the computation con-

4

Wind Speed. This collection holds hourly wind speed records from the northern region of Colombia1 . The task is to predict the current value using the previous ten consecutive samples. The training set covers September 24, 2008, to October 31, 2008, while the test set covers May 28, 2009, to June 02, 2009. Figure 2(a) displays the MSE results performed by each compared solution versus the number of iterations. Although KLMS-VSS produces the poorest MSE values as in the previous dataset, the displayed MSE evolution shows that its performance becomes even worse because of the increased complexity of real-world data. Note that MSE decreases in all methods when compared with the learning curves of synthetic results (see Figure 1(a)), indicating the presence of highly nonstationary dynamics. This issue makes the compared methods demand more time to encode the most relevant samples of this time-series correctly. The evolution curves, shown in Figure 2(b), make clear that the proposed framework reaches the lowest dictionary size during training while maintains a competitive MSE performance (see Figure 2(c)). Thus, the variable bandwidth and learning-rate, incorporated by our framework, promote the kernel-based adaptive filter to converge faster without significant loss of accuracy. However, the converging speed may be adversely affected if their initial values are inappropriately chosen. In consequence, their suitable initial values can

1 The dataset is publicly available at http://www.ideam.gov.co/solicitud-deinformacion

0.5

Testing MSE

0.4 0.3 0.2 0.1

KLMS QKLMS KLMS-VKS KLMS-VSS Proposal

24/09/08

6/10/08

Iteration

18/10/08

31/10/08

(a) Learning curves

KLMS, KLMS-VKS and KLMS-VSS 800

QKLMS Proposal

Size of dictionary

venience. The training set covers the first 500 samples, while another 100 consecutive samples are used as the test set. Figure 1(a) shows the MSE results performed by each compared solution versus the number of iterations. A quick glance shows that KLMS-VSS performs the worst MSE values, having abrupt changes during training, which may suggest that the algorithm is easily trapped on local minimums. The relatively low MSE values of KLMS and KLMS-VKS indicate a more stable performance through iterations. However, as seen from Figure 1(b), their dictionary sizes linearly grow during training. Note that these algorithms do not incorporate any sparsification technique, meaning a significant drawback for implementation in online applications. In contrast, the number of samples of QKLMS grows very slowly, which results in a final network that sizes only 150. Although QKLMS and the proposed framework achieve similar MSE values, the former method demands a dictionary size significantly higher (see Figure 1(b)). In particular, the proposed framework reaches the lowest network size through iterations, suggesting that its sparsification strategy helps to hold the most relevant samples to perform prediction tasks (see Figure 1(c)). Both (the proposed kernel bandwidth and learning-rate) help to improve convergence time while competitive performance is maintained in future iterations. For example, Table 1 displays the MSE evolution over the test set, showing that the proposed framework achieves the lowest MSE at iteration 100. Consequently, our proposal converges to relatively low values of MSE, avoids overfitting, and provide stable solutions in realworld applications.

600

400

200

0 24/09/08

6/10/08

18/10/08

31/10/08

Iteration

(b) Dictionary size

Desired KLMS QKLMS KLMS-VKS KLMS-VSS Proposal

1.4 1.2 1.0 0.8 0.6 0.4 0.2 28/05/09

29/05/09

31/05/09 Test samples

02/06/09

(c) Test set prediction Fig. 2. Performed results by each compared adaptive filter on wind speed prediction.

be selected using one of the methods developed in this account, like Silverman’s rule of thumb. Finally, as seen in Table 2, the proposed framework is an alternative to enhance convergence speed while a high accuracy is maintained, providing the bene-

5 Table 1. Performed results on Mackey-Glass time-series prediction at different iterations. The best overall method of each column are marked with bold notation.

Method KLMS QKLMS KLMS-VKS KLMS-VSS Proposal

Measure MSE Dictionary Size MSE Dictionary Size MSE Dictionary Size MSE Dictionary Size MSE Dictionary Size

100 0.016 100 0.017 80 0.021 100 0.028 100 0.016 57

fit of demanding a condensed dictionary size. 4. Conclusion We introduce a framework, based on kernel adaptive filters, that addresses three main challenges of their online implementation, i.e., selection of bandwidth, learning-rate, and training samples. In particular, we propose to address the first two challenges using a stochastic gradient algorithm that maximizes the correntropy function, which allows an improvement in convergence time while maintaining the robustness and simplicity of kernel-based adaptive filters. The proposed adaptive bandwidth together with learning-rate provide an effective mechanism to eliminate the detrimental effect of outliers, and they are intrinsically different from the use of a threshold of conventional techniques. The third challenge is addressed using a dimensionality reduction method that incorporates a sparsification strategy, employing a kernel-based cost function that quantifies the global structures of training samples. Thus, the proposed sparsification strategy uses the samples that are most likely to appear during the prediction task, starting with an empty dictionary and gradually adding new samples. However, the proposed strategy may be adversely affected by a few training samples, as it is more challenging to identify global structures under this scenario. The validation of both datasets shows that our proposal converges to relatively low values of mean-square-error, avoiding overfitting while provides stable solutions in real-world applications. We are in the process of expanding our research to information-theoretic measures and introduce a hyperparameter tuning procedure into the compared methods.

200 0.017 200 0.017 103 0.019 200 0.021 200 0.007 71

Iteration 300 0.007 300 0.007 126 0.011 300 0.016 300 0.007 86

400 0.006 400 0.006 136 0.008 400 0.013 400 0.006 100

500 0.004 500 0.004 150 0.005 500 0.011 500 0.004 104

when the least-mean-square algorithm is applied to the sample sequence {ϕ (ut ) , yt }, the following sequential rule holds:    ω0 = 0 > t = yt − ωt−1 ϕ(ut )   ωt = ωt−1 + ηt ϕ(ut )

(A.1)

where Pt is the prediction error at iteration t, t ωt =η j=1 j ϕ(ut ) is the weight vector in F, and η is the learning-rate parameter. Then, when u0 arrives to the filter, the output is computed Pt in the input space by kernel evaluations as ωt> ϕ(u0 )=η j=1 j κσ (uj , u0 ), which leads to the following sequential rule in the original space [20]:    f0 = 0 t = yt − ft−1 (ut )   ft = ft−1 + ηt κσ (ut , ·)

(A.2)

being ft the estimate of the input-output nonlinear mapping at time t. Note that ft−2 does not depend on σt−1 . Thus, kernel least-mean-square algorithm produces a growing radial-basis-function network by allocating a new kernel unit for every new example with input ut as the center and ηt as the coefficient. The center set is termed dictionary, while coefficients – the weights.

Appendix B. Optimization of kernel bandwidth and learning-rate parameters

Appendix A. Kernel least-mean-square algorithm Let ϕ: U→F be a kernel-induced mapping to transform the input ut into a high-dimensional feature space F, which is an inner product space, as ϕ (ut ). A Mercer kernel κσ : U×U→R induces a mapping ϕ such that the kernel trick holds [21], i.e., > ϕ (ut ) ϕ (u0 )=κσ (ut , u0 ), where u0 is a new input. Thus,

Provided a step-size value β∈R+ , the cost function Jt in Equation (2) is maximized in terms of either optimizing parameter ζ={σ, η} through the gradient descent method as follows: ζt = ζt−1 + β∂Jt /∂ζt−1

(B.1)

6 Table 2. Performed results on wind speed prediction at different iterations. For each column the bold notation indicates the best overall method.

Method KLMS QKLMS KLMS-VKS KLMS-VSS Proposal

Measure MSE Dictionary Size MSE Dictionary Size MSE Dictionary Size MSE Dictionary Size MSE Dictionary Size

28/09/08 0.253 100 0.252 81 0.241 100 0.307 100 0.262 62

The learning rule Equation (B.1) can be unfolded as below: ∂ exp(−2t /2λ2 ) ∂ζt−1 ∂ =ζt−1 + β exp(−2t /2λ2 ) (−2t /2λ2 ) ∂ζt−1 ∂ 2  =ζt−1 − (β/2λ2 ) exp(−2t /2λ2 ) ∂ζt−1 t 2 ∂  =ζt−1 − (β/2λ2 ) exp(−2t /2λ2 ) yt − ft−1 (ut ) ∂ζt−1 0 ∂ 2 =ζt−1 − β (y 2 − 2yt ft−1 (ut ) + ft−1 (ut )) ∂ζt−1 t

Iteration 14/10/08 0.249 500 0.255 280 0.253 500 0.263 500 0.272 96

06/10/08 0.299 300 0.302 193 0.311 300 0.481 300 0.311 88

ζt ζt ζt

0

where β =(β/2λ2 ) exp(−2t /2λ2 ) From Equation (A.2), it holds for the σ parameter that ft−1 (ut )=ft−2 (ut ) + ηt−1 κσt−1 (ut , ut−1 ) then, the following expression takes place:  0 ∂ σt =σt−1 − β −2yt (ηt−1 κσt−1 (ut , ut−1 )) ∂σt−1 2  ∂  + ηt−1 κσt−1 (ut , ut−1 ) ∂σt−1

Besides, we assumme the Mercer kernel κσt−1 be Gaussian 2 kernel, that is, κσt−1 =exp −||ut − ut−1 ||2 /2σt−1 , so that we obtain:  0 1 σt =σt−1−β −2yt ηt−1 κσt−1 (ut , ut−1 ) 3 ||ut−ut−1 ||2 σt−1  1 2 +2ft−1 (ut ) ηt−1 κσt−1 (ut , ut−1 ) 3 ||ut−ut−1 || σt−1 Lastly, the gradient update yields as follows: σt = σt−1 + αηt t−1 ||ut − ut−1 ||2 κσt−1 (ut , ut−1 ) (B.2) 3 where α=Jt β/λ2 σt−1 , and notation ||·|| stands for `2 norm. In the case of η parameter, the sequential rule in Equation (A.2) is as follows:

ft−1 (ut ) = ft−2 (ut ) + ηt−1 t−1 κσ (ut , ut−1 )

31/10/08 0.115 900 0.122 371 0.094 900 0.063 900 0.095 121

Therefere, the gradient update of η yields as below: 00

ηt = ηt−1 + β t t−1 κσ (ut , ut−1 )

ζt =ζt−1 + β ζt

23/10/08 0.084 700 0.087 357 0.066 700 0.056 700 0.074 108

(B.3)

00

where β = β exp(−2t /2λ2 ). Appendix C. Cost function optimization of dimensionality reduction Likewise, the minimization of cost function C is performed using a gradient descent method as follows: vik = vik−1 − µ∂C/∂vik−1

(C.1)

where vik−1 is the low-dimensional representation of ui at iteration k−1 and µ∈R+ is the learning rate parameter, yielding the following expression:   k−1 X pij − qij ∂  1  − µ k−1 t − 1 ((t − 1) − 1) pij ∂vi j6=i q    k−1 2 X p − q ij ij ∂ 1  vik =vik−1 − µ k−1  t − 1 ((t − 1) − 1) pij ∂vi j6=i    k−1 k−1 k−1 k−1 2 0 v −v p q − q X ij i j ij ij µ vik =vik−1 − k−1 t−1 pij pij − qij j6=i

vik

=vik−1

(C.2)

0

where µ = µ/(σV2 ((t − 1)−1)).

Remark. Equation (C.2) aims to find the points vi and vj that minimizes the mismatch between pij and qij . Note, the pij similarity will not change during the optimization process (see Equation (8a)), but this is not the case for qij . Thus, the following scenarios may appear: (i) 0pij : Here, vi is updated with relatively high values k−1 k−1 2 at each iteration, i.e., (pij qij − (qij ) )/(pij |pij − k−1 qij |) < 0.

7 Acknowledgement This work was supported by “Programa Doctoral de Becas Colciencias – Convocatoria 617 " and “Caracterización Morfológica de Estructuras Cerebrales por Técnicas de Imagen para el Tratamiento Mediante Implantación Quirúrgica de Neuroestimuladores en la Enfermedad de Parkinson – Código 110180763808 Colciencias". References [1] Weifeng Liu, Lianbo Zhang, Dapeng Tao, and Jun Cheng. Reinforcement online learning for emotion prediction by using physiological signals. Pattern Recognition Letters, 107:123–130, 2018. [2] M. Das et al. Data-driven approaches for meteorological time series prediction: A comparative study of the state-of-the-art computational intelligence techniques. Pattern Recognition Letters, 105:155–164, 2018. [3] H. Yang, Z. Pan, Q. Tao, and J. Qiu. Online learning for vector autoregressive moving-average time series prediction. Neurocomputing, 2018. [4] Kuilin Chen and Jie Yu. Short-term wind speed prediction using an unscented kalman filter based state-space support vector regression approach. Applied Energy, 113:690–705, 2014. [5] Ye Ren, PN Suganthan, and Narasimalu Srikanth. A comparative study of empirical mode decomposition-based short-term wind speed forecasting methods. IEEE Trans. Sustain. Energy, 6(1):236–244, 2015. [6] Badong Chen, Songlin Zhao, Pingping Zhu, and José C Príncipe. Quantized kernel least mean square algorithm. IEEE Transactions on Neural Networks and Learning Systems, 23(1):22–32, 2012. [7] S. Sheather. Density estimation. Statistical science, pages 588–597, 2004. [8] W. Härdle. Applied nonparametric regression. Number 19. Cambridge university press, 1990. [9] Senjian An, Wanquan Liu, and Svetha Venkatesh. Fast cross-validation algorithms for least squares support vector machine and kernel ridge regression. Pattern Recognition, 40(8):2154–2162, 2007. [10] B. Chen, J. Liang, N. Zheng, and J. Príncipe. Kernel least mean square with adaptive kernel size. Neurocomputing, 191:95–106, 2016. [11] Y. Li et al. Zero-attracting variable-step-size least mean square algorithms for adaptive sparse channel estimation. International Journal of Adaptive Control and Signal Processing, 29(9):1189–1206, 2015. [12] Q. Niu et al. A new variable step size lms adaptive algorithm. In Chinese Control And Decision Conference, pages 1–4. IEEE, 2018. [13] Yunfei Zheng, Shiyuan Wang, Jiuchao Feng, and K Tse Chi. A modified quantized kernel least mean square algorithm for prediction of chaotic time series. Digital Signal Processing, 48:130–136, 2016. [14] Paul Honeine. Analyzing sparse dictionaries for online learning with kernels. IEEE Transactions on Signal Processing, 63(23):6343–6353, 2015. [15] S. Zhang, H. Cao, S. Yang, Y. Zhang, and X. Hei. Sequential outlier criterion for sparsification of online adaptive filtering. IEEE Transactions on Neural Networks and Learning Systems, (99):1–15, 2018. [16] W. Liu, J. Principe, and S. Haykin. Kernel adaptive filtering: a comprehensive introduction, volume 57. John Wiley & Sons, 2011. [17] Weifeng Liu, Puskal P Pokharel, and José C Príncipe. Correntropy: Properties and applications in non-gaussian signal processing. IEEE Transactions on Signal Processing, 55(11):5286–5298, 2007. [18] W. Wang et al. Convergence performance analysis of an adaptive kernel width mcc algorithm. AEU, 76:71–76, 2017. [19] C. Saide et al. Nonlinear adaptive filtering using kernel-based algorithms with dictionary adaptation. International Journal of Adaptive Control and Signal Processing, 29(11):1391–1410, 2015. [20] W. Liu, P. P Pokharel, and J. Principe. The kernel least-mean-square algorithm. IEEE Transactions on Signal Processing, 56(2):543–554, 2008. [21] B. Scholkopf and A. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.