Kernel least mean square with adaptive kernel size

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Kernel le...

Download PDF

1MB Sizes 2 Downloads 135 Views

Report

PDF Reader
Full Text

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Kernel least mean square with adaptive kernel size Badong Chen a,n, Junli Liang b, Nanning Zheng a, José C. Príncipe a,c a b c

Institute of Artiﬁcial Intelligence and Robotics, Xi'an Jiaotong University, 28 Xianning West Road, Xi'an 710049, China School of Electronics and Information, Northwestern Polytechnical University, Xi'an, China Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611 USA

art ic l e i nf o

a b s t r a c t

Article history: Received 13 February 2014 Received in revised form 1 October 2015 Accepted 6 January 2016 Communicated by: Steven Hoi

Kernel adaptive ﬁlters (KAF) are a class of powerful nonlinear ﬁlters developed in Reproducing Kernel Hilbert Space (RKHS). The Gaussian kernel is usually the default kernel in KAF algorithms, but selecting the proper kernel size (bandwidth) is still an open important issue especially for learning with small sample sizes. In previous research, the kernel size was set manually or estimated in advance by Silverman's rule based on the sample distribution. This study aims to develop an online technique for optimizing the kernel size of the kernel least mean square (KLMS) algorithm. A sequential optimization strategy is proposed, and a new algorithm is developed, in which the ﬁlter weights and the kernel size are both sequentially updated by stochastic gradient algorithms that minimize the mean square error (MSE). Theoretical results on convergence are also presented. The excellent performance of the new algorithm is conﬁrmed by simulations on static function estimation, short term chaotic time series prediction and real world Internet trafﬁc prediction. & 2016 Published by Elsevier B.V.

Keywords: Kernel methods Kernel adaptive ﬁltering Kernel least mean square Kernel selection

1. Introduction Kernel based methods are successfully used in machine learning and nonlinear signal processing due to their inherent advantages of convex optimization and universality in the space of L2 functions. By mapping the input data into a feature space associated with a Mercer kernel, many efﬁcient nonlinear algorithms can be developed, thanks to the kernel trick. Popular kernel methods include support vector machine (SVM) [1,2], kernel regularization network [3], kernel principal component analysis (KPCA) [4], and kernel Fisher discriminant analysis (KFDA) [5], etc. These nonlinear algorithms show signiﬁcant performance improvement over their linear counterparts. Online kernel learning [36–39,47–58] has also been extensively studied in the machine learning and statistical signal processing literature, and it provides efﬁcient alternatives to approximate a desired nonlinearity incrementally. As the training data are sequentially presented to the learning system, online learning requires, in general, much less memory and computational cost. Recently, kernel based online algorithms for adaptive ﬁltering have been developed and have become an emerging area of research [6]. Kernel adaptive ﬁlters (KAF) are derived in Reproducing Kernel Hilbert Spaces (RKHS) [7,8], by using the linear structure and inner n

Corresponding author. Tel.: þ 86 29 8266 8802x8009; fax: þ 86 29 8266 8672. E-mail addresses: [email protected] (B. Chen), [email protected] (J. Liang), [email protected]ﬂ.edu (J.C. Príncipe).

product of this space to implement the well-established linear adaptive ﬁltering algorithms that correspond to nonlinear ﬁlters in the original input space. Typical KAF algorithms include the kernel least mean square (KLMS) [9,10], kernel afﬁne projection algorithms (KAPA) [11], kernel recursive least squares (KRLS) [12], and extended kernel recursive least squares (EX-KRLS) [13], etc. With a radially symmetric Gaussian kernel they create a growing radialbasis function (RBF) network to learn the network topology and adapt free parameters directly from the training data. Among these KAF algorithms, the KLMS is the simplest, and fastest to implement yet very effective. There are two main open challenges in the KAF algorithms . The ﬁrst is their growing structure with each sample, which results in increasing computational costs and memory requirements especially in continuous adaptation scenarios. In order to curb the network growth and to obtain a compact representation, a variety of sparsiﬁcation techniques have been applied, where only the important input data are accepted as new centers. The presently available sparsiﬁcation criteria include the novelty criterion [14], approximate linear dependency (ALD) criterion [12] surprise criterion [15], and so on. In a recent work [16], we have proposed a novel method, the quantized kernel least mean square (QKLMS) algorithm, to compress the input space and hence constrain the network size which is shown to be very effective in yielding a compact network with desirable accuracy. Selecting a proper Mercer kernel is the second remaining problem that should be addressed when implementing kernel adaptive ﬁltering algorithms, especially when the training data

http://dx.doi.org/10.1016/j.neucom.2016.01.004 0925-2312/& 2016 Published by Elsevier B.V.

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

2

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

size is small. In this case, the kernel selection includes two parts: ﬁrst, the kernel type is chosen, and second, its parameters are determined. Among various kernels, the Gaussian kernel is very popular and is usually a default choice in kernel adaptive ﬁltering due to its universal approximating capability, desirable smoothness and numeric stability. The normalized Gaussian kernel is κ ðu; u0 Þ ¼ exp ‖u u0 ‖2 =2σ 2 ð1Þ where the free parameter σ (σ 4 0) is called the kernel size (also known as the kernel bandwidth or smoothing parameter). In fact, the Gaussian kernel is strictly positive deﬁnite and as such produces a RKHS that is dense [8] and as such linear algorithms in this RKHS are universal approximators of smooth L2 functions. In principle this means that in the large sample size regime the asymptotic properties of the mean square approximation are independent of the kernel size σ [46]. This means that the kernel size in KAF only affects the dynamics of learning, because in the initial steps the sample size is always small, therefore both the accuracy for batch learning and the convergence properties for online learning are dependent upon the kernel size. This should be contrasted with the effect of the kernel size in classiﬁcation where the kernel size controls both the accuracy and the generalization of the optimal solution [1,2]. Up to now, there are many methods for selecting a kernel size for the Gaussian kernel borrowed from the areas of statistics, nonparametric regression and kernel density estimation. The most popular methods for the selection of the kernel size are: cross-validation (CV) [17–21] which can always be used since the kernel size is a free parameter, penalizing functions [18], plug-in methods [18,22], Silverman's rule [23] and other rules of thumb [24]. The cross-validation, penalizing functions, and plug-in methods are computationally intensive and are not suitable for online kernel learning. The Silverman's rule is widely accepted in kernel density estimation although it is derived under a Gaussian assumption and is usually not appropriate for multimodal distributions. Besides the ﬁxed kernel size, some adaptive or varying kernel size algorithms can also be found in the literature [25–28]. This topic is also closely related to the techniques of multi-kernel learning or learning the kernel in the machine learning literature [40–45]. There the goal is typically to learn a combination of kernels based on some optimization methods, but in KAF this approach is normally avoided due to the computational complexity [6]. All the above mentioned methods, however, are not suitable for determining an optimal kernel size in online kernel adaptive ﬁltering, since they either are batch mode methods or originate from a different problem, such as the kernel density estimation. Given that in online learning the number of samples is large and not speciﬁed a priori, the ﬁnal solution will be practically independent of the kernel size. The real issue is therefore how to speed up convergence to the neighborhood of the optimal solution, which will also provide smaller network sizes. In the present work, by treating the kernel size as an extra parameter for the optimization, a novel sequential optimization framework is proposed for the KLMS algorithm. The new optimization paradigm allows for an online adaptation algorithm. At each iteration cycle, the ﬁlter weights and the kernel size are both sequentially updated to minimize the mean square error (MSE). As the kernel size is updated sequentially, the proposed algorithm is computationally very simple. The new algorithm can also be incorporated in the quantization method so as to yield a compact model. The rest of the paper is organized as follows. In Section 2, we brieﬂy revisit the KLMS algorithm. In Section 3, we propose a sequential optimization strategy for the kernel size in KLMS, and then derive a simple stochastic gradient algorithm to adapt the kernel size. In Section 4, we give some theoretical results on the convergence. Speciﬁcally, we derive the energy conservation

relation in RKHS, and on this basis we derive a sufﬁcient condition for the mean square convergence, and arrive at a theoretical value of the steady-state excess mean-square error (EMSE). In Section 5, we present simulation examples on static function estimation, short term chaotic time series prediction and real world Internet trafﬁc prediction to conﬁrm the satisfactory performance of the KLMS with adaptive kernel size. Finally, in Section 6, we present the conclusion.

2. KLMS Suppose the goal is to learn a continuous input–output mapping f : U-Y based on a sequence of input–output examples N (training data) uðiÞ; yðiÞ i ¼ 1 , where U ℝm is the input domain, Y ℝ is the desired output space. The hypothesis space for learning is assumed to be a Reproducing Kernel Hilbert Space (RKHS) ℋk associated with a Mercer kernel κ ðu; u0 Þ, a continuous, symmetric, and positive-deﬁnite function κ : U U-ℝ [7]. To ﬁnd such a function f , one may solve the regularized least squares regression in ℋk : min

f A ℋk

N X i¼1

ðyðiÞ f ðuðiÞÞÞ þ γ ‖f ‖2ℋk 2

ð2Þ

where ‖:‖ℋk denotes the norm in ℋk , γ Z 0 is the regularization factor that controls the smoothness of the solution. As the inner product in RKHS satisﬁes the reproducing property, namely, D f κ ðu; :Þ ℋ ¼ f ðuÞ, (2) can be rewritten as k 2 N X min þ γ ‖f ‖2ℋk ð3Þ yðiÞ 〈 f κ ðuðiÞ; :Þ〉ℋk f A ℋk

i¼1

By the representer theorem [8], the solution of (2) can be expressed as a linear combination of kernels: f ðuÞ ¼

N X

αi κ ðuðiÞ; uÞ

ð4Þ

i¼1

1 The coefﬁcient vector can be calculated as α ¼ K þ γ I y, T NN is the Gram matrix with where y ¼ ½yð1Þ; ⋯; yðNÞ , and K A ℝ elements Kij ¼ κ ðuðiÞ; uðjÞÞ. Solving the previous least squares problem usually requires signiﬁcant memory and computational burden due to the necessity of calculating a large Gram matrix, whose dimension equals the number of input patterns. The KAF algorithms, however, provide efﬁcient alternatives that build the solution incrementally, without explicitly computing the Gram matrix. Denote f i the estimated mapping (hypothesis) at iteration i. The KLMS algorithm can be expressed as [6] ( f0 ¼0 ð5Þ f i ¼ f i 1 þ ηκ ðuðiÞ; :ÞeðiÞ where η denotes the step size, eðiÞ is the instantenous prediction error at iteration i, eðiÞ ¼ yðiÞ f i 1 ðuðiÞÞ, i.e. the instantenous error only depends upon the difference between the desired response at the current time and the evaluation of the current sample (uðiÞ) with the previous system model (f i 1 ). The learned mapping of KLMS, at iteration N, will be f N ðuÞ ¼ η

N X

eðiÞκ ðuðiÞ; uÞ

ð6Þ

i¼1

This is a very nice result because it states that the solution to the unknown nonlinear mapping is done incrementally one step at a time, with a growing RBF network, where the centers are the samples and the ﬁtting parameter is automatically determined as the current error.

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Taking advantage of the incremental nature of the KLMS updates, the KLMS adaptation is in essence the solution of the following incremental regularized least squares problem: 2 1 η ‖f i f i 1 ‖2ℋk min yðiÞ f i ðuðiÞÞ þ

η

f i A ℋk

Letting Δf i ¼ f i f i 1 , (7) is equivalent to

2 1 η

min eðiÞ Δf i ðuðiÞÞ þ

Δf i 2ℋk

η

Δf i A ℋk

ð7Þ

where PðuÞ is the marginal measure on U. Then the optimization should ﬁnd the kernel size that minimizes this error. Since in practice f is usually unknown, one can use the mean square error as an alternative cost for optimization: Z 2 y f i ðuÞ dPðu; yÞ ð11Þ J2 ¼ UY

ð8Þ

From (8), one may observe: (1) KLMS learning at iteration i is equivalent to solving a regularized least squares problem, in which the previous hypothesis f i 1 is frozen, and only the adjustment term Δf i is optimized; (2) in this least squares problem, there is only one training example involved, i.e. uðiÞ; eðiÞ ; (3) the regularization factor is directly related to the step-size via γ ¼ ð1 ηÞ=η. In the rest of the paper, the Mercer kernel κ ðu; u0 Þ is assumed to be the Gaussian kernel. In addition, to explicitly show the kernel size dependence, we denote the Gaussian kernel by κ σ ðu; u0 Þ, and the induced RKHS by ℋσ .

3. KLMS with adaptive kernel size The kernel is a crucial factor in all kernel methods in the sense that it deﬁnes the similarity between data points. For the Gaussian kernel, this similarity depends on the kernel size selected. If the kernel size is too large, then all the data will look similar in the RKHS (with inner products all close to 1), and the procedure reduces to linear regression. On the other hand, if the kernel size is too small, then all the data will look distinct (with inner products all close to 0), and the system fails to do inference on unseen data that fall between the training points. Up to now the KLMS has been only studied with a constant kernel size, so all the elegance of the solution has not been fully recognized. In fact, the sequential learning algorithm (5) builds the current estimate of f from two additive parts: the previous hypothesis f i 1 and a correction term proportional to the prediction error on new data. In principle we can possibly use one RKHS to compute the previous hypothesis and change the RKHS to compute the correction term in (5), which can be efﬁciently done by changing the kernel size. This is the motivating idea that we pursue in this paper, and it has two fundamental components: (1) we have to formalize this approach; (2) we have to ﬁnd an easy way of implementing it from samples. In the following, we propose an approach to sequentially optimize the KLMS with a variable kernel size. A. Sequential optimization of the kernel size in KLMS In order to determine an optimal kernel size for KLMS, one should deﬁne in advance a cost function for the optimality. To make this precise, we suppose the training data uðiÞ; yðiÞ are random, and there is an absolutely continuous probability measure P (usually unknown) on the product space U Y from which the data are drawn. The measure P deﬁnes a regression function: Z f ðuÞ ¼ ydP ðyjuÞ ð9Þ Y

where P ðyjuÞ is the conditional measure on u Y. In this situation, the function f can be said to be the desired mapping that needs to be estimated. Thus a measurement of the error in f i (which is updated by KLMS) is Z 2 f f i dPðuÞ ð10Þ J1 ¼ U

3

The cost J 2 can be easily estimated from sample data. This is especially important when the probability measure P is unknown. Of course, the kernel size in KLMS can be optimized in batch mode, that is, the optimization is performed only after presenting the whole training data. Then, combining (6) and (11) yields the optimization: " #2 Z N X σ ¼ arg min yη eðiÞκ σ ðuðiÞ; uÞ dPðu; yÞ ð12Þ σ Aℝþ

UY

i¼1

where σ stands for the optimal kernel size. As KLMS is an online learning algorithm, we are more interested in a sequential optimization framework, which allows the kernel size to be sequentially optimized across iterations. To this end, we propose the following sequential optimization: Z 2 σ i ¼ arg min y f i 1 ðuÞ ηeðiÞκ σ i ðuðiÞ; uÞ dPðu; yÞ ð13Þ σi A ℝ þ

UY

where the previous hypothesis f i 1 is frozen, and σ i denotes the kernel size at iteration i. Remark. By (13), the kernel size is optimized sequentially. Thus at iteration i, the initial learning step will determine an optimal value of the kernel size σ i (the old kernel sizes remain unchanged), followed by the addition of a new center using KLMS with this new kernel size. Learning with a varying kernel size implies at each iteration cycle to perform adaptation in a different RKHS since changing the kernel size modiﬁes the inner product of the Hilbert space. For the KLMS, this learning paradigm is indeed feasible, because at each iteration cycle, the old centers remain frozen, and only a new center is added, or in other words, the correction term Δf i is just a feature vector in the current RKHS. Remark. A more reasonable approach should be to jointly optimize the kernel size and the step size (corresponding to a regularization factor). In this work, however, for simplicity the step size is assumed to be ﬁxed and only the kernel size is optimized. Suppose the training data uðiÞ; yðiÞ are independent, identically distributed (i.i.d.). The hypothesis f i , which depends on the previous training data, will be independent of the future training data. Then the mean square prediction error at iteration i þ 1, conditioned on f i , equals

Z 2 E e2 ði þ1Þf i ¼ yðiþ 1Þ f i ðuðiþ 1ÞÞ dP uði þ 1Þ; yði þ1Þf i UY Z 2 yði þ 1Þ f i ðuði þ 1ÞÞ dP ðuði þ1Þ; yði þ1ÞÞ ¼ UY Z 2 ¼ y f i ðuÞ dPðu; yÞ ð14Þ UY

where P uði þ 1Þ; yði þ 1Þf i denotes the probability measure of ðuði þ 1Þ; yði þ 1ÞÞ conditioned on f i . The mapping update at iteration i will affect directly the prediction error at iteration i þ 1, according to (14) the sequential optimization problem (13) can then be equivalently deﬁned to search a value of the kernel size σi such that the conditional mean square error E e2 ði þ 1Þf i is minimized:

σ i ¼ arg minE e2 ði þ 1Þf i σi A ℝ þ

¼ arg minE e2 ði þ 1Þf i 1 þ ηeðiÞκ σ i ðuðiÞ; :Þ σi A ℝ þ

ð15Þ

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

To understand the above optimization in more detail, we con sider a nonlinear regression model inwhich the output data yðiÞ are related to the input vectors uðiÞ via

yðiÞ ¼ f ðuðiÞÞ þvðiÞ

ð16Þ

where f ð:Þ denotes the unknown nonlinear mapping that needs to be estimated, and vðiÞ stands for the disturbance noise. In this case, the prediction error eðiÞ can be expressed as eðiÞ ¼ yðiÞ f i 1 ðuðiÞÞ ¼ f~ i 1 ðuðiÞÞ þ vðiÞ

ð17Þ

where f~ i 1 9 f f i 1 is the residual mapping at iteration i 1. The mean square error at iteration iþ 1, conditioned on f i , is Z 2 E e2 ði þ 1Þf i ¼ f~ i ðuði þ1ÞÞ þ vði þ 1Þ dP uv ði þ 1Þ UV

Z ¼ UV

0 B B @

f~ i 1 ðuðiþ 1ÞÞ þ vði þ 1Þ

ηeðiÞκ σi ðuðiÞ; uði þ1ÞÞ

12 C C dP uv ðiþ 1Þ A

ð18Þ

where V ℝ denotes the noise space, and P uv ði þ 1Þ denotes the probability measure of ðuði þ 1Þ; vðiþ 1ÞÞ. Remark. One can see from (18) that the optimal kernel size at iteration i depends upon the residual mapping f~ i 1 , prediction error eðiÞ, step size η, and the joint distribution P uv , which is much different from the optimal kernel sizes in problems of density estimation. Theoretically, given the desired mapping and the joint distribution P uv , the optimal kernel sizes can be solved sequentially. This is, however, a rather tedious and impractical procedure since we have to solve an involved nonlinear optimization at each iteration cycle. More importantly, in practice the desired mapping, the noise, and the input distribution are usually unknown. Next, we will develop a stochastic gradient algorithm to adapt the kernel size, without resorting to any prior knowledge.

1) The direction of the gradient depends upon the signs of the prediction errors eði 1Þ and eðiÞ. Speciﬁcally, when the signs of eði 1Þ and eðiÞ are the same, the kernel size will increase; while when the signs of eði 1Þ and eðiÞ are different, the kernel size will decrease. This is reasonable, since intuitively the signs of two successive errors contain the information about the smoothness of the desired mapping. If there is little sign change, the desired mapping is likely a “smoothing function” and a larger kernel size is desirable; while if the sign changes frequently, the desired mapping is likely a “zig zag function”, and in this case a smaller kernel size is usually better. 2) The magnitude of the gradient depends on the input data through ‖uði 1Þ uðiÞ‖2 κ σ i 1 ðuði 1Þ; uðiÞÞ. The value of this term will be nearly zero when the distance between uði 1Þ and uðiÞ is very small or very large. This is also reasonable, since when uðiÞ is very close to uði 1Þ, the sign change between eði 1Þ and eðiÞ only implies a “very local ﬂuctuation”; while when uðiÞ is very far away from uði 1Þ, the sign change between eði 1Þ and eðiÞ contains little information about the smoothness of the desired mapping. 3) The magnitude of the gradient depends on σ i 1 through κ σ i 1 ðuði 1Þ; uðiÞÞ=σ 3i 1 . For the case uðiÞ a uði 1Þ, this term will approach zero when σ i 1 is very small or very large. Although gradient goes to zero does not imply that σ i is bounded, but intuitively, with a proper initial value, the kernel size will be adjusted within a reasonable range. Combining (5) and (19), we obtain the KLMS with adaptive kernel size: 8 f0 ¼0 > > > > < eðiÞ ¼ yðiÞ f i 1 ðuðiÞÞ

σ i ¼ σ i 1 þ ρeði 1ÞeðiÞ‖uði 1Þ uðiÞ‖2 κ σ i 1 ðuði 1Þ; uðiÞÞ=σ 3i 1 > > > > :f ¼f þ ηeðiÞκ σ ðuðiÞ; :Þ i

As discussed previously, at each iteration cycle, the kernel size in KLMS can be optimized by minimizing the mean square error at next iteration (conditioned on the learned mapping at the current iteration). In this sense, one can update the current kernel size using the current prediction error; that is, at iteration i, when the prediction error eðiÞ is available, the kernel size σ i can be updated by minimizing the instantaneous squared error, and a stochastic gradient algorithm can be readily derived as follows: ∂ 2 e ðiÞ ∂σ i 1 i ∂ h~ f i 1 ðuðiÞÞ þ vðiÞ ¼ σ i 1 2μeðiÞ ∂σ i 1 i ∂ h~ f ðuðiÞÞ þ vðiÞ ηeði 1Þκ σ i 1 ðuði 1Þ; uðiÞÞ ¼ σ i 1 2μeðiÞ ∂σ i 1 i 2

σi ¼ σi 1 μ

¼ σ i 1 þ ρeði 1ÞeðiÞ

∂ κ σ ðuði 1Þ; uðiÞÞ ∂σ i 1 i 1

¼ σ i 1 þ ρeði 1ÞeðiÞ‖uði 1Þ uðiÞ‖2 κ σ i 1 ðuði 1Þ; uðiÞÞ=σ 3i 1 ð19Þ where (a) follows from the fact that f~ i 2 does not depend on σ i 1 , ρ ¼ 2μη is the step size for the kernel size adaptation. Remark. The above algorithm is computationally very simple, since the kernel size is updated sequentially, where only the kernel size of the new center is updated, and those of the old centers remain frozen. The initial value of the kernel size can be set manually or calculated roughly using Silverman's rule based on the input distribution in advance. From (19), we have the following observations:

i

ð20Þ

B. KLMS with adaptive kernel size

ðaÞ

i1

Remark. The computational complexity of the algorithm (20) is in the same order of magnitude as that of the original KLMS algorithm, which equals OðiÞ at iteration i, because both algorithms share the same most time-consuming part, that is, the calculation of the prediction error. Similar to the original KLMS algorithm, the new algorithm also produces a growing RBF network, whose network size increases linearly with the number of training data. In order to obtain a compact model (a network with as few centers as possible) and reduce the computational and memory costs, some sparsiﬁcation or quantization methods can be applied. Here, we only adopt a quantization approach. In [16], we use an idea of quantization to compress the input space of KLMS and constrain efﬁciently the network size growth. The learning rule of the quantized KLMS (QKLMS) is 8 > :f ¼f i i 1 þ ηeðiÞκ σ ðQ ½uðiÞ; :Þ where Q ½: denotes a quantization operator in input space U. In QKLMS (21), the centers are limited to the quantization codebook C, and the network size can never be larger than the size of the codebook sizeðC Þ. At iteration i, we just add ηeðiÞ to the coefﬁcient of the code-vector closest to the current input uðiÞ. A simple online vector quantization (VQ) method was proposed in [16]. Suppose that the online VQ method in [16] is adopted and the kernel size of QKLMS is varying as a function of the codebook index. Denote ε the quantization size, and dðuðiÞ; C Þ the distance

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

between uðiÞ and C: dðuðiÞ; C Þ ¼ min1 r j r sizeðC Þ uðiÞ C j k, where C j denotes the jth element of the codebook C. Then at iteration i, if dðuðiÞ; C Þ 4 ε, a new (uðiÞ) will be added into the code-vector codebook, i.e. C ¼ C; uðiÞ . In this case, a new center (uðiÞ) will also be allocated, and its kernel size can be computed in a similar way as in (19): σ j ¼ σ j 1 þ ρeðj 1ÞeðjÞ‖C j C j 1 ‖2 κ σj 1 C j ; C j 1 =σ 3j 1 ð22Þ where σ j denotes the kernel size corresponding to the jth codevector C j , and eðjÞ denotes the prediction error at the iteration when the code-vector C j is added. If dðuðiÞ; C Þ r ε, there is no new center added and no kernel size update. 4. Convergence analysis This section gives some theoretical results on the convergence of the algorithm (20). The unknown system is assumed to be a nonlinear regression model given by (16). First, let us deﬁne the a priori error ea ðiÞ and a posteriori error ep ðiÞ as follows: ea ðiÞ ¼ f~ i 1 ðuðiÞÞ;

ep ðiÞ ¼ f~ i ðuðiÞÞ

ð23Þ

with f~ i 1 and f~ i the residual mappings at iteration i 1 and i, respectively. A. Energy conservation relation In adaptive ﬁltering theory, the energy conservation relation provides a powerful tool for the mean square convergence analysis [29–33]. In our recent studies [16,34,46], this important relation has been extended into the RKHS. Before carrying out the mean square convergence analysis for the mapping update in (20), we derive the corresponding energy conservation relation. The mapping update in (20) can be expressed as a residualmapping update: f~ i ¼ f~ i 1 ηeðiÞκ σ i ðuðiÞ; :Þ

ð24Þ

Further, one can derive the relationship between ea ðiÞ and ep ðiÞ: ep ðiÞ ¼ ea ðiÞ ηeðiÞκ σ i ðuðiÞ; uðiÞÞ ¼ ea ðiÞ ηeðiÞ

Lemma 1. Let U ℝm be any set with nonempty interior. Then the 0 RKHS ℋσ induced by the Gaussian kernel pﬃﬃﬃκ σ ðu; u Þ on U contains the function κ σ ðu; :Þ if and only if σ 4 2σ =2 . For such σ , the function κ σ ðu; :Þ has norm in ℋσ : !m=2

σ2 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ σ 2σ 2 σ 2

‖κ σ ðu; :Þ‖ℋσ ¼

ð25Þ

Remark. The above lemma is p a ﬃﬃﬃdirect consequence of the Theorem 2 in [35]. For the case σ o 2σ =2, the Hilbert space ℋσ will not contain the function κ σ ðu; :Þ. It is worth noting that in this case, the function κ σ ðu; :Þ can still be arbitrarily "close" to a member in ℋσ , because ℋσ is dense in the space of continuous functions on U provided U is compact. pﬃﬃﬃ Now we select a ﬁxed kernel size σ , satisfying σ o 2σ min , where σ min ¼ minfσ i g. By Lemma 1, the RKHS ℋσ will contain all the correction terms. By the reproducing property of the RKHS ℋσ , the prediction error eðiÞ, a priori error ea ðiÞ, and a posteriori error ep ðiÞ can be expressed as 8 eðiÞ ¼ 〈f~ i 1 κ σ ðuðiÞ; :Þ ℋσ þ vðiÞ > > > > D E < ea ðiÞ ¼ f~ i 1 jκ σ ðuðiÞ; :Þ ð26Þ ℋσ > > D > > : e ðiÞ ¼ f~ κ ðuðiÞ; :Þ p σ i

ℋσ

ð27Þ

Hence f~ i ¼ f~ i 1 þ ep ðiÞ ea ðiÞ κ σ i ðuðiÞ; :Þ

ð28Þ

Squaring both sides of (28) in RKHS ℋσ , we derive 2 D ‖f~ i ‖2ℋσ ¼ ‖f~ i 1 ‖2ℋσ þ ep ðiÞ ea ðiÞ κ σi ðuðiÞ; :Þκ σi ðuðiÞ; :Þ ℋσ D þ 2 ep ðiÞ ea ðiÞ f~ i 1 κ σ i ðuðiÞ; :Þ ℋσ 2 D ¼ ‖f~ i 1 ‖2ℋσ þ ep ðiÞ ea ðiÞ κ σi ðuðiÞ; :Þκ σ ðuðiÞ; :Þ þ δi ð:Þ ℋσ D þ 2 ep ðiÞ ea ðiÞ f~ i 1 κ σ ðuðiÞ; :Þ þ δi ð:Þ ℋσ

ð29Þ ¼ ‖f~ i 1 ‖2ℋσ þe2p ðiÞ e2a ðiÞ þ εðiÞ E is named the residual mapping power where ‖f~ i ‖2ℋσ ¼ f~ i f~ i ℋσ

(RMP) at iteration i, δi ð:Þ ¼ κ σ i ðuðiÞ; :Þ κ σ ðuðiÞ; :Þ, and Dh i εðiÞ ¼ ep ðiÞ ea ðiÞ 2 κ σi ðuðiÞ; :Þ þ 2 ep ðiÞ ea ðiÞ f~ i 1 δi ð:Þ ℋσ

ð30Þ It follows that ‖f~ i ‖2ℋσ þ e2a ðiÞ ¼ ‖f~ i 1 ‖2ℋσ þ e2p ðiÞ þ εðiÞ

ð31Þ

Remark. Eq. (31) is referred to as the energy conservation relation for the KLMS with adaptive kernel size. If the kernel size is ﬁxed, say σ i σ , we have δi ð:Þ ¼ 0, and hence εðiÞ ¼ 0. In this case, (31) reduces to the original energy conservation relation for the KLMS [46]: ‖f~ ‖2 þ e2 ðiÞ ¼ ‖f~ ‖2 þ e2 ðiÞ ð32Þ i ℋσ

Due to the variable kernel size, at each iteration the correction term in (24) is computed in a different RKHS. In order to derive an energy conservation relation in a ﬁxed RKHS, one should ﬁnd a RKHS that contains all the correction terms. Below we give an important lemma.

5

a

i 1 ℋσ

p

which, in form, is identical to the energy conservation relation for the normalized LMS (NLMS) algorithm. B. Sufﬁcient condition for mean square convergence Substituting ep ðiÞ ¼ ea ðiÞ ηeðiÞ into (31) and taking expectations of both sides, we have h i h i h D i E ‖f~ i ‖2ℋσ E ‖f~ i 1 ‖2ℋσ ¼ 2ηE eðiÞ ea ðiÞ þ f~ i 1 δi ð:Þ ℋσ h D i þ η2 E e2 ðiÞ 1 þ κ σ i ðuðiÞ; :Þδi ð:Þ ℋσ h D i ðbÞ ~ ¼ 2ηE eðiÞ ea ðiÞ þ f i 1 δi ð:Þ ℋσ h i ð33Þ þ η2 E e2 ðiÞ 1 þ ‖δi ð:Þ‖2ℋσ where (b) comes from D D κ σ i ðuðiÞ; :Þδi ð:Þ ℋσ ¼ δi ð:Þ þ κ σ ðuðiÞ; :Þδi ð:Þ ℋσ D ¼ κ σ ðuðiÞ; :Þδi ð:Þ ℋσ þ‖δi ð:Þ‖2ℋσ

¼ δi ðuðiÞÞ þ ‖δi ð:Þ‖2ℋσ ¼ ‖δi ð:Þ‖2ℋσ It follows easily that h i h i E ‖f~ i ‖2ℋσ r E ‖f~ i 1 ‖2ℋσ D i h 2E eðiÞ ea ðiÞ þ f~ i 1 δi ð:Þ ℋσ h i 3ηr E e2 ðiÞ 1 þ ‖δi ð:Þ‖2ℋσ

ð34Þ

ð35Þ

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

If 8 i, the step-size η satisﬁes the inequality i h 2E eðiÞ ea ðiÞ þ 〈 f~ i 1 δi ð:Þ ℋσ h i 0oηr E e2 ðiÞ 1 þ ‖δi ð:Þ‖2ℋσ

where ξv is the noise variance. Then we have 2

ð36Þ

the RMP in RKHS ℋσ will be monotonically decreasing (and hence convergent). The inequality (36) implies 8 i h D i 40 E eðiÞ ea ðiÞ þ f~ i 1 δi ð:Þ ℋσ

So a sufﬁcient condition for the mean square convergence (monotonic decrease of the RMP) will be, 8 i, i 8 h ~ > > E eðiÞ ea ðiÞ þ 〈f i 1 δi ð:Þiℋσ 4 0 > > < h i 2E eðiÞ ea ðiÞ þ f~ i 1 δi ð:Þ ð37Þ > 0oηr h ℋiσ > > > : E e2 ðiÞ 1 þ ‖δi ð:Þ‖2ℋσ Remark. Since e2 ðiÞ is in expectation at least the noise variance, but ea ðiÞ tends to zero (or possibly very close to zero) if all goes well. So in order for the step-size η to satisfy (36), it should be a variable step-size that approaches zero at some rate. In addition, if (37) holds, one can only conclude that the RMP will monotonically decrease, but not necessarily converge to zero. A. Steady-state excess mean square error Further, we take the limit of (33) as i-1: h i h i lim E ‖f~ i ‖2ℋσ lim E ‖f~ i 1 ‖2ℋσ

i-1

i-1

ð38Þ

i-1

If the RMP reaches the steady-state, h i h i lim E ‖f~ i ‖2ℋσ ¼ lim E ‖f~ i 1 ‖2ℋσ

i-1

i-1

ð39Þ

then it holds

D η h i ¼ lim E e2 ðiÞ 1 þ ‖δi ð:Þ‖2ℋσ lim E eðiÞ ea ðiÞ þ f~ i 1 δi ð:Þ ℋσ 2 i-1 i-1 ð40Þ In order to the steady-state excess mean-square error1 derive (EMSE) lim E e2a ðiÞ , we give two assumptions: i-1

A1: The noise vðiÞ is zero-mean, independent, identically dis tributed, and independent of the input sequence uðiÞ ; A2: The squared a priori error e2a ðiÞ and ‖δi ð:Þ‖2ℋσ are uncorrelated. Remark. The assumption A1 is commonly used in the convergence analysis for adaptive ﬁltering algorithms [29]. This assumption implies the independence between vðiÞ and ea ðiÞ. Under the assumptions A1 and A2, (40) becomes h D i lim E e2a ðiÞ þ lim E ea ðiÞ f~ i 1 δi ð:Þ ℋσ

i-1

i-1

6B ¼ lim E4@ i-1

σi C 7 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃA 5 1 σ 2σ 2i σ 2 2

ð43Þ

Remark. Although both ς and τ depend on the kernel size σ , the steady-state EMSE itself does not depend on σ . This can be easily understood if we notice the fact that the residual mapping f~ i has no relation to σ . To further investigate the steady-state EMSE, we consider the case in which as i-1, σ i and σ 1 are very close and satisfy pﬃﬃﬃ jσ i σ 1 jo 2 2 σ 1 =2, where σ 1 ¼ lim E½σ i . In this case we i-1 pﬃﬃﬃ have σ 1 o 2σ i asi-1. Then at the steady-state stage, we can set σ ¼ σ 1 . Therefore 1m 20 3 ð44Þ

h D i jτj ¼ lim E ea ðiÞ f~ i 1 δi ð:Þ ℋσ j i-1 h D i r lim E ea ðiÞ f~ i 1 δi ð:Þ ℋσ j i-1 h i r lim E ea ðiÞ‖f~ i 1 ‖ℋσ ‖δi ð:Þ‖ℋσ i-1

0

ð45Þ

It follows that ηξ2 ð1 þ ςÞ 2τ ηξ2v lim E e2a ðiÞ ¼ v 2 ηð1 þ ςÞ 2η

i-1

ð46Þ

Remark. It has been shown that the steady-state EMSE of the 2 KLMS (with a ﬁxed kernel size) equals ηξv =ð2 ηÞ, which is not related to the speciﬁc value of the kernel size [46]. From (46) one observes that, when the kernel size σ i converges to a neighborhood of a certain constant (σ 1 ), the adaptation of kernel size also has little effect on the steady-state EMSE. This will be conﬁrmed later by simulation results. We should point out here that, although the kernel size may have little effect on the KLMS steadystate accuracy (in terms of the EMSE), it has signiﬁcant inﬂuence on the convergence speed. In most practical situations, the training data are ﬁnite and the algorithm may never reach the steady state. In these cases the kernel size also has signiﬁcant inﬂuence on the ﬁnal accuracy (not the steady-state accuracy).

5. Simulation results

h i η 2 lim E e2a ðiÞ þ ξv 1 þ lim E ‖δi ð:Þ‖2ℋσ ¼ 2 i-1 i-1

i-1

and

h i þ η2 lim E e2 ðiÞ 1 þ ‖δi ð:Þ‖2ℋσ

i-1

Lemma 1, ς can also be expressed as h i ς ¼ lim E ‖δi ð:Þ‖2ℋσ i-1 h i ¼ lim E ‖κ σ i ðuðiÞ; :Þ κ σ ðuðiÞ; :Þ‖2ℋσ i-1 h i ¼ lim E ‖κ σ i ðuðiÞ; :Þ‖2ℋσ 1 i-1 1m 3 20

σ 2i B C 7 ﬃA 15 0 ς ¼ lim E6 4@ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ i-1 σ 2σ 2i σ 2

D ¼ 2η lim E eðiÞ ea ðiÞ þ f~ i 1 δi ð:Þ ℋσ i-1

ηξ2 ð1 þ ςÞ 2τ ð42Þ lim E e2a ðiÞ ¼ v 2 ηð1 þ ςÞ i-1 h i h D i where ς ¼ lim E ‖δi ð:Þ‖2ℋσ , and τ ¼ lim E ea ðiÞ f~ i 1 δi ð:Þ ℋσ . By

ð41Þ

1 The a priori error power is also referred to as the excess mean-square error in adaptive ﬁltering.

In this section, we present simulation results that illustrate the performance of the proposed algorithm. The illustrative examples presented include static function approximation, short-term chaotic time series prediction and real world Internet trafﬁc prediction.

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

7

1) Static function approximation

σ σ σ σ σ

= 0.05 = 0.1 = 0.35 = 0.5 = 1.0 adaptive kernel size desired

2.5 Consider a simple static function estimation problem in which the desired output data are generated by ð47Þ

where the input uðiÞ is uniformly distributed over ½ π ; π , and vðiÞ is a white Gaussian noise with variance 0.0001. For the KLMS with different kernel sizes, the average convergence curves (in terms of the EMSE) over 1000 Monte Carlo runs are shown in Fig. 1. In the simulation, the step-sizes for all the cases are set at η ¼ 0:5. For the KLMS with adaptive kernel size, the step-size for the kernel size adaptation is set at ρ ¼ 0:025, and the initial kernel size is set as 1.0. From Fig. 1, we see clearly that the kernel size has signiﬁcant inﬂuence on the convergence speed. In this example, the kernel size σ ¼ 1:0 and σ ¼ 0:5 produce a rather slow convergence speed. The kernel sizes σ ¼ 0:05 and σ ¼ 0:1 work very well, and in particular, the kernel size σ ¼ 0:1 achieves a fast convergence speed and the smallest ﬁnal EMSE (at the 5000th iteration). The kernel size σ ¼ 0:35 (selected by Silverman's rule) works, but obviously the performance is not so good. Although the initial kernel size is set as 1.0 (with which the algorithm is almost stalled), the KLMS with adaptive kernel size (σ i ) can still converge to a very small EMSE at the ﬁnal iteration. This can be clearly 0

10

σ = 0.5

-1

10

σ = 0.35 (Silverman)

-2

10 EMSE

σ = 1.0

σ = 0.05 -3

10

-4

10

σ = 0.1

adaptive kernel size

-5

10

0

1000

2000 3000 iteration

4000

5000

Fig. 1. Convergence curves with different kernel sizes.

1 0.9 0.8 0.7

σ

i

0.6 0.5 0.4 0.3 0.2 0.1

0

1000

2000 3000 iteration

4000

Fig. 2. Evolution of the adaptive kernel size σi.

5000

1.5 f(u)

yðiÞ ¼ cos ð8uðiÞÞ þ vðiÞ

2

1 0.5 0 -0.5 -1 -3

-2

-1

0 u

1

2

3

Fig. 3. Learned mappings at ﬁnal iteration.

Table 1 EMSE at ﬁnal iteration for different kernel sizes. Kernel size

EMSE at ﬁnal iteration

σ ¼ 0:05 σ ¼ 0:1 σ ¼ 0:35 σ ¼ 0:5 σ ¼ 1:0 Adaptive kernel size

0.000067 0.00022 0.000057 0.00010 0.00197 0.0068 0.3798 7 0.4468 0.65737 0.6846 0.000077 0.00027

explained from Fig. 2, where the evolution curve of the adaptive kernel size σ i has been plotted. In Fig. 2, the adaptive kernel size σ i converges to a desirable value between 0.1 and 0.2. Fig. 3 shows the learned mappings at ﬁnal iteration for different kernel sizes. The desired mapping f ðuÞ ¼ cos ð8uÞ is also plotted in Fig. 3 for comparison purpose. One can see for the cases σ ¼ 0:05, σ ¼ 0:1 and σ ¼ σ i , the learned mappings match the desired mapping very well, while when σ ¼ 1:0 and 0:5, the learned mappings deviate severely from the desired function. For the kernel size σ ¼ 0:35, there is still some visible deviation between the learned mapping and the desired one. A more detailed comparison is also presented in Table 1, where the EMSE at ﬁnal iteration is summarized. As illustrated in Fig. 1, the initial convergence speed of the KLMS with adaptive kernel size can still be very slow if the initial kernel size is inappropriately chosen. In order to improve the initial convergence speed, one can select a suitable initial kernel size using a certain method such as Silverman's rule. For the present example, if we set the initial kernel size to be 0.35, the convergence speed of the new algorithm will be improved signiﬁcantly. This can be clearly seen from Fig. 4, in which the learning curves for σ ¼ 0:1 and σ ¼ σ i (with σ 0 ¼ 0:35) are shown. The kernel size will inﬂuence the convergence speed (see Fig. 1) and the ﬁnal accuracy with ﬁnite training data (see Table 1), but it has little effect on the steady-state EMSE with inﬁnite training data. In order to conﬁrm this theoretical prediction, we perform another set of simulations with the same settings, except now much more iterations are run. For different kernel sizes and iterations, the EMSEs (obtained as the averages over a window of 2000 samples) are listed in Table 2. One can see for the kernel sizes σ ¼ 0:05, σ ¼ 0:1 and σ ¼ σ i , the algorithms almost reach the steady-state before the 100,000th iteration. For the kernel size σ ¼ 0:35, the algorithm attains its steady-state at around the 800,000th iteration. For the cases σ ¼ 0:5 and σ ¼ 1:0, it is hard to

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

obtain the steady-state EMSE via simulation since the convergence speed is too slow. In Table 2, the simulated steady-state EMSEs for different kernel sizes (σ ¼ 0:05, σ ¼ 0:1, σ ¼ σ i and σ ¼ 0:35) are very close and approach to 0.000033333, the theoretical value of the steady-state EMSE calculated using (46). 2) Short term chaotic time series prediction

another 100 as the test data. At each iteration, the testing MSE is computed based on the test data using the learned ﬁlter at that iteration. It can be seen from Fig. 6 that the kernel size σ ¼ 15 achieves the best performance with the smallest testing MSE at ﬁnal iteration. Also, as expected, the adaptive kernel size yields a satisfactory performance very close to the best result. Fig. 7 shows

The second example is about short term chaotic time series prediction. Consider the Lorenz oscillator whose state equations are 8 dx ¼ β x þ yz > > < dt dy ¼ δðz yÞ ð48Þ dt > > : dz ¼ xy þ ρy z

40 30 20 10

where the parameters are set as β ¼ 4:0, δ ¼ 30, and ρ ¼ 45:92. The second state y is picked for the prediction task and a sample time series is shown in Fig. 5. Here the goal is to predict the value of the current sample yðiÞ using the previous ﬁve consecutive samples. We continue to compare the performances of the KLMS with different kernel sizes. In the simulations below, the step-sizes for the mapping update are set at η ¼ 0:1, and the step-size for the kernel size adaptation is set at ρ ¼ 0:05. The initial value of the adaptive kernel size σ i is set as 1.0. For the kernel sizes σ ¼ 1:0, 5:5 (selected by Silverman's rule), 10, 15, 20, 30 and the adaptive kernel size σ i , the convergence curves in terms of the testing MSE are illustrated in Fig. 6. For each kernel size, 20 independent simulations were run with different segments of the time series. In each segment, 1000 samples are used as the training data and

value

dt

0 -10 -20 -30 -40

0

1000

2000 sample

4000

Fig. 5. Time series produced by Lorenz system.

3

0

10

10

σ = 1.0 σ = 5.5 (Silverman) σ = 10 σ = 15 σ = 20 σ = 30 adaptive kernel size

-1

10

2

10

10

testing MSE

-2

EMSE

3000

adaptive kernel size

-3

10

1

10

-4

10

σ = 0.1 -5

10

0

1000

2000 3000 iteration

4000

5000

Fig. 4. Convergence curves for σ¼ 0.1 and the adaptive kernel size with initial value determined by Silverman's rule.

0

10

0

200

400 600 iteration

800

1000

Fig. 6. Convergence curves in terms of the testing MSE for different kernel sizes.

Table 2 EMSE at different iterations for different kernel sizes. Iterations Kernel size

10,000

50,000

100,000

200,000

800,000

σ ¼ 0:05 σ ¼ 0:1 σ ¼ 0:35 σ ¼ 0:5 σ ¼ 1:0 Adaptive kernel size

0.00004802 0.00003977 0.001042 0.2686 0.6457 0.00004859

0.00003507 0.00003332 0.0001523 0.05134 0.6207 0.00003581

0.00003332 0.00003333 0.00008837 0.01211 0.6024 0.00003334

0.00003334 0.00003331 0.00005514 0.001312 0.5830 0.00003332

0.00003333 0.00003332 0.00003334 0.0001208 0.5684 0.00003334

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

9

100

20 without quantization with quantization

80

network size

σi

15

10

5

0

60

40

20

0

200

400

600 iteration

800

0

1000

Fig. 7. Evolution of the kernel size σi in Lorenz time series prediction.

200

400 600 iteration

800

1000

Fig. 9. Network size evolution in Lorenz time series prediction (with quantization).

Table 4 Testing MSE at ﬁnal iteration (with quantization).

Table 3 Testing MSE at ﬁnal iteration. Kernel size

Testing MSE

σ ¼ 1:0 σ ¼ 5:5 σ ¼ 10 σ ¼ 15 σ ¼ 20 σ ¼ 30 Adaptive kernel size

85.37407 2.6436 1.84777 0.2751 1.0450 7 0.1338 0.94907 0.0276 1.38767 0.0281 3.4480 7 0.1370 1.02647 0.1018

Kernel size

Testing MSE

σ ¼ 1:0 σ ¼ 5:5 σ ¼ 10 σ ¼ 15 σ ¼ 20 σ ¼ 30 Adaptive kernel size

109.28017 1.1974 2.54867 0.4112 1.31837 0.1509 1.2982 7 0.0262 2.11737 0.0215 4.84477 0.2272 1.2626 7 0.0890

9

3

σ = 1.0 σ = 5.5 (Silverman) σ = 10 σ = 15 σ = 20 σ = 30 adaptive kernel size

2

10

1

10

9

x 10

8 7 6

value (bits)

10

testing MSE

0

5 4 3 2 1

0

10

0

200

400 600 iteration

800

1000

Fig. 8. Convergence curves in terms of the testing MSE for different kernel sizes (with quantization).

the evolution of the adaptive kernel size (see the solid line). Interestingly, we observe the adaptive kernel size σ i converges quickly and very close to the desirable value 15. The mean deviation results of the testing MSE at ﬁnal iteration are given in Table 3. Further, we would like to evaluate the performance when the quantization method is applied to curb the growth of the network size. The experimental setting is the same as before, except now

0

0

1000

2000

3000

4000

5000

sample Fig. 10. Internet trafﬁc time series.

the quantization method is used and the quantization size is set at

ε ¼ 4:0. The convergence curves for different kernel sizes are demonstrated in Fig. 8. Once again, the adaptive kernel size works very well, and obtains a performance very close to the best one. As shown in Fig. 7, the adaptive kernel size σ i still converges close to the value 15 (see the dotted line). Fig. 9 illustrates the evolution curve of the network size. One can see that with quantization the network size grows very slowly, and the ﬁnal network size is only

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

10

0

10

σ = 0.05 σ = 0.1 σ = 0.5 σ = 1.0 σ = 5.0 adaptive kernel size

-1

testing MSE

10

-2

10

computed based on the test data using the learned ﬁlter at that iteration. From Fig. 11, we see that the kernel size σ ¼ 0:5 achieves the best performance with the smallest testing MSE during the ﬁnal stage (Strictly speaking, in this ﬁgure the kernel size σ ¼ 1:0 is also good, thus the interval [0.5, 1.0] is a good range of the kernel size). Also evidently, the proposed adaptive kernel size achieves a satisfactory performance that is close to the best one. The evolution curve of the adaptive kernel size is shown in Fig. 12. One can observe that the adaptive kernel size σ i changes very quickly at the initial stage and converges slowly to a neighborhood of the desirable value 0.5 at the ﬁnal stage. The results on real-world Internet trafﬁc dataset conﬁrm again the effectiveness of the proposed algorithm.

6. Conclusion -3

10

0

1000

2000 iteration

3000

4000

Fig. 11. Convergence curves with different kernel sizes in Internet trafﬁc prediction.

0.6

0.5

σi

0.4

0.3

0.2

0.1 0.05

0

1000

2000

3000

4000

iteration Fig. 12. Evolution of the kernel size σi in Internet trafﬁc prediction.

around 75. Table 4 shows the mean deviation results of the testing MSE at ﬁnal iteration.

The kernel function implicitly deﬁnes the feature space and plays a central role in all kernel methods. In kernel adaptive ﬁltering (KAF) algorithms, the Gaussian kernel (a radial basis function kernel) is usually a default kernel. The kernel size (or bandwidth) of the Gaussian kernel controls the smoothness of the mapping and has signiﬁcant inﬂuence on the learning performance. How to select a proper kernel size is a very crucial problem in KAF algorithms. Some existing techniques (e.g. Silverman's rule) for selecting a kernel size can be applied, but they are not appropriate for a KAF algorithm since the problem is approximation in a joint space (the input and the desired), which is different from density estimation. In this work, we propose an approach for sequentially optimizing the kernel size for the kernel least mean square (KLMS), a simple yet efﬁcient KAF algorithm. At each iteration cycle, the kernel size is adjusted by a stochastic gradient based algorithm to minimizing the mean square error. The proposed algorithm is computationally very simple and easy to implement. Theoretical results on convergence are also presented. Based on the energy conservation relation in RKHS, we derive a sufﬁcient condition for the mean square convergence, and obtain a theoretical steadystate excess mean-square error (EMSE). Simulation results conﬁrm the theoretical prediction, and show the adaptive kernel size can automatically converge to a proper value, so as to help KLMS converge faster and achieve better accuracy. In future study, it is of interest to extend this work to the case where the kernel is of any form (not restricted to the Gaussian kernel). Especially, we will study how to sequentially optimize the kernel function using the idea of multi-kernel learning or learning the kernel. Another interesting line of study is how to jointly optimize the kernel size and the step size.

3) Internet trafﬁc prediction Acknowledgments In the third example, we consider a real world Internet tra fﬁc dataset from the homepage of Prof. Paulo Cortez (http://www3. dsi.uminho.pt/pcortez/series/A5M.txt). Fig. 10 illustrates a sequence of the Internet trafﬁc samples. The task is to predict the current value of the sample using the previous ten consecutive samples. For convenience of computation, the Internet trafﬁc data are normalized into the interval [0, 1]. Again we compare the performance of the KLMS with different kernel sizes. In the simulation, the step-sizes for the mapping update are set at η ¼ 0:01, and the step-size for the kernel size adaptation is set to ρ ¼ 0:025. The initial value of the adaptive kernel size (σ i ) is 0.05. For the ﬁxed kernel sizes ð0:05; 0:1 ; 0:5; 1:0; 5:0Þ and the adaptive kernel size σ i , the convergence cures in terms of the testing MSE are demonstrated in Fig. 11. In this example, 4000 samples are used as the training data and another 500 samples as the test data. At each iteration, the testing MSE is

The authors would like to thank Mr. Zhengda Qin, a student at Xi'an Jiaotong University for performing the simulations on Internet trafﬁc prediction. This work was supported by National Natural Science Foundation of China (No. 61372152) and 973 Program (No. 2015CB351703).

References [1] V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995. [2] B. Scholkopf, A.J. Smola, Learning with Kernels, Support Vector Machines, Regularization, Optimization and Beyond, MIT Press, Cambridge, MA, USA, 2002. [3] F. Girosi, M. Jones, T. Poggio, Regularization theory and neural networks architectures, Neural Comput. 7 (1995) 219–269. [4] B. Scholkopf, A.J. Smola, K. Muller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput. 10 (1998) 1299–1319.

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ [5] M.H. Yang, Kernel eigenfaces vs kernel ﬁsherfaces: face recognition using kernel methods, in Proceedings of the 5th IEEE ICAFGR, Washington D.C., 2002, pp. 215–220. [6] W. Liu, J. Principe, S. Haykin, Kernel Adaptive Filtering: A Comprehensive Introduction, Wiley, 2010. [7] N. Aronszajn, Theory of reproducing kernels, Trans. Am. Math. Soc. 68 (1950) 337–404. [8] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Min. Knowl. Discov. 2 (1998) 121–167. [9] W. Liu, P. Pokharel, J. Principe, The kernel least mean square algorithm, IEEE Trans. Signal Process. 56 (2008) 543–554. [10] P. Bouboulis, S. Theodoridis, Extension of Wirtinger's calculus to reproducing kernel Hilbert spaces and the complex kernel LMS, IEEE Trans. Signal Process. 59 (2011) 964–978. [11] W. Liu, J. Principe, Kernel afﬁne projection algorithm, EURASIP J. Adv. Signal Process. (2008) 12, http://dx.doi.org/10.1155/2008/784292. [12] Y. Engel, S. Mannor, R. Meir, The kernel recursive least-squares algorithm, IEEE Trans. Signal Process. 52 (2004) 2275–2285. [13] W. Liu, Il Park, Y. Wang, J.C. Principe, Extended kernel recursive least squares algorithm, IEEE Trans. Signal Process. 57 (2009) 3801–3814. [14] J. Platt, A resource-allocating network for function interpolation, Neural Comput 3 (1991) 213–225. [15] W. Liu, Il Park, J.C. Principe, An information theoretic approach of designing sparse kernel adaptive ﬁlters, IEEE Trans. Neural Netw. 20 (2009) 1950–1961. [16] B. Chen, S. Zhao, P. Zhu, J.C. Principe, Quantized kernel least mean square algorithm, IEEE Trans. Neural Netw. Learn. Syst 23 (1) (2012) 22–32. [17] G. Wahba, Spline Models for Observational Data, SIAM, Philadelphia, PA, 1990. [18] W. Hardle, Applied Nonparametric Regression, Cambridge University Press, Cambridge, UK, 1992. [19] J. Racine, An efﬁcient cross-validation algorithm for window width selection for nonparametric kernel regression, Commun. Stat. : Simulat. Comput. 22 (1993) 1107–1114. [20] G.C. Cawlay, N.L.C. Talbot, Efﬁcient leave-one-out cross-validation of kernel Fisher discriminant classiﬁers, Pattern Recogn. 36 (2003) 2585–2592. [21] S. An, W. Liu, S. Venkatesh, Fast cross-validation algorithm for least squares support vector machine and kernel ridge regression, Pattern Recogn. 40 (2007) 2154–2162. [22] E. Herrmann, Local bandwidth choice in kernel regression estimation, J. Comput. Graph. Stat. 6 (1) (1997) 35–54. [23] B.W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman & Hall, CRC, 1986. [24] M.C. Jones, J.S. Marron, S.J. Sheather, A brief survey of bandwidth selection for density estimation, J. Am. Stat. Assoc. 91 (433) (1996) 401–407. [25] C. Brunsdon, Estimating probability surfaces for geographical point data: an adaptive kernel algorithm, Comput. Geosci. 21 (7) (1995) 877–894. [26] V. Katkovnik, I. Shmulevich, Kernel density estimation with adaptive varying window size, Pattern Recog. Lett. 23 (2002) 1641–1648. [27] J. Yuan, L. Bo, K. Wang, T. Yu, Adaptive spherical Gaussian kernel in sparse Bayesian learning framework for nonlinear regression, Expert Syst. Appl. 36 (2009) 3982–3989. [28] A. Singh, J.C. Principe, Information theoretic learning with adaptive kernels, Signal Process. 91 (2011) 203–213. [29] A.H. Sayed, Fundamentals of Adaptive Filtering, Wiley, NJ, 2003. [30] T.Y. Al-Naffouri, A.H. Sayed, Adaptive ﬁlters with error nonlinearities: meansquare analysis and optimum design, EURASIP J. Appl. Signal Process. 4 (2001) 192–205. [31] N.R. Yousef, A.H. Sayed, A uniﬁed approach to the steady-state and tracking analysis of adaptive ﬁlters, IEEE Trans. Signal Processing 49 (2001) 314–324. [32] T.Y. Al-Naffouri, A.H. Sayed, Transient analysis of adaptive ﬁlters with error nonlinearities, IEEE Trans. Signal Processing 51 (2003) 653–663. [33] B. Chen, L. Xing, J. Liang, N. Zheng, J.C. Principe, Steady-state mean-square error analysis for adaptive ﬁltering under the Maximum Correntropy Criterion, IEEE Signal Process. Lett. 21 (7) (2014) 880–884. [34] S. Zhao, B. Chen, J.C. Principe, Kernel adaptive ﬁltering with maximum correntropy criterion, in Proceedings of International Joint Conference on Neural Networks (IJCNN), San Jose, California, USA, 2011, pp. 2012–2017. [35] H.Q. Minh, Further properties of Gaussian reproducing kernel Hilbert spaces, 2012. 〈http://arxiv.org/abs/1210.6170〉. [36] J. Kivinen, A.J. Smola, R.C. Williamson, Online learning with kernels, IEEE Trans. Signal Process. 52 (8) (2004) 2165–2176. [37] K. Slavakis, S. Theodoridis, I. Yamada, Online kernel-based classiﬁcation using adaptive projection algorithms, IEEE Trans. Signal Process. 56 (7) (2008) 2781–2796. [38] F. Orabona, J. Keshet, B. Caputo, Bounded kernel-based online learning, J. Mach. Learn. Res. 10 (2009) 2643–2666. [39] P. Zhao, S.C.H. Hoi, R. Jin, Double updating online learning, J. Mach. Learn. Res. 12 (2011) 1587–1615. [40] M. Gonen, E. Alpaydin, Multiple kernel learning algorithms, J. Mach. Learn. Res. 12 (2011) 2211–2268.

11

[41] M. Herbster, Relative loss bounds and polynomial time predictions for the KLMS-NET algorithm, in Proceedings of the ALT 2004, 2004, pp. 309–323. [42] C.S. Ong, A.J. Smola, R.C. Williamson, Learning the kernel with hyperkernels, J. Mach. Learn. Res. 6 (2005) 1043–1071. [43] A. Argyriou, C.A. Micchelli, M. Pontil, Learning convex combinations of continuously parameterized basic kernels, COLT 2005 (2005) 338–352. [44] R. Jin, S.C.H. Hoi, T. Yang, Online multiple kernel learning: algorithms and mistake bounds, in Proceedings of the ALT 2010, 2010, pp. 390–404. [45] F. Orabona, J. Luo, B. Caputo, Multi kernel learning with online-batch optimization, J. Mach. Learn. Res. 13 (2012) 165–191. [46] B. Chen, S. Zhao, P. Zhu, J. Principe, Mean square convergence analysis for kernel least mean square algorithm, Signal Process. 92 (2012) 2624–2632. [47] Dekel Ofer, Shai Shalev-shwartz, Yoram Singer, The Forgetron: A Kernel-based Perceptron on a Fixed Budget, Advances in Neural Information Processing Systems (NIPS), 2006. [48] Orabona Francesco, Joseph Keshet, Barbara Caputo, The projectron: a bounded kernel-based perceptron, in Proceedings of the 25th international conference on Machine learning (ICML), ACM, 2008. [49] Wang Zhuang, Slobodan Vucetic, Online passive-aggressive algorithms on a budget, in Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, 2010. [50] Zhao, Peilin, Steven C.H. Hoi, Bduol: double updating online learning on a ﬁxed budget, in: Machine Learning and Knowledge Discovery in Databases, Springer, Berlin Heidelberg, 2012, pp. 810–826. [51] Zhao, Peilin, et al., Fast bounded online gradient descent algorithms for scalable kernel-based online learning, arXiv preprint arXiv:1206.4633, 2012. [52] B. Chen, S. Zhao, P. Zhu, J.C. Principe, Quantized kernel recursive least squares algorithm, IEEE Trans. Neural Netw. Learn. Syst. 24 (9) (2013) 1484–1491. [53] Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang, Online multiple kernel classiﬁcation, Mach. Learn. 90 (2) (2013) 289–316. [54] Steven C.H. Hoi, Jialei Wang, Peilin Zhao, Libol: a library for online learning algorithms, J. Mach. Learn. Res. 15 (2014) 495–499. [55] Peilin Zhao, Steven C.H. Hoi, Jialei Wang, Bin Li, Online transfer learning, Artif. Intell. 216 (2014) 76–102. [56] Wenwu He, James T. Kwok, Simple randomized algorithms for online learning with kernels, Neural Netw. 60 (2014) 17–24. [57] X. Xu, H. Qu, J. Zhao, X. Yang, B. Chen, Quantized kernel least mean square with desired signal smoothing, Electron. Lett. 51 (18) (2015) 1457–1459. [58] C. Saide, R. Lengelle, P. Honeine, C. Richard, R. Achkar, Nonlinear adaptive ﬁltering using kernel based algorithms with dictionary adaptation, Int. J. Adapt. Control Signal Process. (2015), http://dx.doi.org/10.1002/acs.2548.

Badong Chen received the B.S. and M.S. degrees in control theory and engineering from Chongqing University, in 1997 and 2003, respectively, and the Ph.D. degree in computer science and technology from Tsinghua University in 2008. He was a Post-Doctoral Researcher with Tsinghua University from 2008 to 2010, and a Post-Doctoral Associate at the University of Florida Computational NeuroEngineering Laboratory (CNEL) during the period October, 2010 to September, 2012. He is currently a professor at the Institute of Artiﬁcial Intelligence and Robotics (IAIR), Xi'an Jiaotong University. His research interests are in signal processing, information theory, machine learning, and their applications in cognitive science and engineering. He has published 2 books, 3 chapters, and over 100 papers in various journals and conference proceedings. Dr. Chen is an IEEE senior member and an associate editor of IEEE Transactions on Neural Networks and Learning Systems and has been on the editorial boards of Applied Mathematics and Entropy.

Junli Liang received the Ph.D. degree in signal and information processing from the Institute of Acoustics, Chinese Academy of Sciences. At present, he works as a professor in the School of Electronics and Information, Northwestern Polytechnical University, Xi'an, China. His research interests include radar and biomedical imaging, image processing, signal processing, and pattern recognition, etc.

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

12

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Nanning Zheng graduated from the Department of Electrical Engineering, Xi'an Jiaotong University, Xi'an, China, in 1975, and received the M.S. degree in information and control engineering from Xi'an Jiaotong University in 1981 and the Ph.D. degree in electrical engineering from Keio University, Yokohama, Japan, in 1985. He is currently a professor and director of the Institute of Artiﬁcial Intelligence and Robotics, Xi'an Jiaotong University. His research interests include computer vision, pattern recognition and image processing, and hardware implementation of intelligent systems. Prof. Zheng became a member of the Chinese Academy of Engineering in 1999, and he is the Chinese Representative on the Governing Board of the International Association for Pattern Recognition. He is an IEEE Fellow and also serves as an executive deputy editor of the Chinese Science Bulletin.

Jose C. Principe is currently the Distinguished Professor of Electrical and Biomedical Engineering at the University of Florida, Gainesville, where he teaches advanced signal processing and artiﬁcial neural networks (ANNs) modeling. He is BellSouth Professor and Founder and Director of the University of Florida Computational Neuro-Engineering Laboratory (CNEL). He is involved in biomedical signal processing, in particular, the electroencephalogram (EEG) and the modeling and applications of adaptive systems. Dr. Príncipe is the past Editor-in-Chief of the IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, past President of the International Neural Network Society, and former Secretary of the Technical Committee on Neural Networks of the IEEE Signal Processing Society. He is an IEEE Fellow and an AIMBE Fellow and a recipient of the IEEE Engineering in Medicine and Biology Society Career Service Award. He is also a former member of the Scientiﬁc Board of the Food and Drug Administration, and a member of the Advisory Board of the McKnight Brain Institute at the University of Florida.

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

Kernel least mean square with adaptive kernel size

Kernel least mean square with adaptive kernel size

Recommend Documents