Kernel least mean square with adaptive kernel size

Kernel least mean square with adaptive kernel size

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Kernel le...

1MB Sizes 2 Downloads 135 Views

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Kernel least mean square with adaptive kernel size Badong Chen a,n, Junli Liang b, Nanning Zheng a, José C. Príncipe a,c a b c

Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, 28 Xianning West Road, Xi'an 710049, China School of Electronics and Information, Northwestern Polytechnical University, Xi'an, China Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611 USA

art ic l e i nf o

a b s t r a c t

Article history: Received 13 February 2014 Received in revised form 1 October 2015 Accepted 6 January 2016 Communicated by: Steven Hoi

Kernel adaptive filters (KAF) are a class of powerful nonlinear filters developed in Reproducing Kernel Hilbert Space (RKHS). The Gaussian kernel is usually the default kernel in KAF algorithms, but selecting the proper kernel size (bandwidth) is still an open important issue especially for learning with small sample sizes. In previous research, the kernel size was set manually or estimated in advance by Silverman's rule based on the sample distribution. This study aims to develop an online technique for optimizing the kernel size of the kernel least mean square (KLMS) algorithm. A sequential optimization strategy is proposed, and a new algorithm is developed, in which the filter weights and the kernel size are both sequentially updated by stochastic gradient algorithms that minimize the mean square error (MSE). Theoretical results on convergence are also presented. The excellent performance of the new algorithm is confirmed by simulations on static function estimation, short term chaotic time series prediction and real world Internet traffic prediction. & 2016 Published by Elsevier B.V.

Keywords: Kernel methods Kernel adaptive filtering Kernel least mean square Kernel selection

1. Introduction Kernel based methods are successfully used in machine learning and nonlinear signal processing due to their inherent advantages of convex optimization and universality in the space of L2 functions. By mapping the input data into a feature space associated with a Mercer kernel, many efficient nonlinear algorithms can be developed, thanks to the kernel trick. Popular kernel methods include support vector machine (SVM) [1,2], kernel regularization network [3], kernel principal component analysis (KPCA) [4], and kernel Fisher discriminant analysis (KFDA) [5], etc. These nonlinear algorithms show significant performance improvement over their linear counterparts. Online kernel learning [36–39,47–58] has also been extensively studied in the machine learning and statistical signal processing literature, and it provides efficient alternatives to approximate a desired nonlinearity incrementally. As the training data are sequentially presented to the learning system, online learning requires, in general, much less memory and computational cost. Recently, kernel based online algorithms for adaptive filtering have been developed and have become an emerging area of research [6]. Kernel adaptive filters (KAF) are derived in Reproducing Kernel Hilbert Spaces (RKHS) [7,8], by using the linear structure and inner n

Corresponding author. Tel.: þ 86 29 8266 8802x8009; fax: þ 86 29 8266 8672. E-mail addresses: [email protected] (B. Chen), [email protected] (J. Liang), [email protected]fl.edu (J.C. Príncipe).

product of this space to implement the well-established linear adaptive filtering algorithms that correspond to nonlinear filters in the original input space. Typical KAF algorithms include the kernel least mean square (KLMS) [9,10], kernel affine projection algorithms (KAPA) [11], kernel recursive least squares (KRLS) [12], and extended kernel recursive least squares (EX-KRLS) [13], etc. With a radially symmetric Gaussian kernel they create a growing radialbasis function (RBF) network to learn the network topology and adapt free parameters directly from the training data. Among these KAF algorithms, the KLMS is the simplest, and fastest to implement yet very effective. There are two main open challenges in the KAF algorithms . The first is their growing structure with each sample, which results in increasing computational costs and memory requirements especially in continuous adaptation scenarios. In order to curb the network growth and to obtain a compact representation, a variety of sparsification techniques have been applied, where only the important input data are accepted as new centers. The presently available sparsification criteria include the novelty criterion [14], approximate linear dependency (ALD) criterion [12] surprise criterion [15], and so on. In a recent work [16], we have proposed a novel method, the quantized kernel least mean square (QKLMS) algorithm, to compress the input space and hence constrain the network size which is shown to be very effective in yielding a compact network with desirable accuracy. Selecting a proper Mercer kernel is the second remaining problem that should be addressed when implementing kernel adaptive filtering algorithms, especially when the training data

http://dx.doi.org/10.1016/j.neucom.2016.01.004 0925-2312/& 2016 Published by Elsevier B.V.

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

2

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

size is small. In this case, the kernel selection includes two parts: first, the kernel type is chosen, and second, its parameters are determined. Among various kernels, the Gaussian kernel is very popular and is usually a default choice in kernel adaptive filtering due to its universal approximating capability, desirable smoothness and numeric stability. The normalized Gaussian kernel is   κ ðu; u0 Þ ¼ exp  ‖u  u0 ‖2 =2σ 2 ð1Þ where the free parameter σ (σ 4 0) is called the kernel size (also known as the kernel bandwidth or smoothing parameter). In fact, the Gaussian kernel is strictly positive definite and as such produces a RKHS that is dense [8] and as such linear algorithms in this RKHS are universal approximators of smooth L2 functions. In principle this means that in the large sample size regime the asymptotic properties of the mean square approximation are independent of the kernel size σ [46]. This means that the kernel size in KAF only affects the dynamics of learning, because in the initial steps the sample size is always small, therefore both the accuracy for batch learning and the convergence properties for online learning are dependent upon the kernel size. This should be contrasted with the effect of the kernel size in classification where the kernel size controls both the accuracy and the generalization of the optimal solution [1,2]. Up to now, there are many methods for selecting a kernel size for the Gaussian kernel borrowed from the areas of statistics, nonparametric regression and kernel density estimation. The most popular methods for the selection of the kernel size are: cross-validation (CV) [17–21] which can always be used since the kernel size is a free parameter, penalizing functions [18], plug-in methods [18,22], Silverman's rule [23] and other rules of thumb [24]. The cross-validation, penalizing functions, and plug-in methods are computationally intensive and are not suitable for online kernel learning. The Silverman's rule is widely accepted in kernel density estimation although it is derived under a Gaussian assumption and is usually not appropriate for multimodal distributions. Besides the fixed kernel size, some adaptive or varying kernel size algorithms can also be found in the literature [25–28]. This topic is also closely related to the techniques of multi-kernel learning or learning the kernel in the machine learning literature [40–45]. There the goal is typically to learn a combination of kernels based on some optimization methods, but in KAF this approach is normally avoided due to the computational complexity [6]. All the above mentioned methods, however, are not suitable for determining an optimal kernel size in online kernel adaptive filtering, since they either are batch mode methods or originate from a different problem, such as the kernel density estimation. Given that in online learning the number of samples is large and not specified a priori, the final solution will be practically independent of the kernel size. The real issue is therefore how to speed up convergence to the neighborhood of the optimal solution, which will also provide smaller network sizes. In the present work, by treating the kernel size as an extra parameter for the optimization, a novel sequential optimization framework is proposed for the KLMS algorithm. The new optimization paradigm allows for an online adaptation algorithm. At each iteration cycle, the filter weights and the kernel size are both sequentially updated to minimize the mean square error (MSE). As the kernel size is updated sequentially, the proposed algorithm is computationally very simple. The new algorithm can also be incorporated in the quantization method so as to yield a compact model. The rest of the paper is organized as follows. In Section 2, we briefly revisit the KLMS algorithm. In Section 3, we propose a sequential optimization strategy for the kernel size in KLMS, and then derive a simple stochastic gradient algorithm to adapt the kernel size. In Section 4, we give some theoretical results on the convergence. Specifically, we derive the energy conservation

relation in RKHS, and on this basis we derive a sufficient condition for the mean square convergence, and arrive at a theoretical value of the steady-state excess mean-square error (EMSE). In Section 5, we present simulation examples on static function estimation, short term chaotic time series prediction and real world Internet traffic prediction to confirm the satisfactory performance of the KLMS with adaptive kernel size. Finally, in Section 6, we present the conclusion.

2. KLMS Suppose the goal is to learn a continuous input–output mapping f : U-Y based on a sequence of input–output examples  N (training data) uðiÞ; yðiÞ i ¼ 1 , where U  ℝm is the input domain, Y  ℝ is the desired output space. The hypothesis space for learning is assumed to be a Reproducing Kernel Hilbert Space (RKHS) ℋk associated with a Mercer kernel κ ðu; u0 Þ, a continuous, symmetric, and positive-definite function κ : U  U-ℝ [7]. To find such a function f , one may solve the regularized least squares regression in ℋk : min

f A ℋk

N X i¼1

ðyðiÞ  f ðuðiÞÞÞ þ γ ‖f ‖2ℋk 2

ð2Þ

where ‖:‖ℋk denotes the norm in ℋk , γ Z 0 is the regularization factor that controls the smoothness of the solution. As the inner product in RKHS satisfies the reproducing property, namely, D   f κ ðu; :Þ ℋ ¼ f ðuÞ, (2) can be rewritten as k 2 N  X  min þ γ ‖f ‖2ℋk ð3Þ yðiÞ  〈 f κ ðuðiÞ; :Þ〉ℋk f A ℋk

i¼1

By the representer theorem [8], the solution of (2) can be expressed as a linear combination of kernels: f ðuÞ ¼

N X

αi κ ðuðiÞ; uÞ

ð4Þ

i¼1

 1 The coefficient vector can be calculated as α ¼ K þ γ I y, T NN is the Gram matrix with where y ¼ ½yð1Þ; ⋯; yðNÞ , and K A ℝ elements Kij ¼ κ ðuðiÞ; uðjÞÞ. Solving the previous least squares problem usually requires significant memory and computational burden due to the necessity of calculating a large Gram matrix, whose dimension equals the number of input patterns. The KAF algorithms, however, provide efficient alternatives that build the solution incrementally, without explicitly computing the Gram matrix. Denote f i the estimated mapping (hypothesis) at iteration i. The KLMS algorithm can be expressed as [6] ( f0 ¼0 ð5Þ f i ¼ f i  1 þ ηκ ðuðiÞ; :ÞeðiÞ where η denotes the step size, eðiÞ is the instantenous prediction error at iteration i, eðiÞ ¼ yðiÞ  f i  1 ðuðiÞÞ, i.e. the instantenous error only depends upon the difference between the desired response at the current time and the evaluation of the current sample (uðiÞ) with the previous system model (f i  1 ). The learned mapping of KLMS, at iteration N, will be f N ðuÞ ¼ η

N X

eðiÞκ ðuðiÞ; uÞ

ð6Þ

i¼1

This is a very nice result because it states that the solution to the unknown nonlinear mapping is done incrementally one step at a time, with a growing RBF network, where the centers are the samples and the fitting parameter is automatically determined as the current error.

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Taking advantage of the incremental nature of the KLMS updates, the KLMS adaptation is in essence the solution of the following incremental regularized least squares problem:  2 1  η ‖f i f i  1 ‖2ℋk min yðiÞ  f i ðuðiÞÞ þ

η

f i A ℋk

Letting Δf i ¼ f i  f i  1 , (7) is equivalent to

 2 1  η



min eðiÞ  Δf i ðuðiÞÞ þ

Δf i 2ℋk

η

Δf i A ℋk

ð7Þ

where PðuÞ is the marginal measure on U. Then the optimization should find the kernel size that minimizes this error. Since in  practice f is usually unknown, one can use the mean square error as an alternative cost for optimization: Z 2 y  f i ðuÞ dPðu; yÞ ð11Þ J2 ¼ UY

ð8Þ

From (8), one may observe: (1) KLMS learning at iteration i is equivalent to solving a regularized least squares problem, in which the previous hypothesis f i  1 is frozen, and only the adjustment term Δf i is optimized; (2) in this least squares problem, there is   only one training example involved, i.e. uðiÞ; eðiÞ ; (3) the regularization factor is directly related to the step-size via γ ¼ ð1  ηÞ=η. In the rest of the paper, the Mercer kernel κ ðu; u0 Þ is assumed to be the Gaussian kernel. In addition, to explicitly show the kernel size dependence, we denote the Gaussian kernel by κ σ ðu; u0 Þ, and the induced RKHS by ℋσ .

3. KLMS with adaptive kernel size The kernel is a crucial factor in all kernel methods in the sense that it defines the similarity between data points. For the Gaussian kernel, this similarity depends on the kernel size selected. If the kernel size is too large, then all the data will look similar in the RKHS (with inner products all close to 1), and the procedure reduces to linear regression. On the other hand, if the kernel size is too small, then all the data will look distinct (with inner products all close to 0), and the system fails to do inference on unseen data that fall between the training points. Up to now the KLMS has been only studied with a constant kernel size, so all the elegance of the solution has not been fully recognized. In fact, the sequential learning algorithm (5) builds the current estimate of f from two additive parts: the previous hypothesis f i  1 and a correction term proportional to the prediction error on new data. In principle we can possibly use one RKHS to compute the previous hypothesis and change the RKHS to compute the correction term in (5), which can be efficiently done by changing the kernel size. This is the motivating idea that we pursue in this paper, and it has two fundamental components: (1) we have to formalize this approach; (2) we have to find an easy way of implementing it from samples. In the following, we propose an approach to sequentially optimize the KLMS with a variable kernel size. A. Sequential optimization of the kernel size in KLMS In order to determine an optimal kernel size for KLMS, one should define in advance a cost function for the optimality. To   make this precise, we suppose the training data uðiÞ; yðiÞ are random, and there is an absolutely continuous probability measure P (usually unknown) on the product space U  Y from which the data are drawn. The measure P defines a regression function: Z  f ðuÞ ¼ ydP ðyjuÞ ð9Þ Y

where P ðyjuÞ is the conditional measure on u  Y. In this situation,  the function f can be said to be the desired mapping that needs to be estimated. Thus a measurement of the error in f i (which is updated by KLMS) is Z   2 f  f i dPðuÞ ð10Þ J1 ¼ U

3

The cost J 2 can be easily estimated from sample data. This is especially important when the probability measure P is unknown. Of course, the kernel size in KLMS can be optimized in batch mode, that is, the optimization is performed only after presenting the whole training data. Then, combining (6) and (11) yields the optimization: " #2 Z N X  σ ¼ arg min yη eðiÞκ σ ðuðiÞ; uÞ dPðu; yÞ ð12Þ σ Aℝþ

UY

i¼1

where σ  stands for the optimal kernel size. As KLMS is an online learning algorithm, we are more interested in a sequential optimization framework, which allows the kernel size to be sequentially optimized across iterations. To this end, we propose the following sequential optimization: Z 2 σ i ¼ arg min y f i  1 ðuÞ  ηeðiÞκ σ i ðuðiÞ; uÞ dPðu; yÞ ð13Þ σi A ℝ þ

UY

where the previous hypothesis f i  1 is frozen, and σ i denotes the kernel size at iteration i. Remark. By (13), the kernel size is optimized sequentially. Thus at iteration i, the initial learning step will determine an optimal value of the kernel size σ i (the old kernel sizes remain unchanged), followed by the addition of a new center using KLMS with this new kernel size. Learning with a varying kernel size implies at each iteration cycle to perform adaptation in a different RKHS since changing the kernel size modifies the inner product of the Hilbert space. For the KLMS, this learning paradigm is indeed feasible, because at each iteration cycle, the old centers remain frozen, and only a new center is added, or in other words, the correction term Δf i is just a feature vector in the current RKHS. Remark. A more reasonable approach should be to jointly optimize the kernel size and the step size (corresponding to a regularization factor). In this work, however, for simplicity the step size is assumed to be fixed and only the kernel size is optimized.   Suppose the training data uðiÞ; yðiÞ are independent, identically distributed (i.i.d.). The hypothesis f i , which depends on the previous training data, will be independent of the future training data. Then the mean square prediction error at iteration i þ 1, conditioned on f i , equals

Z    2  E e2 ði þ1Þf i ¼ yðiþ 1Þ  f i ðuðiþ 1ÞÞ dP uði þ 1Þ; yði þ1Þf i UY Z 2 yði þ 1Þ  f i ðuði þ 1ÞÞ dP ðuði þ1Þ; yði þ1ÞÞ ¼ UY Z 2 ¼ y  f i ðuÞ dPðu; yÞ ð14Þ UY

   where P uði þ 1Þ; yði þ 1Þf i denotes the probability measure of ðuði þ 1Þ; yði þ 1ÞÞ conditioned on f i . The mapping update at iteration i will affect directly the prediction error at iteration i þ 1, according to (14) the sequential optimization problem (13) can then be equivalently defined to search a value of the kernel size  σi such that the conditional mean square error E e2 ði þ 1Þf i is minimized:



σ i ¼ arg minE e2 ði þ 1Þf i σi A ℝ þ

 ¼ arg minE e2 ði þ 1Þf i  1 þ ηeðiÞκ σ i ðuðiÞ; :Þ σi A ℝ þ

ð15Þ

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

To understand the above optimization in more detail, we con  sider a nonlinear regression model  inwhich the output data yðiÞ are related to the input vectors uðiÞ via 

yðiÞ ¼ f ðuðiÞÞ þvðiÞ

ð16Þ



where f ð:Þ denotes the unknown nonlinear mapping that needs to be estimated, and vðiÞ stands for the disturbance noise. In this case, the prediction error eðiÞ can be expressed as eðiÞ ¼ yðiÞ f i  1 ðuðiÞÞ ¼ f~ i  1 ðuðiÞÞ þ vðiÞ

ð17Þ

 where f~ i  1 9 f  f i  1 is the residual mapping at iteration i  1. The mean square error at iteration iþ 1, conditioned on f i , is Z  2  E e2 ði þ 1Þf i ¼ f~ i ðuði þ1ÞÞ þ vði þ 1Þ dP uv ði þ 1Þ UV

Z ¼ UV

0 B B @

f~ i  1 ðuðiþ 1ÞÞ þ vði þ 1Þ 

ηeðiÞκ σi ðuðiÞ; uði þ1ÞÞ

12 C C dP uv ðiþ 1Þ A

ð18Þ

where V  ℝ denotes the noise space, and P uv ði þ 1Þ denotes the probability measure of ðuði þ 1Þ; vðiþ 1ÞÞ. Remark. One can see from (18) that the optimal kernel size at iteration i depends upon the residual mapping f~ i  1 , prediction error eðiÞ, step size η, and the joint distribution P uv , which is much different from the optimal kernel sizes in problems of density estimation. Theoretically, given the desired mapping and the joint distribution P uv , the optimal kernel sizes can be solved sequentially. This is, however, a rather tedious and impractical procedure since we have to solve an involved nonlinear optimization at each iteration cycle. More importantly, in practice the desired mapping, the noise, and the input distribution are usually unknown. Next, we will develop a stochastic gradient algorithm to adapt the kernel size, without resorting to any prior knowledge.

1) The direction of the gradient depends upon the signs of the prediction errors eði  1Þ and eðiÞ. Specifically, when the signs of eði 1Þ and eðiÞ are the same, the kernel size will increase; while when the signs of eði  1Þ and eðiÞ are different, the kernel size will decrease. This is reasonable, since intuitively the signs of two successive errors contain the information about the smoothness of the desired mapping. If there is little sign change, the desired mapping is likely a “smoothing function” and a larger kernel size is desirable; while if the sign changes frequently, the desired mapping is likely a “zig zag function”, and in this case a smaller kernel size is usually better. 2) The magnitude of the gradient depends on the input data through ‖uði 1Þ  uðiÞ‖2 κ σ i  1 ðuði 1Þ; uðiÞÞ. The value of this term will be nearly zero when the distance between uði  1Þ and uðiÞ is very small or very large. This is also reasonable, since when uðiÞ is very close to uði 1Þ, the sign change between eði  1Þ and eðiÞ only implies a “very local fluctuation”; while when uðiÞ is very far away from uði 1Þ, the sign change between eði  1Þ and eðiÞ contains little information about the smoothness of the desired mapping. 3) The magnitude of the gradient depends on σ i  1 through κ σ i  1 ðuði  1Þ; uðiÞÞ=σ 3i  1 . For the case uðiÞ a uði  1Þ, this term will approach zero when σ i  1 is very small or very large. Although gradient goes to zero does not imply that σ i is bounded, but intuitively, with a proper initial value, the kernel size will be adjusted within a reasonable range. Combining (5) and (19), we obtain the KLMS with adaptive kernel size: 8 f0 ¼0 > > > > < eðiÞ ¼ yðiÞ  f i  1 ðuðiÞÞ

σ i ¼ σ i  1 þ ρeði 1ÞeðiÞ‖uði  1Þ  uðiÞ‖2  κ σ i  1 ðuði  1Þ; uðiÞÞ=σ 3i  1 > > > > :f ¼f þ ηeðiÞκ σ ðuðiÞ; :Þ i

As discussed previously, at each iteration cycle, the kernel size in KLMS can be optimized by minimizing the mean square error at next iteration (conditioned on the learned mapping at the current iteration). In this sense, one can update the current kernel size using the current prediction error; that is, at iteration i, when the prediction error eðiÞ is available, the kernel size σ i can be updated by minimizing the instantaneous squared error, and a stochastic gradient algorithm can be readily derived as follows: ∂ 2 e ðiÞ ∂σ i  1 i ∂ h~ f i  1 ðuðiÞÞ þ vðiÞ ¼ σ i  1  2μeðiÞ ∂σ i  1 i ∂ h~ f ðuðiÞÞ þ vðiÞ  ηeði 1Þκ σ i  1 ðuði  1Þ; uðiÞÞ ¼ σ i  1  2μeðiÞ ∂σ i  1 i  2

σi ¼ σi  1  μ

¼ σ i  1 þ ρeði  1ÞeðiÞ

∂ κ σ ðuði  1Þ; uðiÞÞ ∂σ i  1 i  1

¼ σ i  1 þ ρeði  1ÞeðiÞ‖uði 1Þ  uðiÞ‖2  κ σ i  1 ðuði 1Þ; uðiÞÞ=σ 3i  1 ð19Þ where (a) follows from the fact that f~ i  2 does not depend on σ i  1 , ρ ¼ 2μη is the step size for the kernel size adaptation. Remark. The above algorithm is computationally very simple, since the kernel size is updated sequentially, where only the kernel size of the new center is updated, and those of the old centers remain frozen. The initial value of the kernel size can be set manually or calculated roughly using Silverman's rule based on the input distribution in advance. From (19), we have the following observations:

i

ð20Þ

B. KLMS with adaptive kernel size

ðaÞ

i1

Remark. The computational complexity of the algorithm (20) is in the same order of magnitude as that of the original KLMS algorithm, which equals OðiÞ at iteration i, because both algorithms share the same most time-consuming part, that is, the calculation of the prediction error. Similar to the original KLMS algorithm, the new algorithm also produces a growing RBF network, whose network size increases linearly with the number of training data. In order to obtain a compact model (a network with as few centers as possible) and reduce the computational and memory costs, some sparsification or quantization methods can be applied. Here, we only adopt a quantization approach. In [16], we use an idea of quantization to compress the input space of KLMS and constrain efficiently the network size growth. The learning rule of the quantized KLMS (QKLMS) is 8 > :f ¼f i i  1 þ ηeðiÞκ σ ðQ ½uðiÞ; :Þ where Q ½: denotes a quantization operator in input space U. In QKLMS (21), the centers are limited to the quantization codebook C, and the network size can never be larger than the size of the codebook sizeðC Þ. At iteration i, we just add ηeðiÞ to the coefficient of the code-vector closest to the current input uðiÞ. A simple online vector quantization (VQ) method was proposed in [16]. Suppose that the online VQ method in [16] is adopted and the kernel size of QKLMS is varying as a function of the codebook index. Denote ε the quantization size, and dðuðiÞ; C Þ the distance

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

between uðiÞ and C: dðuðiÞ; C Þ ¼ min1 r j r sizeðC Þ uðiÞ  C j k, where C j denotes the jth element of the codebook C. Then at iteration i, if dðuðiÞ; C Þ 4 ε, a new (uðiÞ) will be added into the  code-vector  codebook, i.e. C ¼ C; uðiÞ . In this case, a new center (uðiÞ) will also be allocated, and its kernel size can be computed in a similar way as in (19):   σ j ¼ σ j  1 þ ρeðj  1ÞeðjÞ‖C j  C j  1 ‖2 κ σj  1 C j ; C j  1 =σ 3j  1 ð22Þ where σ j denotes the kernel size corresponding to the jth codevector C j , and eðjÞ denotes the prediction error at the iteration when the code-vector C j is added. If dðuðiÞ; C Þ r ε, there is no new center added and no kernel size update. 4. Convergence analysis This section gives some theoretical results on the convergence of the algorithm (20). The unknown system is assumed to be a nonlinear regression model given by (16). First, let us define the a priori error ea ðiÞ and a posteriori error ep ðiÞ as follows: ea ðiÞ ¼ f~ i  1 ðuðiÞÞ;

ep ðiÞ ¼ f~ i ðuðiÞÞ

ð23Þ

with f~ i  1 and f~ i the residual mappings at iteration i  1 and i, respectively. A. Energy conservation relation In adaptive filtering theory, the energy conservation relation provides a powerful tool for the mean square convergence analysis [29–33]. In our recent studies [16,34,46], this important relation has been extended into the RKHS. Before carrying out the mean square convergence analysis for the mapping update in (20), we derive the corresponding energy conservation relation. The mapping update in (20) can be expressed as a residualmapping update: f~ i ¼ f~ i  1  ηeðiÞκ σ i ðuðiÞ; :Þ

ð24Þ

Further, one can derive the relationship between ea ðiÞ and ep ðiÞ: ep ðiÞ ¼ ea ðiÞ  ηeðiÞκ σ i ðuðiÞ; uðiÞÞ ¼ ea ðiÞ  ηeðiÞ

Lemma 1. Let U  ℝm be any set with nonempty interior. Then the 0 RKHS ℋσ  induced by the Gaussian kernel pffiffiffiκ σ  ðu; u Þ on U contains the function κ σ ðu; :Þ if and only if σ 4 2σ  =2 . For such σ , the function κ σ ðu; :Þ has norm in ℋσ  : !m=2

σ2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi σ  2σ 2  σ 2

‖κ σ ðu; :Þ‖ℋσ ¼

ð25Þ

Remark. The above lemma is p a ffiffiffidirect consequence of the Theorem 2 in [35]. For the case σ o 2σ  =2, the Hilbert space ℋσ  will not contain the function κ σ ðu; :Þ. It is worth noting that in this case, the function κ σ ðu; :Þ can still be arbitrarily "close" to a member in ℋσ  , because ℋσ  is dense in the space of continuous functions on U provided U is compact. pffiffiffi Now we select a fixed kernel size σ  , satisfying σ  o 2σ min , where σ min ¼ minfσ i g. By Lemma 1, the RKHS ℋσ  will contain all the correction terms. By the reproducing property of the RKHS ℋσ  , the prediction error eðiÞ, a priori error ea ðiÞ, and a posteriori error ep ðiÞ can be expressed as 8   eðiÞ ¼ 〈f~ i  1 κ σ  ðuðiÞ; :Þ ℋσ þ vðiÞ > > > > D E < ea ðiÞ ¼ f~ i  1 jκ σ  ðuðiÞ; :Þ ð26Þ ℋσ  > > D > > : e ðiÞ ¼ f~ κ ðuðiÞ; :Þ p σ i



ℋσ 

ð27Þ

Hence   f~ i ¼ f~ i  1 þ ep ðiÞ ea ðiÞ κ σ i ðuðiÞ; :Þ

ð28Þ

Squaring both sides of (28) in RKHS ℋσ  , we derive   2 D  ‖f~ i ‖2ℋσ ¼ ‖f~ i  1 ‖2ℋσ þ ep ðiÞ  ea ðiÞ κ σi ðuðiÞ; :Þκ σi ðuðiÞ; :Þ ℋσ   D  þ 2 ep ðiÞ  ea ðiÞ f~ i  1 κ σ i ðuðiÞ; :Þ ℋσ    2 D  ¼ ‖f~ i  1 ‖2ℋσ þ ep ðiÞ  ea ðiÞ κ σi ðuðiÞ; :Þκ σ ðuðiÞ; :Þ þ δi ð:Þ ℋσ   D  þ 2 ep ðiÞ  ea ðiÞ f~ i  1 κ σ  ðuðiÞ; :Þ þ δi ð:Þ ℋσ





ð29Þ ¼ ‖f~ i  1 ‖2ℋσ þe2p ðiÞ  e2a ðiÞ þ εðiÞ   E  is named the residual mapping power where ‖f~ i ‖2ℋσ ¼ f~ i f~ i ℋσ 

(RMP) at iteration i, δi ð:Þ ¼ κ σ i ðuðiÞ; :Þ  κ σ  ðuðiÞ; :Þ, and Dh i     εðiÞ ¼ ep ðiÞ  ea ðiÞ 2 κ σi ðuðiÞ; :Þ þ 2 ep ðiÞ  ea ðiÞ f~ i  1 δi ð:Þ ℋσ



ð30Þ It follows that ‖f~ i ‖2ℋσ þ e2a ðiÞ ¼ ‖f~ i  1 ‖2ℋσ þ e2p ðiÞ þ εðiÞ

ð31Þ

Remark. Eq. (31) is referred to as the energy conservation relation for the KLMS with adaptive kernel size. If the kernel size is fixed, say σ i  σ  , we have δi ð:Þ ¼ 0, and hence εðiÞ ¼ 0. In this case, (31) reduces to the original energy conservation relation for the KLMS [46]: ‖f~ ‖2 þ e2 ðiÞ ¼ ‖f~ ‖2 þ e2 ðiÞ ð32Þ i ℋσ 

Due to the variable kernel size, at each iteration the correction term in (24) is computed in a different RKHS. In order to derive an energy conservation relation in a fixed RKHS, one should find a RKHS that contains all the correction terms. Below we give an important lemma.

5

a

i  1 ℋσ 

p

which, in form, is identical to the energy conservation relation for the normalized LMS (NLMS) algorithm. B. Sufficient condition for mean square convergence Substituting ep ðiÞ ¼ ea ðiÞ  ηeðiÞ into (31) and taking expectations of both sides, we have h i h i h  D i   E ‖f~ i ‖2ℋσ  E ‖f~ i  1 ‖2ℋσ ¼  2ηE eðiÞ ea ðiÞ þ f~ i  1 δi ð:Þ ℋσ  h  D i   þ η2 E e2 ðiÞ 1 þ κ σ i ðuðiÞ; :Þδi ð:Þ ℋσ  h  D i   ðbÞ ~  ¼  2ηE eðiÞ ea ðiÞ þ f i  1 δi ð:Þ ℋσ  h  i ð33Þ þ η2 E e2 ðiÞ 1 þ ‖δi ð:Þ‖2ℋσ where (b) comes from D D     κ σ i ðuðiÞ; :Þδi ð:Þ ℋσ ¼ δi ð:Þ þ κ σ  ðuðiÞ; :Þδi ð:Þ ℋσ D   ¼ κ σ  ðuðiÞ; :Þδi ð:Þ ℋσ þ‖δi ð:Þ‖2ℋσ 

¼ δi ðuðiÞÞ þ ‖δi ð:Þ‖2ℋσ ¼ ‖δi ð:Þ‖2ℋσ It follows easily that h i h i E ‖f~ i ‖2ℋσ r E ‖f~ i  1 ‖2ℋσ D i h    2E eðiÞ ea ðiÞ þ f~ i  1 δi ð:Þ ℋσ  h  i 3ηr E e2 ðiÞ 1 þ ‖δi ð:Þ‖2ℋσ

ð34Þ

ð35Þ

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

If 8 i, the step-size η satisfies the inequality i h    2E eðiÞ ea ðiÞ þ 〈 f~ i  1 δi ð:Þ ℋσ h  i  0oηr E e2 ðiÞ 1 þ ‖δi ð:Þ‖2ℋσ

where ξv is the noise variance. Then we have 2

ð36Þ

the RMP in RKHS ℋσ  will be monotonically decreasing (and hence convergent). The inequality (36) implies 8 i h  D i   40 E eðiÞ ea ðiÞ þ f~ i  1 δi ð:Þ ℋσ 

So a sufficient condition for the mean square convergence (monotonic decrease of the RMP) will be, 8 i, i 8 h   ~ > > E eðiÞ ea ðiÞ þ 〈f i  1 δi ð:Þiℋσ 4 0 > > < h    i  2E eðiÞ ea ðiÞ þ f~ i  1 δi ð:Þ ð37Þ > 0oηr h  ℋiσ > > > : E e2 ðiÞ 1 þ ‖δi ð:Þ‖2ℋσ  Remark. Since e2 ðiÞ is in expectation at least the noise variance, but ea ðiÞ tends to zero (or possibly very close to zero) if all goes well. So in order for the step-size η to satisfy (36), it should be a variable step-size that approaches zero at some rate. In addition, if (37) holds, one can only conclude that the RMP will monotonically decrease, but not necessarily converge to zero. A. Steady-state excess mean square error Further, we take the limit of (33) as i-1: h i h i lim E ‖f~ i ‖2ℋσ  lim E ‖f~ i  1 ‖2ℋσ

i-1

i-1

ð38Þ

i-1

If the RMP reaches the steady-state, h i h i lim E ‖f~ i ‖2ℋσ ¼ lim E ‖f~ i  1 ‖2ℋσ

i-1

i-1

ð39Þ

then it holds

 D  η h  i   ¼ lim E e2 ðiÞ 1 þ ‖δi ð:Þ‖2ℋσ lim E eðiÞ ea ðiÞ þ f~ i  1 δi ð:Þ ℋσ  2 i-1 i-1 ð40Þ In order to the steady-state excess mean-square error1 derive (EMSE) lim E e2a ðiÞ , we give two assumptions: i-1

A1: The noise vðiÞ is zero-mean, independent, identically dis  tributed, and independent of the input sequence uðiÞ ; A2: The squared a priori error e2a ðiÞ and ‖δi ð:Þ‖2ℋσ are uncorrelated. Remark. The assumption A1 is commonly used in the convergence analysis for adaptive filtering algorithms [29]. This assumption implies the independence between vðiÞ and ea ðiÞ. Under the assumptions A1 and A2, (40) becomes h D i   lim E e2a ðiÞ þ lim E ea ðiÞ f~ i  1 δi ð:Þ ℋσ

i-1

i-1

6B ¼ lim E4@ i-1

σi C 7 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiA 5  1 σ  2σ 2i  σ 2 2



ð43Þ

Remark. Although both ς and τ depend on the kernel size σ  , the steady-state EMSE itself does not depend on σ  . This can be easily understood if we notice the fact that the residual mapping f~ i has no relation to σ  . To further investigate the steady-state EMSE, we consider the case in which as i-1, σ i and σ 1 are very close and satisfy  pffiffiffi jσ i  σ 1 jo 2  2 σ 1 =2, where σ 1 ¼ lim E½σ i . In this case we i-1 pffiffiffi have σ 1 o 2σ i asi-1. Then at the steady-state stage, we can set σ  ¼ σ 1 . Therefore 1m 20 3 ð44Þ

 h D i    jτj ¼ lim E ea ðiÞ f~ i  1 δi ð:Þ ℋσ j  i-1 h D i    r lim E ea ðiÞ f~ i  1 δi ð:Þ ℋσ j  i-1 h i  r lim E ea ðiÞ‖f~ i  1 ‖ℋσ ‖δi ð:Þ‖ℋσ i-1

0

ð45Þ

It follows that ηξ2 ð1 þ ςÞ  2τ ηξ2v  lim E e2a ðiÞ ¼ v 2  ηð1 þ ςÞ 2η

i-1

ð46Þ

Remark. It has been shown that the steady-state EMSE of the 2 KLMS (with a fixed kernel size) equals ηξv =ð2  ηÞ, which is not related to the specific value of the kernel size [46]. From (46) one observes that, when the kernel size σ i converges to a neighborhood of a certain constant (σ 1 ), the adaptation of kernel size also has little effect on the steady-state EMSE. This will be confirmed later by simulation results. We should point out here that, although the kernel size may have little effect on the KLMS steadystate accuracy (in terms of the EMSE), it has significant influence on the convergence speed. In most practical situations, the training data are finite and the algorithm may never reach the steady state. In these cases the kernel size also has significant influence on the final accuracy (not the steady-state accuracy).

5. Simulation results



 h i η 2 lim E e2a ðiÞ þ ξv 1 þ lim E ‖δi ð:Þ‖2ℋσ ¼ 2 i-1 i-1 

i-1

and



h  i þ η2 lim E e2 ðiÞ 1 þ ‖δi ð:Þ‖2ℋσ

i-1

Lemma 1, ς can also be expressed as h i ς ¼ lim E ‖δi ð:Þ‖2ℋσ i-1 h i ¼ lim E ‖κ σ i ðuðiÞ; :Þ  κ σ  ðuðiÞ; :Þ‖2ℋσ i-1 h i ¼ lim E ‖κ σ i ðuðiÞ; :Þ‖2ℋσ  1 i-1 1m 3 20

σ 2i B C 7 ffiA  15  0 ς ¼ lim E6 4@ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i-1 σ  2σ 2i  σ 2

 D    ¼  2η lim E eðiÞ ea ðiÞ þ f~ i  1 δi ð:Þ ℋσ i-1

ηξ2 ð1 þ ςÞ  2τ ð42Þ lim E e2a ðiÞ ¼ v 2  ηð1 þ ςÞ i-1 h i h D i   where ς ¼ lim E ‖δi ð:Þ‖2ℋσ , and τ ¼ lim E ea ðiÞ f~ i  1 δi ð:Þ ℋσ . By

ð41Þ

1 The a priori error power is also referred to as the excess mean-square error in adaptive filtering.

In this section, we present simulation results that illustrate the performance of the proposed algorithm. The illustrative examples presented include static function approximation, short-term chaotic time series prediction and real world Internet traffic prediction.

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

7

1) Static function approximation

σ σ σ σ σ

= 0.05 = 0.1 = 0.35 = 0.5 = 1.0 adaptive kernel size desired

2.5 Consider a simple static function estimation problem in which the desired output data are generated by ð47Þ

where the input uðiÞ is uniformly distributed over ½  π ; π , and   vðiÞ is a white Gaussian noise with variance 0.0001. For the KLMS with different kernel sizes, the average convergence curves (in terms of the EMSE) over 1000 Monte Carlo runs are shown in Fig. 1. In the simulation, the step-sizes for all the cases are set at η ¼ 0:5. For the KLMS with adaptive kernel size, the step-size for the kernel size adaptation is set at ρ ¼ 0:025, and the initial kernel size is set as 1.0. From Fig. 1, we see clearly that the kernel size has significant influence on the convergence speed. In this example, the kernel size σ ¼ 1:0 and σ ¼ 0:5 produce a rather slow convergence speed. The kernel sizes σ ¼ 0:05 and σ ¼ 0:1 work very well, and in particular, the kernel size σ ¼ 0:1 achieves a fast convergence speed and the smallest final EMSE (at the 5000th iteration). The kernel size σ ¼ 0:35 (selected by Silverman's rule) works, but obviously the performance is not so good. Although the initial kernel size is set as 1.0 (with which the algorithm is almost stalled), the KLMS with adaptive kernel size (σ i ) can still converge to a very small EMSE at the final iteration. This can be clearly 0

10

σ = 0.5

-1

10

σ = 0.35 (Silverman)

-2

10 EMSE

σ = 1.0

σ = 0.05 -3

10

-4

10

σ = 0.1

adaptive kernel size

-5

10

0

1000

2000 3000 iteration

4000

5000

Fig. 1. Convergence curves with different kernel sizes.

1 0.9 0.8 0.7

σ

i

0.6 0.5 0.4 0.3 0.2 0.1

0

1000

2000 3000 iteration

4000

Fig. 2. Evolution of the adaptive kernel size σi.

5000

1.5 f(u)

yðiÞ ¼ cos ð8uðiÞÞ þ vðiÞ

2

1 0.5 0 -0.5 -1 -3

-2

-1

0 u

1

2

3

Fig. 3. Learned mappings at final iteration.

Table 1 EMSE at final iteration for different kernel sizes. Kernel size

EMSE at final iteration

σ ¼ 0:05 σ ¼ 0:1 σ ¼ 0:35 σ ¼ 0:5 σ ¼ 1:0 Adaptive kernel size

0.000067 0.00022 0.000057 0.00010 0.00197 0.0068 0.3798 7 0.4468 0.65737 0.6846 0.000077 0.00027

explained from Fig. 2, where the evolution curve of the adaptive kernel size σ i has been plotted. In Fig. 2, the adaptive kernel size σ i converges to a desirable value between 0.1 and 0.2. Fig. 3 shows the learned mappings at final iteration for different kernel sizes.  The desired mapping f ðuÞ ¼ cos ð8uÞ is also plotted in Fig. 3 for comparison purpose. One can see for the cases σ ¼ 0:05, σ ¼ 0:1 and σ ¼ σ i , the learned mappings match the desired mapping very well, while when σ ¼ 1:0 and 0:5, the learned mappings deviate severely from the desired function. For the kernel size σ ¼ 0:35, there is still some visible deviation between the learned mapping and the desired one. A more detailed comparison is also presented in Table 1, where the EMSE at final iteration is summarized. As illustrated in Fig. 1, the initial convergence speed of the KLMS with adaptive kernel size can still be very slow if the initial kernel size is inappropriately chosen. In order to improve the initial convergence speed, one can select a suitable initial kernel size using a certain method such as Silverman's rule. For the present example, if we set the initial kernel size to be 0.35, the convergence speed of the new algorithm will be improved significantly. This can be clearly seen from Fig. 4, in which the learning curves for σ ¼ 0:1 and σ ¼ σ i (with σ 0 ¼ 0:35) are shown. The kernel size will influence the convergence speed (see Fig. 1) and the final accuracy with finite training data (see Table 1), but it has little effect on the steady-state EMSE with infinite training data. In order to confirm this theoretical prediction, we perform another set of simulations with the same settings, except now much more iterations are run. For different kernel sizes and iterations, the EMSEs (obtained as the averages over a window of 2000 samples) are listed in Table 2. One can see for the kernel sizes σ ¼ 0:05, σ ¼ 0:1 and σ ¼ σ i , the algorithms almost reach the steady-state before the 100,000th iteration. For the kernel size σ ¼ 0:35, the algorithm attains its steady-state at around the 800,000th iteration. For the cases σ ¼ 0:5 and σ ¼ 1:0, it is hard to

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

obtain the steady-state EMSE via simulation since the convergence speed is too slow. In Table 2, the simulated steady-state EMSEs for different kernel sizes (σ ¼ 0:05, σ ¼ 0:1, σ ¼ σ i and σ ¼ 0:35) are very close and approach to 0.000033333, the theoretical value of the steady-state EMSE calculated using (46). 2) Short term chaotic time series prediction

another 100 as the test data. At each iteration, the testing MSE is computed based on the test data using the learned filter at that iteration. It can be seen from Fig. 6 that the kernel size σ ¼ 15 achieves the best performance with the smallest testing MSE at final iteration. Also, as expected, the adaptive kernel size yields a satisfactory performance very close to the best result. Fig. 7 shows

The second example is about short term chaotic time series prediction. Consider the Lorenz oscillator whose state equations are 8 dx ¼  β x þ yz > > < dt dy ¼ δðz  yÞ ð48Þ dt > > : dz ¼ xy þ ρy  z

40 30 20 10

where the parameters are set as β ¼ 4:0, δ ¼ 30, and ρ ¼ 45:92. The second state y is picked for the prediction task and a sample time series is shown in Fig. 5. Here the goal is to predict the value of the current sample yðiÞ using the previous five consecutive samples. We continue to compare the performances of the KLMS with different kernel sizes. In the simulations below, the step-sizes for the mapping update are set at η ¼ 0:1, and the step-size for the kernel size adaptation is set at ρ ¼ 0:05. The initial value of the adaptive kernel size σ i is set as 1.0. For the kernel sizes σ ¼ 1:0, 5:5 (selected by Silverman's rule), 10, 15, 20, 30 and the adaptive kernel size σ i , the convergence curves in terms of the testing MSE are illustrated in Fig. 6. For each kernel size, 20 independent simulations were run with different segments of the time series. In each segment, 1000 samples are used as the training data and

value

dt

0 -10 -20 -30 -40

0

1000

2000 sample

4000

Fig. 5. Time series produced by Lorenz system.

3

0

10

10

σ = 1.0 σ = 5.5 (Silverman) σ = 10 σ = 15 σ = 20 σ = 30 adaptive kernel size

-1

10

2

10

10

testing MSE

-2

EMSE

3000

adaptive kernel size

-3

10

1

10

-4

10

σ = 0.1 -5

10

0

1000

2000 3000 iteration

4000

5000

Fig. 4. Convergence curves for σ¼ 0.1 and the adaptive kernel size with initial value determined by Silverman's rule.

0

10

0

200

400 600 iteration

800

1000

Fig. 6. Convergence curves in terms of the testing MSE for different kernel sizes.

Table 2 EMSE at different iterations for different kernel sizes. Iterations Kernel size

10,000

50,000

100,000

200,000

800,000

σ ¼ 0:05 σ ¼ 0:1 σ ¼ 0:35 σ ¼ 0:5 σ ¼ 1:0 Adaptive kernel size

0.00004802 0.00003977 0.001042 0.2686 0.6457 0.00004859

0.00003507 0.00003332 0.0001523 0.05134 0.6207 0.00003581

0.00003332 0.00003333 0.00008837 0.01211 0.6024 0.00003334

0.00003334 0.00003331 0.00005514 0.001312 0.5830 0.00003332

0.00003333 0.00003332 0.00003334 0.0001208 0.5684 0.00003334

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

9

100

20 without quantization with quantization

80

network size

σi

15

10

5

0

60

40

20

0

200

400

600 iteration

800

0

1000

Fig. 7. Evolution of the kernel size σi in Lorenz time series prediction.

200

400 600 iteration

800

1000

Fig. 9. Network size evolution in Lorenz time series prediction (with quantization).

Table 4 Testing MSE at final iteration (with quantization).

Table 3 Testing MSE at final iteration. Kernel size

Testing MSE

σ ¼ 1:0 σ ¼ 5:5 σ ¼ 10 σ ¼ 15 σ ¼ 20 σ ¼ 30 Adaptive kernel size

85.37407 2.6436 1.84777 0.2751 1.0450 7 0.1338 0.94907 0.0276 1.38767 0.0281 3.4480 7 0.1370 1.02647 0.1018

Kernel size

Testing MSE

σ ¼ 1:0 σ ¼ 5:5 σ ¼ 10 σ ¼ 15 σ ¼ 20 σ ¼ 30 Adaptive kernel size

109.28017 1.1974 2.54867 0.4112 1.31837 0.1509 1.2982 7 0.0262 2.11737 0.0215 4.84477 0.2272 1.2626 7 0.0890

9

3

σ = 1.0 σ = 5.5 (Silverman) σ = 10 σ = 15 σ = 20 σ = 30 adaptive kernel size

2

10

1

10

9

x 10

8 7 6

value (bits)

10

testing MSE

0

5 4 3 2 1

0

10

0

200

400 600 iteration

800

1000

Fig. 8. Convergence curves in terms of the testing MSE for different kernel sizes (with quantization).

the evolution of the adaptive kernel size (see the solid line). Interestingly, we observe the adaptive kernel size σ i converges quickly and very close to the desirable value 15. The mean deviation results of the testing MSE at final iteration are given in Table 3. Further, we would like to evaluate the performance when the quantization method is applied to curb the growth of the network size. The experimental setting is the same as before, except now

0

0

1000

2000

3000

4000

5000

sample Fig. 10. Internet traffic time series.

the quantization method is used and the quantization size is set at

ε ¼ 4:0. The convergence curves for different kernel sizes are demonstrated in Fig. 8. Once again, the adaptive kernel size works very well, and obtains a performance very close to the best one. As shown in Fig. 7, the adaptive kernel size σ i still converges close to the value 15 (see the dotted line). Fig. 9 illustrates the evolution curve of the network size. One can see that with quantization the network size grows very slowly, and the final network size is only

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

10

0

10

σ = 0.05 σ = 0.1 σ = 0.5 σ = 1.0 σ = 5.0 adaptive kernel size

-1

testing MSE

10

-2

10

computed based on the test data using the learned filter at that iteration. From Fig. 11, we see that the kernel size σ ¼ 0:5 achieves the best performance with the smallest testing MSE during the final stage (Strictly speaking, in this figure the kernel size σ ¼ 1:0 is also good, thus the interval [0.5, 1.0] is a good range of the kernel size). Also evidently, the proposed adaptive kernel size achieves a satisfactory performance that is close to the best one. The evolution curve of the adaptive kernel size is shown in Fig. 12. One can observe that the adaptive kernel size σ i changes very quickly at the initial stage and converges slowly to a neighborhood of the desirable value 0.5 at the final stage. The results on real-world Internet traffic dataset confirm again the effectiveness of the proposed algorithm.

6. Conclusion -3

10

0

1000

2000 iteration

3000

4000

Fig. 11. Convergence curves with different kernel sizes in Internet traffic prediction.

0.6

0.5

σi

0.4

0.3

0.2

0.1 0.05

0

1000

2000

3000

4000

iteration Fig. 12. Evolution of the kernel size σi in Internet traffic prediction.

around 75. Table 4 shows the mean deviation results of the testing MSE at final iteration.

The kernel function implicitly defines the feature space and plays a central role in all kernel methods. In kernel adaptive filtering (KAF) algorithms, the Gaussian kernel (a radial basis function kernel) is usually a default kernel. The kernel size (or bandwidth) of the Gaussian kernel controls the smoothness of the mapping and has significant influence on the learning performance. How to select a proper kernel size is a very crucial problem in KAF algorithms. Some existing techniques (e.g. Silverman's rule) for selecting a kernel size can be applied, but they are not appropriate for a KAF algorithm since the problem is approximation in a joint space (the input and the desired), which is different from density estimation. In this work, we propose an approach for sequentially optimizing the kernel size for the kernel least mean square (KLMS), a simple yet efficient KAF algorithm. At each iteration cycle, the kernel size is adjusted by a stochastic gradient based algorithm to minimizing the mean square error. The proposed algorithm is computationally very simple and easy to implement. Theoretical results on convergence are also presented. Based on the energy conservation relation in RKHS, we derive a sufficient condition for the mean square convergence, and obtain a theoretical steadystate excess mean-square error (EMSE). Simulation results confirm the theoretical prediction, and show the adaptive kernel size can automatically converge to a proper value, so as to help KLMS converge faster and achieve better accuracy. In future study, it is of interest to extend this work to the case where the kernel is of any form (not restricted to the Gaussian kernel). Especially, we will study how to sequentially optimize the kernel function using the idea of multi-kernel learning or learning the kernel. Another interesting line of study is how to jointly optimize the kernel size and the step size.

3) Internet traffic prediction Acknowledgments In the third example, we consider a real world Internet tra ffic dataset from the homepage of Prof. Paulo Cortez (http://www3. dsi.uminho.pt/pcortez/series/A5M.txt). Fig. 10 illustrates a sequence of the Internet traffic samples. The task is to predict the current value of the sample using the previous ten consecutive samples. For convenience of computation, the Internet traffic data are normalized into the interval [0, 1]. Again we compare the performance of the KLMS with different kernel sizes. In the simulation, the step-sizes for the mapping update are set at η ¼ 0:01, and the step-size for the kernel size adaptation is set to ρ ¼ 0:025. The initial value of the adaptive kernel size (σ i ) is 0.05. For the fixed kernel sizes ð0:05; 0:1 ; 0:5; 1:0; 5:0Þ and the adaptive kernel size σ i , the convergence cures in terms of the testing MSE are demonstrated in Fig. 11. In this example, 4000 samples are used as the training data and another 500 samples as the test data. At each iteration, the testing MSE is

The authors would like to thank Mr. Zhengda Qin, a student at Xi'an Jiaotong University for performing the simulations on Internet traffic prediction. This work was supported by National Natural Science Foundation of China (No. 61372152) and 973 Program (No. 2015CB351703).

References [1] V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995. [2] B. Scholkopf, A.J. Smola, Learning with Kernels, Support Vector Machines, Regularization, Optimization and Beyond, MIT Press, Cambridge, MA, USA, 2002. [3] F. Girosi, M. Jones, T. Poggio, Regularization theory and neural networks architectures, Neural Comput. 7 (1995) 219–269. [4] B. Scholkopf, A.J. Smola, K. Muller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput. 10 (1998) 1299–1319.

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ [5] M.H. Yang, Kernel eigenfaces vs kernel fisherfaces: face recognition using kernel methods, in Proceedings of the 5th IEEE ICAFGR, Washington D.C., 2002, pp. 215–220. [6] W. Liu, J. Principe, S. Haykin, Kernel Adaptive Filtering: A Comprehensive Introduction, Wiley, 2010. [7] N. Aronszajn, Theory of reproducing kernels, Trans. Am. Math. Soc. 68 (1950) 337–404. [8] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Min. Knowl. Discov. 2 (1998) 121–167. [9] W. Liu, P. Pokharel, J. Principe, The kernel least mean square algorithm, IEEE Trans. Signal Process. 56 (2008) 543–554. [10] P. Bouboulis, S. Theodoridis, Extension of Wirtinger's calculus to reproducing kernel Hilbert spaces and the complex kernel LMS, IEEE Trans. Signal Process. 59 (2011) 964–978. [11] W. Liu, J. Principe, Kernel affine projection algorithm, EURASIP J. Adv. Signal Process. (2008) 12, http://dx.doi.org/10.1155/2008/784292. [12] Y. Engel, S. Mannor, R. Meir, The kernel recursive least-squares algorithm, IEEE Trans. Signal Process. 52 (2004) 2275–2285. [13] W. Liu, Il Park, Y. Wang, J.C. Principe, Extended kernel recursive least squares algorithm, IEEE Trans. Signal Process. 57 (2009) 3801–3814. [14] J. Platt, A resource-allocating network for function interpolation, Neural Comput 3 (1991) 213–225. [15] W. Liu, Il Park, J.C. Principe, An information theoretic approach of designing sparse kernel adaptive filters, IEEE Trans. Neural Netw. 20 (2009) 1950–1961. [16] B. Chen, S. Zhao, P. Zhu, J.C. Principe, Quantized kernel least mean square algorithm, IEEE Trans. Neural Netw. Learn. Syst 23 (1) (2012) 22–32. [17] G. Wahba, Spline Models for Observational Data, SIAM, Philadelphia, PA, 1990. [18] W. Hardle, Applied Nonparametric Regression, Cambridge University Press, Cambridge, UK, 1992. [19] J. Racine, An efficient cross-validation algorithm for window width selection for nonparametric kernel regression, Commun. Stat. : Simulat. Comput. 22 (1993) 1107–1114. [20] G.C. Cawlay, N.L.C. Talbot, Efficient leave-one-out cross-validation of kernel Fisher discriminant classifiers, Pattern Recogn. 36 (2003) 2585–2592. [21] S. An, W. Liu, S. Venkatesh, Fast cross-validation algorithm for least squares support vector machine and kernel ridge regression, Pattern Recogn. 40 (2007) 2154–2162. [22] E. Herrmann, Local bandwidth choice in kernel regression estimation, J. Comput. Graph. Stat. 6 (1) (1997) 35–54. [23] B.W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman & Hall, CRC, 1986. [24] M.C. Jones, J.S. Marron, S.J. Sheather, A brief survey of bandwidth selection for density estimation, J. Am. Stat. Assoc. 91 (433) (1996) 401–407. [25] C. Brunsdon, Estimating probability surfaces for geographical point data: an adaptive kernel algorithm, Comput. Geosci. 21 (7) (1995) 877–894. [26] V. Katkovnik, I. Shmulevich, Kernel density estimation with adaptive varying window size, Pattern Recog. Lett. 23 (2002) 1641–1648. [27] J. Yuan, L. Bo, K. Wang, T. Yu, Adaptive spherical Gaussian kernel in sparse Bayesian learning framework for nonlinear regression, Expert Syst. Appl. 36 (2009) 3982–3989. [28] A. Singh, J.C. Principe, Information theoretic learning with adaptive kernels, Signal Process. 91 (2011) 203–213. [29] A.H. Sayed, Fundamentals of Adaptive Filtering, Wiley, NJ, 2003. [30] T.Y. Al-Naffouri, A.H. Sayed, Adaptive filters with error nonlinearities: meansquare analysis and optimum design, EURASIP J. Appl. Signal Process. 4 (2001) 192–205. [31] N.R. Yousef, A.H. Sayed, A unified approach to the steady-state and tracking analysis of adaptive filters, IEEE Trans. Signal Processing 49 (2001) 314–324. [32] T.Y. Al-Naffouri, A.H. Sayed, Transient analysis of adaptive filters with error nonlinearities, IEEE Trans. Signal Processing 51 (2003) 653–663. [33] B. Chen, L. Xing, J. Liang, N. Zheng, J.C. Principe, Steady-state mean-square error analysis for adaptive filtering under the Maximum Correntropy Criterion, IEEE Signal Process. Lett. 21 (7) (2014) 880–884. [34] S. Zhao, B. Chen, J.C. Principe, Kernel adaptive filtering with maximum correntropy criterion, in Proceedings of International Joint Conference on Neural Networks (IJCNN), San Jose, California, USA, 2011, pp. 2012–2017. [35] H.Q. Minh, Further properties of Gaussian reproducing kernel Hilbert spaces, 2012. 〈http://arxiv.org/abs/1210.6170〉. [36] J. Kivinen, A.J. Smola, R.C. Williamson, Online learning with kernels, IEEE Trans. Signal Process. 52 (8) (2004) 2165–2176. [37] K. Slavakis, S. Theodoridis, I. Yamada, Online kernel-based classification using adaptive projection algorithms, IEEE Trans. Signal Process. 56 (7) (2008) 2781–2796. [38] F. Orabona, J. Keshet, B. Caputo, Bounded kernel-based online learning, J. Mach. Learn. Res. 10 (2009) 2643–2666. [39] P. Zhao, S.C.H. Hoi, R. Jin, Double updating online learning, J. Mach. Learn. Res. 12 (2011) 1587–1615. [40] M. Gonen, E. Alpaydin, Multiple kernel learning algorithms, J. Mach. Learn. Res. 12 (2011) 2211–2268.

11

[41] M. Herbster, Relative loss bounds and polynomial time predictions for the KLMS-NET algorithm, in Proceedings of the ALT 2004, 2004, pp. 309–323. [42] C.S. Ong, A.J. Smola, R.C. Williamson, Learning the kernel with hyperkernels, J. Mach. Learn. Res. 6 (2005) 1043–1071. [43] A. Argyriou, C.A. Micchelli, M. Pontil, Learning convex combinations of continuously parameterized basic kernels, COLT 2005 (2005) 338–352. [44] R. Jin, S.C.H. Hoi, T. Yang, Online multiple kernel learning: algorithms and mistake bounds, in Proceedings of the ALT 2010, 2010, pp. 390–404. [45] F. Orabona, J. Luo, B. Caputo, Multi kernel learning with online-batch optimization, J. Mach. Learn. Res. 13 (2012) 165–191. [46] B. Chen, S. Zhao, P. Zhu, J. Principe, Mean square convergence analysis for kernel least mean square algorithm, Signal Process. 92 (2012) 2624–2632. [47] Dekel Ofer, Shai Shalev-shwartz, Yoram Singer, The Forgetron: A Kernel-based Perceptron on a Fixed Budget, Advances in Neural Information Processing Systems (NIPS), 2006. [48] Orabona Francesco, Joseph Keshet, Barbara Caputo, The projectron: a bounded kernel-based perceptron, in Proceedings of the 25th international conference on Machine learning (ICML), ACM, 2008. [49] Wang Zhuang, Slobodan Vucetic, Online passive-aggressive algorithms on a budget, in Proceedings of the International Conference on Artificial Intelligence and Statistics, 2010. [50] Zhao, Peilin, Steven C.H. Hoi, Bduol: double updating online learning on a fixed budget, in: Machine Learning and Knowledge Discovery in Databases, Springer, Berlin Heidelberg, 2012, pp. 810–826. [51] Zhao, Peilin, et al., Fast bounded online gradient descent algorithms for scalable kernel-based online learning, arXiv preprint arXiv:1206.4633, 2012. [52] B. Chen, S. Zhao, P. Zhu, J.C. Principe, Quantized kernel recursive least squares algorithm, IEEE Trans. Neural Netw. Learn. Syst. 24 (9) (2013) 1484–1491. [53] Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang, Online multiple kernel classification, Mach. Learn. 90 (2) (2013) 289–316. [54] Steven C.H. Hoi, Jialei Wang, Peilin Zhao, Libol: a library for online learning algorithms, J. Mach. Learn. Res. 15 (2014) 495–499. [55] Peilin Zhao, Steven C.H. Hoi, Jialei Wang, Bin Li, Online transfer learning, Artif. Intell. 216 (2014) 76–102. [56] Wenwu He, James T. Kwok, Simple randomized algorithms for online learning with kernels, Neural Netw. 60 (2014) 17–24. [57] X. Xu, H. Qu, J. Zhao, X. Yang, B. Chen, Quantized kernel least mean square with desired signal smoothing, Electron. Lett. 51 (18) (2015) 1457–1459. [58] C. Saide, R. Lengelle, P. Honeine, C. Richard, R. Achkar, Nonlinear adaptive filtering using kernel based algorithms with dictionary adaptation, Int. J. Adapt. Control Signal Process. (2015), http://dx.doi.org/10.1002/acs.2548.

Badong Chen received the B.S. and M.S. degrees in control theory and engineering from Chongqing University, in 1997 and 2003, respectively, and the Ph.D. degree in computer science and technology from Tsinghua University in 2008. He was a Post-Doctoral Researcher with Tsinghua University from 2008 to 2010, and a Post-Doctoral Associate at the University of Florida Computational NeuroEngineering Laboratory (CNEL) during the period October, 2010 to September, 2012. He is currently a professor at the Institute of Artificial Intelligence and Robotics (IAIR), Xi'an Jiaotong University. His research interests are in signal processing, information theory, machine learning, and their applications in cognitive science and engineering. He has published 2 books, 3 chapters, and over 100 papers in various journals and conference proceedings. Dr. Chen is an IEEE senior member and an associate editor of IEEE Transactions on Neural Networks and Learning Systems and has been on the editorial boards of Applied Mathematics and Entropy.

Junli Liang received the Ph.D. degree in signal and information processing from the Institute of Acoustics, Chinese Academy of Sciences. At present, he works as a professor in the School of Electronics and Information, Northwestern Polytechnical University, Xi'an, China. His research interests include radar and biomedical imaging, image processing, signal processing, and pattern recognition, etc.

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i

12

B. Chen et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Nanning Zheng graduated from the Department of Electrical Engineering, Xi'an Jiaotong University, Xi'an, China, in 1975, and received the M.S. degree in information and control engineering from Xi'an Jiaotong University in 1981 and the Ph.D. degree in electrical engineering from Keio University, Yokohama, Japan, in 1985. He is currently a professor and director of the Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University. His research interests include computer vision, pattern recognition and image processing, and hardware implementation of intelligent systems. Prof. Zheng became a member of the Chinese Academy of Engineering in 1999, and he is the Chinese Representative on the Governing Board of the International Association for Pattern Recognition. He is an IEEE Fellow and also serves as an executive deputy editor of the Chinese Science Bulletin.

Jose C. Principe is currently the Distinguished Professor of Electrical and Biomedical Engineering at the University of Florida, Gainesville, where he teaches advanced signal processing and artificial neural networks (ANNs) modeling. He is BellSouth Professor and Founder and Director of the University of Florida Computational Neuro-Engineering Laboratory (CNEL). He is involved in biomedical signal processing, in particular, the electroencephalogram (EEG) and the modeling and applications of adaptive systems. Dr. Príncipe is the past Editor-in-Chief of the IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, past President of the International Neural Network Society, and former Secretary of the Technical Committee on Neural Networks of the IEEE Signal Processing Society. He is an IEEE Fellow and an AIMBE Fellow and a recipient of the IEEE Engineering in Medicine and Biology Society Career Service Award. He is also a former member of the Scientific Board of the Food and Drug Administration, and a member of the Advisory Board of the McKnight Brain Institute at the University of Florida.

Please cite this article as: B. Chen, et al., Kernel least mean square with adaptive kernel size, Neurocomputing (2016), http://dx.doi.org/ 10.1016/j.neucom.2016.01.004i