Digital Signal Processing 40 (2015) 154–163
Contents lists available at ScienceDirect
Digital Signal Processing www.elsevier.com/locate/dsp
Smoothed least mean p-power error criterion for adaptive filtering Badong Chen a,∗ , Lei Xing a , Zongze Wu b , Junli Liang c , José C. Príncipe a,d , Nanning Zheng a a
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, 710049, China School of Electronic and Information Engineering, South China University of Technology, Guangzhou, 510640, China c School of Automation & Information Engineering, Xi’an University of Technology, Xi’an 710048, China d Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA b
a r t i c l e
i n f o
a b s t r a c t
Article history: Available online 25 February 2015
In this paper, we propose a novel error criterion for adaptive filtering, namely the smoothed least mean p-power (SLMP) error criterion, which aims to minimize the mean p-power of the error plus an independent and scaled smoothing variable. Some important properties of the SLMP criterion are presented. In particular, we show that if the smoothing variable is symmetric and zero-mean, and p is an even number, then the SLMP error criterion will become a weighted sum of the even-order moments of the error, and as the smoothing factor (i.e. the scale factor) is large enough, this new criterion will be approximately equivalent to the well-known mean square error (MSE) criterion. Based on the proposed error criterion, we develop a new adaptive filtering algorithm and its kernelized version, and derive a theoretical value of the steady-state excess mean square error (EMSE). Simulation results suggest that the new algorithms with proper choice of the smoothing factor may perform quite well. © 2015 Elsevier Inc. All rights reserved.
Keywords: Adaptive filtering LMP SLMP EMSE
1. Introduction Given two random variables: d ∈ R, an unknown real-valued parameter to be estimated, U ∈ Rm , a vector of observations (measurements), an estimator of d can be defined as a function of U : dˆ = g (U ), where g is solved by minimizing a certain error criterion:
= arg min g ∈G
ˆ ˆ
g ∈G
φ x − g ( y ) p dU (x, y )dxdy
Corresponding author. Fax: +86 29 82668672. E-mail address:
[email protected] (B. Chen).
http://dx.doi.org/10.1016/j.dsp.2015.02.009 1051-2004/© 2015 Elsevier Inc. All rights reserved.
(2)
where the error is
(1)
where E denotes the expectation operator, e = d − g (U ) is the estimation error, G stands for the collection of all measurable functions of U , p dU denotes the joint probability density function (PDF) of (d, U ), and φ(.) is a loss function, generally being a strictly convex function satisfying: i) φ(e ) ≥ 0; ii) φ(−e ) = φ(e ); iii) |e 1 | < |e 2 | ⇒ φ(e 1 ) ≤ φ(e 2 ). A key issue in the above estimation problem is the choice of the loss function φ . A very popular choice is φ(e ) = e 2 , in which case the estimator is simply the min-
*
J MSE = E e 2 (i )
g = arg min E φ(e ) = arg min E φ d − g (U ) g ∈G
imum mean-square-error (MSE) estimator. Due to its mathematical simplicity and robustness, the MSE criterion has been widely used. In the context of adaptive filtering, the desired value and the input vector stand for, respectively, the unknown variable and the observation vector. The well-known Widrow–Hoff least mean square (LMS) algorithm is based on the minimization of the following MSE cost function [1–3]:
e (i ) = d (i ) − W (i ) T U (i )
(3)
with d(i ) ∈ R being the desired value at time i, W (i ) = [ w 1 (i ), w 2 (i ), · · · , w m (i )] T ∈ Rm the weight vector of the adaptive filter, and U (i ) ∈ Rm the input vector (regressor vector), which is, in general, defined as
T
U (i ) = u (i ), u (i − 1), · · · , u (i − m + 1)
(4)
where u (i ) is the input signal. However, the MSE criterion may perform poorly in many situations, especially in nonlinear and nonGaussian situations, as it captures only the second order statistics in the data. In order to take into account higher-order (or lowerorder) statistics and to improve the convergence performance in realistic scenarios, many alternative optimality criteria beyond the
B. Chen et al. / Digital Signal Processing 40 (2015) 154–163
155
second-order statistics have been applied in adaptive filtering. Typical examples include the least mean fourth (LMF) and least mean p-power (LMP) criteria [4,5],1 least mean mixed-norm (LMMN) criterion [6], minimum risk-sensitive criterion [7], minimum error entropy (MEE) criterion [8–11], and the maximum correntropy criterion (MCC) [12–16]. In a recent paper, a new family of adaptive filtering algorithms are developed based on a logarithmic cost function [17]. The LMP criterion uses the mean p-power error as the cost function (φ(e ) = |e | p , p ∈ N) [5]:
¨
x − g ( y ) p p dU (x, y )dxdy
J LMP = E |e | p =
(5)
which includes the MSE (p = 2) and LMF (p = 4) criteria as special cases. It is computationally simple, and has been proven successful in numerous applications [5,18]. As pointed out in [5], when used as an error criterion in adaptive filtering, the LMP has some useful properties: 1) compared with the MSE, it may produce a better solution if the performance function has different optimum solutions for various p; 2) the steepest descent algorithm based on LMP error criterion with p > 2 (especially when p = 4) may have better convergence performance (i.e. achieve either faster convergence speed or lower misadjustment); 3) the adaptive filter based on LMP error criterion with p < 2 (e.g. when p = 1) is robust to impulsive noises. When p is too large (say p > 4), however, the LMP based algorithms will be likely to diverge at the initial stage, and converge slowly at the steady-state stage. In this work, we propose a new error criterion, namely the smoothed least mean p-power (SLMP) error criterion, which aims to minimize the mean p-power of the error plus an independent and scaled smoothing variable. If the distribution of the smoothing variable is symmetric and zero-mean, and p is an even number, i.e. p = 2k (k ∈ N), then the SLMP error criterion will become a weighted sum of the even-order moments of the error, in which the weights are controlled by the smoothing factor (i.e. the scale factor), and further, as the smoothing factor is large enough, this new criterion will be approximately equivalent to the MSE criterion. Based on the SLMP error criterion, we develop an adaptive algorithm and study its convergence performance. By properly choosing a smoothing factor, the SLMP algorithm may achieve a much better performance than the original LMP algorithm. Since the kernel method is a powerful technique for generalizing a linear algorithm to a non-linear one [19,20], we also present a kernel extension of the proposed algorithm. The rest of the paper is organized as follows. In Section 2, we give the definition of the SLMP error criterion, and present some important properties. In Section 3, we develop the new adaptive algorithms and analyze the mean and mean square performance. In Section 4, we present simulation results to confirm the theoretical analyses and demonstrate the desirable performance of the proposed algorithms. Finally, we give the concluding remarks in Section 5. 2. Smoothed least mean p-power error criterion
Definition. Let e = d − g (U ) be the estimation error. Then the smoothed LMP (SLMP) cost is defined as (p ∈ N)
If the smoothing variable ξ has a continuous PDF K (x), then the PDF of hξ can be expressed as
K h (x) =
1 h
K
x
(7)
h
Since ξ is independent of e, the PDF of the random variable e + hξ will be
ˆ∞ p e+hξ (x) = ( p e ∗ K h )(x) =
(6)
where ξ is a smoothing variable, which is a random variable that is independent of e, and h ≥ 0 is a smoothing factor. 1 The least mean p-power algorithm was described for the first time in [4], although that reference emphasizes LMF.
p e (τ ) K h (x − τ )dτ
(8)
−∞
where ∗ denotes the convolution operator, and p e denotes the PDF of e. Therefore, we have
J SLMP = E |e + hξ | p
ˆ∞ |x| p p e+hξ (x)dx
= −∞
ˆ∞ =
ˆ∞ |x|
−∞
p
p e (τ ) K h (x − τ )dτ dx −∞
ˆ∞
ˆ∞ = −∞
= E φh (e )
p
|x| K h (x − τ )dx dτ
p e (τ )
´∞
2.1. SLMP criterion
J SLMP = E |e + hξ | p
Fig. 1. Loss functions of the SLMP (p = 6) criterion with different smoothing variable distributions (Gaussian, Uniform, Binary) and smoothing factors (h = 0.5, 1.0, 1.5).
−∞
(9)
where φh (e ) = −∞ |x| p K h (x − e )dx. If the PDF function K h (x) is symmetric, then φh (x) = (φ ∗ K h )(x), where φ(x) = |x| p . In this case, the loss function φh (.) of the SLMP is a “smoothed” function (by convolution) of the loss function of LMP (hence the name “smoothed LMP”). Given a distribution (usually with zero-mean and unit variance) of the smoothing variable ξ , one can easily derive the analytical expression of the smoothed loss function φh (e ). Fig. 1 shows the loss functions of the SLMP (p = 6) criterion with different smoothing variable distributions (Gaussian, Uniform, Binary) and smoothing factors (h = 0.5, 1.0, 1.5), where the three smoothing variable distributions are
156
B. Chen et al. / Digital Signal Processing 40 (2015) 154–163
⎧ ⎪ Binary: ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩
K (x) =
1 2
δ(x + 1) + δ(x − 1)
⎧ 1 ⎨ √ K (x) = 2 3 ⎩
Uniform:
if −
0
Gaussian:
√
3≤x≤
Proof. Since p is an even number, we have J SLMP = E [(e + hξ ) p ] = h p E [( h1 e + ξ ) p ], and
√
3
(10)
1
≈ ψ(0) + ψ (0)τ + ψ (0)τ 2
2.2. Some important properties
Property 1. As h → 0, the SLMP criterion will be equivalent to the LMP criterion. Property 2. If the smoothing variable ξ is zero-mean, then the SLMP criterion with p = 2 is equivalent to the MSE criterion.
=E ξ
p
= E e2 + h2 E ξ 2
(11)
Since the term h2 E [ξ 2 ] is independent of the error e, we get J SLMP ∼ J MSE = E [e 2 ], where “∼” denotes the equivalence relation. 2 Property 3. If the smoothing variable ξ is symmetric and zero-mean, and p is an even number, p = 2k (k ∈ N), then the SLMP cost will be a weighted sum of the even-order moments of the error. Proof. It is easy to derive
J SLMP = E |e + hξ |2k
=E
2k
q C 2k eq (hξ )2k−q
q =0
(a)
=
2k
C 2k h2k−q E eq E ξ 2k−q q
+ pE ξ
l =0
λl E e 2l
e
τ+
p ( p − 1) p −2 2 2 e τ E ξ 2
p ( p − 1)τ 2 p −2 2 = E ξp + E e E ξ 2 p 2 = E ξ + λE e
(c )
(14)
2
where λ = E [ξ p −2 ], and (c ) is because that ξ is indepen2 dent of e, and is symmetric and zero-mean. Thus as h → ∞, we have J SLMP ∼ E [e 2 ]. 2 Property 5. If the smoothing variable ξ is symmetric and zero-mean, and p is an even number, then as e is small enough, the SLMP criterion will be approximately equivalent to the MSE criterion. Proof. Since p is an even number, we have J SLMP = E [(e + hξ ) p ], and
J SLMP = E (e + hξ ) p = E e E ξ (e + hξ ) p
1
= E e E ξ (hξ ) p + p (hξ ) p −1 e + p ( p − 1)(hξ ) p −2 e 2 2 2 +o e 1 (d) = E e E ξ (hξ ) p + p ( p − 1)(hξ ) p −2 e 2 + o e 2 2 1 ≈ E e E ξ (hξ ) p + E ξ p ( p − 1)(hξ ) p −2 e 2 2
(15)
where (d) comes from the fact that ξ is independent of e, and is symmetric around zero, and c 1 = E [(hξ ) p ], c 2 = E [ 12 p ( p − 1)(hξ ) p −2 ]. So, as e → 0, we have J SLMP ∼ E [e 2 ]. 2
k (b) 2l 2(k−l) 2l 2(k−l) = C 2k h E e E ξ k
p −1
= c1 + c2 E e2
q =0
=
2
p ( p −1)τ
Proof. When the smoothing variable ξ is zero-mean and p = 2, we have
(13)
where
2
1 x exp − K (x) = √ 2 2π
J SLMP = E |e + hξ |2
τ = 1/h. As h → ∞, we have τ → 0, and ψ(τ ) = E (τ e + ξ ) p
other
J SLMP ∼ E (τ e + ξ ) p
(12)
l =0 2l 2(k−l) where λl = C 2k h E [ξ 2(k−l) ], (a) is because ξ is independent of e, and (b) follows from the assumption that the PDF of ξ is symmetric and zero-mean. 2
Remark. According to Property 3, under certain conditions the SLMP criterion will be equivalent to a linear combination of several even-order LMP criteria (or the LMF family criteria), where the combination weights are determined by the smoothing variable ξ and the smoothing factor h. Since the SLMP criterion can capture the information contained in more moments of the error, one can expect that algorithms developed under the SLMP criterion may behave more efficiently. Property 4. If the smoothing variable ξ is symmetric and zero-mean, and p is an even number, then as h is large enough, the SLMP criterion will be approximately equivalent to the MSE criterion.
Property 6. The SLMP cost equals the asymptotic value of the empirical LMP cost estimated by the kernel density estimation (KDE) approach with a fixed kernel function K h (x), that is
J SLMP = lim ˆJ LMP = lim N →∞
ˆ∞
N →∞
|x| p pˆ e (x)dx
(16)
−∞
in which pˆ e (x) denotes the estimated PDF of the error:
pˆ e (x) =
N 1
N
K h (x − e i )
(17)
i =1
where {e 1 , · · · , e N } are N independent, identically distributed (i.i.d.) error samples. Proof. According to the theory of kernel density estimation, as sample number N → ∞, the estimated PDF will uniformly converge (with probability 1) to the true PDF convolved with the kernel function [10,21]:
B. Chen et al. / Digital Signal Processing 40 (2015) 154–163
N →∞
pˆ e (x) − −−−→ ( p e ∗ K h )(x) = p e+hξ (x)
(18)
It follows easily that
ˆ∞ lim
N →∞
ˆ∞ |x| p pˆ e (x)dx =
−∞
|x| p p e+hξ (x)dx = E |e + hξ | p
−∞
= J SLMP
2
(19)
Remark. The smoothed LMP criterion is similar to the smoothed MEE (SMEE) criterion studied in [22]. The SMEE cost is defined as the entropy of the error plus an independent random variable (i.e. the smoothing variable), which is the limit (when sample size tends to infinity) of the empirical error entropy estimated by the KDE method with a fixed kernel function equal to the PDF of the smoothing variable [22]. The SLMP criterion is also inspired by the maximum correntropy criterion (MCC), which is shown to be equivalent to a smoothed maximum a posterior (MAP) criterion [16]. 3. Smoothed least mean p-power adaptive algorithm
performance of the SLMP algorithm. However, as shown in simulation results, this influence is much smaller than the influence of h. In recent years, a family of nonlinear adaptive filtering algorithms, called the kernel adaptive filtering (KAF) algorithms, have been proposed, which are developed by mapping the input data into a high (possibly infinite) dimensional reproducing kernel Hilbert space (RKHS) and implementing a linear adaptive algorithm in this space (see [19,20] for the details about the KAF algorithms). With a Gaussian kernel, a KAF algorithm is basically a growing radial basis function (RBF) network with each input vector as the centers [19]. A new kernel adaptive filtering algorithm can be obtained by simply extending the SLMP algorithm into the kernel space:
⎧ f0 = 0 ⎪ ⎪ ⎪ ⎪ ⎨ e (i ) = d(i ) − f i −1 (U (i )) k ⎪ ⎪ 2l − 1 ⎪ 2lλl e (i ) κ (U (i ), ·) ⎪ ⎩ f i = f i −1 + η
One can derive an adaptive algorithm to update the weight vector of the adaptive filter under the SLMP criterion. A simple stochastic gradient based algorithm is
∂ φh e (i ) ∂ W (i )
W ( i + 1) = W ( i ) − η
(20)
where η denotes the step-size, and φh (e (i )) is the instantaneous value of the SLMP cost at time i, that is
φh e (i ) =
p |x| p K h x − e (i ) dx = E e (i ) + hξ
(21)
−∞
where the expectation is taken with respect to ξ . The evaluation of this expectation is not easy in general. However, by Property 3, if p is an even number (p = 2k) and ξ is symmetric around zero, the instantaneous SLMP cost can be expressed as
(22)
l =0 2l 2(k−l) where λl = C 2k h E [ξ 2(k−l) ]. In this case, the adaptive algorithm (20) can be derived as
= W (i ) + η
where f i denotes the estimated nonlinear mapping at time i, and κ (., .) denotes a Mercer kernel. We refer to (24) as the kernel SLMP (KSLMP) algorithm. There are many other variants of the SLMP algorithm, such as the variable step-size SLMP (VSS-SLMP) and normalized SLMP (NSLMP), where the step-size is changing across iterations or just divided by the squared norm of the input vector. Remark. The computational complexity of the SLMP algorithm (23) is almost the same as the original LMP algorithm, which is O (m). The only extra computational effort needed is to calculate the term k 2l−1 , which is not expensive since generally k is small. l=1 2lλl e (i ) 3.2. Convergence performance analysis The SLMP algorithm (23) can be written in a general form as
W ( i ) = W ( i − 1) + η f e ( i ) U ( i )
k φh e (i ) = λl e (i )2l
W ( i + 1) = W ( i ) − η
(24)
l =1
3.1. SLMP algorithm
ˆ∞
157
∂ ∂ W (i ) k
k l =0
where
f e (i ) =
k
2lλl e (i )2l−1
(26)
l =1
Assume that the desired signal d(i ) can be expressed as
λl e (i )2l
d(i ) = W 0T U (i ) + v (i )
2lλl e (i )2l−1 U (i )
(25)
(23)
l =1
which is a mixture of the LMF family adaptive algorithms, and the mixture weights are controlled by the smoothing variable ξ and the smoothing factor h. In the rest of the paper, for simplicity we always assume that p is an even number and ξ is symmetric around zero. We refer to (23) as the SLMP algorithm. Remark. Given a smoothing variable ξ , the performance of the SLMP algorithm is significantly influenced by the smoothing factor h. As h → 0, the SLMP algorithm reduces to the original LMP algorithm, while when h → ∞, the SLMP algorithm converges to the popular LMS algorithm (see Property 4). Thus, the smoothing factor h provides a nice bridge between LMP and LMS. The distribution of the smoothing variable ξ also has influence on the
(27)
where W 0 denotes an unknown weight vector that needs to be estimated, and v (i ) stands for the disturbance noise, with variance σ v2 . It follows that
˜ ( i − 1) T U ( i ) + v ( i ) = e a ( i ) + v ( i ) e (i ) = W
(28)
˜ (i − 1) = W 0 − W (i − 1) is the weigh error vector at iterwhere W ˜ (i − 1)T U (i ) is referred to as the a priori ation i − 1, and ea (i ) = W error. First, we analyze the mean convergence performance of the SLMP. In [23], the mean convergence behavior of a general class of adaptive filters was studied. According to [23], one can immediately obtain the following results: Theorem 1. The SLMP algorithm is convergent in mean if the step-size η satisfies
158
B. Chen et al. / Digital Signal Processing 40 (2015) 154–163
0<η<
2 f (λmax ) +
1 2
f (2) (λ
σ
max )
(29)
2 v
where f (2) denotes the second order derivative of the function f (x) = k 2l−1 , and λmax is the largest eigenvalue of the input correlal=1 2lλl x
tion matrix R U = E [U (i )U (i ) T ]. Further, if one unit of time corresponds to one iteration, the time constant corresponding to the ith eigenvalue of R U is
τi ≈
1
(30)
η( f (λi ) + 12 f (2) (λi )σ v2 )
Next, we analyze the mean square convergence performance. According to the well-known energy conservation relation [3,24, 25], the following equality holds:
˜ (i )2 = E W ˜ (i − 1)2 − 2η E ea (i ) f e (i ) E W 2 + η2 E U (i ) f 2 e (i )
(31)
η is chosen such that for all i k 2E [ l=1 2lλl ea (i )e (i )2l−1 ] 2E [ea (i ) f (e (i ))] η≤ (32) = k E [ U (i ) 2 f 2 (e (i ))] E [ U (i ) 2 ( l=1 2lλl e (i )2l−1 )2 ]
If E [o(ea2 )] is small enough, then based on the assumptions A1 and A2 we can derive
E ea f (e ) = E ea f ( v ) + f ( v )ea2 + o ea2 ≈ E f ( v ) S 2 E f 2 (e ) ≈ E f 2 ( v ) + E f ( v ) f ( v ) + f ( v ) S
k
and E [ U (i ) 2 ( l=1 2lλl e (i )2l−1 )2 ]. In the following, we present only some analysis results about the steady-state mean square performance. Our approach of analysis is based on the Taylor expansion method [15,26]. When the filter reaches the steady-state, the distributions of ea (i ) and e (i ) are independent of i, and one can simply write ea and e by omitting the time index. The following lemma holds [24]: Lemma 1. If the filter is stable and reaches the steady-state, then
2E ea f (e ) = ηTr ( R U ) E f 2 (e )
(33)
provided that U (i ) is asymptotically uncorrelated with f (e (i )). 2
2
Remark. The rationality of the assumption that U (i ) 2 and f 2 (e (i )) are asymptotically uncorrelated has been discussed in [24]. limi →∞ E [ea2 (i )]
E [ea2 ]
Let S = = be the steady-state excess mean square error (EMSE). Below we will derive an approximate analytical expression of S. First, we give two assumptions: A1: The noise v (i ) is zero-mean, independent, identically distributed, and is independent of the input. A2: The a priori error ea (i ) is zero-mean and independent of the noise. Taking the Taylor expansion of f (e ) with respect to ea around v, we have
1 f (e ) = f (ea + v ) = f ( v ) + f ( v )ea + f ( v )ea2 + o ea2 2
(34)
where o(ea2 ) denotes the third and higher-order terms, and
f (v ) =
k
2l(2l − 1)λl v 2l−2 ,
l =1
f ( v ) =
k l =2
2l(2l − 1)(2l − 2)λl v 2l−3
(35)
(37)
Substituting (36) and (37) into (33), we obtain
S=
ηTr( R U ) E [ f 2 ( v )]
(38)
2E [ f ( v )] − ηTr ( R U ) E [ f ( v ) f ( v ) + | f ( v )|2 ]
Further, substituting (26) and (35) into (38) yields
S=
ηTr( R U ) E
2E
Therefore, if the step-size
˜ (i ) 2 ] will be dethen the sequence of weight error power E [ W creasing and converging. One can further analyze the mean square transient behaviors of the algorithm, but this is not an easy task k since we have to evaluate the expectations E [ l=1 2lλl ea (i )e (i )2l−1 ]
(36)
× +
k
k
2
2lλl v
2l−1
l =1
2l(2l − 1)λl v
2l−2
− ηTr( R U ) E
l =1 k
2l(2l − 1)(2l − 2)λl v
l =2 k
k
2lλl v
2l−1
l =1
2l−3
2
2l(2l − 1)λl v
2l−2
(39)
l =1
η is small enough, (39) can be simplified to k ηTr( R U ) E [( l=1 2lλl v 2l−1 )2 ] S= (40) k 2E [ l=1 2l(2l − 1)λl v 2l−2 ]
When the step-size
Remark. Given a distribution of the noisev, one can evaluate the expectations in (39) and obtain a theoretical value of the steadystate EMSE. We should emphasize that the steady-state EMSE of (39) is derived under the assumption that the steady-state a priori error ea is small such that its third and higher-order terms are negligible. When the step-size or noise power is too large, the a priori error will also be large. In this situation, the derived EMSE value will not accurately enough characterize the performance. 4. Simulation results In this section, we present simulation results to confirm the theoretical analyses and demonstrate the performance of the proposed SLMP (or KSLMP) algorithm. In the simulations below, except mentioned otherwise, the smoothing variable ξ is zero-mean binary distributed with unit variance, and the order p is 6. First, we show the theoretical and simulated steady-state performance of the SLMP algorithm. The filter length is 20, and the input signal is a zero-mean white Gaussian process with unit variance. The disturbance noise is assumed to be zero-mean and uniform distributed, and the smoothing factor is set at h = 1.5. The steady-state EMSEs with different noise variances and step sizes are plotted in Fig. 2, where the simulated EMSEs are computed as an average over 100 independent Monte Carlo simulations and in each Monte Carlo simulation, 50 000 iterations were run to ensure the algorithm to reach the steady state, and the steady-state EMSE was obtained as an average over the last 1000 iterations. One can observe: i) the steady-state EMSEs are increasing with both noise variance and step size; ii) when the noise variance or step size is small, the steady-state EMSEs computed by simulations match very well the theoretical values computed by (39); iii) when the noise variance or step size becomes larger, however, the experimental results may gradually differ from the theoretical values, and this confirms the theoretical prediction.
B. Chen et al. / Digital Signal Processing 40 (2015) 154–163
Fig. 2. Theoretical and simulated steady-state EMSEs with different noise variances and step sizes.
Second, we investigate the stability problems of the SLMP algorithm. The steady-state performance is valid only when the algorithm does not diverge. In many cases, however, the proposed algorithm may diverge especially at the initial convergence stage. Here, we present some simulation results about the probability of divergence of the new algorithm. The weight vector of the unknown system is assumed to be W 0 = [0.1, 0.2, 0.3, 0.4, 0.5, 0.4, 0.3, 0.2, 0.1] T , and the initial weight vector of the adaptive filter is a null vector. The input signal and the disturbance noise are both zero-mean and unit-power Gaussian. The probabilities of divergence for different smoothing factors are illustrated in Fig. 3. To evaluate the probability of divergence, for each smoothing factor 1000 independent Monte Carlo simulations were performed and in each simulation, 1000 iterations were run. We labeled a learning curve as “diverging” if at the last iteration the weight error power
W 0 − W (i ) 2 is larger than 100. For comparison purpose, we also plot in Fig. 3 the probability of divergence of the LMF algorithm. In the simulation, the step sizes for SLMP and LMF are both set to 0.5. We have the following observations: i) when the smoothing factor is very small (by Property 1, in this case the SLMP will be approximately equivalent to the LMP), the probability of divergence of the SLMP can be larger than that of the LMF; ii) when the smoothing factor becomes larger (by Property 4, the SLMP becomes closer to the LMS), the probability of divergence of the SLMP can be much smaller than that of the LMF and will approach zero as the smoothing factor becomes very large. Simulation results suggest that a larger smoothing factor makes the new algorithm more stable. Moreover, the step size will also influence the stability of the algorithms. The probabilities of divergence for different step sizes are shown in Fig. 4. Obviously, a smaller step size leads to a more stable algorithm. Third, we compare the performance of the SLMP with that of the LMP algorithms (p = 1, 2, 4, 6) and least mean mixed-norm (LMMN) algorithm [6]. In the simulation, we use the normalized versions of the SLMP, LMP and LMMN (denoted by NSLMP, NLMP and NLMMN) since in general, the normalized algorithms are robust and faster when compared with the original versions. The unknown system is assumed to be the same as that in the previous simulation. Here, we consider four noise distributions: Binary, Gaussian, MixGaussian, and Laplace, where the Binary noise is distributed over {−1, 1} with probability mass Pr{ X = −1} = Pr{ X = 1} = 0.5, the Gaussian and Laplace noises are distributed with zero-mean and unit variance, and the MixGaussian noise is
Fig. 3. Probabilities of divergence with different smoothing factors.
Fig. 4. Probabilities of divergence with different step sizes.
159
160
B. Chen et al. / Digital Signal Processing 40 (2015) 154–163
Fig. 5. Convergence curves in different noises: (a) Binary; (b) Gaussian; (c) MixGaussian; (d) Laplace.
generated by adding a zero-mean Gaussian noise with variance 1.0 to a Binary noise. The convergence curves in terms of the weight error power W 0 − W (i ) 2 averaged over 100 independent Monte Carlo runs are shown in Fig. 5. In each simulation, the step-sizes are chosen such that all the algorithms have approximately the same initial convergence speed. The smoothing factor in SLMP is experimentally chosen such that the algorithm achieves a desirable performance. Specifically, for Binary, Gaussian, MixGaussian, and Laplace noises, the smoothing factor h is set at 0.1, 20, 7.0, and 10, respectively. From the simulation results we observe: i) when the noise is Binary, the NSLMP (with a smaller h) and the NLMP with p = 6 may achieve a much lower misadjustment than others; ii) when the noise is Gaussian, the NSLMP (with a larger h) can perform almost as well as the NLMS (p = 2) and NLMMN (by Property 4, when h is large enough, the NSLMP will be approximately equivalent to the NLMS); iii) when the noise is MixGaussian, the NSLMP may perform better than all others; iv) when the noise is Laplacian, the NLMP with p ≥ 4 performs very poorly, particularly the NLMP with p = 6 will diverge and its convergence curve cannot be plotted, but in this case the NSLMP, NLMMN, and NLMP with p = 1, 2 can still work well. Fourth, we investigate how the smoothing variable and smoothing factor affect the convergence performance. Suppose the noise
is MixGaussian, formed by adding a zero-mean Gaussian noise with variance 0.64 to a Binary noise. Fig. 6 illustrates the average convergence curves with different smoothing variables, where the distributions (Binary, Uniform, Gaussian) of the smoothing variables are given by (10), and the smoothing factor is fixed at h = 6.0. Fig. 7 demonstrates the average convergence curves with different smoothing factors, where the smoothing variable is Binary distributed over {−1, 1}. From Fig. 6 and Fig. 7 we observe: i) the performance of the SLMP algorithm will be influenced by both smoothing variable and smoothing factor; ii) the influence of the smoothing factor on the performance is more significant than the influence of the smoothing variable. Hence, it is very important to choose a proper smoothing factor to get desirable performance. Fig. 8 shows the steady-state EMSEs with different smoothing factors. In this example, the steady-state EMSE reaches its minimum at nearly h = 1.5. In practice, the smoothing factor can be set manually. How to choose the best smoothing factor, however, is a challenging problem and is beyond the scope of this paper. Finally, we show the performance of the KSLMP algorithm in Chaotic time series prediction. Let’s consider the Mackey–Glass time series generated from the following time-delay ordinary differential equation [19]:
B. Chen et al. / Digital Signal Processing 40 (2015) 154–163
Fig. 6. Convergence curves with different smoothing variables.
161
Fig. 8. EMSEs with different smoothing factors.
Fig. 7. Convergence curves with different smoothing factors.
dx(t ) dt
= −bx(t ) +
ax(t − τ ) 1 + x(t − τ
)10
Fig. 9. Learning curves of Mackey–Glass time series prediction.
(41)
with b = 0.1, a = 0.2, and τ = 30. The dimension of the time embedding is 10, that is, the previous ten samples are used as the input to predict the present value (which is the desired response). A segment of 500 samples is used as the training data and another 100 as the test data. The data are corrupted by additive Gaussian noise with zero mean and 0.04 standard deviation. At each iteration, the testing mean square error is computed based on the test set using the kernel adaptive filter (which is fixed at the testing phase). The learning curves are shown in Fig. 9. For comparison purpose, we also plot in Fig. 9 the learning curves of the KLMP (kernel LMP) with different p values (p = 1, 2√ , 4, 6). In the simulation, the Gaussian kernel with bandwidth 2/2 is chosen as the Mercer kernel. For the KSLMP, the smoothing factor is set at h = 0.6. Simulation results indicate that the KSLMP algorithm performs very well, achieving a fast convergence speed and a high accuracy. Although the KLMP (p = 2) may achieve almost the same accuracy as the KSLMP, its convergence speed is slower. One can also see that in this example when p = 1, the KLMP converges very slowly.
5. Conclusion The smoothed least mean p-power (SLMP) error criterion has been proposed for adaptive filtering, which minimizes the mean p-power of the error plus an independent and scaled smoothing variable. When the smoothing factor (or the scale factor) is zero, the SLMP will reduce to the original LMP criterion. If the smoothing variable is symmetric and zero-mean, and p is an even number, then the SLMP error criterion will be a weighted sum of the evenorder moments of the error, and further, as the smoothing factor tends to infinity, the new criterion will be approximately equivalent to the commonly used mean square error (MSE) criterion. Based on the proposed error criterion, a new adaptive filtering algorithm and its kernelized version have been developed, and the convergence issue has been studied. The theoretical value of the steady-state excess mean square error (EMSE) has been derived. Simulation results have confirmed the theoretical predictions and shown the desirable performance of the new algorithms. There are some important problems for future study. First, the selection of the PDF of smoothing variable and value of the
162
B. Chen et al. / Digital Signal Processing 40 (2015) 154–163
smoothing factor, especially the latter, is very crucial. An improper choice may lead to poor performance. In general, a larger smoothing factor makes the algorithm more stable but converge more slowly. If the noise distribution is light-tailed and the variance is also small, a smaller smoothing factor is better, since the probability of divergence is usually very small in a high SNR environment. To make the new algorithms more practical, it is very important to find some effective ways to optimize or adjust the PDF and smoothing factor for any particular situation. Second, the theoretical analysis on the convergence behaviors should be carried out in more detail. The convergence speed and the stability problems such as the probability of divergence are of particular interest. In this direction, the theoretical analysis results about the LMF algorithm [27–30] are worth learning. Acknowledgments This work was supported by 973 Program (no. 2015CB351703) and National Natural Science Foundation of China (no. 61372152, no. 61271210). References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]
[13]
[14]
[15]
[16] [17] [18]
[19] [20]
[21] [22] [23] [24]
S. Haykin, Adaptive Filtering Theory, 3rd ed., Prentice Hall, NY, 1996. B. Widrow, E. Walach, Adaptive Inverse Control, Prentice Hall, NY, 1996. A.H. Sayed, Fundamentals of Adaptive Filtering, Wiley, NJ, 2003. E. Walach, B. Widrow, The least mean fourth (LMF) adaptive algorithm and its family, IEEE Trans. Inf. Theory IT-30 (2) (1984) 275–283. S.C. Pei, C.C. Tseng, Least mean p-power error criterion for adaptive FIR filter, IEEE J. Sel. Areas Commun. 12 (9) (1994) 1540–1547. J.A. Chambers, O. Tanrikulu, A.G. Constantinides, Least mean mixed-norm adaptive filtering, Electron. Lett. 30 (19) (1994) 1574–1575. R.K. Boel, M.R. James, I.R. Petersen, Robustness and risk-sensitive filtering, IEEE Trans. Autom. Control 47 (3) (2002) 451–461. D. Erdogmus, J.C. Principe, From linear adaptive filtering to nonlinear information processing, IEEE Signal Process. Mag. 23 (6) (2006) 15–33. B. Chen, J. Hu, L. Pu, Z. Sun, Stochastic gradient algorithm under (h, φ )-entropy criterion, Circuits Syst. Signal Process. 26 (2007) 941–960. J.C. Principe, Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives, Springer, New York, 2010. B. Chen, Y. Zhu, J. Hu, J.C. Principe, System Parameter Identification: Information Criteria and Algorithms, Newnes, 2013. W. Liu, P.P. Pokharel, J.C. Principe, Correntropy: properties and applications in non-Gaussian signal processing, IEEE Trans. Signal Process. 55 (11) (2007) 5286–5298. A. Singh, J.C. Principe, Using correntropy as a cost function in linear adaptive filters, in: Proceedings of International Joint Conference on Neural Networks, IJCNN, 2009, pp. 2950–2955. S. Zhao, B. Chen, J.C. Principe, Kernel adaptive filtering with maximum correntropy criterion, in: Proceedings of International Joint Conference on Neural Networks, IJCNN, 2011, pp. 2012–2017. B. Chen, L. Xing, J. Liang, N. Zheng, J.C. Principe, Steady-state mean-square error analysis for adaptive filtering under the maximum correntropy criterion, IEEE Signal Process. Lett. 21 (7) (2014) 880–884. B. Chen, J.C. Principe, Maximum correntropy estimation is a smoothed MAP estimation, IEEE Signal Process. Lett. 19 (8) (2012) 491–494. M.O. Sayin, N.D. Vanli, S.S. Kozat, A novel family of adaptive filtering algorithms based on the logarithmic cost, arXiv preprint arXiv:1311.6809, 2013. Y. Xiao, Y. Tadokoro, K. Shida, Adaptive algorithm based on least mean p-power error criterion for Fourier analysis in additive noise, IEEE Trans. Signal Process. 47 (4) (1999) 1172–1181. W. Liu, J. Principe, S. Haykin, Kernel Adaptive Filtering: a Comprehensive Introduction, Wiley, 2010. B. Chen, L. Li, W. Liu, J.C. Principe, Nonlinear adaptive filtering in kernel spaces, in: Springer Handbook of Bio-/Neuroinformatics, Springer, Berlin/Heidelberg, 2014, pp. 715–734. B.W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman & Hall, New York, NY, USA, 1986. B. Chen, J.C. Principe, On the smoothed minimum error entropy criterion, Entropy 14 (11) (2012) 2311–2323. E. Santana, A.K. Barros, R.C.S. Freire, On the time constant under general error criterion, IEEE Signal Process. Lett. 14 (8) (2007) 533–536. T.Y. Al-Naffouri, A.H. Sayed, Adaptive filters with error nonlinearities: meansquare analysis and optimum design, EURASIP J. Appl. Signal Process. 4 (2001) 192–205.
[25] T.Y. Al-Naffouri, A.H. Sayed, Transient analysis of adaptive filters with error nonlinearities, IEEE Trans. Signal Process. 51 (2003) 653–663. [26] B. Lin, R. He, X. Wang, B. Wang, The steady-state mean-square error analysis for least mean p-order algorithm, IEEE Signal Process. Lett. 16 (2009) 176–179. [27] V.H. Nascimento, J.C.M. Bermudez, Probability of divergence for the least-mean fourth (LMF) algorithm, IEEE Trans. Signal Process. 54 (4) (2006) 1376–1385. [28] P.I. Hubscher, J.C.M. Bermudez, V.H. Nascimento, A mean-square stability analysis of the least mean fourth (LMF) adaptive algorithm, IEEE Trans. Signal Process. 55 (8) (2007) 4018–4028. [29] E. Eweda, Global stabilization of the least mean fourth algorithm, IEEE Trans. Signal Process. 60 (3) (March 2012) 1473–1477. [30] E. Eweda, Dependence of the stability of the least mean fourth algorithm on target weights non-stationarity, IEEE Trans. Signal Process. 62 (7) (April 2014) 1634–1643.
Badong Chen received the B.S. and M.S. degrees in control theory and engineering from Chongqing University, China, in 1997 and 2003, respectively, and the Ph.D. degree in computer science and technology from Tsinghua University, China, in 2008. He was a Post-Doctoral Researcher with Tsinghua University from 2008 to 2010, and a Post-Doctoral Associate at the University of Florida Computational NeuroEngineering Laboratory (CNEL) during the period October, 2010 to September, 2012. He is currently a professor at the Institute of Artificial Intelligence and Robotics (IAIR), Xi’an Jiaotong University, China. His research interests are in signal processing, information theory, machine learning, and their applications in cognitive science and engineering. He has published 2 books, 3 chapters, and over 70 papers in various journals and conference proceedings. Dr. Chen is an IEEE senior member and an associate editor of IEEE Transactions on Neural Networks and Learning Systems and has been on the editorial boards of Applied Mathematics and Entropy. Lei Xing received his B.S. degree in automation science and technology from Xi’an Jiaotong University, China, in 2014. He is currently pursuing his Ph.D. degree at the Institute of Artificial Intelligence and Robotics (IAIR), Xi’an Jiaotong University. His current research interests focus on information theoretic learning (ITL) and kernel adaptive filtering (KAF). Zongze Wu was born on Feb 9, 1975 in Chongqing, China. He is an associate professor of South China University of Technology. He received B.S., M.S. and PH.D degree from Xi’an Jiaotong University, in 1995, 2002 and 2005 respectively. His research interests include artificial intelligence, image processing, and big data. Dr. Wu was awarded Guangdong Province Science and Technology First Award and Microsoft Fellow of Asia. He published one book chapter and more than 30 scientific papers, and has 22 authorized patents of invention. Junli Liang was born in China. He received his Ph.D. degree in signal and information processing from Chinese Academy of Sciences in 2007. So far, he has published 20 papers in IEEE Trans. Neural Network and Learning Systems, IEEE Trans. Signal Processing, IEEE Trans. Image Processing, IEEE Trans. Antennas and Propagation, IEEE Trans. Industrial Informatics, IEEE Trans. Aerospace and Electronic Systems, etc. His current research interests include radar and sonar signal processing, image processing and recognition, etc. José C. Príncipe is currently the Distinguished Professor of Electrical and Biomedical Engineering at the University of Florida, Gainesville, where he teaches advanced signal processing and artificial neural networks (ANNs) modeling. He is BellSouth Professor and Founder and Director of the University of Florida Computational Neuro-Engineering Laboratory (CNEL). He is involved in biomedical signal processing, in particular, the electroencephalogram (EEG) and the modeling and applications of adaptive systems. Dr. Príncipe is the past Editor-in-Chief of the IEEE Transactions on Biomedical Engineering, past President of the International Neural Network Society, and former Secretary of the Technical Committee on Neural Networks of the IEEE Signal Processing Society. He is an IEEE Fellow and an AIMBE Fellow and a recipient of the IEEE Engineering in Medicine and Biology Society Career Service Award. He is also a former member of the Scientific Board of the Food and Drug Administration, and a member of the Advisory Board of the McKnight Brain Institute at the University of Florida.
B. Chen et al. / Digital Signal Processing 40 (2015) 154–163
Nanning Zheng graduated from the Department of Electrical Engineering, Xi’an Jiaotong University, Xi’an, China, in 1975, and received the M.S. degree in information and control engineering from Xi’an Jiaotong University in 1981 and the Ph.D. degree in electrical engineering from Keio University, Yokohama, Japan, in 1985. He is currently a professor and director of the Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University. His research interests include computer vision, pattern recog-
163
nition and image processing, and hardware implementation of intelligent systems. Prof. Zheng became a member of the Chinese Academy of Engineering in 1999, and he is the Chinese Representative on the Governing Board of the International Association for Pattern Recognition. He is an IEEE Fellow and serves as an executive deputy editor of the Chinese Science Bulletin.