State space maximum correntropy filter

State space maximum correntropy filter

Signal Processing 130 (2017) 152–158 Contents lists available at ScienceDirect Signal Processing journal homepage: www.elsevier.com/locate/sigpro S...

977KB Sizes 0 Downloads 111 Views

Signal Processing 130 (2017) 152–158

Contents lists available at ScienceDirect

Signal Processing journal homepage: www.elsevier.com/locate/sigpro

State space maximum correntropy filter Xi Liu, Hua Qu, Jihong Zhao, Badong Chen n School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China

art ic l e i nf o

a b s t r a c t

Article history: Received 4 February 2016 Received in revised form 23 June 2016 Accepted 24 June 2016 Available online 25 June 2016

The state space recursive least squares (SSRLS) filter is a new addition to the well-known recursive least squares (RLS) family filters, which can achieve an excellent tracking performance by overcoming some limitations of the standard RLS algorithm. However, when the underlying system is disturbed by some heavy-tailed non-Gaussian impulsive noises, the performance of SSRLS will deteriorate significantly. The main reason for this is that the SSRLS is derived under the minimum mean square error (MMSE) criterion, which is not well-suited to estimation problems under non-Gaussian noises. To overcome this issue, we propose in this paper a novel linear filter, called the state space maximum correntropy (SSMC) filter, which is derived under the maximum correntropy criterion (MCC) instead of the MMSE. Since MCC is very suited to non-Gaussian signal processing, the SSMC performs very well in non-Gaussian noises especially when the signals are corrupted by impulsive noises. A simple illustrative example is presented to demonstrate the desirable performance of the new algorithm. & 2016 Elsevier B.V. All rights reserved.

Keywords: State space recursive least squares (SSRLS) Maximum correntropy criterion (MCC) State space maximum correntropy (SSMC)

1. Introduction Adaptive filtering is widely used in lots of applications, such as spectrum estimation [1], object tracking [2], frequency estimation [3] and so on. Owing to its simplicity and fast convergence speed, the recursive least squares (RLS) [4–6] has been one of the most popular algorithms. In recent years, the state space recursive least squares (SSRLS) [7–9] was developed by Malik as a new addition to the RLS family filters. Compared with the standard RLS algorithm, the SSRLS may achieve a better tracking performance. However, its performance may get worse when the systems are disturbed by some non-Gaussian noises. One main reason for this is that the SSRLS is derived based on the wellknown minimum mean square error (MMSE) criterion, which is optimal only under Gaussian assumptions. Actually, the SSRLS is very sensitive to large outliers and its performance may deteriorate significantly in heavy-tailed non-Gaussian environments. Therefore, some alternative optimization criteria beyond MMSE should be adopted to overcome this issue. In the last few years, the optimization criteria in information theoretic learning (ITL) [10,11] have gained increasing attention, which use the information theoretic quantities directly estimated from data instead of the usual second order statistical measures (e.g. variance or mean square error) as the optimization costs for machine learning and signal processing. ITL costs can capture n

Corresponding author. E-mail address: [email protected] (B. Chen).

http://dx.doi.org/10.1016/j.sigpro.2016.06.025 0165-1684/& 2016 Elsevier B.V. All rights reserved.

higher-order statistics of the data and offer potentially significant performance improvement in adaptive filtering applications. The ITL links information theory, nonparametric estimators, and reproducing kernel Hilbert spaces (RKHS) in a simple and unconventional way. Particularly, the correntropy in ITL is a nonlinear similarity measure in kernel space which has its root in Renyi's entropy [12–16]. The correntropy is also a local similarity measure, and hence insensitive to large outliers [17–26]. In supervised learning, such as regression, the problem can be formulated as that of maximizing the correntropy between model output and target response. This optimization criterion is called the maximum correntropy criterion (MCC) [10,11]. Recently, the MCC has been successfully applied in robust adaptive filtering in impulsive noise environments [10,14–16,27,28]. It is worth noting that the MCC solution cannot be obtained in a closed form even for a simple linear regression problem. So one usually solves it using an iterative update algorithm such as the celebrated gradient based methods [14–16,27]. The gradient based methods are simple, but they depend on a free parameter, namely the step-size, and usually converge slowly. The fixed-point iterative algorithm is an alternative way to solve the MCC solution, which involves no stepsize and may converge to the solution very fast [10,29,30]. A sufficient condition that guarantees the convergence of the fixedpoint MCC algorithm was given in [31]. To improve the performance of the state space based adaptive filtering algorithms such as SSRLS in non-Gaussian noises, we develop a new filter in this work, called the state space maximum correntropy (SSMC) filter, to estimate the state of a linear system.

X. Liu et al. / Signal Processing 130 (2017) 152–158

The SSMC is derived based on the MCC criterion and a fixed-point iterative algorithm, which can achieve an excellent performance in heavy-tailed non-Gaussian environments by suppressing bad effects of large outliers. The following contributions have been made within this work: (1) Most of the previous papers on MCC filtering focus mainly on the adaptive filters with a simple FIR structure, but in the present work, we study a state-space model and derive an efficient adaptive algorithm to improve the robustness against large outliers; (2) We adopt a novel fixed-point algorithm instead of using a gradient based method to achieve fast convergence speed; (3) We propose a sliding window approach to reduce the computational complexity. The rest of the paper is organized as follows. In Section 2, we briefly introduce the MCC, the state space model and the SSRLS filter. In Section 3, we derive the SSMC filter and present some results about convergence analysis. In Section 4, we provide an illustrative example to show the superiority of the new algorithm. Finally, in Section 5, we give the concluding remark.

153

u (i ) being the input. The MCC based learning can be formulated as the following optimization problem:

m = arg max 1 W W∈ Ω N

N

∑ Gσ ( e (i) )

(6)

i=1

where m W denotes the optimal solution, and Ω denotes a feasible set of the parameter. Remark 1. ITL costs may offer significant performance improvements in adaptive filtering applications. Besides MCC, the minimum error entropy (MEE) [10,11,32–38] is another important optimization criterion in ITL. Compared with MCC, the MEE is computationally more expensive due to a double-summation. In the present work, our focus is mainly on the MCC criterion. 2.2. State space model Consider the following unforced linear discrete-time system:

x (k + 1) = Ax (k )

(7)

y (k ) = Cx (k ) + v (k )

(8)

2. Preliminaries 2.1. Maximum correntropy criterion Correntropy is a generalized similarity measure between two random variables. Given two random variables X , Y ∈  with joint distribution function FXY ( x, y), correntropy is defined by

V ( X , Y ) = E ⎡⎣ κ ( X , Y ) ⎤⎦ =

∫ κ ( x, y) dFXY ( x, y)

(1)

where E denotes the expectation operator, and κ ( · , ·) is a shiftinvariant Mercer Kernel. In this paper, without mentioned otherwise the kernel function is the Gaussian Kernel given by

⎛ e2 ⎞ κ ( x, y) = Gσ ( e) = exp ⎜ − ⎟ ⎝ 2σ 2 ⎠

(2)

where e = x − y , and σ > 0 stands for the kernel bandwidth. In most practical situations, only a limited number of data are available and the joint distribution FXY is usually unknown. In these cases, one can estimate the correntropy using a sample mean estimator:

l ( X, Y ) = 1 V N

N

∑ Gσ ( e (i) ) i=1

(3)

where e (i ) = x (i ) − y (i ), {x (i ) , y (i )}iN= 1 are N samples drawn from FXY . Taking Taylor series expansion of the Gaussian Kernel, we have ∞

V ( X, Y ) =

∑ n= 0

( −1)n

2n E ⎡ ( X − Y ) ⎤⎦ 2n σ 2nn! ⎣

1 N

(4)

N

∑ Gσ ( e (i) ) i=1

2.3. State space recursive least squares Assume that we have a series of p consecutive measurements with the last one being at time k. Our goal is to fit this series of measurements on the model (7) and (8) according to the method of least squares. From (7) and (8), one can write the observation vector yp (k ) as follows: ⎡ CA−p + 1x (k ) ⎤ ⎤ ⎡ y (k − p + 1)⎤ ⎡ ⎥ ⎢ ⎥ ⎢ Cx (k − p + 1)⎥ ⎢ ⎢ CA−p + 2x (k )⎥ ⎢ y (k − p + 2)⎥ ⎢ Cx (k − p + 2)⎥ ⎥ ⎢ ⎥ ⎥ ⎢ ⎢ ⋮ ⋮ ⋮ ⎥ + v (k ) ⎥ + v (k ) = ⎢ ⎥=⎢ yp (k ) = ⎢ p p 2 − ⎢ CA x (k ) ⎥ ⎢ y (k − 2) ⎥ ⎢ Cx (k − 2) ⎥ ⎥ ⎢ ⎥ ⎥ ⎢ ⎢ 1 − 1 y ( k − ) 1 Cx ( k − ) ⎢ CA x (k ) ⎥ ⎥ ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎢ Cx (k ) y (k ) ⎣ Cx (k ) ⎦ ⎦ ⎦ ⎣ ⎣

(9)

T ⎤T

As one can see, the correntropy is a weighted sum of all even order moments of the random variable X − Y . The kernel bandwidth appears as a parameter weighting the second order and higher order moments. With a very large s (compared to the dynamic range of the data), the correntropy will be dominated by the second order moment. Given a sequence of error data {e (i )}iN= 1, the cost function of MCC is given by

JMCC =

where x (k ) ∈ n denotes the n-dimensional state vector at time k, y (k ) ∈ m represents the m-dimensional output vector, v (k ) is the measurement noise, A stands for the system matrix (or state transition matrix) and is assumed to be invertible, and C represents the observation matrix. The matrix pair (A, C) is assumed to be l-step observable [39]. Notice that the process noise is absent in the above system. In this case, y (k ) can be viewed as a deterministic signal corrupted by observation noise. If there is a process noise in (7), the Kalman filter [40,26,41] will be a common tool for estimating the state vector.

(5)

Suppose the goal is to learn a parameter vector W of an adaptive model, and let x(i) and y(i) denote, respectively, the model output and the desired response, where x (i ) = WT u (i ), with

where vp (k ) = ⎣⎡ v (k − p + 1)T v (k − p + 2)T …v (k − 1)T v (k ) ⎦ . For simplicity, (9) can be written as

yp (k ) = Hp (k ) x (k ) + vp (k )

(10)

T where Hp (k ) = ⎡⎣ (CA−p + 1)T (CA−p + 2)T ⋯ (CA−1)T CT ⎤⎦ . Note that if p ≥ l , the matrix Hp (k ) is full rank because of the above observability assumption [39]. In this case, the solution of (10) in terms of least-squares is given as

(

l x (k ) = HTp (k ) Hp (k )

−1

)

HTp (k ) yp (k )

(11)

Defining θ ≤ 1 as the forgetting factor, the corresponding weighted solution of (10) is

(

l x (k ) = HTp Wp (k ) Hp where

−1

)

HTp Wp (k ) yp (k )

(12)

154

X. Liu et al. / Signal Processing 130 (2017) 152–158

⎡ θ p−1 I 0 ⋯ 0 m ⎢ 2 − p θ Im 0 ⎢ 0 Wp (k ) = ⎢ ⋮ ⋱ ⎢ 0 θ Im ⎢ 0 ⎢⎣ 0 0 ⋯ 0

0⎤ ⎥ 0⎥ ⋮ ⎥, ⎥ 0⎥ Im ⎥⎦

with Gb, k = diag Gσ y1 (b) − (CAb − k)1x (k ) , … , Gσ ym (b) − (CAb − k)m x (k ) .

( (

with Im being an m × m identity matrix. Now, suppose that the measurements start at time k ¼0. Since the current estimate is based on the previous measurements, setting p = k + 1 in (9) we have

Hk + 1(k ) = [(CA−k)T (CA−k + 1)T ⋯ (CA−1)T C T ]T

0⎤ ⎥ 0⎥ ⋮⎥ ⎥ 0⎥ Im ⎥⎦

γ (k ) = HTk + 1(k ) Bk + 1(k ) yk + 1(k )

(21)

Then, we have

x (k ) = χ (k )−1γ (k )

(22)

(23)

l x (k )t + 1 = f ( l x (k )t )

(13)

From (12) we have −1

)

(20)

where f (·) denotes a certain function. Then, a fixed-point iterative algorithm can be readily obtained as

and the weighting matrix is given by

(

))

χ (k ) = HTk + 1(k ) Bk + 1(k ) Hk + 1(k )

x (k ) = f ( x (k ) )

sT yk + 1(k ) = ⎡⎣ y (0)T y (1)T ⋯ y (k − 1)T y (k )T ⎤⎦

l x (k ) = HTk + 1(k ) Wk + 1(k ) Hk + 1(k )

(

Since both χ (k ) and γ (k ) depend on x (k ), the optimal solution (22) is a fixed-point equation of x (k ). Thus, one can rewrite (22) as

In this case, the observation vector is

⎡ θk I 0 ⋯ 0 m ⎢ ⎢ 0 0 θ k − 1 Im Wk + 1(k ) = ⎢ ⋮ ⋱ ⎢ 0 θ Im ⎢ 0 ⎢⎣ 0 0 ⋯ 0

)

Let us define

HTk + 1(k ) Wk + 1(k ) yk + 1(k )

(14)

The SSRLS algorithm calculates the solution (14) in a recursive manner like the standard RLS (see [7,8] for details).

(24)

where l x (k )t denotes the estimated solution at the t-th fixed-point iteration. With the above derivations, we summarize the SSMC algorithm as follows: (1) Choose a proper kernel bandwidth s and a small positive number ε, and initialize l x (0) = 0 . l (k − 1). (2) Let t¼1 and l x (k )0 = Ax (3) Use (25) to compute l x (k )t : ∼ l (k )t = HT (k ) Bk + 1 (k )t − 1Hk + 1(k ) x k +1

(

−1

)

∼ HTk + 1(k ) Bk + 1 (k )t − 1yk + 1(k )

(25)

where 3. State space maximum correntropy 3.1. Derivation of the algorithm In this part, we derive the state space maximum correntropy (SSMC) algorithm. From the linear system (7) and (8), assuming that the measurements start at time k ¼0 and with the last one being at time k, we propose the following cost function with respect to x (k ) under MCC:

⎛ G y (i) − (CA−k + i) x (k ) + ⋯ + ⎞ 1 σ 1 1 ⎟ J ( x (k ) ) = ∑ θ k − i ⎜⎜ ⎟ −k + i) x (k ) (k + 1) i = 0 m ⎝ + Gσ ym (i) − (CA ⎠ k

(

)

(

where yj (i ) is the j-th element of y (i ),

)

(CA−k + i)j

(15)

is the j-th row of

CA−k + i . Then the optimal estimate of x (k ) is computed as

l x (k ) = arg maxJ ( x (k ) )

(16)

x (k )

The optimal solution of (16) can be obtained by solving

∂J ( x (k ) ) ∂x (k )

=0

(17)

It follows easily that

(

x (k ) = HTk + 1(k ) Bk + 1(k ) Hk + 1(k )

−1

)

HTk + 1(k ) Bk + 1(k ) yk + 1(k )

(18)

⎤ ⎡ k∼ 0 0 0 ⎥ … ⎢ θ G 0, k ∼ ⎢ 0 0 0 ⎥ θ k − 1 G1, k ⎥ ⎢ ∼ Bk + 1(k )t − 1 = ⎢ ⋮ ⋱ ⋮ ⎥ ∼ ⎥ ⎢ 0 θ Gk − 1, k 0 ⎥ ⎢ 0 ∼ ⎥ ⎢ Gk , k ⎦ 0 0 ⋯ ⎣ 0

(26)

∼ Gb, k = diag Gσ y1 (b) − (CAb − k)1l x (k )t − 1 , …, Gσ ym (b) − (CAb − k)m l x (k )t − 1

(

)

with ( ) ( ). (4) Compare the estimate at the current step and the estimate at the last step. If (27) holds, set l x (k ) = l x (k )t , and continue to (5). Otherwise, t + 1 → t , and go back to (3):

l x (k )t − l x (k )t − 1 ≤ε l x (k )t − 1

(27)

(5) k + 1 → k , after acquiring a new measurement y (k ) then go back to (2). Remark 2. Although the proposed SSMC algorithm loses the perfect recursive structure as in SSRLS, it can achieve an excellent performance in non-Gaussian environments. The kernel bandwidth s is a crucial parameter in SSMC. When s is large enough, the SSMC will behave like the SSRLS. Actually, the following theorem holds. Theorem 1. When the kernel bandwidth σ → ∞, the SSMC will reduce to the SSRLS algorithm.

where

⎡ θk G 0 0 0 ⎤ … 0, k ⎥ ⎢ ⎢ 0 0 0 ⎥ θ k − 1 G1, k ⎥ Bk + 1(k ) = ⎢ ⋮ ⋱ ⋮ ⎥ ⎢ ⎢ 0 0 θ Gk − 1, k 0 ⎥ ⎥ ⎢ 0 0 Gk , k ⎦ ⋯ ⎣ 0

Proof. See Appendix A. 3.2. Some improvements of the algorithm

(19)

Since correntropy is a local similarity measure, the algorithm may not converge to the optimal solution especially when the initial values deviate from the true values greatly. In this case, one

X. Liu et al. / Signal Processing 130 (2017) 152–158

can use the SSRLS to make the algorithm converge to the neighbourhood of the optimal solution, and then switch to the SSMC to continue the adaptation. Moreover, the more the samples, the higher the computational complexity of the SSMC. In order to reduce the computational complexity, when the sample number is too large, one can only use the latest samples within a sliding window to estimate the state. 3.3. Convergence of the fixed-point iterations Below we present a sufficient condition that guarantees the convergence of the fixed-point iterations in SSMC. The proof is similar to that of [31]. Hence, we do not include it here. Let · p be an lp -norm of a vector or an induced norm of a matrix defined by A

p

= max

X p ≠0

AX p X p

with p ≥ 1, and λmin [·] be

the minimum eigenvalue of a matrix. According to the results of [31], the following theorem holds. Denote Hk + 1(k ) by H (k ) for simplicity and suppose ( yk + 1(k ) )i is the i-th element of yk + 1(k ),

( H (k) )i is the i-th row of

( H (k) )Ti

n ∑m (k + 1) i=1

(

)

yk + 1(k ) i 1 ⎡ m (k + 1) ⎤ T λmin ⎢ ∑ ( H (k) )i ( H (k) )i ⎦⎥ ⎣ i=1

, and

σ ≥ max { σ ⁎, σ†}, in which σ ⁎ is the solution of the equation ϕ (σ ) = β , with ϕ (σ ) =

⎡ (k + 1 ) Gσ β λmin ⎢ ∑m ⎣ i =1

(

T

( H (k) )i

( H (k) )i

1

+

1

( yk + 1(k) )i ⎤

)

( yk + 1(k) )i ( H (k) )Ti ( H (k) )i⎥⎦

,

σ ∈ ( 0, ∞)

(28)

and σ† is the solution of equation ψ ( σ ) = α ( 0 < α < 1), with

m (k + 1)

n ∑i = 1 ψ (σ )=

⎡ β ⎣⎢

(

( H (k) )i

1

+

( yk + 1(k) )i

⎡ m (k + 1) σ 2λmin ⎢ ∑i = 1 Gσ β ⎣

(

Then it holds that

x (k ) ∈ { x (k ) ∈

n :

f ( x (k ) )

x (k )

width s is larger than a certain value ( e.g. max { σ ⁎, σ†}). Theorem 2 implies that the kernel bandwidth has a significant effect on the convergence behavior of SSMC. If the kernel bandwidth s is too small, the algorithm will diverge or converge very slowly. A larger kernel bandwidth in general ensures a faster converge speed but usually leads to a poor performance in impulsive noises. In practical situations, the kernel bandwidth can be set manually or optimized by trial and error methods.

4. An illustrative example We consider the following state-space model:

⎡ cos (ω) − sin (ω)⎤ x (k ) = ⎢ ⎥ x (k − 1) ⎣ sin (ω) cos (ω) ⎦

(32)

1

1

i 1

( H (k) )i

1

∇x (k ) f ( x (k ) )

≤ β , and

≤ β}, where

) ( H (k)) ( β +

1

⎡ ⎤ y (k ) = ⎢ 1 1⎥ x (k ) + v (k ) ⎣ ⎦ ⎡ sin (φ) ⎤ x (0) = ⎢ ⎥ ⎣ cos (φ)⎦

with ω = π /18 and φ = π /3. The forgetting factor is set at θ = 0.98. First, we consider the case in which the noise v(k) is Gaussian, that is

( H (k) )iT ( H (k) )i

1

+

( H (k) )iT ( yk + 1(k) )i ⎤

( yk + 1(k) )i )( H (k) )iT ( H (k) )i ⎦⎥

≤ α for all

∇x (k ) f ( x (k ) ) denotes the

⎡ ∂ ⎤ ∂ f ( x (k ) )… f ( x (k ) ) ⎥ ∇x (k) f ( x (k ) ) = ⎢ ⎣ ∂x1 (k ) ⎦ ∂xn (k )

(33)

v (k ) ∼ N (0, 0.01)

n × n Jacobian matrix of f ( x (k ) ) with respect to x (k ), that is

(30)

with ⎞ m (k + 1 ) ⎛ ∂ 1 ⎜ 1 f ( x (k ) ) = − N − ∑ ⎡⎣⎢ ei (k ) ( H (k ) )ijGσ ( ei (k ) ] ( H (k ) )Ti ( H (k ) )i ⎟⎟ f ( x (k ) ) ww ⎜ 2 ∂x j (k ) σ ⎠ ⎝ i =1 ⎞ ⎛ m (k + 1 ) 1 ⎡ j T − 1 ⎜ + N ww ⎜ ∑ ⎢⎣ ei (k ) ( H (k ) ]i Gσ ( ei (k ) )( H (k ) )i ( yk + 1(k ) )i ⎟⎟ 2 σ ⎠ ⎝ i =1

)

)

T (k + 1) where ei (k ) = ( yk + 1(k ) )i − (H (k ))i x (k ), Nww = ∑m Gσ ( ei (k ) )( H (k ) ) ( H (k ) )i , i=1 i

j

with ( H (k ) ) is the j-th element of ( H (k ) )i . i

By Theorem 2, if the kernel bandwidth s is larger than a certain value, we have the following condition:

⎧ f x (k ) ) 1≤β ⎪ ( ⎨ ⎪ ∇x (k) f ( x (k ) ) ≤ α < 1 ⎩ 1

in SSMC will surely converge to a unique fixed point in the range x (k ) ∈ { x (k ) ∈ n : x (k ) 1 ≤ β} provided that the kernel band-

H (k ) .

Theorem 2. Suppose that β > ζ =

(k + 1 ) n ∑m i =1

155

(31)

By Banach Fixed-Point Theorem [29], given an initial state estimate satisfying x (k )0 1 ≤ β , the fixed-point iteration algorithm

1

)⎤⎦⎥ , σ ∈ ( 0, ∞). (29)

Figs. 1 and 2 show, respectively, the probability densities of x1 and x2 estimation errors with different filters in Gaussian noise. In the simulation, we set ε = 10−6 for SSMC. In addition, to improve the convergence performance, we use the SSRLS to estimate the state during the initial five steps before switching to SSMC. Since the noise is Gaussian, the SSRLS achieves desirable performances in this example, with a concentrated error distribution. We can also see that when the kernel bandwidth is too small, the SSMC may achieve a poor performance; while when the kernel bandwidth becomes larger, its performance will approach that of the SSRLS. Actually, we prove in Appendix A that when σ → ∞, the SSMC will reduce to the SSRLS. In general, one can choose a larger kernel bandwidth to improve the performance in Gaussian noises. Second, we consider the case in which the observation noise is an impulsive non-Gaussian noise, with a mixed-Gaussian distribution, that is

v (k ) ∼ 0.9N (0, 0.01) + 0.1N (0, 1) Figs. 3 and 4 illustrate, respectively, the probability densities of x1 and x2 estimation errors with different filters in non-Gaussian noise. Once again, we set ε = 10−6 . As one can see, in impulsive noises, when the kernel bandwidth is too small or too large, the performance of SSMC will not be good. However, in this case with a proper kernel bandwidth (say σ = 0.5), the SSMC can outperform the SSRLS significantly, achieving a desirable error distribution

156

X. Liu et al. / Signal Processing 130 (2017) 152–158

Fig. 1. Probability densities of x1 estimation errors with different filters in Gaussian noise.

Fig. 4. Probability densities of x2 estimation errors with different filters in nonGaussian noise. Table 1 Average iteration numbers for every time step with different s. Filter

Average iteration numbers

( ) SSMC ( σ = 0.5, ε = 10−6) SSMC ( σ = 0.7, ε = 10−6) SSMC ( σ = 1.0, ε = 10−6) SSMC ( σ = 3.0, ε = 10−6) SSMC ( σ = 5.0, ε = 10−6)

5.134160

SSMC σ = 0.3, ε = 10−6

4.554560 4.337320 4.115900 3.185320 2.976920

Table 2 MSEs of x1 and x2 with different ε. Fig. 2. Probability densities of x2 estimation errors with different filters in Gaussian noise.

Filter

MSE of x1

MSE of x2

( ) SSMC ( σ = 0.5, ε = 10−2) SSMC ( σ = 0.5, ε = 10−4) SSMC ( σ = 0.5, ε = 10−6) SSMC ( σ = 0.5, ε = 10−8)

0.030993

0.019068

0.016601

0.014468

SSMC σ = 0.5, ε = 10−1

0.016533

0.014445

0.016531

0.014443

0.016531

0.014443

Table 3 Average iteration numbers for every time step with different ε. Filter

Average iteration numbers

( ) SSMC ( σ = 0.5, ε = 10−2) SSMC ( σ = 0.5, ε = 10−4) SSMC ( σ = 0.5, ε = 10−6) SSMC ( σ = 0.5, ε = 10−8)

1.013700

SSMC σ = 0.5, ε = 10−1

Fig. 3. Probability densities of x1 estimation errors with different filters in nonGaussian noise.

with a higher peak and smaller dispersion. Again, when s is very large, SSMC achieves almost the same performance as the SSRLS. Table 1 shows the average fixed-point iteration numbers for every time step with different s (where the threshold is set to ε = 10−6 ), which are computed as averages over 100 independent Monte Carlo runs, with each run containing 500 time steps. It is evident that the larger the kernel bandwidth, the faster the

1.043027 2.716137 4.533868 6.429592

convergence speed. In general, the fixed-point algorithm in SSMC will converge to the optimal solution in only several iterations. It should be noted that when the kernel bandwidth is too large, the SSMC will lose its robustness against impulsive noises. We further investigate the influence of the threshold ε on the performance. Table 2 illustrates the MSEs of x1 and x2 with

X. Liu et al. / Signal Processing 130 (2017) 152–158

157

for a finite number of samples. It follows easily that

Table 4 MSEs of x1 and x2 with different sliding window lengths.

max

⎛ D2 ⎞ Gσ yi (b) − (CAb − k)i x (k ) − 1 = 1 − exp ⎜ − ⎟ ⎝ 2σ 2 ⎠ )

(

)

Filter

MSE of x1

MSE of x2

(

SSMC ( M = 20)

1.3030 × 10−3

9.7989 × 10−4

SSMC ( M = 30)

1.0911 ×

10−3

6.1459 × 10−4

Assume that ∀ x ≤ D , Gσ ( x ) − 1 < δ for a sufficiently large s. Then

SSMC ( M = 50)

8.0045 × 10−4

4.6232 × 10−4

SSMC ( M = 100)

6.6682 × 10−4

3.5132 × 10−4

SSMC ( M = 200)

6.2922 × 10−4

3.2356 × 10−4

different ε (where the kernel bandwidth is set at σ = 0.5), and Table 3 presents the average fixed-point iteration numbers. Here the MSEs and average iteration numbers are all computed as an average over 100 independent Monte Carlo runs, and in each run, 500 samples (time steps) are used to evaluate. One can see that a smaller ε usually results in slightly lower MSEs but needs more iterations to converge. As one can see obviously, the influence of ε is not significant compared with the kernel bandwidth s. As mentioned in Section 3, a sliding window method can be applied to reduce the computational complexity of SSMC. In the last part, we show how the sliding window length (denoted by M here) will influence the performance of the algorithm. Table 4 gives the MSEs (over 100 independent Monte Carlo runs) of x1 and x2 with different sliding window lengths. In the simulation, the parameters are set at σ = 0.5, ε = 10−6 . It is noted that the larger the sliding window length, the smaller the MSEs. In practical applications, the sliding window length can be set by making a tradeoff between performance and computational complexity.

yi (b)−(CAb − k) i x (k)

( )<δ

we have 1 − exp −

D2

2σ 2

(

. That is, σ >

(CAb − k)

D2 2 ln ( 1 / (1 − δ ) )

. Therefore,

)

the function sequence Gσ yi (b) − i x (k ) is uniformly convergent with limit 1. From the above result, we obtain the following equation:

( (

)

(

lim G b, k = lim diag Gσ y1 (b) − (CAb − k)1x (k ) , … , Gσ ym (b) − (CAb − k) m x (k )

σ →∞

σ →∞

= diag ( 1, …, 1 ) = I m 

)) (A.2)

m

Then, we have

⎡ θ kG 0 0 0 ⎤ … 0, k ⎥ ⎢ k 1 − ⎢ 0 0 0 ⎥ θ G1, k (19) ⎢ ⎥ lim Bk + 1(k ) = ⋱ ⋮ ⎥ ⎢ ⋮ σ→∞ ⎢ 0 0 θ Gk − 1, k 0 ⎥ ⎥ ⎢ 0 0 Gk , k ⎦ ⋯ ⎣ 0 ⎡ k 0 ⋯ 0 ⎢ θ Im ⎢ 0 θ k − 1Im 0 (A.2) ⎢ = ⋱ ⎢ ⋮ ⎢ 0 0 θ Im ⎢ 0 0 0 ⋯ ⎣

⎤ 0⎥ 0⎥ ⎥ ⋮ ⎥ = Wk + 1(k ) 0⎥ ⎥ Im ⎦

In this case, the optimal solution of (18) is identical to the LS solution (14), which can be recursively updated by the SSRLS. This completes the proof.

5. Conclusion A new state space based adaptive filtering algorithm, called the state space maximum correntropy (SSMC), has been developed in this work, which is derived by using the maximum correntropy criterion (MCC) as the optimality criterion, instead of using the celebrated minimum mean square error (MMSE) criterion. In particular, the proposed SSMC uses a novel fixed-point algorithm to update the solution. The convergence of the fixed-point iteration is ensured if the kernel bandwidth is larger than a certain value. The SSMC will behave like the SSRLS when the kernel bandwidth is large enough. However, with a proper kernel bandwidth, the SSMC exhibits superior performances as shown by simulation results, especially when the system is disturbed by nonGaussian impulsive noises.

Acknowledgments This work was supported by 863 Program (No. 2015AA015702) and National Natural Science Foundation of China (No. 61371087).

Appendix A. Proof of Theorem 1 First, we have the following limit with point-wise convergence:

lim Gσ

σ →∞

(

⎡ ⎛ yi (b) − CAb − k x (k ) ⎢ ⎜ i yi (b) − (CAb − k) i x (k ) = lim ⎢ exp ⎜ − σ →∞ ⎢ ⎜ 2σ 2 ⎜ ⎢⎣ ⎝

)

(

(

)

2 ⎞⎤

) ⎟⎟ ⎥⎥ = 1 ⎟⎥ ⎟⎥ ⎠⎦

(A.1)

Next, we show that the above convergence is also uniform. Let D be the maximum value of yi (b) − (CAb − k)i x (k ) , which surely exists

References [1] M.B. Malik, A. Hassan, I. Ghazi, M.U. Hakeem, On spectrum estimation using state space recursive least squares (SSRLS), in: Proceedings of 8th International Multitopic Conference (INMIC), 2004, pp. 196–200. [2] M.B. Malik, A. Hasan, N. ur Rehman, M. Shahzad, State-space recursive leastsquares based object tracking, in: Proceedings of 8th International Multitopic Conference (INMIC), 2004, pp. 180–183. [3] J. Dai, J. Wang, H. Chen, D. Li, Estimating the frequency in power system based on state space recursive least squares, in: Proceedings of 3rd IEEE Conference on Industrial Electronics and Applications (ICIEA), 2008, pp. 1444–1447. [4] W. Liu, J.C. Principe, S. Haykin, Kernel Adaptive Filtering: A Comprehensive Introduction, John Wiley & Sons, Hoboken, NJ, 2010. [5] S. Haykin, Adaptive Filter Theory, Prentice-Hall, Englewood Cliffs, NJ, 1996. [6] T. Kailath, A. Sayed, B. Hassibi, Linear Estimation, Prentice-Hall, Englewood Cliffs, NJ, 2000. [7] M.B. Malik, State-space RLS, in: Proceedings of IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), vol. 6, 2003, pp. 645–648. [8] M.B. Malik, State-space recursive least-squares: Part I and II, Signal Process. 84 (2004) 1709–1728. [9] M.B. Malik, H. Qureshi, R.A. Bhatti, Tracking of linear time varying systems by state-space recursive least-squares, in: Proceedings of International Symposium on Circuits and Systems (ISCAS), vol. 3, 2004, pp. 305–308. [10] J.C. Principe, Information Theoretic Learning: Renyi's Entropy and Kernel Perspectives, Springer, New York, USA, 2010. [11] B. Chen, Y. Zhu, J. Hu, J.C. Principe, System Parameter Identification: Information criteria and algorithms, Newnes, Oxford, UK, 2013. [12] W. Liu, P.P. Pokharel, J.C. Principe, Correntropy: properties, and applications in non-gaussian signal processing, IEEE Trans. Signal Process. 55 (11) (2007) 5286–5298. [13] B. Chen, J.C. Principe, Maximum correntropy estimation is a smoothed MAP estimation, IEEE Signal Process. Lett. 19 (2012) 491–494. [14] B. Chen, L. Xing, J. Liang, N. Zheng, J.C. Principe, Steady-state mean-square error analysis for adaptive filtering under the maximum correntropy criterion, IEEE Signal Process. Lett. 21 (7) (2014) 880–884. [15] W. Ma, H. Qu, G. Gui, L. Xu, J. Zhao, B. Chen, Maximum correntropy criterion based sparse adaptive filtering algorithms for robust channel estimation under non-Gaussian environments, J. Frankl. Inst. 352 (7) (2015) 2708–2727. [16] A. Singh, J.C. Principe, Using correntropy as a cost function in linear adaptive filters, in: Proceedings of International Joint Conference on Neural Networks

158

X. Liu et al. / Signal Processing 130 (2017) 152–158

(IJCNN), 2009, pp. 2950–2955. [17] L. Chen, H. Qu, J. Zhao, B. Chen, J.C. Principe, Efficient and robust deep learning with correntropy-induced loss function, Neural Comput. Appl. 27 (4) (2016) 1019–1031. [18] X. Chen, J. Yang, J. Liang, Q. Ye, Recursive robust least squares support vector regression based on maximum correntropy criterion, Neurocomputing 97 (2012) 63–73. [19] R. He, B. Hu, X. Yuan, L. Wang, Robust Recognition via Information Theoretic Learning, Springer, Amsterdam, The Netherlands, 2014. [20] A. Singh, J.C. Principe, A loss function for classification based on a robust similarity metric, in: Proceedings of International Joint Conference on Neural Networks (IJCNN), 2010, pp. 1–6. [21] A. Gunduz, J.C. Principe, Correntropy as a novel measure for nonlinearity tests, Signal Process. 89 (1) (2009) 14–23. [22] R. He, W. Zheng, B. Hu, Maximum correntropy criterion for robust face recognition, IEEE Trans. Pattern Anal. Intell. 33 (8) (2011) 1561–1576. [23] R. He, B. Hu, W. Zheng, X. Kong, Robust principal component analysis based on maximum correntropy criterion, IEEE Trans. Image Process. 20 (6) (2011) 1485–1494. [24] J. Xu, J.C. Principe, A pitch detector based on a generalized correlation function, IEEE Trans. Image Process. 16 (8) (2008) 1420–1432. [25] R.J. Bessa, V. Miranda, J. Gama, Entropy, and correntropy against minimum square error in offline, and online three-day ahead wind power forecasting, IEEE Trans. Image Process. 24 (4) (2009) 1657–1666. [26] B. Chen, X. Liu, H. Zhao, J.C. Principe, Maximum correntropy Kalman filter ar Xiv:1509.04580. [27] L. Shi, Y. Lin, Convex combination of adaptive filters under the maximum correntropy criterion in impulsive interference, IEEE Signal Process. Lett. 21 (11) (2014) 1385–1388. [28] B. Chen, L. Xing, H. Zhao, N. Zheng, J.C. Principe, Generalized correntropy for robust adaptive filtering, IEEE Trans. Signal Process. 64 (13) (2016) 3376–3387.

[29] R.P. Agarwal, M. Meehan, D.O. Regan, Fixed Point Theory and Applications, Cambridge University Press, Cambridge, UK, 2001. [30] A. Singh, J.C. Principe, A closed form recursive solution for maximum correntropy training, in: Proceedings of IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2010, pp. 2070–2073. [31] B. Chen, J. Wang, H. Zhao, N. Zheng, J.C. Principe, Convergence of a fixed-point algorithm under maximum correntropy criterion, IEEE Signal Process Lett. 22 (10) (2015) 1723–1727. [32] B. Chen, Y. Zhu, J. Hu, Mean-square convergence analysis of ADALINE training with minimum error entropy criterion, IEEE Trans. Neural Netw. 21 (7) (2010) 1168–1179. [33] W. Ma, H. Qu, J. Zhao, B. Chen, J.C. Principe, Sparsity aware minimum error entropy algorithms, in: Proceedings of IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2015, pp. 2179–2183. [34] B. Chen, P. Zhu, J.C. Principe, Survival information potential: a new criterion for adaptive system training, IEEE Trans. Signal Process. 60 (3) (2012) 1184–1194. [35] B. Chen, J. Wang, H. Zhao, J.C. Principe, Insights into entropy as a measure of multivariate variability, Entropy 18 (2016) 196. http://dx.doi.org/10.3390/ e18050196. [36] D. Erdogmus, J.C. Principe, An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems, IEEE Trans. Signal Process. 50 (7) (2002) 1780–1786. [37] D. Erdogmus, J.C. Principe, Generalized information potential criterion for adaptive system training, IEEE Trans. Neural Netw. 13 (5) (2002) 1035–1044. [38] I. Santamaria, D. Erdogmus, J.C. Principe, Entropy minimization for supervised digital communications channel equalization, IEEE Trans. Signal Process. 50 (5) (2002) 1184–1192. [39] W.J. Rugh, Linear System Theory, Prentice-Hall, Englewood Cliffs, NJ, 2001. [40] R.E. Kalman, A new approach to linear filtering and prediction problems, Trans. ASME—J. Basic Eng. Ser. D 82 (1960) 35–45. [41] N.E. Nahi, Estimation Theory and Applications, Wiley, New York, NY, 1969.