Fuzzy theoretic approach to signals and systems: Static systems

Fuzzy theoretic approach to signals and systems: Static systems

Information Sciences 418–419 (2017) 668–702 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate...

3MB Sizes 0 Downloads 98 Views

Information Sciences 418–419 (2017) 668–702

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Fuzzy theoretic approach to signals and systems: Static systemsR Mohit Kumar a,b,1, Yihua Mao d,1, Yuhao Wang c, Taorong Qiu c, Yang Chenggen c, Weiping Zhang c,∗ a

Binhai Industrial Technology Research Institute of Zhejiang University, Tianjin 300301, China Faculty of Computer Science and Electrical Engineering, University of Rostock, Germany c Department of Electronic Information Engineering, Nanchang University, Nanchang, China d Zhejiang University College of Civil Engineering and Architecture, Hangzhou 310027, China b

a r t i c l e

i n f o

Article history: Received 29 September 2016 Revised 1 August 2017 Accepted 12 August 2017 Available online 14 August 2017 Keywords: Modeling Membership functions Variational optimization

a b s t r a c t “Fuzzy Theoretic Approach to Signals and Systems” assumes all system variables and parameters as uncertain (i.e. being characterized by membership functions), develops a mathematical theory for analytically determining the membership functions on system variables and parameters, derives algorithms for estimating the parameters of membership functions, and establishes robustness and convergence properties of the estimation algorithms. The membership functions are analytically determined via solving a variational optimization problem that maximizes the “over-uncertainties-averaged-log-membership” of the observed data around an initial guess. This paper develops the analytical fuzzy theory for the particular case of a multi-input single-output static system affected by noises. The theory facilitates designing an adaptive filtering algorithm. The robustness of the adaptive filtering algorithm is proved theoretically via a mathematical analysis. Numerical experiments further demonstrate the robustness of the filtering algorithm. A comparison of the algorithm with the state-of-art methods is made using the practical biomedical applications related to the modeling and analysis of heart rate signals for assessing the physiological state of an individual. © 2017 Elsevier Inc. All rights reserved.

1. Introduction The data-driven models capture regularities in the data. The learning of a model is the process of adjusting the flexible model parameters to best explain the observed data. The observed data would inevitably possess a certain degree of uncertainty regarding the representation of the true process. The uncertainties, if not taken care of, would lead to erroneously learned models. The aim of this study is to formalize the idea of applying membership functions to take care of uncertainties during the learning of data models. The modeling methods fit the observed input-output data (x, y) through adjustment of model parameters α while taking into consideration the unobserved (latent or hidden) variables h. Specifically, the central

R ∗

1

This work was supported partially by National Natural Science Foundation of China (NSFC: 61662045). Corresponding author. E-mail address: [email protected] (W. Zhang). These authors contributed equally to the paper as first authors.

http://dx.doi.org/10.1016/j.ins.2017.08.048 0020-0255/© 2017 Elsevier Inc. All rights reserved.

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

669

theme of this study is to determine optimal membership functions on observed data (x, y), unknown model parameters α , and unobserved (obviously unknown) variables h. Despite the fuzzy theory being a densely studied subject, most (if not all) fuzzy researchers have to rely on numerical methods for the design of fuzzy systems. Several previous studies have focused on the estimation theory of linear parameters of Takagi-Sugeno type fuzzy models while relying on ad-hoc numerical algorithms for determining membership functions. This is understandable as the design criterion is of optimizing a certain objective functional which due to nonlinearity can’t be analytically optimized with respect to membership function parameters. Those studies have applied the mathematical tools of system and control theory for linear fuzzy model parameters [9–11,15–18]. The numerical algorithms extensively used in literature for designing fuzzy systems includes multi-objective evolutionary [1,2,7,8,24,25] and data clustering algorithms [5,6,19,22]. On the other hand, treating the fuzzy model parameters as random variables, the concept of variational Bayes borrowed from probability theory has been applied to design stochastic fuzzy systems [12–14]. The author in [4] has also suggested a fuzzy approach to signal theory in parallel to the probabilistic approach via incorporating uncertainty in the description of mathematical transformations typically used for signal analysis. However, our focus is on the learning aspect of an uncertain signal model and not on introducing new fuzzy based mathematical transformations to extract signal features. While the research presented in [14] provides a mathematical framework to design both fuzzy filtering and analysis algorithms in a unified manner, the developed theory being built up for stochastic fuzzy models contributes more to the probability theory than the fuzzy theory. It is worth mentioning that attempts have been made to apply the tools and concepts such as H ∞ −filtering [26], Kalman filtering [21], and information-theoretic measures [3,20] for determining membership functions. However, a “self-contained” theory on the optimization of membership functions was not available. A large literature (see, e.g. [27–29]) is available in the fuzzy control area where a fuzzy mixture of linear systems is used to approximate the nonlinear system with the aim of achieving robustness and stability of the filtering errors. Our problem is different from the fuzzy filtering problem extensively studied in fuzzy control area as our aim is to identify the unknown process model. We observe that the development of powerful analytical methods for the inclusion of uncertainties into mathematical modeling and analysis of data has been a challenge for the fuzzy research community. Our research group develops the mathematical tools to facilitate the design and analysis of membership functions which take care of the uncertainties affecting signals and systems. This contributes to the so-called “Fuzzy Theoretic Approach to Signals and Systems”. “Fuzzy Theoretic Approach to Signals and Systems” assumes all system variables and parameters as uncertain (i.e. being characterized by membership functions), develops a mathematical theory for analytically determining the membership functions on system variables and parameters, derives algorithms for estimating the parameters of membership functions, and establishes robustness and convergence properties of the estimation algorithms. Our approach to fuzzy theory development, for a given modeling scenario, consists of following steps: 1. An initial guess is made regarding the functional representation of multivariate membership functions on model parameters (α ) and hidden variables (h). The initial guess regarding the membership functions on α and h is represented as μ(α ) and μ(h) respectively. 2. A mathematical expression of the joint membership function on input-output data (x, y) is derived taking into consideration the structure of the model and the relationships among variables. The joint membership function on input-output data, being represented as μ(x, y; h, α ), is a function of h and α . 3. The average value of a function f(x) with respect to x, where x is an uncertain variable with μ(x) as the membership function to characterize its uncertainty, is calculated as

 f (x )μ(x) = 

1 ∂ x μ (x )



∂ x μ ( x ) f ( x ).

The membership functions on α and h are determined by solving following problems: (a) The averaged logarithmic value of μ(x, y; h, α ) is maximized with respect to arbitrary membership functions having fixed integrals (i.e. with fixed area under their curves). Mathematically,



{q (α ; kα ), q (h; kh )} = arg max ∗



q(α ),q(h )



log (μ(x, y; h, α ) )q(h)q(α ) − log



q (α )

μ (α )



 − log q (α )



q (h ) μ (h )





q (h )

subjected to the fixed integral constraints:



∂α q(α ) = kα > 0,



∂ h q ( h ) = kh > 0.

Here, q(α ) and q(h) represent arbitrary membership functions on α and h respectively. The optimization process maximizes an objective functional which consists of three terms. The first term equals the average log-membership value, when average is taken over uncertain variables α and h with q(α ) and q(h) respectively as the membership functions for representing the uncertainties. The second and third terms of the objective functional regularize the solution towards the initial guess (μ(α ), μ(h)). (b) Once the analytical expressions for q∗ (α ; kα ) and q∗ (h; kh ) have been derived using variational optimization technique, the values of kα and kh are so chosen such that maximum value of q∗ (α ; kα ) and q∗ (h; kh ) w.r.t. α and h respectively

670

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

remains as equal to one. That is, two positive constants k∗α and k∗h are found such that

max q∗ (α ; k∗α ) = 1, max q∗ (h; k∗h ) = 1. α

h

(c) The optimal membership functions on α and h are given as

q∗ (α ) = q∗ (α ; k∗α ), q∗ (h ) = q∗ (h; k∗h ). 4. Finally, the data model is created by defining the so-called optimal joint membership function on (x, y) as

log (μ∗ (x, y ) ) ∝ log (μ(x, y; h, α ) )q∗ (h )q∗ (α ) . The quantity μ∗ (x, y) results from averaging over α and h (i.e. integrating out both hidden variables and model parameters from the log-membership of data) and could play a key role to choose between different models in model selection task. Remark 1 (Originality). A vast literature studies signals and systems in a stochastic setting assuming variables as random (i.e. characterized by probability distribution functions) where Bayesian theory is typically used to infer the posterior distributions of the variables given the observed data. The originality of our approach lies in considering the modeling problem in a completely deterministic framework assuming variables as uncertain (i.e. characterized by membership functions) and determining membership functions via maximizing the “over-uncertainties-averaged-log-membership” of the observed data around an initial guess. Remark 2 (Significance). “Fuzzy Theoretic Approach to Signals and Systems” would provide a mathematical framework for modeling and analysis of signals and systems where uncertainties are taken care in a sensible manner via optimally determining membership functions on uncertain variables. The approach would facilitate the development of a mathematical theory on the robustness and convergence of modeling algorithms. Remark 3 (Relationship to Stochastic Fuzzy Theory). The work presented in [14] has considered in a stochastic framework the same signal modeling problem as that of this study. Also, the both studies apply variational optimization to maximize the formulated objective functions. However, there are fundamental differences in the approaches of two studies as described below. • The approach of [14] has considered the probability distribution functions to model system variables while this study defines membership functions over system variables for a modeling. Thus, the approach of this study is completely deterministic in nature in contrast to the stochastic approach of [14]. • The filtering algorithm is derived in [14] via approximating the true posterior distributions of variables under Bayesian framework. This study derives the filtering algorithm via maximizing the “over-uncertainties-averaged-log-membership” of the observed data around an initial guess. • The signal analysis algorithm of [14] is based on the maximization of an information theoretic quantity referred to as “mutual information”. This study, however, suggest the membership functions as the instrument to calculate quantitatively the degree-of-belongingness of a signal to a model. Our aim is to develop mathematical fuzzy theory for various signal and system models including static systems, state space models, and Markov models. This paper develops the theory for the particular case of a multi-inputs-single-output static system being affected by noises. The application of our methodology on the static system results in an adaptive filtering algorithm. As the filtering algorithm is in the form of closed-form expressions, a time-domain analysis of the algorithm is facilitated. The aim of the time-domain analysis is to establish mathematically the robustness and convergence properties of the algorithm. The upper bounds on the magnitudes of errors are derived to prove the robustness of the algorithm. The robustness of the algorithm against data outliers is also verified by the simulations. A comparison of the algorithm with existing state-of-art methods is made using the practical biomedical applications related to the modeling and analysis of heart rate signals for assessing the physiological state of an individual. The aforementioned contribution of this study (i.e. development of analytical fuzzy theory for modeling static systems) is original to authors’ best knowledge and should encourage the fuzzy researchers to solve the model learning problem in an analytical manner. 2. Background and problem formulation 2.1. A multi-inputs-single-output static system T

A multi-inputs-single-output system, as shown in Fig. 1, is considered. With reference to Fig. 1, o = [o1 · · · on ] ∈ Rn is the T T input vector contaminated with input noise vector  = [1 · · · n ] ∈ Rn . The vector x = [x1 · · · xn ] ∈ Rn is the measured input T vector. In Fig. 1, the input vector x is mapped to a scalar (G(x))T α , where G(x ) = [G1 (x ) · · · GK (x )] ∈ RK with Gi (x ) ∈ R, ∀i ∈ {1, T 2, , K}, is a nonlinear function of x. The vector α = [α1 · · · αK ] ∈ RK collects the system parameters and thus characterizes

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

671

Fig. 1. A static system with inputs and output noises.

the system. The measured system output (y ∈ R) is being disturbed by output noise (v ∈ R). It follows from Fig. 1 that

x = o+

(1)

y = (G(x ))T α + v

(2)

Here, the input vector (x) and output scalar (y) are observed variables while o,  , α , and v are unknown. 2.2. Defining membership functions on system variables This subsection defines membership functions on the variables for their modeling. With the intent of keeping our methodology in its most basic form, we choose just following two types of membership functions for the variables: Definition 1 (Gaussian membership function). The Gaussian membership function on a vector x ∈ Rn can be defined as

1 μ(x; mx , x ) = exp − (x − mx )T x (x − mx ) , mx ∈ Rn , −1 x > 0. 2

Definition 2 (Gamma membership function). The Gamma membership function on a non-negative scalar z can be defined as



μ(z; a, b) =

b a−1

a−1

exp(a − 1 )(z )a−1 exp(−bz ), a ≥ 1, b > 0.

A few examples of this type of membership functions for different values of a and b are provided in Fig. 2. The parameters a is referred to as the shape parameter and b is referred to as the rate parameters (i.e. the reciprocal of the scale parameter). The peak of the membership function is given at (a − 1 )/b. The skewness of the membership function is inversely proportional to the value of a. The Gamma membership function can alternatively be represented as

μ(z; r, s ) = (s )r exp (r )(z )r exp (−srz ), r ≥ 0, s > 0. Definition 3 (Membership function on o). The membership function on o ∈ Rn is defined as Gaussian as

1 μ(o; mo, o ) = exp − (o − mo )T o (o − mo ) , mo ∈ Rn , o > 0. 2

Definition 4 (Membership function on  j ). The membership function on  j ∈ R, ∀j ∈ {1, , n}, is defined as zero-mean Gaussian with scaled precision as

 

λx zx j 2 μ  j ; λx , zx j = exp − j 2

where zx j > 0 is the uncertain scaling factor characterized by the following Gamma membership function:



μ zx j ; rx , sx = (sx )rx exp(rx )(zx j )rx exp(−rx sx zx j ), rx ≥ 0, sx > 0.

Here, λx > 0, rx ≥ 0, and sx > 0 are uncertain as well characterized by the following Gamma membership functions:





μ λx ; aλx , bλx =  μ ( rx ; arx , brx ) =

bλx aλx − 1

brx arx − 1

aλx −1

exp(aλx − 1 )(λx )aλx −1 exp(−bλx λx )

arx −1 exp(arx − 1 )(rx )arx −1 exp(−brx rx )

672

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

Fig. 2. A few examples of Gamma membership functions.

 μ ( sx ; asx , bsx ) =

bsx asx − 1

asx −1 exp(asx − 1 )(sx )asx −1 exp(−bsx sx )

where aλx ≥ 1, bλx > 0, arx ≥ 1, brx > 0, asx ≥ 1, bsx > 0. Remark 4. The motivation of defining membership function on  j , as in Definition 4, is derived from the fact that noise  j might be long-tailed (e.g. Student-t distributed). Being inspired from the fact that Student-t distribution is an infinite mixture of Gaussian distributions with the common mean and Gamma-density-scaled precisions, the precision λx is scaled by the uncertain factor zx j and Gamma membership is defined on zx j in Definition 4 in parallel to the case of Student-t distribution on  j . Definition 5 (Membership function on x). The membership function on x ∈ Rn , for a given (o, λx , zx = [zx1 · · · zxn ] ∈ Rn ), is defined as T

μ(x; o, zx , λx ) =

n 





exp −

j=1

λx zx j 2



(x j − o j )

2

.

Definition 6 (Membership function on α ). The Membership function on α ∈ RK is defined as Gaussian as

1 μ(α ; mα , α ) = exp − (α − mα )T α (α − mα ) , mα ∈ RK , α > 0. 2

Definition 7 (Membership function on v). The membership function on v ∈ R, is defined as zero-mean Gaussian with scaled precision as

  λz μ(v; λy , zy ) = exp − y y v2 2

where zy > 0 is the uncertain scaling factor characterized by the following Gamma membership function:

μ(zy ; ry , sy ) = (sy )ry exp(ry )(zy )ry exp(−ry sy zy ), ry ≥ 0, sy > 0. Here, λy > 0, ry ≥ 0, and sy > 0 are uncertain as well characterized by the following Gamma membership functions:



μ λy ; aλy , bλy =

μ ry ; ary , bry =





aλy −1

bλy aλy − 1

bry ary − 1

exp(aλy − 1 )(λy )aλy −1 exp(−bλy λy ),

ary −1 exp(ary − 1 )(ry )ary −1 exp(−bry ry ),

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702



μ sy ; asy , bsy =



bsy asy − 1

673

asy −1 exp(asy − 1 )(sy )asy −1 exp(−bsy sy )

where aλy ≥ 1, bλy > 0, ary ≥ 1, bry > 0, asy ≥ 1, bsy > 0. Remark 5. The motivation of defining membership function on v, as in Definition 7, is derived from the fact that noise v might be long-tailed (e.g. Student-t distributed). Being inspired from the fact that Student-t distribution is an infinite mixture of Gaussian distributions with the common mean and Gamma-density-scaled precisions, the precision λy is scaled by the uncertain factor zy and Gamma membership is defined on zy in Definition 7 in parallel to the case of Student-t distribution on v. Definition 8 (Membership function on y). The membership function on y ∈ R, for a given (x, zy , λy , α ), is defined as

 

2 λz μ(y; x, zy , λy , α ) = exp − y y y − (G(x ) )T α . 2

Definition 9 (Membership function on “x and y”). The membership function on “x and y”, for a given (o, zx , λx , zy , λy , α ), is defined as the product of membership functions on x and y as



n  1  μ(x, y; o, zx , λx , zy , λy , α ) = exp − λx zx j (x j − o j )2 − 2



j=1

λy zy 2



zx1 1 T⎣ ⎝ = exp − λx (x − o) 2

y − (G (x ) )

⎤ ..

⎦(x − o) −

. zxn

T

α

2

λy zy 2



y − (G (x ) )

T

2



α ⎠.

2.3. Membership functions identification as an optimization problem This subsection formulates the membership functions’ identification problem to a deterministic optimization problem. The identification criterion could be of maximizing log-membership of “x and y”, i.e., log (μ(x, y; o, zx , λx , zy , λy , α )). The log-membership of “x and y” is a function of uncertain variables and thus it makes sense to maximize the over-uncertaintiesaveraged log-membership. However, the uncertainties (i.e. the membership functions modeling the uncertain variables) are unknown too and thus the log-membership value can’t be averaged over uncertainties without determining the membership functions on uncertain variables. To identify the membership functions, assume that q(zx j ), q(rx ), q(sx ) q(λx ), q(o), q(zy ), q(ry ), q(sy ), q(λy ), and q(α ) are arbitrary membership functions on zx j , rx , sx , λx , o, zy , ry , sy , λy , and α respectively. For a simplicity of mathematical expressions, define a set , a functional q(), and a differential ∂  as follows

 = {zx1 , · · · , zxn , rx , sx , λx , o, zy , ry , sy , λy , α} q() = q(zx1 ) · · · q(zxn )q(rx )q(sx )q(λx )q(o)q(zy )q(ry )q(sy )q(λy )q(α ) ∂  = ∂ zx1 · · · ∂ zxn ∂ rx ∂ sx ∂ λx ∂ o∂ zy ∂ ry ∂ sy ∂ λy ∂ α . Define another functional μ() as



μ() =

n 



μ(zx j ; rx , sx ) μ(rx ; arx , brx )μ(sx ; asx , bsx )μ(λx ; aλx , bλx )μ(o; mo, o )

j=1

×μ(zy ; ry , sy )μ(ry ; ary , bry )μ(sy ; asy , bsy )μ(λy ; aλy , bλy )μ(α ; mα , α ). The arbitrary membership functions are determined by maximizing an objective functional defined as

F (q(zx j ), q(rx ), q(sx ), q(λx ), q(o), q(zy ), q(ry ), q(sy ), q(λy ), q(α ))   1 1 =  ∂  q() log (μ(x, y; o, zx , λx , zy , λy , α )) −  ∂  q() ∂  q()

 ∂  q() log

q()

μ()

 .

(3)

The objective functional F is maximized w.r.t. q(zx j ), q(dx ), q(λx ), q(o), q(zy ), q(dy ), q(λy ), and q(α ) under the following constraints: Fixed Integral Constraints:



∂ z x j q ( z x j ) = kzx j > 0,

 

∂ rx q ( rx ) = krx > 0, ∂ sx q ( sx ) = ksx > 0,

674

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702



∂λx q(λx ) = kλx > 0,

 

∂ z y q ( z y ) = kzy > 0,



∂ ry q ( ry ) = kry > 0,

 

∂ o q(o) = ko > 0,

∂ sy q ( sy ) = ksy > 0, ∂λy q(λy ) = kλy > 0,



∂α q(α ) = kα > 0.

Unity Maximum Value Constraints: The values of kzx , krx , ksx , kλx , ko , kzy , kry , ksy , kλy , and kα are so chosen such that j

maximum value of q(zx j ), q(rx ), q(sx ), q(λx ), q(o), q(zy ), q(ry ), q(sy ), q(λy ), and q(α ) remains equal to one. The first term of F computes the log-membership value after averaging over all of the uncertain variables. The second term of F regularizes the maximization problem towards pre-defined functional form of the membership functions (i.e. towards μ(zx j ; rx , sx ), μ(rx ; arx , brx ), μ(sx ; asx , bsx ), μ(λx ; aλx , bλx ), μ(o; mo , o ), μ(zy ; ry , sy ), μ(ry ; ary , bry ), μ(sy ; asy , bsy ), μ(λy ; aλy , bλy ), and μ(α ; mα , α )). 3. Variational optimization for membership functions identification 3.1. Optimal membership functions Result 1. The analytical expressions for variational membership functions, that maximize F under Fixed Integral Constraints and Unity Maximum Value Constraints, are



ˆ o (o − m ˆ o )T  ˆ o) , q∗ (o) = exp − 12 (o − m

⎡ aˆ ⎢ ⎢ ⎢ x ⎣

aˆλx bˆ λ

ˆ o = o + 

zx

⎤ 1

bˆ zx

1

..

aˆzxn bˆ z



bˆ λx aˆλx −1

aˆλx −1

(4)

xn

⎧ ⎪ ⎪ ⎪ −1 ⎨ ˆo ˆo =  m omo + ⎪ ⎪ ⎪ ⎩ q∗ (λx ) =

.

⎥ ⎥ ⎥ ⎦

⎡ aˆ ⎢ ⎢ ⎢ x ⎣

aˆλx bˆ λ

zx

1

bˆ zx

1

..

.

⎤ ⎫ ⎪ ⎪ ⎪ ⎥ ⎬ ⎥ ⎥x . ⎦ ⎪ ⎪ aˆzxn ⎪ ⎭

(5)

bˆ zxn





exp(aˆλx − 1 )(λx )aˆλx −1 exp −bˆ λx λx ,

aˆλx = aλx

(6)

n 1  aˆzx j bˆ λx = bλx + 2 ˆ j=1 bzx j

 q ( zx j ) = ∗

aˆzx j =

bˆ zx j aˆzx j − 1

aˆrx +1 bˆ rx

'

aˆzx

j

ˆo x j − Ijm

2

+ Tr



ˆo 

−1

ITj I j

−1

exp(aˆzx j − 1 )(zx j )

aˆzx −1 j

(

.



(7)



exp −bˆ zx j zx j ,

(8)

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

aˆr aˆs 1 aˆλx bˆ zx j = x x + 2 bˆ λ ˆbr bˆ s x x x

 q∗ ( rx ) =

bˆ rx aˆrx − 1

'

ˆo x j − Ijm

2

+ Tr



ˆo 

−1

aˆrx −1

ITj I j

675

(

.



(9)



exp(aˆrx − 1 )(rx )aˆrx −1 exp −bˆ rx rx ,

aˆrx = arx

(10)

'

(

( n n '

 aˆs  aˆzx j bˆ rx = brx + x − n ψ aˆsx − log bˆ sx −n− ψ aˆzx j − log bˆ zx j . bˆ sx j=1 bˆ zx j j=1

(11)

 q ( sx ) = ∗

bˆ sx aˆsx − 1

aˆsx = asx + n

aˆsx −1





exp(aˆsx − 1 )(sx )aˆsx −1 exp −bˆ sx sx ,

aˆrx bˆ r

(12)

x

n aˆr  aˆzx j bˆ sx = bsx + x . bˆ rx j=1 bˆ zx j

(13)





1 ˆ α (α − m ˆ α )T  ˆ α) , q∗ (α ) = exp − (α − m 2 ˆ α = α + 

aˆλy aˆzy T G (x ) (G (x ) ) bˆ λ bˆ z



ˆα ˆα =  m

−1





aˆλ aˆz α mα + y y G(x )y . bˆ λ bˆ z y

 q∗ (λy ) =

(14)

y

y

(15)

y

aˆλy −1

bˆ λy





exp(aˆλy − 1 )(λy )aˆλy −1 exp −bˆ λy λy ,

aˆλy − 1

aˆλy = aλy

(16)

1 aˆzy bˆ λy = bλy + 2 bˆ z

'

ˆα y − (G (x ) ) m T

2

+ Tr



ˆα 

−1

G (x ) (G (x ) )

T

(

.

(17)

y

 q ( zy ) = ∗

aˆzy =

bˆ zy aˆzy − 1

aˆzy −1





exp(aˆzy − 1 )(zy )aˆzy −1 exp −bˆ zy zy ,

aˆry +1 bˆ r

(18)

'

−1 (

2 aˆry aˆsy 1 aˆλy T T ˆα ˆ α + Tr  + y − (G (x ) ) m G (x ) (G (x ) ) . 2 bˆ λ bˆ ry bˆ sy y

(19)

y

bˆ zy =

 q∗ ( ry ) = aˆry = ary

bˆ ry aˆry − 1

aˆry −1





exp(aˆry − 1 )(ry )aˆry −1 exp −bˆ ry ry , (20)

676

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

bˆ ry = bry +



'

( '

( aˆsy aˆzy − ψ aˆsy − log bˆ sy − 1 − ψ aˆzy − log bˆ zy . bˆ sy bˆ zy aˆsy −1



(21)



q ( sy ) =

bˆ sy aˆsy − 1

aˆsy = asy +

aˆry bˆ r

(22)

aˆry aˆzy . bˆ ry bˆ zy

(23)



exp(aˆsy − 1 )(sy )aˆsy −1 exp −bˆ sy sy ,

y

bˆ sy = bsy +

Here, I j ∈ R1×n is a row matrix equal to the j−th row of the identity matrix of size n, ψ ( · ) is the digamma function, and Tr( · ) denotes the trace operator. Proof. The proof is available in Appendix A.



3.2. Optimal membership function on inputs-output data space and output noise Finally, the so-called optimal membership function on “x and y” is defined by averaging over the uncertainties such that

log (μ∗ (x, y ) ) ∝ log (μ(x, y; o, zx , λx , zy , λy , α ))q∗ (o)q∗ (zx )q∗ (λx )q∗ (zy )q∗ (λy )q∗ (α ) =− −

n 1 aˆλx  aˆzx j 2 bˆ λ bˆ zx j x j=1



μ (x, y ) ∝ exp − ∗

ˆo x j − Ijm

2

+ Tr



ˆo 

−1

ITj I j

(

'

−1 (

2 1 aˆλy aˆzy T T ˆα ˆ α + Tr  y − (G (x ) ) m G (x ) (G (x ) ) . 2 bˆ λ bˆ z y

That is,

'

y

n 1 aˆλx  aˆzx j 2 bˆ λ bˆ zx j x j=1

'

ˆo x j − Ijm

2

+ Tr



ˆo 

−1

ITj I j

( 

'

−1 (

2 1 aˆλy aˆzy T T ˆα ˆ α + Tr  − y − (G (x ) ) m G (x ) (G (x ) ) . 2 bˆ λ bˆ z y y Similarly, the membership function on noise v is defined as

log (μ∗ (v ) ) ∝ log (μ(v; λy , zy ) )q∗ (λy )q∗ (zy ) = −

1 aˆλy aˆzy 2 v. 2 bˆ λ bˆ z y

Thus,



y



1 aˆλy aˆzy 2 μ∗ (v ) ∝ exp − v . 2 bˆ λ bˆ z y y 4. Adaptive filtering algorithm The adaptive filtering algorithm recursively estimates the membership functions on o and α based on the optimal solution of log-membership maximization problem. The algorithm uses the current data pair (x(k), y(k)) to recursively estimate the parameters of q∗ (o) and q∗ (α ) based on (4–23) taking initial guess as equal to the estimates available from previous, i.e. (k − 1 )th data pair. A recursion of the filtering algorithm is as follows

aˆλx |k+1 = aλx

(24)

'

n

2

−1 T ( 1  aˆzx j |ini ˆ o |k ˆ o |k + T r  bˆ λx |k+1 = bλx + x j (k ) − I j m Ij Ij 2 ˆ j=1 bzx j |ini

(25)

aˆzx j |k+1 =

aˆrx |ini +1 bˆ rx |ini

(26)

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

aˆr |ini aˆsx |ini 1 aˆλx |k+1 bˆ zx j |k+1 = x + 2 bˆ λ |k+1 ˆbr |ini bˆ s |ini x x x

'

ˆ o |k x j (k ) − I j m

2

+ Tr



ˆ o |k 

−1

ITj I j

677

( (27)

aˆrx |k+1 = arx

(28)

'

( n

aˆs |ini  aˆzx j |k+1 bˆ rx |k+1 = brx + x − n ψ aˆsx |ini − log bˆ sx |ini bˆ sx |ini j=1 bˆ zx j |k+1 −n −

n ' 



( ψ aˆzx j |k+1 − log bˆ zx j |k+1

(29)

j=1

aˆsx |k+1 = asx + n

aˆrx |k+1 bˆ rx |k+1

(30)

n a ˆzx j |k+1  aˆr | bˆ sx |k+1 = bsx + x k+1 bˆ rx |k+1 j=1 bˆ zx j |k+1

⎡ aˆ ˆ o|k+1 =  ˆ o |k + 



aˆλx |k+1 ⎢ ⎢ bˆ λ |k+1 ⎣ x

zx

1

(31)



|k+1

bˆ zx |k+1 1

..

. aˆzxn |k+1 bˆ z |k+1

⎥ ⎥ ⎥ ⎦

(32)

xn

⎧ ⎡ aˆ | zx k+1 1 ⎪ ⎪ ⎪ bˆ zx |k+1 ⎨ ⎢

−1 aˆλ |k+1 ⎢ 1 ˆ o|k+1 ˆ o |k m ˆ o|k+1 =  ˆ o |k + x m  ⎢ ⎪ bˆ λx |k+1 ⎣ ⎪ ⎪ ⎩

⎤ ..

⎫ ⎪ ⎪ ⎪ ⎬

⎥ ⎥ ⎥x(k ) ⎪ ⎦ ⎪ aˆzxn |k+1 ⎪ ⎭ ˆ

.

(33)

bzxn |k+1

aˆλy |k+1 = aλy

(34)

1 aˆzy |ini bˆ λy |k+1 = bλy + 2 bˆ z |ini

'

ˆ α |k y(k ) − (G(x(k )) ) m T

2

+ Tr



ˆ α |k 

−1

G(x(k ))(G(x(k )) )

T

( (35)

y

aˆzy |k+1 =

aˆry |ini +1 bˆ r |ini

(36)

y

bˆ zy |k+1 =

'

2 aˆry |ini aˆsy |ini 1 aˆλy |k+1 T ˆ α |k + y(k ) − (G(x(k )) ) m 2 bˆ λ |k+1 bˆ ry |ini bˆ sy |ini y +T r



ˆ α |k 

−1

G(x(k ))(G(x(k )) )

T

(

aˆry |k+1 = ary

bˆ ry |k+1 = bry + −1 −

aˆsy |k+1 = asy +

(37)

(38)

'

(

aˆsy |ini aˆzy |k+1 − ψ aˆsy |ini − log bˆ sy |ini bˆ sy |ini bˆ zy |k+1 '

(

ψ aˆzy |k+1 − log bˆ zy |k+1

aˆry |k+1 bˆ r |k+1 y

(39)

(40)

678

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

bˆ sy |k+1 = bsy +

aˆry |k+1 aˆzy |k+1 bˆ ry |k+1 bˆ zy |k+1

ˆ α |k+1 =  ˆ α |k + 

aˆλy |k+1 aˆzy |k+1 T G(x(k ))(G(x(k )) ) bˆ λ |k+1 bˆ z |k+1

ˆ α |k+1 ˆ α |k+1 =  m

−1

(42)

y

y



(41)



aˆλy |k+1 aˆzy |k+1 ˆ α |k m ˆ α |k +  G(x(k ))y(k ) bˆ λ |k+1 bˆ z |k+1

(43)

y

y

where aˆzx |ini denotes an initial guess about aˆzx |k+1 and so is the notation also for the initial guess of other parameters. We j

j

suggest to chose the initial guess of the parameters as follows

ˆ o|0 = I, m ˆ α |0 = I, ˆ o |0 = 0,  ˆ α |0 = 0,  m aλx = bλx = 1, arx = brx = 1, asx = bsx = 1, aλy = bλy = 1, ary = bry = 1, asy = bsy = 1, aˆzx j |ini = bˆ zx j |ini = 1, aˆrx |ini = bˆ rx |ini = 1, aˆsx |ini = bˆ sx |ini = 1, aˆzy |ini = bˆ zy |ini = 1, aˆry |ini = bˆ ry |ini = 1, aˆsy |ini = bˆ sy |ini = 1. Remark 6. The adaptive filtering algorithm (24–43) has striking similarities to the stochastic filtering algorithm of [14] where variational Bayes was used to approximate the posterior distribution of random system parameters and variables. The similarity between the two algorithms lies due to the identical choices made regarding the types of distribution functions (in the stochastic case of [14]) and the types of membership functions (in the present fuzzy case). The difference between the two algorithms is attributed to the difference in the functional form of probability density functions and membership functions, as the area under a density function is unity while the maximum value of a membership function is unity. 5. Time-domain analysis for robustness and convergence analysis This section provides a time-domain analysis of the adaptive filtering algorithm (24–43). We are in-particularly interested in the temporal analysis of the estimation errors during the recursions of the algorithm. For an error analysis of the algorithm, assume that there exists a deterministic vector, o∗ ∈ Rn , such that

x(k ) = o∗ +  (k ) ˆ o|k+1 is where  (k) is the input noise vector associated to kth observed input sample x(k). The mismatch between o∗ and m ˜ o, is defined as referred to as the input estimation error. The input estimation error vector, m

ˆ o|k+1 . ˜ o (k ) = o∗ − m m Result 2. The adaptive filtering algorithm (24–43) at any time, k = t, ensures an upper bound on the magnitude of instantaneous input estimation error vector such that 1.

ˆ o|t+1 m ˜ o (t ) (m˜ o (t ) )T  ≤ 1, ˆ (m˜ o (t − 1 ) ) o|t m˜ o (t − 1 ) + ( (t ) )T Dux (t ) (t ) T

2.

  ) ˆ o|1 + tk=1 Dux (k ) m ˜ o (t ) (m˜ o (t ) )T  ˆ o |1 m ˜ o (0 ) + (m˜ o (0 ) )T 

where

⎡ ⎢ ⎢ ⎣

Dux ( k ) = ⎢

)t

k=1

( (k ) )T Dux (k ) (k )



aˆλx |k+1 aˆzx1 |k+1 bˆ λx |k+1 bˆ zx |k+1 1

..

. aˆλx |k+1 aˆzxn |k+1 bˆ λx |k+1 bˆ zxn |k+1

⎥ ⎥ ⎥. ⎦

≤ 1,

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

Proof. The proof is available in Appendix B.

679



Remark 7. Result 2 demonstrates the robustness of the filtering algorithm in the sense that if the magnitudes of input noise vector remain “small” and magnitude of the error in initial guess of input parameters is “small”, then the magnitude of instantaneous input estimation error vector remains “small”. Remark 8. The convergence of the algorithm towards true input parameters follows from Result 2 in the case of zero input noise. If  (t ) = 0, then it follows from Result 2 that



ˆ o |1 + (m˜ o (t ) )T 

t 



T ˆ ˜ o (t ) ≤ (m ˜ o (0 ) )  ˜ o ( 0 ). Dux ( k ) m o |1 m

(44)

k=1

That is,

m˜ o (t ) 2 ≤

ˆ o |1 m ˜ o (0 ) (m˜ o (0 ) )T 

)t ˆ min_eig o|1 + k=1 Dux (k )

(45)

where min_eig(· ) denotes the minimum eigenvalue. Since Dux (k ) is a positive definite diagonal matrix, 

 ) ˆ o |1 + t Dux ( k ) ˆ o |1 m ˜ o (0 ) )T  ˜ o (0 ) is a constant, it implies min_eig  is a monotonically increasing sequence. As (m k=1 from inequality (45) that there must exist a large enough T such that

∀t > T .

˜ o (t ) = 0, m That is,

ˆ o|t+1 = o∗ , m

∀t > T .

For an analysis of the system parameters estimation error, assume that there exists a vector of true system parameters,

α ∗ ∈ RK , such that

y(k ) = (G(x(k )) )

T

α ∗ + v (k )

where v(k) is the output disturbance including the data noise and modeling errors. The parameters estimation error vector, ˜ α , is defined as m

ˆ α |k+1 . ˜ α (k ) = α ∗ − m m Result 3. The adaptive filtering algorithm (24–43) at any time, k = t, ensures an upper bound on the magnitude of instantaneous parameters estimation error vector such that 1.

ˆ α |t+1 m ˜ α (t ) (m˜ α (t ) )T  ≤ 1. T ˆ (m˜ α (t − 1 ) ) α |t m˜ α (t − 1 ) + uy (t )|v(t )|2 2.

  ) ˆ α |1 + tk=1 uy (k )G(x(k ))(G(x(k )) )T m ˜ α (t ) (m˜ α (t ) )T  ˆ α |1 m ˜ α (0 ) + (m˜ α (0 ) )  T

t 

≤ 1.

uy (k )|v(k )|

2

k=1

where

aˆλy |k+1 aˆzy |k+1 . bˆ λy |k+1 bˆ zy |k+1

uy ( k ) =

Proof. The result can be proved by following the same steps as the robustness proof of the algorithm in [14]. Appendix C provides the proof.  Remark 9. Result 3 demonstrates the robustness of the filtering algorithm in the sense that if the magnitudes of output disturbance remain “small” and magnitude of the error in initial guess of system parameters is “small”, then the magnitude of instantaneous parameters estimation error vector remains “small”. Remark 10. The convergence of the algorithm towards true system parameters follows from Result 3 in the case of zero output disturbance. If v(t ) = 0, then it follows from Result 3 that



(m˜ α (t ) )

T

ˆ α |1 + 

t  k=1



uy (k )G(x(k ))(G(x(k )) )

T

T ˆ ˜ α (t ) ≤ (m ˜ α (0 ) )  ˜ α ( 0 ). m α |1 m

680

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

That is,

ˆ α |1 m ˜ α (0 ) (m˜ α (0 ) )T 

. )t ˆ α |1 + k=1 uy (k )G(x(k ))(G(x(k )) )T min_eig  '

( ) ˆ α |1 + t uy (k )G(x(k ))(G(x(k )) )T It can be noted that min_eig  is a monotonically increasing sequence, as uy (k) > 0. k=1

m˜ α (t ) 2 ≤



Thus, there must exist a large enough T such that

˜ α (t ) = 0, m

∀t > T .

That is,

ˆ α |t+1 = α ∗ , m

∀t > T .

6. Simulation studies The first study is simply a demonstration of the learning of membership functions for the simplest case of single-input single-output linear system. The system 1–(2) was simulated by choosing o from the uniform distribution on [−3, 3], definT ing G(x ) = [1x] , choosing each of two elements of α from the uniform distribution on [−3, 3], and sampling both input noise ( ) and output noise (v) from the Gaussian distribution with it’s mean as zero and variance as unity. The learning of membership functions via adaptive filtering algorithm (24–43) continued till 10 0 0 time-steps. The learning of the parameters during the recursions of the algorithm and their convergences to nearly to true values are shown in Fig. 3(a) and (b). The input-output data pairs used for the learning are displayed as white markers in Fig. 3(c). The surface plot of the optimal membership function in Fig. 3(c) shows the modeling of the observed data by the membership function. The aim of this study is to simply demonstrate that the proposed methodology models the observed data via defining an optimal membership function on the data space as shown in Fig. 3(c). The second study performs numerous experiments to demonstrate the robustness of the filtering algorithm again data T T outliers. The system 1–(2) with x = [x1 · · · xn ] ∈ Rn and G(x ) = [1x1 · · · xn ] was considered. Each element of o and α was randomly chosen from the uniform distribution on [−1, 1]. The input and output noises were sampled from the Student-t distribution and thus the data included a fair amount of outliers. The experiments were performed at different values of n to study the effect of data dimension (n) on performance. For each value of n, the level of Student-t distributed noises was varied by varying the degree-of-freedom. As the degree-of-freedom of Student-t distribution increases, the frequency of outliers in the data decreases and the distribution approaches the standard normal distribution. The simulation of the system, for a given n and noise level, continued till N time-steps where N = 100n. Further, an experiment for a given n and noise level is repeated 100 times and the average estimation performance over 100 experiments is considered. For the considered linear system, the RLS (recursive least-squares) and NLMS (normalized least mean squares) are amongst the most widely used adaptive filtering algorithms for estimating linear parameters. RLS is known to provide optimal filtering in presence of Gaussian distributed noise while NLMS is considered as a robust estimation algorithm in presence of data outliers due to its H ∞ −optimality. Therefore, we compare our algorithm with RLS and NLMS in terms of error in the estimation of α . The RLS algorithm for estimating α is defined as

αˆ (k ) = αˆ (k − 1 ) +

Pk−1 G(x(k ))[y(k ) − (G(x(k )) ) αˆ (k − 1 )] T

1 + (G(x(k )) ) Pk−1 G(x(k )) T

Pk−1 G(x(k ))(G(x(k )) ) Pk−1 T

Pk = Pk−1 −

1 + (G(x(k )) ) Pk−1 G(x(k )) T

where αˆ (k ) is the recursive estimate of α at kth time/data index. The NLMS algorithm was considered in the following form:

αˆ (k ) = αˆ (k − 1 ) +

μG(x(k ))[y(k ) − (G(x(k )) )T αˆ (k − 1 )] . T 1 + μ(G(x(k )) ) G(x(k ))

The RLS algorithm was initialized as αˆ (0 ) equal to zero vector and P0 equal to identity matrix. The NLMS algorithm was also started with αˆ (0 ) as equal to zero vector. Since the learning rate (μ) of NLMS algorithm remains constant and doesn’t decrease with increasing time-index (k), choosing a higher value of μ (such as equal to one) results in the non-converging unstable behavior of the algorithm. On the other hand, choosing too low value of μ results in the slow convergence and thus a large time (i.e. large number of data points) are required for a good estimation. Under this consideration, the learning rate of the NLMS algorithm was chosen as μ = 1/N, where N is the total number of available data points to be used for estimating α . ˜ α , is taken as a performance measure. Fig. 4 shows, for n = 5 as an example, the The norm of estimation error vector, m ˜ α (averaged over 100 independent experiments) at varying degree-of-freedom of the Student-t distributed time-plot of m ˜ α (k ) value at noise. The experiments are performed at n = 5, 10, 20, 50, 100. For a quantitative comparison, the final m k = N, averaged over 100 independent experiments, is reported in Table 1. The following facts are consistently observed in Table 1 and Fig. 4:

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

681

Fig. 3. An example of the learning of membership functions for a single-input single-output linear system.

• The proposed algorithm performs far better than RLS and NLMS in the presence of data outliers as observed in the cases corresponding to lower values of degree-of-freedom associated to the Student-t distributed noise, i.e., in the cases of degree-of-freedom as equal to 1, 2, and 3. Thus, the adaptive filtering algorithm (24–43) is robust towards data outliers. • As the amount of data outliers reduces (i.e. the degree-of-freedom associated to the Student-t distributed noise increases), the performance of RLS improves. In the extreme case of “no-outliers”, RLS algorithm, being optimal for the normal distributed noise, performs better than the proposed algorithm as observed in the case of degree-of-freedom as equal to 50. However, the proposed algorithm’s performance still remains acceptable even in this case. The third study applies the methodology of this study in the analysis of a biomedical signal to predict the state of an individual. The application example has been taken from [14] (whose algorithm has striking similarities to ours) for a direct comparison with the method of [14] and other state-of-art methods. The study involves the heartbeat intervals (i.e. the time between consecutive R waves of electrocardiogram), referred to as R-R intervals, of 42 subjects recorded under the sate of supine rest on tilting table for 10 min followed by a 10 min head-up tilt at 70°. The aim is to design a classifier to

682

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

Fig. 4. The time plots of average error magnitudes for n = 5.

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

683

Table 1 The performances of algorithms at varying data dimension and the level of noise. n

degree-of-freedom of Student-t noise

m˜ α (proposed)

m˜ α (RLS)

m˜ α (NLMS)

100 100 100 100 100 50 50 50 50 50 20 20 20 20 20 10 10 10 10 10 5 5 5 5 5

1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50

0.0154 0.0630 0.0980 0.1162 0.1453 0.0226 0.0643 0.1006 0.1176 0.1466 0.0310 0.0740 0.1019 0.1120 0.1432 0.05139 0.0795 0.1085 0.1194 0.1435 0.0624 0.0988 0.1119 0.1246 0.1427

13.4421 0.1205 0.1167 0.1142 0.1167 4.3229 0.1273 0.1135 0.1146 0.1164 2.8869 0.1496 0.1179 0.1201 0.1135 2.4375 0.1702 0.1297 0.1241 0.1156 3.6492 0.1649 0.1269 0.1237 0.1176

10.4784 0.4784 0.6026 1.006 2.1293 8.4719 0.4262 0.5369 0.7634 1.5623 3.3791 0.4129 0.4572 0.5853 1.0432 2.6075 0.3570 0.3915 0.4626 0.7735 3.1482 0.3448 0.3418 0.3740 0.5791

Table 2 State prediction accuracy of different classification methods. method

correct predictions

wrong predictions

decision tree [14] probabilistic neural network [14] quadratic discriminant analysis [14] k−nearest neighbor [14] support vector machine (radial basis kernel) [14] naive Bayes [14] linear discriminant analysis [14] stochastic fuzzy filtering [14] proposed method

2879 2919 2922 2928 2930 2932 2944 2948 2953

153 113 110 104 102 100 88 84 79

predict the state (whether of rest or of tilt) of an individual based on the analysis of heartbeat intervals data. Following [14], the training set consisted of 3 min long heartbeat intervals of both states and testing set consisted of another 6 min long heartbeat intervals of both states. For each of the 42 subjects, the training data is modeled by a fuzzy membership function as follows. T

T

1. Define y(k ) = RRk , x(k ) = [RRk−1 RRk−2 ] , and G(x ) = [1x1 x2 ] . 2. Determine membership function, μ∗rest (x, y ), by running algorithm (24–43) on the training data of rest state. 3. Determine membership function, μ∗ (x, y ), by running algorithm (24–43) on the training data of tilt state. tilt Now, the state associated to a data pair (x, y) is predicted as



state(x, y ) =

rest tilt

if μ∗rest (x, y ) > μ∗ (x, y ) tilt otherwise.

The membership values (i.e. μ∗rest (x, y ), μ∗ (x, y )) of data points of the testing set were averaged over non-overlapping 10 tilt seconds long data segments to predict the state associated to the data segment. There were in total 3032 data segments by accumulating of all 42 subjects. The method could accurately predict the state 2953 times while the wrong state was predicted 79 times. Table 2 compares the proposed method with other state-of-art methods. As observed from Table 2, the best performance of our method supports the claim that fuzzy theoretic approach would serve as an effective tool in signal modeling and analysis. The optimal membership function μ∗ (x, y) can’t be visualized via a surface plot in the case of multi-dimensional inputoutput data space. Thus, to visualize the membership functions μ∗rest (x, y ) and μ∗ (x, y ), 50 0 0 random samples of (x, y) tilt are generated from a uniform distribution on [0, 2] and membership values at the generated data points are calculated. Figs. 5 and 6 display the bar plots of calculated membership values at randomly generated data samples for a subject. The

684

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

Fig. 5. The bar plots of μ∗rest (x, y ) membership values at 50 0 0 randomly generated samples of (x, y) for a subject.

Fig. 6. The bar plots of μ∗ (x, y ) membership values at 50 0 0 randomly generated samples of (x, y) for a subject. tilt

Fig. 7. The membership functions on the output noises associated to data models of rest and tilt state for a subject.

optimal membership functions on the output noise (i.e. μ∗ (v)) associated to data models of rest and tilt state are drawn in Fig. 7 for a subject. The aforementioned application of the methodology in predicting the state of an individual based on the analysis of heartbeat intervals is further exploited to assess the different job tasks based on the physiological status of the operator. The heartbeat intervals data were collected by performing experiments on the staff of a chemical laboratory. The aim is to find out if a change in the job task results in a visible and trackable change in the heartbeat intervals of the operator. We are in-particularly interested in finding out how good the heartbeat intervals data can be classified between the job tasks, namely, “pipetting” (i.e. 4 different types of pipetting tasks with varying complexities referred to as pipetting 1, pipetting 2, pipetting 3, and pipetting 4) and “physical work” (i.e. physically carrying some laboratory related stuff from one place to another). The data corresponding to the first half of the each job task duration are used to train the classifier while that of second half are used to test the classifier. The considered classification algorithms in additional to the proposed one are

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

685

Table 3 Classification accuracies (%), averaged over data of 20 operators, achieved by different algorithms on heartbeat intervals data. algorithm

between pipetting 1 & physical work

between pipetting 2 & physical work

between pipetting 3 & physical work

between pipetting 4 & physical work

Nearest Neighbors Linear SVM RBF SVM Decision Tree Random Forest AdaBoost Naive Bayes LDA QDA stochastic fuzzy filtering proposed method

90.51 92.72 86.83 88.06 92.13 90.71 92.05 92.26 92.67 91.42 93.07

90.53 91.54 85.91 87.77 90.63 89.89 91.23 92.01 92.31 90.88 93.05

93.63 91.72 85.42 90.85 92.97 93.30 94.13 93.47 93.59 92.27 93.70

90.02 83.88 78.72 87.93 90.62 89.95 89.34 89.27 89.98 90.97 91.71

Fig. 8. The bar plots of μ∗ (x, y ) membership values at 50 0 0 randomly generated samples of (x, y) for a subject. pipetting-1

Fig. 9. The bar plots of μ∗ (x, y ) membership values at 50 0 0 randomly generated samples of (x, y) for a subject. pipetting-2

Fig. 10. The bar plots of μ∗ (x, y ) membership values at 50 0 0 randomly generated samples of (x, y) for a subject. pipetting-3

686

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

Fig. 11. The bar plots of μ∗ (x, y ) membership values at 50 0 0 randomly generated samples of (x, y) for a subject. pipetting-4

Fig. 12. The bar plots of μ∗ (x, y ) membership values at 50 0 0 randomly generated samples of (x, y) for a subject. physical-work

Fig. 13. The membership functions on the output noises associated to data models of pipetting 1, pipetting 2, pipetting 3, pipetting 4, and physical work for a subject.

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

687

1. k-nearest neighbors” (with number of neighbors = 5) implemented by “scikit-learn” [23], 2. “support vector machines” (with linear kernel, penalty parameter equal to one) implemented by “scikit-learn” [23], 3. “support vector machines” (with rbf kernel, kernel coefficient equal to 0.1, penalty parameter equal to one) implemented by “scikit-learn” [23], 4. “decision tree” implemented by “scikit-learn” [23], 5. “random forest” (with the number of trees in the forest equal to 10) implemented by “scikit-learn” [23], 6. “AdaBoost” implemented by “scikit-learn” [23], 7. “Gaussian naive Bayes” implemented by “scikit-learn” [23], 8. “linear discriminant analysis” implemented by “scikit-learn” [23], 9. “quadratic discriminant analysis” implemented by “scikit-learn” [23], 10. “stochastic fuzzy filtering” [14]. Table 3 reports the results of the independent experiments on 20 different operators. The accuracies of the considered classifiers have been calculated on the testing data of each operator and Table 3 lists the accuracies averaged over all the 20 operators. It is observed from Table 3 that the proposed method performs best amongst the considered algorithms for three classification problems and performs second best for one classification problem. Hence, the results of Table 3 further justify the fuzzy theoretic approach introduced in this study. The membership functions μ∗ (x, y ), μ∗ (x, y ), μ∗ (x, y ), μ∗ (x, y ), and pipetting-1 pipetting-2 pipetting-3 pipetting-4 μ∗ (x, y ) are visualized in Figs. 8–12 respectively using bar plots of membership values at 50 0 0 randomly physical-work generated samples of (x, y) from a uniform distribution on [0, 2]. Finally, the membership functions on the output noise associated to data models of different states are shown in Fig. 13. 7. Concluding remarks This study has introduced a new fuzzy approach to the modeling of signals and systems. The proposed approach provides a principled basis of determining the membership functions to handle uncertainties in the learning of models. The development of a “self-contained” fuzzy theory for static systems is the main contribution of this paper and future research should apply the fuzzy theoretic approach on state space and Markov models. Appendix A. Proof of Result 1 A1. Evaluation of F Let < f(x) > q(x) denotes the value of f(x) averaged over x with q(x) as the membership function on x characterizing the uncertainty of x. That is,

 f (x )q(x) = 

1 ∂ x q (x )



F can be rewritten from (3) as

∂ x q ( x ) f ( x ). 

F = log (μ(x, y; ) )q() − log



q()

 .

μ()

q()

Consider



*

1 ⎢ log (μ(x, y; ) )q() = − λx q(λx ) (x − o)T ⎣ 2



zx1 q(zx1 ) ..

⎥ ⎦(x − o)

.

zxn q(zxn )

, 1 T − λy q(λy ) zy q(zy ) |y − (G(x ) ) α|2 q (α ) 2 to express F as

* 1 F = − λx q(λx ) 2

⎡ zx1 q(zx1 ) T⎢ (x − o) ⎣

⎤ ..

⎥ ⎦(x − o)

.

zxn q(zxn )  

, 1 T − λy q(λy ) zy q(zy ) |y − (G(x ) ) α|2 − log q (α ) 2

q() μ()



+

q(o)

. q()

+

q(o)

688

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

A2. Optimization w.r.t. q(o) F can be partitioned into o−dependent and o−independent terms as follows

* 1 F = − λx q(λx ) 2





− log

⎡ zx1 q(zx1 ) T⎢ (x − o) ⎣

q(o) μ(o; mo, o )



..

⎥ ⎦(x − o)

.

+

zxn q(zxn )



q(o)

+ { /o} q(o)

where { /o} represents all o−independent terms of F. Define,

⎡ ⎢



zx1 q(zx1 ) ..

diag(zx q(zx ) ) = ⎣

⎥ ⎦

.

zxn q(zxn ) and after substituting the value of μ(o; mo , o ) from Definition 3, F becomes

, 1 1, λx q(λx ) (x − o)T diag(zx q(zx ) )(x − o) q(o) − (o − mo )T o (o − mo ) q(o) 2 2 −log(q(o))q(o) + { /o}

F =−

1, T o o + λx q(λx ) diag(zx q(zx ) ) o q(o) ,2 T  + o omo + λx q(λx ) diag(zx q(zx ) )x − log(q(o))q(o) + { /o}.

=−

q(o)

Define

ˆ o = o + λx   q(λx ) diag(zx q(zx ) ) −1   ˆo ˆo =  m omo + λx q(λx ) diag(zx q(zx ) )x to express F as

, 1, T ˆ ˆ om ˆo o oo + oT  − log(q(o))q(o) + { /o} q(o) q(o) 2  ' ( 1 ˆ 1 T ˆ ˆ o − log(q(o)) + { /o} =  ∂ o q(o) − oT  o o + o o m 2 ∂ o q(o)  ' 1 ( 1 ˆ oo + oT  ˆ om ˆ o − log(q(o)) + { /o}. = ∂ o q(o) − oT  ko 2

F =−

The constraint on q(o) can be incorporated in the optimization problem by the use of Lagrange multiplier (λ) to define the new objective functional (J) as

J=

1 ko



' 1 ( ( ' ˆ oo + oT  ˆ om ˆ o − log(q(o)) + λ ∂ o q(o) − oT  ∂ o q(o) − ko . 2

Setting the functional derivative of J w.r.t. q(o) equal to zero, we have

'

(

1 ˆ 1 T ˆ ˆ o − 1 − log(q(o)) + λ = 0. − oT  o o + o o m ko 2 That is,





1 ˆ T ˆ ˆo . q(o) = exp(λko − 1 ) exp − oT  o o + o o m 2 The optimal value of λ is obtained by solving

.

exp(λko − 1 )

1





∂ o q(o) = ko. This leads to

( 2π )n ˆ om ˆ T ˆ o = ko . exp m 2 o ˆ o| |

Thus, the optimal expression for q(o) is given as

.

q (o) = ko ∗

1 ˆ o| | T ˆ ˆ ˆ exp − ( o − m )  ( o − m ) . o o o ( 2π )n 2

Finally, the value of ko is chosen to make maxo q∗ (o) = 1. This results in





1 ˆ o (o − m ˆ o )T  ˆ o) . q∗ (o) = exp − (o − m 2

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

689

A3. Optimization w.r.t. q(λx ) F can be partitioned into λx −dependent and λx −independent terms as follows:



, 1 F = − λx q(λx ) (x − o)T diag(zx q(zx ) )(x − o) − log q(o) 2



q(λx ) μ(λx ; aλx , bλx )

 + { /λx } q(λx )

, , 1 λx q(λx ) (x − o)T diag(zx q(zx ) )(x − o) q(o) + (aλx − 1 ) log (λx ) − bλx λx q(λx ) 2 −log (q(λx ) )q(λx ) + { /λx }.

=−

Define

aˆλx = aλx

1, bˆ λx = bλx + (x − o)T diag(zx q(zx ) )(x − o) q(o) 2 to express F as

/

0

F = (aˆλx − 1 ) log (λx ) − bˆ λx λx − log (q(λx ) ) + { /λx } q(λx )  ' ( 1 = ∂λx q(λx ) (aˆλx − 1 ) log (λx ) − bˆ λx λx − log (q(λx ) ) + { /λx }. kλx The constraint on q(λx ) can be incorporated in the optimization problem by the use of Lagrange multiplier (λ) to define the new functional (J) as

' ( ( ' ∂λx q(λx ) (aˆλx − 1 ) log (λx ) − bˆ λx λx − log (q(λx ) ) + λ ∂λx q(λx ) − kλx .



1 kλx

J=

Setting the functional derivative of J w.r.t. q(λx ) equal to zero, we have

'

(

1 (aˆλx − 1 ) log (λx ) − bˆ λx λx − 1 − log (q(λx ) ) + λ = 0. kλx That is,

q(λx ) = exp



λkλx − 1 (λx )aˆλx −1 exp −bˆ λx λx .



The optimal value of λ is obtained by solving

exp



λkλx − 1

(aˆλx ) (bˆ λx )aˆλx



∂λx q(λx ) = kλx . This leads to

= kλx

where ( · ) is the Gamma function defined as

(x ) =



∞ 0

t x−1 exp(−t ) dt .

Thus, the optimal expression for q(λx ) is given as

q∗ (λx ) = kλx

(bˆ λx )aˆλx (λx )aˆλx −1 exp −bˆ λx λx .

(aˆλx )

Finally, kλx is chosen to make maxλx q∗ (λx ) = 1. This consideration results in



q∗ (λx ) =

bˆ λx aˆλx − 1

aˆλx −1





exp(aˆλx − 1 )(λx )aˆλx −1 exp −bˆ λx λx .

A4. Optimization w.r.t. q(zx j ) F can be partitioned into zx j −dependent and zx j −independent terms as follows



, - , 1 F = − λx q(λx ) |x j − o j |2 zx j − log q(o) q ( zx j ) 2 =−



q ( zx j )



μ ( zx j ; rx , sx )

, - , , 1 λx q(λx ) |x j − o j |2 q(o) zx j q(zx ) + rx q(rx ) log(zx j ) q(zx ) 2 j j , -

−rx q(rx ) sx q(sx ) zx j

q ( zx j )

,



− log q(zx j )

-

q ( zx j )

+ { /zx j }.

+ { /zx j } q ( zx j )q ( rx )q ( sx )

690

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

Define

aˆzx j = rx q(rx ) + 1

, 1 bˆ zx j = rx q(rx ) sx q(sx ) + λx q(λx ) |x j − o j |2 q(o) 2 to express F as

/





F = (aˆzx j − 1 ) log zx j − bˆ zx j zx j − log q(zx j ) =

Maximizing F w.r.t. q(zx j ) under the constraint: should result in

+ { /zx j }

q ( zx j )

'

( ∂ zx j q(zx j ) (aˆzx j − 1 ) log zx j − bˆ zx j zx j − log q(zx j ) + { /zx j }.



1 kzx j

0



q ( zx j ) = ∗

aˆzx

bˆ zx j

j



∂ zx j q(zx j ) = kzx j , and then choosing kzx j such that maxzx j q(zx j ) = 1,

−1

exp(aˆzx j − 1 )(zx j )

aˆzx j − 1

aˆzx −1 j





exp −bˆ zx j zx j .

A5. Optimization w.r.t. q(rx ) F can be partitioned into rx −dependent and rx −independent terms as follows





F = − log

q ( rx ) μ ( z x1 ; rx , sx ) · · · μ ( z xn ; rx , sx )μ ( rx ; arx , brx )

= nrx q(rx ) log (sx )q(sx ) + nrx q(rx ) + rx q(rx )



+ { /rx } q(zx1 )···q(zxn )q(rx )q(sx )

n ,  log zx j

q ( zx j )

j=1

− rx q(rx ) sx q(sx )

n ,  zx j j=1

q ( zx j )

+(arx − 1 )log(rx )q(rx ) − brx rx q(rx ) − log (q(rx ) )q(rx ) + { /rx }. Define

aˆrx = arx bˆ rx = brx + sx q(sx )

n ,  zx j j=1

q ( zx j )

− nlog (sx )q(sx ) − n −

to express F as

/

 q∗ ( rx ) =

0

bˆ rx aˆrx − 1

q ( zx j )

j=1

F = (aˆrx − 1 ) log(rx ) − bˆ rx rx − log (q(rx ) ) Maximizing F w.r.t. q(rx ) under the constraint: result in

n ,  log zx j

q ( rx )



+ { /rx }.

∂ rx q(rx ) = krx , and then choosing krx such that maxrx q(rx ) = 1, should

aˆrx −1





exp(aˆrx − 1 )(rx )aˆrx −1 exp −bˆ rx rx .

A6. Optimization w.r.t. q(sx ) F can be partitioned into sx −dependent and sx −independent terms as follows



F = − log



q ( sx ) μ ( z x1 ; rx , sx ) · · · μ ( z xn ; rx , sx )μ ( sx ; asx , bsx )

= nrx q(rx ) log (sx )q(sx ) − rx q(rx ) sx q(sx )

n ,  zx j j=1

−bsx sx q(sx ) − log (q(sx ) )q(sx ) + { /sx }.



q ( zx j )

+ { /sx } q(zx1 )···q(zxn )q(rx )q(sx )

+ (asx − 1 )log(sx )q(sx )

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

691

Define

aˆsx = asx + nrx q(rx ) bˆ sx = bsx + rx q(rx )

n ,  zx j

q ( zx j )

j=1

to express F as

/

0

F = (aˆsx − 1 ) log(sx ) − bˆ sx sx − log (q(sx ) ) Maximizing F w.r.t. q(sx ) under the constraint: result in



q∗ ( sx ) =

bˆ sx aˆsx − 1

q ( sx )



+ { /sx }.

∂ sx q(sx ) = ksx , and then choosing ksx such that maxsx q(sx ) = 1, should

aˆsx −1





exp(aˆsx − 1 )(sx )aˆsx −1 exp −bˆ sx sx .

A7. Optimization w.r.t. q(α ) F can be partitioned into α −dependent and α −independent terms as follows



F =−

, 1 λy q(λy ) zy q(zy ) α T G(x )(G(x ) )T α − 2α T G(x )y q(α ) − log 2



q (α ) μ(α ; mα , α )

 + { /α} q (α )

, 1 1, λy q(λy ) zy q(zy ) α T G(x )(G(x ) )T α − 2α T G(x )y q(α ) − α T α α − 2α T α mα q(α ) 2 2 −log (q(α ) )q(α ) + { /α}.

=−

Define T ˆ α = α + λy   q(λy ) zy q(zy ) G (x ) (G (x ) ) −1   ˆα ˆα =  m α mα + λy q(λy ) zy q(zy ) G(x )y

to express F as

/ 1

0 ˆ α α + αT  ˆ αm ˆ α − log (q(α ) ) αT  + { /α} 2 q (α )  ' ( 1 1 ˆ α α + αT  ˆ αm ˆ α − log (q(α ) ) + { /α} =  ∂α q(α ) − α T  2 ∂α q(α )  ' 1 ( 1 T ˆ ˆ αm ˆ α − log (q(α ) ) + { /α}. = ∂α q(α ) − α α α + α T 

F = −



2

The constraint on q(α ) can be incorporated in the optimization problem by the use of Lagrange multiplier (λ) to define the new functional (J) as

J=

1 kα



' ' 1 ( ( ˆ α α + αT  ˆ αm ˆ α − log (q(α ) ) + λ ∂α q(α ) − α T  ∂α q(α ) − kα . 2

Setting the functional derivative of J w.r.t. q(α ) equal to zero, we have

'

(

1 1 ˆ α α + αT  ˆ αm ˆ α − 1 − log (q(α ) ) + λ = 0. − αT  kα 2 That is,

1

q(α ) = exp(λkα − 1 ) exp −

2

ˆ α α + αT  ˆ αm ˆα . αT 

The optimal value of λ is obtained by solving

.

1



∂α q(α ) = kα . This leads to



( 2π ) ˆ αm ˆT ˆ α = kα . exp m 2 α ˆ α| | Thus, the optimal expression for q(α ) is given as .

1 ˆ α| | ∗ T ˆ ˆ ˆ q ( α ) = kα exp − ( α − m )  ( α − m ) . α α α 2 ( 2π )K Finally, the value of kα is chosen to make maxα q∗ (α ) = 1. This results in

1 ˆ α (α − m ˆ α )T  ˆ α) . q∗ (α ) = exp − (α − m K

exp(λkα − 1 )

2

692

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

A8. Optimization w.r.t. q(λy ) F can be separated into λy −dependent and λy −independent terms as follows



F =−

, 1 λy q(λy ) zy q(zy ) |y − (G(x ) )T α|2 q(α ) − log 2



q(λy ) μ(λy ; aλy , bλy )



+ { /λy } q(λy )

, , 1 λy q(λy ) zy q(zy ) |y − (G(x ) )T α|2 q(α ) + (aλy − 1 ) log(λy ) − bλy λy q(λy ) 2 −log (q(λy ) )q(λy ) + { /λy }.

=−

Define

aˆλy = aλy

, 1 T bˆ λy = bλy + zy q(zy ) |y − (G(x ) ) α|2 q (α ) 2 to express F as

/

0

F = (aˆλy − 1 ) log(λy ) − bˆ λy λy − log (q(λy ) ) Maximizing F w.r.t. q(λy ) under the constraint: result in



q∗ (λy ) =



q(λy )

+ { /λy }.

∂λy q(λy ) = kλy , and then choosing kλy such that maxλy q(λy ) = 1, should

aˆλy −1

bˆ λy aˆλy − 1





exp(aˆλy − 1 )(λy )aˆλy −1 exp −bˆ λy λy .

A9. Optimization w.r.t. q(zy ) F can be partitioned into zy −dependent and zy −independent terms as follows



, 1 T F = − λy q(λy ) zy q(zy ) |y − (G(x ) ) α|2 − log q (α ) 2



q ( zy ) μ ( zy ; ry , sy )



, 1 λy q(λy ) zy q(zy ) |y − (G(x ) )T α|2 q(α ) + ry q(ry ) log(zy )q(zy ) 2 −ry q(ry ) sy q(sy ) zy q(zy ) − log(q(zy ))q(zy ) + { /zy }.

+ { /zy } q ( zy )q ( ry )q ( sy )

=−

Define

aˆzy = ry q(ry ) + 1

, 1 T bˆ zy = ry q(ry ) sy q(sy ) + λy q(λy ) |y − (G(x ) ) α|2 q (α ) 2 to express F as

/

F = (aˆzy − 1 ) log(zy ) − bˆ zy zy − log(q(zy ))



Maximizing F w.r.t. q(zy ) under the constraint: result in



q ( zy ) = ∗

bˆ zy aˆzy − 1

0

aˆzy −1

q ( zy )

+ { /zy }.

∂ zy q(zy ) = kzy , and then choosing kzy such that maxzy q(zy ) = 1, should





exp(aˆzy − 1 )(zy )aˆzy −1 exp −bˆ zy zy .

A10. Optimization w.r.t. q(ry ) F can be partitioned into ry −dependent and ry −independent terms as follows



F = − log



q ( ry ) μ ( z y ; ry , sy )μ ( ry ; ary , bry )



+ { /ry } q ( zy )q ( ry )q ( sy )

= ry q(ry ) log (sy )q(sy ) + ry q(ry ) + ry q(ry ) log (zy )q(zy ) − ry q(ry ) sy q(sy ) zy q(zy ) +(ary − 1 )log(ry )q(ry ) − bry ry q(ry ) − log (q(ry ) )q(ry ) + { /ry }.

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

693

Define

aˆry = ary bˆ ry = bry + sy q(sy ) zy q(zy ) − log (sy )q(sy ) − 1 − log (zy )q(zy ) to express F as

/

0

F = (aˆry − 1 ) log(ry ) − bˆ ry ry − log (q(ry ) )



Maximizing F w.r.t. q(ry ) under the constraint: result in



q∗ ( ry ) =

bˆ ry aˆry − 1

q ( ry )

+ { /ry }.

∂ ry q(ry ) = kry , and then choosing kry such that maxry q(ry ) = 1, should

aˆry −1





exp(aˆry − 1 )(ry )aˆry −1 exp −bˆ ry ry .

A11. Optimization w.r.t. q(sy ) F can be partitioned into sy −dependent and sy −independent terms as follows





F = − log

q ( sy ) μ ( z y ; ry , sy )μ ( sy ; asy , bsy )



+ { /sy } q ( zy )q ( ry )q ( sy )

= ry q(ry ) log (sy )q(sy ) − ry q(ry ) sy q(sy ) zy q(zy ) + (asy − 1 )log(sy )q(sy ) −bsy sy q(sy ) − log (q(sy ) )q(sy ) + { /sy }. Define

aˆsy = asy + ry q(ry ) bˆ sy = bsy + ry q(ry ) zy q(zy ) to express F as

/

0

F = (aˆsy − 1 ) log(sy ) − bˆ sy sy − log (q(sy ) ) Maximizing F w.r.t. q(sy ) under the constraint: result in



q ( sy ) = ∗

bˆ sy aˆsy − 1

aˆsy −1



q ( sy )

+ { /sy }.

∂ sy q(sy ) = ksy , and then choosing ksy such that maxsy q(sy ) = 1, should





exp(aˆsy − 1 )(sy )aˆsy −1 exp −bˆ sy sy .

A12. Evaluation of integrals The derived expressions for the parameters of optimal membership functions include the integrals terms of < f(x) > q(x) type. Having derived the expression for q(x), the integral terms can be evaluated as

λx q∗ (λx ) =

aˆλx bˆ λ

x

λy q∗ (λy ) =

aˆλy bˆ λy

, zx j

,

log(zx j )

q∗ ( zx j )

q∗ ( zx j )

= =

zy q∗(zy ) =

aˆzx j bˆ zx j



ψ aˆzx j − log bˆ zx j aˆzy bˆ z



log(zy )q∗(zy ) = ψ aˆzy − log bˆ zy y



694

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

rx q∗ (rx ) =

aˆrx bˆ r

x

aˆr ry q∗(ry ) = y bˆ r

y

aˆ sx q∗ (sx ) = sx bˆ s

log(sx )q∗ (sx ) = ψ aˆsx − log bˆ sx x



sy q∗(sy ) =

aˆsy bˆ sy

log(sy )q∗(sy ) = ψ aˆsy − log bˆ sy

−1

2 , ˆo ITj I j |x j − o j |2 q∗ (o) = x j − I j mˆ o + T r 

−1

2 , T ˆα G (x ) (G (x ) ) , |y − (G(x ) )T α|2 q∗ (α ) = y − (G(x ) )T mˆ α + T r  where I j ∈ R1×n is a row matrix equal to the j−th row of the identity matrix of size n, ψ ( · ) is the digamma function, and Tr( · ) denotes the trace operator. A13. Estimation of membership functions’ parameters The set of equations for estimating optimal membership functions’ parameter, obtained after evaluating the integrals, are summarized as 4–(23) in Result 1. Appendix B. Proof of Result 2 Using Matrix Inversion Lemma, it follows from (32) that

⎛⎡



ˆ o|k+1 

−1



ˆ o |k = 

For simplicity, define



ˆ o|k+1 Po (k ) =  ux j ( k ) =

−1



ˆ o |k − 



−1 ⎜ ⎜⎢ ⎜⎢ ⎝⎣

1

..

. bˆ λx |k+1 bˆ zxn |k+1 aˆλx |k+1 aˆzxn |k+1

⎞−1



−1 ⎟ ⎥ ⎟ ˆ −1 ˆ o |k ⎥+  ⎟ o|k . ⎦ ⎠

−1

aˆλx |k+1 aˆzx j |k+1 bˆ λx |k+1 bˆ zx j |k+1





ux1 ( k )

Dux ( k ) =



bˆ λx |k+1 bˆ zx1 |k+1 aˆλx |k+1 aˆzx |k+1



..

⎦,

.

uxn ( k ) to write (32) as

3

−1

Po (k ) = Po (k − 1 ) − Po (k − 1 ) (Dux (k ) )

+ Po (k − 1 )

Similarily, (33) is rewritten as

3

−1

ˆ o|k+1 = Po (k ) (Po (k − 1 ) ) m ˆ o |k + Dux ( k ) x ( k ) m

4−1

Po (k − 1 ).

4 3

−1

ˆ o|k + Po (k − 1 )Dux (k )x(k ) − Po (k − 1 ) (Dux (k ) ) =m

3

−1

−Po (k − 1 ) (Dux (k ) )

+ Po (k − 1 )

4−1

+ Po (k − 1 )

Po (k − 1 )Dux (k )x(k )

4−1

ˆ o |k m

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

5



−1

ˆ o|k + I − Po (k − 1 ) (Dux (k ) ) =m

3

−1

−Po (k − 1 ) (Dux (k ) )

+ Po (k − 1 )

Using the matrix identities

3

−1

I − Po (k − 1 ) (Dux (k ) )

3

+ Po (k − 1 )

+ Po (k − 1 )

(Dux (k ) )−1 + Po (k − 1 )

4−1

4−1

4−1

695

−1 6

Po (k − 1 )Dux (k )x(k )

ˆ o |k . m −1

= [I + Po (k − 1 )Dux (k )]

−1

= Dux (k )[I + Po (k − 1 )Dux (k )] ,

ˆ o|k+1 can be expressed as m −1

ˆ o|k+1 = m ˆ o|k + [I + Po (k − 1 )Dux (k )] Po (k − 1 )Dux (k )x(k ) m −1

ˆ o |k . −Po (k − 1 )Dux (k )[I + Po (k − 1 )Dux (k )] m Another matrix identity −1

Po (k − 1 )Dux (k )[I + Po (k − 1 )Dux (k )]

−1

= [I + Po (k − 1 )Dux (k )] Po (k − 1 )Dux (k )

leads to −1

ˆ o|k+1 = m ˆ o|k + Po (k − 1 )Dux (k )[I + Po (k − 1 )Dux (k )] m

3

4

ˆ o |k . x (k ) − m

(B.1)

Similarly, Po (k) can be expressed as −1

Po (k ) = Po (k − 1 ) − Po (k − 1 )Dux (k )[I + Po (k − 1 )Dux (k )] Po (k − 1 ).

(B.2)

It can be observed from (B.1) that −1

˜ o (k ) = m ˜ o (k − 1 ) − Po (k − 1 )Dux (k )[I + Po (k − 1 )Dux (k )] [m ˜ o (k − 1 ) +  (k )]. m −1

Premultiplying the both sides of (B.3) with (Po (k − 1 ) ) −1

(B.3)

, we have

−1

(Po (k − 1 ) ) m˜ o (k ) = (Po (k − 1 ) ) m˜ o (k − 1 ) − Dux (k )[I + Po (k − 1 )Dux (k )]−1 [m˜ o (k − 1 ) +  (k )]. Define for a vector x ∈ Rn ,

x 2A = xT Ax where A ∈ Rn×n . Also, note that



−1 T

Dux (k )[I + Po (k − 1 )Dux (k )]

−1

= Dux (k )[I + Po (k − 1 )Dux (k )] .

Now, we have

m˜ o (k ) 2(Po (k−1))−1 = m˜ o (k − 1 ) 2 (Po (k − 1 ) )−1 T −1 ˜ o (k − 1 ) ) Dux (k )[I + Po (k − 1 )Dux (k )] [m ˜ o ( k − 1 ) +  ( k )] −2(m 7 72 −1 ˜ o (k − 1 ) +  (k )]7 +7Dux (k )[I + Po (k − 1 )Dux (k )] [m . Po (k − 1 )

(B.4)

Also, −1

˜ o (k − 1 ) ) Dux (k )[I + Po (k − 1 )Dux (k )] [m ˜ o ( k − 1 ) +  ( k )] −2(m T

˜ o ( k − 1 ) =  ( k ) −1 − m −1 Dux (k )[I + Po (k − 1 )Dux (k )] Dux (k )[I + Po (k − 1 )Dux (k )] 2

2

˜ o ( k − 1 ) +  ( k ) − m −1 . Dux (k )[I + Po (k − 1 )Dux (k )] 2

(B.5)

We can combine (B.4) and (B.5) as follows: 2 +  ( k ) m˜ o (k − 1 ) 2 −1 Dux (k )[I + Po (k − 1 )Dux (k )] (Po (k − 1 ) )−1 2 2 ˜ o ( k ) ˜ o ( k − 1 ) = m + m −1 Dux (k )[I + Po (k − 1 )Dux (k )] (Po (k − 1 ) )−1 2 ˜ o ( k − 1 ) +  ( k ) + m −2 Dux (k )[I + Po (k − 1 )Dux (k )]

(B.6)

It can be observed from (B.2) that

(Po (k ) )−1 = (Po (k − 1 ) )−1 + Dux (k ).

(B.7)

Therefore, 2 2 ˜ o ( k ) ˜ o (k ) D (k ) . = m − m m˜ o (k ) 2 ux (Po (k − 1 ) )−1 (Po (k ) )−1

(B.8)

696

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

It follows from (B.3) that

7 72 m˜ o (k ) 2Du (k ) = m˜ o (k − 1 ) 2Du (k ) + 7Po (k − 1 )Dux (k )[I + Po (k − 1 )Dux (k )]−1 [m˜ o (k − 1 ) +  (k )]7D (k ) ux x x ˜ o ( k − 1 ) −2 m −1 Dux (k )Po (k − 1 )Dux (k )[I + Po (k − 1 )Dux (k )] 2

−1

˜ o (k − 1 ) ) Dux (k )Po (k − 1 )Dux (k )[I + Po (k − 1 )Dux (k )] −2(m T

 (k )

2

˜ o (k ) D (k ) in (B.8), we have By substituting the value of m ux 2 2 ˜ o ( k ) ˜ o (k − 1 ) D (k ) = m − m m˜ o (k ) 2 ux (Po (k − 1 ) )−1 (Po (k ) )−1 7 72 −1 ˜ o (k − 1 ) +  (k )]7 −7Po (k − 1 )Dux (k )[I + Po (k − 1 )Dux (k )] [m

Dux ( k )

˜ o ( k − 1 ) +2 m −1 Dux (k )Po (k − 1 )Dux (k )[I + Po (k − 1 )Dux (k )] 2

−1

˜ o (k − 1 ) ) Dux (k )Po (k − 1 )Dux (k )[I + Po (k − 1 )Dux (k )] +2(m T

After substituting the value of

m˜ o (k ) 2(Po (k−1))−1 ,

 (k )

equality (B.6) becomes

2 +  ( k ) m˜ o (k − 1 ) 2 −1 Dux (k )[I + Po (k − 1 )Dux (k )] (Po (k − 1 ) )−1 2 2 ˜ o ( k ) ˜ o (k − 1 ) D (k ) = m − m ux (Po (k ) )−1 7 72 −1 ˜ o (k − 1 ) +  (k )]7 −7Po (k − 1 )Dux (k )[I + Po (k − 1 )Dux (k )] [m Dux ( k ) 2 ˜ o ( k − 1 ) +2 m −1 Dux (k )Po (k − 1 )Dux (k )[I + Po (k − 1 )Dux (k )] T −1 ˜ o (k − 1 ) ) Dux (k )Po (k − 1 )Dux (k )[I + Po (k − 1 )Dux (k )]  (k ) +2(m 2 ˜ o ( k − 1 ) + m −1 Dux (k )[I + Po (k − 1 )Dux (k )] 2 ˜ o ( k − 1 ) +  ( k ) + m −2 . Dux (k )[I + Po (k − 1 )Dux (k )]

After some algebra, it follows from above equality that 2 +  ( k ) m˜ o (k − 1 ) 2 −1 Dux (k )[I + Po (k − 1 )Dux (k )] (Po (k − 1 ) )−1 2 2 ˜ o ( k ) ˜ o ( k − 1 ) = m + m −1 Dux (k )[I + Po (k − 1 )Dux (k )] (Po (k ) )−1 2 2 ˜ o ( k − 1 ) +  ( k ) ˜ o ( k − 1 ) , + m −2 − A (k ) − Bm Dux (k )[I + Po (k − 1 )Dux (k )]

where 1/2

Po (k − 1 )Dux (k )[I + Po (k − 1 )Dux (k )] ,

1/2

[I + Po (k − 1 )Dux (k )] ,

A = ( Dux ( k ) ) B = ( Dux ( k ) )

−1

−1

and · 2 is the squared Euclidean norm. Now, we have 2 +  ( k ) m˜ o (k − 1 ) 2 −1 Dux (k )[I + Po (k − 1 )Dux (k )] (Po (k − 1 ) )−1 2 ˜ o ( k − 1 ) + A  ( k ) − B m 2 2 ˜ o ( k ) ˜ o ( k − 1 ) = m + m −1 Dux (k )[I + Po (k − 1 )Dux (k )] (Po (k ) )−1 2 ˜ o ( k − 1 ) +  ( k ) + m −2 . Dux (k )[I + Po (k − 1 )Dux (k )]

Therefore, 2 +  ( k ) m˜ o (k − 1 ) 2 −1 Dux (k )[I + Po (k − 1 )Dux (k )] (Po (k − 1 ) )−1 2 ˜ o ( k − 1 ) + A  ( k ) − B m 2 2 ˜ o ( k ) ˜ o ( k − 1 ) ≥ m + m −1 . Dux (k )[I + Po (k − 1 )Dux (k )] (Po (k ) )−1

Since Po (k − 1 ) is a positive definite matrix, it can be decomposed as

Po (k − 1 ) = LLT

(B.9)

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

697

where L is a lower triangular matrix. As the Euclidean norm is a non-negative quantity, thus

7

72 −1 8 7 8 7 7 Dux (k )L A (k ) + LT Dux (k )Bm˜ o (k − 1 )7 ≥ 0. 7 7

That is,

0 ≤ A (k ) 8 2

8 −1 + Bm˜ o (k − 1 ) 28 Dux (k )Po (k − 1 ) Dux (k ) Dux (k )Po (k − 1 ) Dux (k ) 8

˜ o ( k − 1 ). +2( (k ) ) AT Bm T

˜ o (k − 1 ) to the both sides of aforementioned inequality leads to Adding A (k ) + Bm 2

2

A (k ) 2 + Bm˜ o (k − 1 ) 2 ≤ A (k ) 2 8

−1

8

Dux (k )Po (k − 1 ) Dux (k )

+I

8 8 ˜ o ( k − 1 ) + B m I + Dux (k )Po (k − 1 ) Dux (k ) 2

˜ o (k − 1 ) +2( (k ) ) AT Bm T

That is

A (k ) − Bm˜ o (k − 1 ) 2 ≤ A (k ) 2 8

−1

8

Dux (k )Po (k − 1 ) Dux (k )

+I

8 8 ˜ o ( k − 1 ) + B m I + Dux (k )Po (k − 1 ) Dux (k ) 2

Since

A

T

9 8

−1 : −1 Dux (k )Po (k − 1 ) Dux (k ) + I A = Dux (k )[I + Po (k − 1 )Dux (k )] Po (k − 1 )Dux (k ) 5 8 6 8 −1 BT I + Dux (k )Po (k − 1 ) Dux (k ) B = Dux (k )[I + Po (k − 1 )Dux (k )] 8

we have

A (k ) − Bm˜ o (k − 1 ) 2 ≤  (k ) 2

−1

Dux (k )[I + Po (k − 1 )Dux (k )] Po (k − 1 )Dux (k )

˜ o ( k − 1 ) + m −1 . Dux (k )[I + Po (k − 1 )Dux (k )] 2

Combining of (B.9) and (B.10) leads to 2 2 ˜ o ( k − 1 ) ≤ m +  ( k ) m˜ o (k ) 2 −1 Dux (k )[I + Po (k − 1 )Dux (k )] (Po (k ) )−1 (Po (k − 1 ) )−1 2 +  ( k ) . −1 Dux (k )[I + Po (k − 1 )Dux (k )] Po (k − 1 )Dux (k )

That is, 2 2 ˜ o ( k − 1 ) ≤ m +  (k ) D (k ) . m˜ o (k ) 2 ux (Po (k ) )−1 (Po (k − 1 ) )−1

This proves the first part of Result 2. If the algorithm is run from k = 1 to k = t, then

m˜ o (t ) 2 (Po (t ) )−1 m˜ o (0 )

2

(Po (0 ) )−1

+

t  k=1

≤ 1.

 (k ) Du (k ) x 2

It can be seen from (B.7) that

(Po (t ) )−1 = (Po (0 ) )−1 +

t 

Dux ( k ) .

k=1

Therefore,

  ) (m˜ o (t ) )T (Po (0 ) )−1 + tk=1 Dux (k ) m˜ o (t ) t 

+ m˜ o (0 ) 2  (k ) 2Du (k ) x (Po (0 ) )−1 k=1 Hence, the second part of Result 2 is proved.

≤ 1.

(B.10)

698

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

Appendix C. Proof of Result 3 Using Matrix Inversion Lemma on (42), we get



ˆ α |k+1 

−1



ˆ α |k = 

−1





ˆ α |k 

−1

aˆλy |k+1 aˆzy |k+1 bˆ λ |k+1 bˆ z |k+1



T ˆ G(x(k ))(G(x(k ) )  α |k

−1

−1



−1 T ˆ + (G (x (k ) )  G(x(k )) α |k

.

(C.1)

y

y

For simplicity, define



ˆ α |k+1 Pα (k ) =  uy ( k ) =

−1

aˆλy |k+1 aˆzy |k+1 , bˆ λ |k+1 bˆ z |k+1 y

y

to write (42) as

Pα (k ) = Pα (k − 1 ) − uy (k )

Pα (k − 1 )G(x(k ))(G(x(k ) ) Pα (k − 1 ) T

1 + uy (k ) G(x(k ) P (k − 1 ) α 2

Similarly, (43) can be expressed as

ˆ α |k+1 = m ˆ α |k + uy ( k ) m

3

.

ˆ α |k Pα (k − 1 )G(x(k )) y(k ) − (G(x(k )) ) m T

1 + uy (k ) G(x(k ) P (k − 1 ) α 2

That is,

3

˜ α (k ) = m ˜ α ( k − 1 ) − uy ( k ) m

4 .

˜ α (k − 1 ) + v (k ) Pα (k − 1 )G(x(k )) (G(x(k )) ) m T

1 + uy (k ) G(x(k ) P (k − 1 ) α 2

4 .

(C.2)

Thus, 2 ˜ α ( k − 1 ) = m m˜ α (k ) 2 (Pα (k − 1 ) )−1 (Pα (k − 1 ) )−1 3 4 T T ˜ α (k − 1 ) + v(k ) (G(x(k )) ) m ˜ α (k − 1 ) uy (k ) (G(x(k )) ) m

−2

1 + uy (k ) G(x(k ) P (k − 1 ) α 2

; ;2 |uy (k )|2 ;(G(x(k )) )T m˜ α (k − 1 ) + v(k ); G(x(k ) 2Pα (k − 1 ) + . ; ;2 ; ; 2 1 + u ( k ) G ( x ( k ) ; y Pα (k − 1 ) ; Using

3

4

˜ α (k − 1 ) + v(k ) (G(x(k )) ) m ˜ α (k − 1 ) −2 (G(x(k )) ) m T

;

T

;2

;

;2

˜ α (k − 1 ); − ;(G(x(k )) ) m ˜ α ( k − 1 ) + v ( k ); , = |v(k )| − ;(G(x(k )) ) m 2

T

T

we have 2 uy (k )|v(k )| m˜ α (k − 1 ) 2 −1 + 2 (Pα (k − 1 ) ) 1 + uy (k ) G(x(k ) P (k − 1 ) α ; ;2 T ; ˜ u ( k ) G ( x ( k )) m ( ) α ( k − 1 ); y 2 ˜ α ( k ) = m + 2 (Pα (k − 1 ) )−1 1 + uy (k ) G(x(k ) P (k − 1 ) α ; ;2 T ˜ α ( k − 1 ) + v ( k ); uy (k );(G(x(k )) ) m + ; ;2 . ; ; 2 1 + u ( k ) G ( x ( k ) Pα (k − 1 ) ; ; y

(C.3)

It follows from (42) that

(Pα (k ) )−1 = (Pα (k − 1 ) )−1 + uy (k )G(x(k ))(G(x(k )) )T ,

(C.4)

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

699

and thus

; ;2 2 T ; ˜ α ( k ) ˜ α ( k ); m˜ α (k ) 2 −1 = m −1 − uy (G (x (k )) ) m (Pα (k − 1 ) ) (Pα (k ) ) It can be derived from (C.2) that

; ; ; ; ;(G(x(k )) )T m˜ α (k );2 = ;(G(x(k )) )T m˜ α (k − 1 );2 ; ;2 ; ;2 ; ; |uy (k )|2 ; G(x(k ) 2Pα (k − 1 ) ; ;(G(x(k )) )T m˜ α (k − 1 ) + v(k ); + ; ;2 ; ; 2 ;1 + uy (k ) G(x(k ) Pα (k − 1 ) ; ; ;2 2 T ˜ α ( k − 1 ); uy (k ) G(x(k ) P (k − 1 ) ;(G(x(k )) ) m α −2 2 1 + uy (k ) G(x(k ) P (k − 1 ) α ˜ α ( k − 1 )v ( k ) uy (k ) G(x(k ) P (k − 1 ) (G(x(k )) ) m α 2

−2

T

1 + uy (k ) G(x(k ) P (k − 1 ) α 2

.

Thus

; ;2 2 T ˜ α ( k ) ˜ α ( k − 1 ); = m − uy (k );(G(x(k )) ) m m˜ α (k ) 2 (Pα (k − 1 ) )−1 (Pα (k ) )−1 ; ;2 ; ;2 ; ; (uy (k ) )3 ; G(x(k ) 2Pα (k − 1 ) ; ;(G(x(k )) )T m˜ α (k − 1 ) + v(k ); − ; ;2 ; ; 2 ;1 + uy (k ) G(x(k ) Pα (k − 1 ) ; ; ;2 |uy (k )|2 G(x(k ) 2Pα (k − 1 ) ;(G(x(k )) )T m˜ α (k − 1 ); +2 2 1 + uy (k ) G(x(k ) P (k − 1 ) α +2

|uy (k )|2 G(x(k ) 2Pα (k − 1 ) (G(x(k )) )T m˜ α (k − 1 )v(k ) 1 + uy (k ) G(x(k ) P (k − 1 ) α 2

˜ α (k ) 2 The equality (C.3) after substituting the value of m

(Pα (k − 1 ) )−1

becomes

2 uy (k )|v(k )| m˜ α (k − 1 ) 2 −1 + 2 (Pα (k − 1 ) ) 1 + uy (k ) G(x(k ) P (k − 1 ) α ; ;2 2 T ; ˜ α ( k ) ˜ α ( k − 1 ); = m −1 − uy (k ) (G (x (k )) ) m (Pα (k ) ) ; ;2 ; ;2 ; ; (uy (k ) )3 ; G(x(k ) 2Pα (k − 1 ) ; ;(G(x(k )) )T m˜ α (k − 1 ) + v(k ); − ; ;2 ; ; 2 ;1 + uy (k ) G(x(k ) Pα (k − 1 ) ;

+2

+2

; ;2 |uy (k )|2 G(x(k ) 2Pα (k − 1 ) ;(G(x(k )) )T m˜ α (k − 1 ); 1 + uy (k ) G(x(k ) P (k − 1 ) α 2

|uy (k )|2 G(x(k ) 2Pα (k − 1 ) (G(x(k )) )T m˜ α (k − 1 )v(k ) 2

;

+

1 + uy (k ) G(x(k ) P (k − 1 ) α

;2

˜ α ( k − 1 ); uy (k );(G(x(k )) ) m T

1 + uy (k ) G(x(k ) P (k − 1 ) α 2

;

+

;2

˜ α ( k − 1 ) + v ( k ); uy (k );(G(x(k )) ) m T

; ;2 ; ; 2 ;1 + uy (k ) G(x(k ) Pα (k − 1 ) ;

.

700

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

;

˜ α ( k ) = m

2

(Pα (k ) )−1

+

;2

˜ α ( k − 1 ); uy (k );(G(x(k )) ) m T

1 + uy (k ) G(x(k ) P (k − 1 ) α 2



uy (k )|v(k )| 2 +; ;2 1 − |uy (k )| ; ; 2 ;1 + uy (k ) G(x(k ) Pα (k − 1 ) ; 2

+

2 

' ( T 2 ; ;2 1 + uy (k ) G(x(k )) Pα (k − 1 ) ; ; 2 ;1 + uy (k ) G(x(k ) Pα (k − 1 ) ; ;

˜ α ( k ) = m

(Pα (k ) )−1

;

+

;2

˜ α ( k − 1 ); uy (k );(G(x(k )) ) m T

1 + uy (k ) G(x(k ) P (k − 1 ) α 2

;2

˜ α ( k − 1 ) + v ( k ); uy (k );(G(x(k )) ) m T

; ;2 ; ; 2 ;1 + uy (k ) G(x(k ) Pα (k − 1 ) ; ; ;



G(x(k )) 2Pα (k − 1 )

˜ α ( k − 1 )v ( k ) uy (k )2(G(x(k )) ) m

2

+



;2 ;

˜ α ( k − 1 ); uy (k );uy (k ) G(x(k )) P (k − 1 ) v(k ) − (G(x(k )) ) m α 2

T

; ;2 ; ; 2 ;1 + uy (k ) G(x(k ) Pα (k − 1 ) ;

.

Therefore, 2 uy (k )|v(k )| m˜ α (k − 1 ) 2 −1 + 2 (Pα (k − 1 ) ) 1 + uy (k ) G(x(k ) P (k − 1 ) α ; ;2 T ˜ α ( k − 1 ); uy (k );(G(x(k )) ) m 2 ˜ α ( k ) ≥ m + 2 (Pα (k ) )−1 1 + uy (k ) G(x(k ) P (k − 1 ) α ; ;2 ; ; 2 T ˜ α ( k − 1 ); uy (k );uy (k ) G(x(k )) P (k − 1 ) v(k ) − (G(x(k )) ) m α − . ; ;2 ; ; 2 1 + u ( k ) G ( x ( k ) ; y Pα (k − 1 ) ;

(C.5)

As the square is always a non-negative quantity, thus

;8 ;

0 ≤ ; uy (k ) G(x(k )) P (k − 1 ) v(k ) + α

8

;2 ;

˜ α ( k − 1 ); uy (k ) G(x(k )) P (k − 1 ) (G(x(k )) ) m α T

That is,

˜ α ( k − 1 )v ( k ) −2uy (k ) G(x(k )) P (k − 1 ) (G(x(k )) ) m α 2

T

≤ uy (k ) G(x(k )) P (k − 1 ) |v(k )| α 2

2

;

;2

˜ α ( k − 1 ); . +uy (k ) G(x(k )) P (k − 1 ) ;(G(x(k )) ) m α 2

; ;

;2 ;

T

; ;

(C.6)

;2 ;

2 T ˜ α (k − 1 ); to the both sides of (C.6) leads to Adding ;uy (k ) G(x(k )) P (k − 1 ) v(k ); + ;(G(x(k )) ) m α

; ;2 ; ; 2 T ;uy (k ) G(x(k )) Pα (k − 1 ) v(k ) − (G(x(k )) ) m˜ α (k − 1 );

2 2 2 ≤ 1 + uy (k ) G(x(k )) P (k − 1 ) uy (k ) G(x(k )) P (k − 1 ) |v(k )| α α

; ;2 2 T ˜ α ( k − 1 ); . + 1 + uy (k ) G(x(k )) P (k − 1 ) ;(G(x(k )) ) m α

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

That is,

; ;



701

;2 ;

˜ α ( k − 1 ); uy (k );uy (k ) G(x(k )) P (k − 1 ) v(k ) − (G(x(k )) ) m α

≥−

2

T

; ;2 ; ; 2 ;1 + uy (k ) G(x(k ) Pα (k − 1 ) ;

|uy (k )|2 G(x(k )) 2Pα (k − 1 ) |v(k )|2 1 + uy (k ) G(x(k )) P (k − 1 ) α 2



;

;2

˜ α ( k − 1 ); uy (k );(G(x(k )) ) m T

1 + uy (k ) G(x(k )) P (k − 1 ) α 2

.

(C.7)

Combining (C.7) with (C.5) results in 2 2 ˜ α ( k ) + uy (k )|v(k )| ≥ m . m˜ α (k − 1 ) 2 (Pα (k − 1 ) )−1 (Pα (k ) )−1

This proves the first part of Result 3. If the algorithm is run from k = 1 to k = t, then

m˜ α (t ) 2 (Pα (t ) )−1 m˜ α (0 )

2

(Pα (0 ) )−1

+

t 

≤ 1.

uy (k )|v(k )|

2

k=1

It can be observed from (C.4) that

(Pα (t ) )−1 = (Pα (0 ) )−1 +

t 

uy (k )G(x(k ))(G(x(k )) ) , T

k=1

and therefore

  ) (m˜ α (t ) )T (Pα (0 ) )−1 + tk=1 uy (k )G(x(k ))(G(x(k )) )T m˜ α (t ) m˜ α (0 )

2

(Pα (0 ) )−1

+

t 

≤ 1.

uy (k )|v(k )|

2

k=1

Hence, the second part of Result 3 is proved. References [1] R. Alcala, P. Ducange, F. Herrera, B. Lazzerini, F. Marcelloni, A multiobjective evolutionary approach to concurrently learn rule and data bases of linguistic fuzzy-rule-based systems, IEEE Trans. Fuzzy Syst. 17 (5) (2009) 1106–1122. [2] M. Antonelli, P. Ducange, F. Marcelloni, Genetic training instance selection in multiobjective evolutionary fuzzy systems: a coevolutionary approach, IEEE Trans. Fuzzy Syst. 20 (2) (2012) 276–290. [3] W.-H. Au, K. Chan, A.K. Wong, A fuzzy approach to partitioning continuous attributes for classification, IEEE Trans. Knowl. Data Eng. 18 (5) (2006) 715–719, doi:10.1109/TKDE.2006.70. [4] B. Butkiewicz, An approach to theory of fuzzy signals - Basic definitions, IEEE Trans. Fuzzy Syst. 16 (4) (2008) 982–993, doi:10.1109/TFUZZ.2008.924183. [5] A. Celikyilmaz, I. Turksen, Enhanced fuzzy system models with improved fuzzy clustering algorithm, IEEE Trans. Fuzzy Syst. 16 (3) (2008) 779–794. [6] L. Chen, C. Chen, Pre-shaped fuzzy c-means algorithm (pfcm) for transparent membership function generation, in: IEEE International Conference on Systems, Man and Cybernetics, 2007. ISIC., 2007, pp. 789–794. [7] M. Cococcioni, B. Lazzerini, F. Marcelloni, On reducing computational overhead in multi-objective genetic Takagi-Sugeno fuzzy systems, Appl. Soft Comput. 11 (1) (2011) 675–688. [8] M. Gacto, R. Alcala, F. Herrera, Integration of an index to preserve the semantic interpretability in the multiobjective evolutionary rule selection and tuning of linguistic fuzzy systems, IEEE Trans. Fuzzy Syst. 18 (3) (2010) 515–531. [9] M. Kumar, N. Stoll, R. Stoll, An energy-gain bounding approach to robust fuzzy identification., Automatica 42 (5) (2006) 711–721. [10] M. Kumar, N. Stoll, R. Stoll, Adaptive fuzzy filtering in a deterministic setting, IEEE Trans. Fuzzy Syst. 17 (4) (2009) 763–776, doi:10.1109/TFUZZ.2008. 924331. [11] M. Kumar, N. Stoll, R. Stoll, On the estimation of parameters of takagi-Sugeno fuzzy filters, IEEE Trans. Fuzzy Syst. 17 (1) (2009) 150–166, doi:10.1109/ TFUZZ.20 08.20 05405. [12] M. Kumar, N. Stoll, R. Stoll, Variational bayes for a mixed stochastic/deterministic fuzzy filter, IEEE Trans. Fuzzy Syst. 18 (4) (2010) 787–801, doi:10. 1109/TFUZZ.2010.2048331. [13] M. Kumar, N. Stoll, R. Stoll, Stationary fuzzy Fokker-Planck learning and stochastic fuzzy filtering, IEEE Trans. Fuzzy Syst. 19 (5) (2011) 873–889, doi:10.1109/TFUZZ.2011.2148724. [14] M. Kumar, N. Stoll, R. Stoll, K. Thurow, A stochastic framework for robust fuzzy filtering and analysis of signals–Part i, IEEE Trans Cybern 46 (5) (2016) 1118–1131. [15] M. Kumar, R. Stoll, N. Stoll, Deterministic approach to robust adaptive learning of fuzzy models, IEEE Trans. Syst., Man, Cybern., Part B 36 (4) (2006) 767–780, doi:10.1109/TSMCB.2006.870625. [16] M. Kumar, R. Stoll, N. Stoll, A min-max approach to fuzzy clustering, estimation, and identification, IEEE Trans. Fuzzy Syst. 14 (2) (2006) 248–262. [17] M. Kumar, R. Stoll, N. Stoll, A robust design criterion for interpretable fuzzy models with uncertain data, IEEE Trans. Fuzzy Syst. 14 (2) (2006) 314–328, doi:10.1109/TFUZZ.2005.861614. [18] M. Kumar, M. Weippert, D. Arndt, S. Kreuzfeld, K. Thurow, N. Stoll, R. Stoll, Fuzzy filtering for physiological signal analysis, IEEE Trans. Fuzzy Syst. 18 (1) (2010) 208–216. [19] T. Liao, A.K. Celmins, R.J.H. II, A fuzzy c-means variant for the generation of fuzzy term sets, Fuzzy Sets Syst. 135 (2) (2003) 241–257. [20] M. Makrehchi, O. Basir, M. Kamel, Generation of fuzzy membership function using information theory measures and genetic algorithm, in: T. Bilgiç, B. De Baets, O. Kaynak (Eds.), Fuzzy Sets and Systems - IFSA 2003, Lecture Notes in Computer Science, 2715, Springer Berlin Heidelberg, 2003, pp. 603–610.

702

M. Kumar et al. / Information Sciences 418–419 (2017) 668–702

[21] M. Mottaghi-Kashtiban, A. Khoei, K. Hadidi, Optimization of rational-powered membership functions using extended Kalman filter, Fuzzy Sets Syst. 159 (23) (2008) 3232–3244. [22] S.-K. Oh, W. Pedrycz, H.-S. Park, Hybrid identification in fuzzy-neural networks, Fuzzy Sets Syst. 138 (2) (2003) 399–426. [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: machine learning in Python, J. Mach. Learn. Res. 12 (2011) 2825–2830. [24] P. Pulkkinen, H. Koivisto, A dynamically constrained multiobjective genetic fuzzy system for regression problems, IEEE Trans. Fuzzy Syst. 18 (1) (2010) 161–177. [25] RoblesIgnacio and Alcalá, Rafael and Benítez, José Manuel and Herrera, Francisco, Evolutionary parallel and gradually distributed lateral tuning of fuzzy rule-based systems., Evol. Intell. 2 (1–2) (2009) 5–19. [26] D. Simon, H∞ estimation for fuzzy membership function optimization, Int. J. Approximate Reasoning 40 (3) (2005) 224–242. [27] X. Su, L. Wu, P. Shi, Y.-D. Song, H∞ model reduction of Takagi-Sugeno fuzzy stochastic systems, IEEE Trans. Syst., Man, Cybern., Part B 42 (6) (2012) 1574–1585, doi:10.1109/TSMCB.2012.2195723. [28] L. Wu, X. Su, P. Shi, J. Qiu, Model approximation for discrete-Time state-Delay systems in the T-S fuzzy framework, IEEE Trans. Fuzzy Syst. 19 (2) (2011) 366–378, doi:10.1109/TFUZZ.2011.2104363. [29] S. Zhang, Z. Wang, D. Ding, H. Shu, Fuzzy filtering with randomly occurring parameter uncertainties, interval delays, and channel fadings, IEEE Trans. Cybern. 44 (3) (2014) 406–417, doi:10.1109/TCYB.2013.2256782.