A back propagation algorithm to estimate the parameters of non-linear dynamic rational models

A back propagation algorithm to estimate the parameters of non-linear dynamic rational models

Applied Mathematical Modelling 27 (2003) 169–187 www.elsevier.com/locate/apm A back propagation algorithm to estimate the parameters of non-linear dy...

177KB Sizes 0 Downloads 21 Views

Applied Mathematical Modelling 27 (2003) 169–187 www.elsevier.com/locate/apm

A back propagation algorithm to estimate the parameters of non-linear dynamic rational models Q.M. Zhu

*

Faculty of Engineering, University of the West of England, Frenchay Campus, Coldharbour Lane, Bristol, BS16 1QY, UK

Abstract In this study an error back propagation (BP) parameter estimation algorithm is derived for a class of non-linear dynamic rational models. By combining an orthogonal correlation test algorithm, the BP based parameter estimator provides an alternative for the model structure detection. That is to select a set of significant terms over a full model set while estimating the associated parameters. A number of simulated examples are used to illustrate the efficiency of the algorithm. Ó 2002 Elsevier Science Inc. All rights reserved. Keywords: Non-linear rational models; Model structure detection; Parameter estimation; Back propagation computation

1. Introduction The non-linear autoregressive moving average with exogenous input (NARMAX) model set has been extensively studied in theory [8,11,22] and gradually adopted in applications [20,24]. There are two main steams of sub-model sets, polynomial and rational. A polynomial NARMAX model is defined as linear in parameters and non-linear in the regression terms, and can be used to represent a wide range of linear and non-linear systems. The advantage for identification is that the model is linear in the parameters. The rational model [1,4,23] represents an extension of the polynomial model set, and is defined as the ratio of two polynomial expressions. The introduction of the denominator polynomial terms makes the rational NARMAX model to be non-linear in both the parameters and the regression terms. The justification for using the rational model is that it provides a very concise and parsimonious representation for complex non-linear systems and *

Tel.: +44-117-344-2533. E-mail address: [email protected] (Q.M. Zhu).

0307-904X/02/$ - see front matter Ó 2002 Elsevier Science Inc. All rights reserved. PII: S 0 3 0 7 - 9 0 4 X ( 0 2 ) 0 0 0 9 7 - 5

170

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

has excellent extrapolation properties. However model identification and controller design are much more difficult for the rational model compared to the polynomial model [18,27]. A number of model structure detection techniques and parameter estimation algorithms have been developed for the rational model. These include an orthogonal model structure detection and parameter estimation routine [5], generalised least squares estimators [4,25,26], a prediction error estimator [1], and a genetic algorithm estimator [2]. In the view of the difficulty of parameter estimation for rational models, further developments should consider introducing inductive learning mechanisms and iteratively extracting information embedded in the measured data. Also straightforward and efficient computations should be investigated. Error back propagation computation developed for neural network training is a potential candidate. The error back propagation algorithm and its variations have been successfully used to train multilayer neural networks. Basically the error back propagation process consists of two processes through different layers of the neural network, a forward pass and a backward pass [12]. In the forward pass the predicted model outputs are computed and then the errors, the difference between the measured outputs and the predicted outputs, are obtained. In the backward pass the error signals are used to update the weight/parameter estimates. However back propagation algorithms have not been widely linked to classical model structure detection and parameter estimation. It is believed that the merits of BP computations can be exploited in classical model identification, to provide a marriage of the best of both classical and neural network approaches, and should provide a powerful toolkit for analysing an enormous range of systems. In this study, motivated by the property of neural networks, the back propagation algorithm [12] will be modified to act as a rational model parameter estimator. The main sections of the study are organised as follows. In Section 2 the general rational model structure is presented. In Section 3 the back propagation estimator (BPE) for rational model parameter estimation is derived in terms of recursive and iterative computations, and issues such as parameter initialisation, learning rate selection, stopping criteria, and model structure detection are discussed. In Section 4 the convergence of the BPE is analysed using a general convergence analysis tool for stochastic approximation algorithms. In Section 5 statistics of the BPE, mean and variances, are investigated to prove that the estimator is unbiased with finite variance. In Section 6 the BPE is tested with three different types of rational models.

2. The rational model Mathematically a general dynamic non-linear rational model is defined as aðkÞ þ eðkÞ yðkÞ ¼ y^ðkÞ þ eðkÞ ¼ bðkÞ   a uðk  1Þ; . . . ; uðk  nnu Þ; yðk  1Þ; . . . ; yðk  nny Þ; eðk  1Þ; . . . ; eðk  nne Þ  þ eðkÞ ¼  b uðk  1Þ; . . . ; uðk  ndu Þ; yðk  1Þ; . . . ; yðk  ndy Þ; eðk  1Þ; . . . ; eðk  nde Þ ð2:1Þ where yðkÞ and y^ðkÞ are the measured and model output respectively, uðkÞ is the input, eðkÞ is the model error, and k (¼ 1; 2; . . .) is the time index. Generally the numerator aðkÞ and denominator

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

171

bðkÞ are functions of past inputs, outputs, and errors, and can be expressed in terms of polynomials aðkÞ ¼

num X j¼1

pnj ðkÞhnj ;

bðkÞ ¼

den X

pdj ðkÞhdj

ð2:2Þ

j¼1

The regression terms pnj ðkÞ and pdj ðkÞ are the products of past inputs, outputs, and errors, such as uðk  1Þyðk  3Þ, uðk  1Þeðk  2Þ, y 2 ðk  1Þ, and hnj , hdj are the associated parameters. The task of parameter estimation, for a given model structure, is to estimate the associated parameters from the measured inputs and outputs. The non-linear rational model can be depicted, in a neural network structure, as shown in Fig. 1. Here the rational model neural network is structured with an input layer of regression terms pnj ðkÞ and pdj ðkÞ, a hidden layer of aðkÞ and bðkÞ with linear activation functions, an output layer of yðkÞ with a ratio activation function. The activation function at the output layer is the division operation of the numerator polynomial divided by the denominator polynomial. Leung and Haykin [16] presented a rational function neural network. The shortcomings compared with the current study are that Leung and HaykinÕs network does not have a generalised rational model structure and correlated errors are not accommodated. Hence Leung and HaykinÕs parameter estimation algorithm cannot provide unbiased estimates with noise corrupted data, and is essentially a special implementation of the procedure of Billings and Zhu [4] and Zhu and Billings [25] in the case of noise free data. Several remarks relating to the characteristics of the rational model of (2.1) are noted below: (i) The polynomial NARMAX models is a special case of model (2.1) by setting denominator polynomial bðkÞ ¼ 1. (ii) The model is non-linear in both the parameters and the regression terms, this is induced by the denominator polynomial. (iii) The model can be much more concise than a polynomial expansion, for example

Fig. 1. Rational neural network.

172

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

yðkÞ ¼

h i uðk  1Þ 2 4 ¼ uðk  1Þ 1  y ðk  1Þ þ y ðk  1Þ     1 þ y 2 ðk  1Þ

ð2:3Þ

(iv) The model can produce large deviations in the output, for example yðkÞ ¼

1 1 þ uðk  1Þ

ð2:4Þ

When uðk  1Þ approaches )1, the model output response will be large. Hence the power to capture quick and large changes is gained by introducing the denominator polynomial. (v) Identification is difficult, this is the challenge studied in this paper. (vi) Rational models have been gradually adopted in applications of non-linear system modelling and control [9,10,14,19]. (vii) Many neuro–fuzzy systems have been expressed as non-linear rational models [24]. For example fuzzy systems with centre defuzzifier, product inference rule, singleton fuzzifier, and Gaussian membership function. The normalised radial basis function network is also a type of rational model. When the centres and widths are estimated this becomes a rational model parameter estimation problem.

3. Back propagation computation for parameter estimation To derive the parameter estimation algorithm, with reference to model of (2.1) and (2.2), define the error between the measured output and the model output as h i2 eðkÞ ¼ 12 yðkÞ  y^ðkÞ

ð3:1Þ

The task is to determine a set of parameters for the model (2.1) so that the squared error in (3.1) is minimised. The parameter determination can be described as parameter training when the model is interpreted as a neural network structure. To train the parameters hnj associated with the numerator polynomial aðkÞ, the local gradients are given by h i oeðkÞ oaðkÞ h i 1 oeðkÞ pnj ðkÞ ¼  yðkÞ  y^ðkÞ ¼  yðkÞ  y^ðkÞ ohnj oaðkÞ ohnj bðkÞ

ð3:2Þ

To train the parameters hdj associated with the denominator polynomial bðkÞ, the local gradients are given by h i oeðkÞ obðkÞ h i aðkÞ oeðkÞ ¼  yðkÞ  y^ðkÞ ¼ yðkÞ  y^ðkÞ 2 pdj ðkÞ ð3:3Þ ohdj obðkÞ ohdj b ðkÞ To carry out recursive computations, a time index k is introduced to the parameter sets hnj ðkÞ and hdj ðkÞ as they are updated following the time sequence. Convergence will be proved with limk!1 hnj ðkÞ ! hnj and limk!1 hdj ðkÞ ! hdj later in this study. Therefore the parameter variation at time k þ 1 can be defined as follows:

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

Dhnj ðk þ 1Þ ¼ hnj ðk þ 1Þ  hnj ðkÞ ¼ gnj

oeðk þ 1Þ ohnj ðk þ 1Þ

oeðk þ 1Þ Dhdj ðk þ 1Þ ¼ hdj ðk þ 1Þ  hdj ðkÞ ¼ gdj ohdj ðk þ 1Þ

173

ð3:4Þ

where gnj and gdj are the learning rates. The integrated parameter estimation procedure can be summarised in terms of the BP computation as follows: I Initialisation Set all model parameters to uniformly distributed random numbers with zero mean and small variance, or set up the initial parameters using the prior information extraction technique, which is proposed later in this study. II Parameter estimation For l ¼ 0 to L  1 (epoch number) For k ¼ 1 to N (training data length) Forward computation aðk þ 1Þ ¼

num X

pnj ðk þ 1Þhnj ðkÞ;

bðk þ 1Þ ¼

j¼1

aðk þ 1Þ ; y^ðk þ 1Þ ¼ bðk þ 1Þ

den X

pdj ðk þ 1Þhdj ðkÞ

j¼1

ð3:5Þ

eðk þ 1Þ ¼ yðk þ 1Þ  y^ðk þ 1Þ

Backward computation oeðk þ 1Þ 1 pnj ðk þ 1Þ ¼ eðk þ 1Þ ohnj ðk þ 1Þ bðk þ 1Þ oeðk þ 1Þ aðk þ 1Þ ¼ eðk þ 1Þ 2 pdj ðk þ 1Þ ohdj ðk þ 1Þ b ðk þ 1Þ

ð3:6Þ

oeðk þ 1Þ ohnj ðk þ 1Þ oeðk þ 1Þ hdj ðk þ 1Þ ¼ hdj ðkÞ  gdj ohdj ðk þ 1Þ

ð3:7Þ

hnj ðk þ 1Þ ¼ hnj ðkÞ  gnj

end end 3.1. Initialisation The first step in the above computation is to initialise the parameters. A sensible choice will provide tremendous help for the resultant parameter estimation. A wrong choice of the initial parameters can lead to two critical phenomena, premature saturation and divergence. In the case of premature saturation, the instantaneous sum of squared errors remains almost constant for

174

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

some period of time during the learning process. In the case of divergence, the sum of squared errors tends to become very large. A general initialisation technique [12] is to uniformly randomise the initial parameters with zero mean and small variance. In this study, the initial parameters will be determined using a prior information extraction technique. Since the model structure is assumed to be known in advance, an ordinary or extended least squares algorithm can be applied to obtain initial estimates. Although there will be biased they should not be far away from the true parameters in Euclidean distance. 3.2. Learning rate The BPE algorithm provides an approximation to the trajectory in parameter space computed in terms of steepest descent. The smaller the learning rate, the smaller the change to the model parameters will be from one iteration to the next and the smoother the trajectory will be in parameter space. However a small learning rate will possibly cause slower convergence of the parameter estimation. If the learning rate is set too large, the estimator may become unstable. The choice of learning rates are discussed below: (i) Constant gnj ¼ constant1 ;

gdj ¼ constant2

ð3:8Þ

The disadvantage is that the learning rate cannot properly adjust in the early stage (large learning rate) and in the later stage (small learning rate) of the learning process. (ii) Inversely decayed sequence 1 ð3:9Þ k The disadvantage is that the learning rate decays too fast and this may not be suitable when long training data sequences are available. (iii) Generalised delta rule [13,21] gnj ðkÞ ¼ gdj ðkÞ ¼

hnj ðk þ 1Þ ¼ hnj ðkÞ  gnj

  oeðk þ 1Þ þ anj hnj ðkÞ  hnj ðk  1Þ ohnj ðk þ 1Þ

  oeðk þ 1Þ þ adj hdj ðkÞ  hdj ðk  1Þ hdj ðk þ 1Þ ¼ hdj ðkÞ  gdj ohdj ðk þ 1Þ

ð3:10Þ

where anj and adj are positive numbers called momentum constants. This has been extensively studied and can increase the learning rate and avoid instability. However, this choice involves more computations and requires the know-how to set up the parameters anj and adj . (iv) Linearly decayed sequence To overcome the above disadvantages, a new learning rate sequence is proposed in this study as follows: g  gend ð3:11Þ ðlN þ kÞ þ g0 gnj ðkÞ ¼ gdj ðkÞ ¼  0 LN

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

175

Fig. 2. Linearly decayed learning rate.

where L is the number of iteration epoch, N is the training data length, l (¼ 0; . . . ; L  1) for the lth epoch, and the g0 gend are positive constants for the initial and end learning rates respectively. Fig. 2 shows the learning rate variation. 3.3. Stopping criterion Two criteria [15] have been used to stop neural network training. These are formulated by means of local or global minimum of the error surface. (i) Gradient stopping criterion This states that the BP converges when the Euclidean norm of the gradient vector reaches a sufficiently small gradient threshold. The drawback of this criterion is longer training and complicity in the computation of the gradient vector. (ii) Error measure stopping criterion This states that the BP converges when the absolute rate of change in the mean squared error per epoch is sufficiently small, typically 0.01% per epoch is used. The drawback is that the error may still be correlated for non-linear models, even though the mean squared error or its variation rate is very small [3,6]. To overcome the above drawbacks, a higher order correlation test [6,7], introduced for nonlinear model validation, can be used as a stopping criterion. The tests are described below:   PN  2 2  aðkÞ  a e ðk  sÞ  e k¼1 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi Uae2 ðsÞ ¼ s

2 P  2 PN  N 2 2  k¼1 aðkÞ  a k¼1 e ðkÞ  e   PN   u2 ðk  sÞ  u2 k¼1 aðkÞ  a ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Uau2 ðsÞ ¼ s

2 P  2 PN  N 2 2  k¼1 aðkÞ  a k¼1 u ðkÞ  u

ð3:12Þ

176

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

where aðkÞ ¼ yðkÞeðkÞ;

a ¼

N 1 X aðkÞ; N k¼1

u2 ¼

N 1 X u2 ðkÞ; N k¼1

e2 ¼

When the higher order correlation functions Uae2 ðsÞ and Uau2 ðsÞ satisfy

k 0 s¼0 Uae2 ðsÞ ¼ 0 otherwise Uau2 ðsÞ ¼ 0

N 1 X e2 ðkÞ N k¼1

ð3:13Þ

ð3:14Þ

8s

the estimated parameters are considered to be unbiased. Otherwise the training procedure will continue until the above conditions are satisfied. In practice the 95% confidence limits, about pffiffiffiffi 1:96= N , are used as the stopping thresholds. 3.4. Model structure detection Model structure detection is to select significant terms from a fairly large model set which is normally called full model set, therefore a sub-model structured with the significant terms can be determined. When identifying systems with an unknown structure, it is important to avoid loosing these significant terms in the final model. With regard to the rational model structure detection, an orthogonal algorithm has been developed [5,26], which the rational model has been expanded as a liner in the parameter expression as follows: Y ðkÞ ¼ aðkÞ  yðkÞ

den X

pdj ðkÞhdj þ bðkÞeðkÞ

j¼2

¼

num X

pnj ðkÞhnj  yðkÞ

j¼1

den X

pdj ðkÞhdj þ bðkÞeðkÞ

ð3:15Þ

j¼2

where Y ðkÞ ¼ yðkÞpd1 ðkÞjhd1 ¼1

ð3:16Þ

Alternatively divide all the right hand side terms of (3.15) by hd1 and redefine symbols to give essentially hd1 ¼ 1. In the orthogonal algorithm, there are two directional computations Forward term selection: Use revised classical Gram–Schmidt transform to form orthogonal decomposition of terms over a pre-set full model set. Use error reduction ratio (err) to select significant terms and estimate the associated parameters of the orthogonalised model. Backward parameter estimation: Use backward substitutions to obtain original model parameters. Mathematically term selections involve correlation tests and parameter estimations involve optimisation computations. Basically BP based algorithms are used to deal with optimisation

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

177

problems. Although BPE cannot be directly used to detect rational model terms, it can be jointly used with the orthogonal detection algorithm to provide an alternative for both structure detection and parameter estimation. Therefore the procedure is proposed as follows: 1. Fit a deterministic rational model (i.e. no noise model initially), predict the residual sequence, and hence obtain an estimate of the sequence eðkÞ and its r^2e . 2. Use the orthogonal term selection section to find significant terms prj ðkÞ and yðkÞpdj ðkÞ (equivalently pdj ðkÞ in model of (2.1)) for model of (3.15) over a pre-set full model set. 3. Use BPE to estimate the parameters associated with the selected significant terms of model of (2.1). 4. Update the model error sequence by eðkÞ ¼ yðkÞ  ð^ aðkÞ=b^ðkÞÞ, repeat steps 2 and 3 until the computations converge, the pre-set maximum number of iterations are exceeded, or a specified number of significant terms are included in the model. In the above procedure, BPE is not only used for parameter estimation, but also for providing an updated error sequence eðkÞ and its variance estimate r^2e for model structure detection. The details of the orthogonal algorithm for term selection are attached in Appendix A.

4. Convergence analysis of the BPE algorithm To analyse the convergence of a class of stochastic approximation algorithms [17], consider a general parameter update equation Hðk þ 1Þ ¼ HðkÞ þ gðkÞhðHðkÞ; NðkÞÞ

ð4:1Þ

where k is the number of iteration, HðÞ is a sequence of parameter vectors, and NðÞ is an observation vector received at time k, which caused HðÞ to be updated to take account of new information. The sequence gðÞ is assumed to be a sequence of positive scalars, and the update function hð; Þ is a deterministic function with some regularity conditions imposed on it, which together specify the complete structure of the algorithm. The convergence analysis then consists of examining if the algorithm satisfies the following conditions: ii(i) gðkÞ is a decreasing sequence of positive real numbers, such that 1 X gðkÞ ¼ 1 ðaÞ k¼1

ðbÞ

1 X

gp ðkÞ  1 for p 1

ð4:2Þ

k¼1

ðcÞ

gðkÞ ! 0

as k ! 1

i(ii) The sequence of parameter vectors HðÞ is bounded with probability 1. (iii) The update function hð; Þ is continuously differentiable with respect to H and N, and its derivatives are bounded in time.

178

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

(iv) The limit  hðHÞ ¼ lim E½hðH; NÞ

ð4:3Þ

k!1

exists for each H, where the statistical expectation operator E is over N. i(v) There is a locally asymptotically stable (in the sense of Liapunov) solution to the ordinary differential equation dHðtÞ  ¼ hðHðtÞÞ ð4:4Þ dt where t denotes continuous time. (vi) Let q0 denote the solution to equation of (4.4) with a basin of attraction Bðq0 Þ. Then the parameter vector HðkÞ enters a compact subset A of the basin of attraction Bðq0 Þ infinitely often, with probability 1. Proof. The proof of convergence of the BPE algorithm involves proving that the estimator satisfies the above six conditions [12]. For simplicity but not losing generality, the vector notations quoted in the above general expressions are replaced by single variables. Therefore for this particular analysis, the general recursive stochastic parameter approximation algorithm of (3.7) can be expressed as below, oeðk þ 1Þ ohnj ðk þ 1Þ pnj ðk þ 1Þ

hnj ðk þ 1Þ ¼ hnj ðkÞ þ gnj hð; Þ ¼ hnj ðkÞ  gnj

¼ hnj ðkÞ þ gnj eðk þ 1Þ

1 bðk þ 1Þ

oeðk þ 1Þ hdj ðk þ 1Þ ¼ hdj ðkÞ þ gdj hð; Þ ¼ hdj ðkÞ  gdj ohdj ðk þ 1Þ

aðk þ 1Þ ¼ hdj ðkÞ  gdj eðk þ 1Þ 2 pdj ðk þ 1Þ b ðk þ 1Þ

ð4:5Þ

Setting gnj ðkÞ ¼ gdj ðkÞ ¼ gðkÞ allows the use of (4.1) for both learning rates. Examining condition (i), with reference to Fig. 2, set the number of epochs equal to one and using the learning rate formulation of (3.11) gives 1 X k ðg0  gend Þ gðkÞ ¼ triangle area ¼ lim ¼1 ðaÞ k!1 2 k¼1 1 1 X X ð4:6Þ gp ðkÞ  gðkÞ ¼ 1 for p 1 0 6 gðkÞ  1 ðbÞ k¼1

k¼1

i g  gend  k þ g0  ! 0 as k ! 1 ðk ¼ NÞ ðcÞ gðkÞ ¼  0 N gend ¼0 h

Examining condition (ii), the sequence of parameters hnj ðÞ and hdj ðÞ are bounded with probability 1. Examining condition (iii), this can be directly satisfied from the expression of (4.5).

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

179

Examining condition (iv), when the model error eðkÞ is reduced to an uncorrelated noise sequence with a zero mean and finite variance r2e , then

1  pnj ðk þ 1Þ hðhnj Þ ¼ lim E eðk þ 1Þ k!1 bðk þ 1Þ

1 ¼ lim E½eðk þ 1Þ lim E pnj ðk þ 1Þ ¼ 0 k!1 k!1 bðk þ 1Þ ð4:7Þ

aðk þ 1Þ  pdj ðk þ 1Þ hðhdj Þ ¼ lim E eðk þ 1Þ 2 k!1 b ðk þ 1Þ

aðk þ 1Þ ¼ lim E½eðk þ 1Þ lim E 2 pdj ðk þ 1Þ ¼ 0 k!1 k!1 b ðk þ 1Þ Examining condition (v), the locally asymptotically stable solution of the ordinary differential equation  dhnj ðtÞ  ¼ h hnj ðtÞ ¼ 0; dt

 dhdj ðtÞ  ¼ h hdj ðtÞ ¼ 0 dt

can be obtained as Z hnj ðtÞ ¼ 0 dt ¼ constant;

hdj ðtÞ ¼

Z

0 dt ¼ constant

ð4:8Þ

ð4:9Þ

Examining condition (vi), let the solutions of (4.9) be lim hnj ðkÞ ! constant ¼ hnj

k!1

lim hdj ðkÞ ! constant ¼ hdj

infinitely often with probability 1

ð4:10Þ

k!1

As the six conditions are satisfied, the BPE algorithm provides a convergent estimation for the non-linear rational model parameters. 

5. Statistics of the BPE Combining expressions (3.6) and (3.7), the bias and variance of the estimates can be determined as follows. For the bias

  1 biasnj ¼ hnj  E hnj ðk þ 1Þ ¼ E½eðk þ 1ÞE pnj ðk þ 1Þ ¼ 0 bðk þ 1Þ ð5:1Þ

  aðk þ 1Þ pnj ðk þ 1Þ ¼ 0 biasdj ¼ hdj  E hdj ðk þ 1Þ ¼ E½eðk þ 1ÞE 2 b ðk þ 1Þ Therefore the estimator will produce unbiased estimates of the non-linear rational model parameters.

180

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

For the variance h  2 i varnj ¼ E hnj ðk þ 1Þ  E hnj ðk þ 1Þ ¼ r2e r2nj h  2 i vardj ¼ E hdj ðk þ 1Þ  E hdj ðk þ 1Þ ¼ r2e r2dj where r2nj ¼ E

"

1 pnj ðk þ 1Þ bðk þ 1Þ

2 # ;

r2dj ¼ E

"

ð5:2Þ

aðk þ 1Þ pdj ðk þ 1Þ b2 ðk þ 1Þ

2 # ð5:3Þ

Therefore the variance of each parameter estimate is associated with the noise variance and the second order moment of the regression term. 6. Simulation studies Three simulated examples were selected to demonstrate the efficiency of the BPE. For each simulated system, 1024 pairs of input and output data were used with the same choice of input and noise signals. The input uðtÞ was a uniformly distributed random sequence with zero mean and variance of 0.33 (equivalently with an amplitude range of 1) and the noise eðtÞ was an uncorrelated Gaussian sequence with zero mean and variance of 0.01. In the simulations the real noise data were assumed to be unknown, since only the input and output data were available the estimated noise sequence were computed as the difference between the measured output and the model predicted output. To run the BPE, the linearly decayed sequence was designed as learning rates, the prior information extraction technique was used to set up the initial parameters, and the higher order correlation test procedure was used as the stopping criterion. Example 1. A simple noise free non-linear output affine model yðtÞ ¼

0:3yðt  1Þ þ 0:7uðt  1Þ 1 þ u2 ðt  1Þ

ð6:1Þ

The learning rates were assigned as g0 ¼ 0:5, gend ¼ 0:02 with one iteration. The prior information extraction technique was a generalised least squares parameter for rational parameter estimation, which produces unbiased estimates in the case of noise free data. Therefore the BPE was tested to determine it diverges when the initial parameters are set to the true parameters. Table 1 shows that the BPE is convergent. The mean and variance of the resultant residuals were 7:982  1018 and 2:264  1032 respectively. Table 1 Estimates for Example 1 Prior parameter estimates

BPE estimated parameters

True parameters

hn1 ¼ 0:3 hn2 ¼ 0:7 hd2 ¼ 1:0

0.3 0.7 1.0

0.3 0.7 1.0

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

181

Example 2. A non-linear rational model with uncorrelated noise yðtÞ ¼

0:3yðt  1Þ þ 0:7uðt  1Þ þ eðtÞ 1 þ u2 ðt  1Þ

ð6:2Þ

Assumed the model structure was not known in advance and only data sequences were measured from the input and output for the model structure detection and parameter estimation. To detect the model structure, an over specified full model consisting of 26 terms was used for the model searching set. The full model was set with numerator degree ¼ 1, denominator degree ¼ 2, input lag ¼ output lag ¼ 2, and noise lag ¼ 1. In expansion of the rational model into its linear in the parameter expression, the output was set as Y ðtÞ ¼ yðtÞpd1 ðtÞ ¼ yðtÞ (pd1 ðtÞ ¼ 1). The cut-off value was set to 0.011 and searches over the full model set were performed to select the terms with significant err values. As mentioned above the orthogonal correlation tests were used to select the significant terms and BPE was used to estimate the associated parameters and then to update the estimation of the noise sequence. The whole process to detect model structure took seven iterations. The first three selected terms are shown in Table 2. For the BPE computations initially the learning rates were assigned as g0 ¼ 0:9 gend ¼ 0:2 with 10 iterations. The estimated parameters were then used as initial values to re-run the BPE with g0 ¼ 0:2, gend ¼ 0:02 and 50 iterations. The initial values obtained from the prior information extraction technique and the final parameter estimates are listed in Table 3. The higher order correlation tests are shown in Fig. 3. The bias induced by the noise is clearly evident in the estimates from the prior estimation in Table 3. The mean and variance of the residuals were )0.0046 and 0.0090, respectively. Example 3. A non-linear rational model with correlated noise yðtÞ ¼

0:3y 2 ðt  1Þ þ 0:8uðt  1Þ  0:6yðt  1Þeðt  1Þ þ eðtÞ 1 þ u2 ðt  1Þ þ y 2 ðy  1Þ

ð6:3Þ

Table 2 Estimates for Example 2 Selected orders

Selected terms

err

1 2 3

uðt  1Þ (numerator term) u2 ðt  1ÞyðtÞ (denominator term) yðt  1Þ (numerator term)

0.7989 0.0546 0.0159

Table 3 Estimates for Example 2 Prior parameter estimates

BPE estimated parameters

True parameters

hn1 ¼ 0:1072 hn2 ¼ 0:4203 hd2 ¼ 0:8038

0.3005 0.6926 1.0196

0.3 0.7 1.0

182

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

Fig. 3. Higher order correlation tests for Example 2: (a) Uae2 ðsÞ; (b) Uau2 ðsÞ.

Initially the learning rates were assigned as g0 ¼ 0:9 gend ¼ 0:2 with 10 iterations. The resulting parameters were then used as initial values and the BPE was re-run with g0 ¼ 0:2; gend ¼ 0:02 and

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

183

Table 4 Estimates for Example 3 Prior parameter estimates hn1 hn2 hn3 hd2 hd3

¼ 0:3673 ¼ 0:0827 ¼ 0:0 ¼ 0:2361 ¼ 1:5952

BPE estimated parameters

True parameters

0.3012 0.8053 )0.6063 0.9910 1.0523

0.3 0.8 )0.6 1.0 1.0

50 iterations. The parameter estimates obtained from the prior information extraction technique and the final parameter estimates are listed in Table 4. The null prior value of hn3 arises because no knowledge of the noise is available at the beginning. The higher order correlation tests are shown in Fig. 4. The mean and variance of the residuals were 0.005 and 0.0101 respectively. 7. Conclusions A new back propagation parameter estimator has been introduced for the dynamic rational model, which provides a bridge between inductive computation and classical model parameter estimation. The efficiency of the algorithm has been demonstrated using three typical examples. The concept of BP estimation could be expanded to various classical non-linear model identification problems. For example a combination of the methodology with an orthogonal correlation test algorithm can be used for rational model structure detection, or regression term selection over a full model set while estimating the associated parameters. The rational model is an alternative to the classical neural network and fuzzy logic systems, and therefore a general rational model identification procedure should find a wide range of applications in inductive learning systems.

Acknowledgements The author is grateful to Professor S.A. Billings for his helpful suggestions and revisions to the original draft and also to the anonymous reviewers for constructive suggestions with regard to the revised draft.

Appendix A With reference to [5,26], consider an orthogonal transform of (3.15) into Y ðkÞ ¼

num X j¼1

qnj ðkÞgnj  yðkÞ

den X j¼2

qdj ðkÞgdj þ bðkÞeðkÞ

ðA:1Þ

184

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

Fig. 4. Higher order correlation tests for Example 3: (a) Uae2 ðsÞ; (b) Uau2 ðsÞ.

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

185

where Pqj ðkÞ ( denotes either n or d) are constructed to be orthogonal over the data record such that Nk¼1 qj ðkÞqi ðkÞ ¼ 0 j 6¼ i and gj are the associated unknown parameters. The terms qj ðkÞ are defined as qnj ðkÞ ¼ pnj ðkÞ 

j1 X

annij qni ðkÞ 

dl X

andij qdi ðkÞ

i¼2

i¼1

qdj ðkÞ ¼ pdj ðkÞyðkÞ 

nl X

adnij qni ðkÞ 

j1 X

i¼1

ðA:2Þ addij qdi ðkÞ

i¼2

where the ordering of the numerator and denominator terms qnj ðkÞ and qdj ðkÞ is arbitrary, nl is the number of numerator terms selected and dl is the number of denominator term selected. Squaring Eq. (A.1) with assumption that the signals are egodic, it gives num den num den X X X   X   r2 1 ¼ err þ err þ bias err bias errdj þ 2e þ nj dj nj 2 rb rY j¼1 j¼2 j¼1 j¼2

ðA:3Þ

Eq. (A.3) can be used as a criterion for determining the number of terms to be included in a selected model from a full model set, it therefore determines the model structure. The larger the value of err associated with a specific term, the more the ratio r2e =r2Y would be reduced if that term were included in the selected model. Hence terms can be ordered based upon that err value. InP significant terms can be rejected by defining a cut off value of 1  errj below which terms are deemed to be negligible, or by a pre-set the number for the maximum terms included in the selected model. As a criterion err attempts to balance the prediction accuracy and complexity of the finally selected model. The above expressions lie a basis for developing the orthogonal model term selection algorithm, which is summarised for each iteration as follows: For k ¼ 1 to N (training data length) qj ðkÞ ¼ D1 ðkÞ 

ej ðkÞ ¼ D2 ðkÞ 

j1 X

aij qi ðkÞ 

i¼ð1;2Þ

i¼ð2;1Þ

j1 X

l X

aij ei ðkÞ 

gj ¼

aij ei ðkÞ

qi ðkÞD1 ðkÞ  ei ðkÞD2 ðkÞr2e q2i ðkÞ  e2i ðkÞr2e

qj ðkÞY ðkÞ  ei ðkÞpd1 ðkÞr2e

errj ¼ end

aij qi ðkÞ

i¼ð2;1Þ

i¼ð1;2Þ

aij ¼

l X

q2j ðkÞ  e2j ðkÞr2e 2 2 2 2 gj qj ðkÞ  gj ej ðkÞr2e  2gj ej ðkÞbðkÞr2e

Y 2 ðkÞb2 ðkÞ

ðA:4Þ

186

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

In the computations  denotes either n or d, and ( for the numerator pnj ðkÞ D1 ðkÞ ¼ pdj ðkÞyðkÞ for the denominator ( 0 for the numerator D2 ðkÞ ¼ pdj ðkÞ for the denominator

ðA:5Þ

References [1] S.A. Billings, S. Chen, Identification of non-linear rational systems using a prediction error estimation algorithm, Int. J. Sys. Sci. 20 (1989) 467–494. [2] S.A. Billings, K.Z. Mao, Structure detection for nonlinear rational models using genetic algorithms, Int. J. Sys. Sci. 29 (1998) 223–231. [3] S.A. Billings, W.S.F. Woon, Structure detection and model validity tests in the identification of nonlinear systems, IEE Proc. Part D 130 (1983) 190–199. [4] S.A. Billings, Q.M. Zhu, Rational model identification using extended least squares algorithm, Int. J. Contr. 54 (1991) 529–546. [5] S.A. Billings, Q.M. Zhu, A structure detection algorithm for nonlinear dynamic rational models, Int. J. Contr. 59 (1994a) 1439–1463. [6] S.A. Billings, Q.M. Zhu, Nonlinear model validation using correlation tests, Int. J. Contr. 60 (1994b) 1107– 1120. [7] S.A. Billings, Q.M. Zhu, Model validation tests for multivariable nonlinear models including neural networks, Int. J. Contr. 62 (1995) 749–766. [8] S. Chen, S.A. Billings, Representations of nonlinear system: the NARMAX model, Int. J. Contr. 48 (1989) 1013– 1032. [9] S.D. Dimitrov, D.I. Kamenski, A parameter estimation method for rational functions, Comput. Chem. Eng. 15 (1991) 657–662. [10] I. Ford, D.M. Titterington, C.P. Kitsos, Recent advances in non-linear experimental design, Technometrics 31 (1989) 49–60. [11] R. Haber, H. Unbehauen, Structure identification of nonlinear dynamic system––a survey on input/output approaches, Automatica 26 (1990) 651–677. [12] S. Haykin, Neural Networks, Macmillan College Publishing Company, Oxford, 1994. [13] R.A. Jacobs, Increased rates of convergence through learning rate adaptation, Neural Networks 1 (1988) 295– 307. [14] C. Kambhampati, J.D. Mason, K. Warwick, A stable one-step-ahead predictive control of nonlinear systems, Automatica 36 (2000) 485–495. [15] A.H. Kramer, A. Sangiovanni-Vincentelli, Efficient parallel learning algorithms for neural networks, in: D.S. Touretzky (Ed.), Advances in Neural Information Processing Systems, Morgan Kaufmann, San Mateo, CA, 1989, pp. 40–48. [16] H. Leung, S. Haykin, Rational function neural network, Neural Comput. 5 (1993) 928–938. [17] L. Ljung, Analysis of recursive stochastic algorithms, IEEE Trans. Autom. Contr. 22 (1977) 551–575. [18] K.S. Narendra, K. Parthasarathy, Identification and control of dynamical systems using neural networks, IEEE Trans. Neural Networks 1 (1990) 4–27. [19] J.W. Ponton, The use of multivariable rational functions for non-linear data presentation and classification, Comput. Chem. Eng. 17 (1993) 1047–1052. [20] T. Proll, M.N. Karim, Model predictive PH control using real time NARX approach, AIChe J. 40 (1994) 269– 282.

Q.M. Zhu / Appl. Math. Modelling 27 (2003) 169–187

187

[21] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by back propagating errors, Nature (London) 323 (1986) 533–536. [22] E.D. Sontag, Realization theory of discrete time nonlinear systems: Part I––the bounded case, IEEE Trans. Circ. Syst. 26 (1978) 342–356. [23] E.D. Sontag, Polynomial Response Maps––Lecture Notes in Control and Information Sciences, vol. 13, SpringerVerlag, Berlin, 1979. [24] L.X Wang, Adaptive fuzzy systems and control, Prentice Hall, Englewood Cliffs, NJ, 1994. [25] Q.M. Zhu, S.A Billings, Recursive parameter estimation for nonlinear rational models, J. Sys. Eng. 1 (1991) 63–67. [26] Q.M. Zhu, S.A Billings, Parameter estimation for stochastic nonlinear models, Int. J. Contr. 57 (1993) 309–333. [27] Q.M. Zhu, Z. Ma, K. Warwick, Neural network enhanced generalised minimum variance self-tuning controller for nonlinear discrete time systems, IEE Proc.––Contr. Theo. Appl. 146 (1999) 319–326.