Signal Processing 80 (2000) 1597}1606
Cumulant-based training algorithms of two-layer feedforward neural networks夽 Xianhua Dai Deptartment of Electronic Engineering, Shantou University, Guangdong 515063, People's Republic of China Received 21 January 1999; received in revised form 21 December 1999
Abstract It is always di$cult to train the multi-layer feedforward neural networks (FNN) based on the cumulants match criterion because cumulants are the nonlinear and implicit function of the FNN parameters. In this work, two new cumulant-based training methods for two-layer FNN are developed. In the "rst method, the hidden units of two-layer FNN are approximated with multiple linear systems, and further total FNN is modeled with a `mixture of expertsa (ME) architecture. With the ME model, FNN parameters are estimated with the expectation}maximization (EM) algorithm. The second method, for simplifying the two-layer FNN statistical model, proposes a simpli"ed two-level hierarchical ME to remodel the FNN, in which hidden variables are introduced to decompose training total FNN into training a set of single neurons. Based on training single neuron, total FNN is trained in a simpli"ed version with a faster convergence speed. 2000 Elsevier Science B.V. All rights reserved. Zusammenfassung Es ist stets schwierig, mehrschichtige vorwaK rtsgekoppelte Neuronale Netze (FNN) unter Verwendung des Kumulantenanpassungskriteriums zu trainieren, da Kumulanten nichtlineare und implizite Funktionen der FNN-Parameter sind. In dieser Arbeit werden zwei neue kumulantenbasierte Trainingsmethoden fuK r zweischichtige FNN entwickelt. Bei der ersten Methode werden die versteckten Einheiten des zweischichtigen FNN durch multiple lineare Systeme angenaK hert, und weiters wird das gesamte FNN mittels einer `Expertenmischunga-Architektur (ME-Architektur) modelliert. Im Rahmen des ME-Modells werden FNN-Parameter mittels des Expectation-Maximization-Algorithmus (EM-Algorithmus) geschaK tzt. Bei der zweiten Methode wird zur Vereinfachung des statistischen Modells des zweischichtigen FNN dieses mittels einer vereinfachten zweischichtigen hierarchischen ME-Architektur modelliert. Hierbei werden versteckte Variablen eingefuK hrt, um das Training des gesamten FNN in das Training einzelner Neuronen zu zerlegen; dies fuK hrt zu einer Vereinfachung und zu hoK herer Konvergenzgeschwindigkeit. 2000 Elsevier Science B.V. All rights reserved. Re2 sume2 Il est toujours di$cile de faire l'apprentissage de reH seaux de neurones multi-couches a` propagation anteH rograde (FNN) a` l'aide d'un crite`re baseH sur les cumulants car ces derniers constituent une fonction non-lineH aire implicite des parame`tres du FNN. Nous preH sentons dans cet article deux meH thodes nouvelles d'apprentissage de FNN a` deux couches baseH es sur les cumulants. Dans la premie`re meH thode, les uniteH s cacheH es du FNN a` deux couches sont approximeH es par des syste`mes
夽
This work is supported by the National Science Foundation of China (NSFC), Grant 69872021, and Science Foundation of Guangdong Province. E-mail address:
[email protected] (X. Dai). 0165-1684/00/$ - see front matter 2000 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 5 - 1 6 8 4 ( 0 0 ) 0 0 0 5 9 - 1
1598
X. Dai / Signal Processing 80 (2000) 1597}1606
lineH aires multiples, et le FNN global est modeH liseH e par une architecture de type `meH lange d'expertsa (ME). Par le biais du mode`le ME, les parame`tres du FNN, sont estimeH s a` l'aide de l'algorithme espeH rance- maximisation (expectationmaximization, EM). La seconde meH thode, dans le but de simpli"er le mode`le statistique du FNN, utilise un ME hieH rarchique a` deux couches simpli"eH pour remodeH liser le FNN, des variables cacheH es eH tant introduites pour deH composer l'apprentissage du FNN global en les apprentissages de neurones isoleH s. Sur la base de ces apprentissages de neurones isoleH s, l'apprentissage du FNN global est simpli"eH et converge plus rapidement. 2000 Elsevier Science B.V. All rights reserved. Keywords: ME model; EM algorithm; Cumulant-based system identi"cation
1. Introduction Cumulant and kurtosis-based training has played an important role in many "elds, including blind equalization of digital communication systems, re#ection seismology, blind separation of sources and independent component analysis. Most of the traditional approaches in these application "elds are based on the assumption that a real system can be approximated with a linear system or linear mixture [2,4,9}12,14]. But, many systems we encounter are more or less nonlinear, and can only be represented by the nonlinear models. The feedforward neural networks have been successfully used as approximation of nonlinear adaptive systems, since they are shown to be universal function approximation. Hence, it is valuable to investigate cumulant-based training of two-layer FNN. The traditional approaches to supervised training multiple layers FNN are based on a complete training set speci"ed by input and supervised output [13]. But, in the aforementioned application areas, it is impossible or inconvenient to obtain the complete observed sequences for training the nonlinear approximation. Only the output measurements and partial statistics of input signal, such as higher-order cumulants of input or mutual independence of inputs, are observable and available for training the nonlinear system. In this case, training FNN is changed to the system identi"cation just from the output measurement [2,4,9}12,14]. This is also called the blind training problem. In blind training, one has two kinds of training algorithms. The "rst kind of training algorithms, in which the system is trained based only on the output measurements and higher-order cumulants
of input, is the so-called cumulant-based algorithm, such as the blind equalization and blind deconvolution [2,4,9}12,14]. The second kind of training algorithms is based only on the output measurements and mutually statistical independence of multiple sources or input samples, such as the blind source separation and independent component analysis. This is the so-called independence-based training [2]. This work is only concerned with the former. Blind training of the linear systems is in some way simple because the higher-order statistics (HOS) of system output are not too complicated functions of the system parameters. Blind training of complicated nonlinear systems such as two-layer FNN, however, is more di$cult than the cumulant-based training of a linear system. It is always impossible to directly estimate FNN parameters according to the HOS. This is also a main reason why most of the recent research is only concerned with the cumulant-based identi"cation of linear systems [2,4,9}12,14]. How to ful"ll cumulantbased training of two-layer FNN is the main task of this paper. The paper is organized in the following manner. In Section 2, the two-layer FNN is linearly approximated with multiple linear systems. In Section 3, the linear approximation of two-layer FNN is remodeled by an ME model. Based on the ME, FNN parameter estimation with EM algorithm is brie#y discussed. Section 4 proposes a simpli"ed two-level hierarchical ME to remodel the two-layer FNN through introducing some hidden variables to decompose training total FNN into training a set of single neurons. Based on training single neuron, total FNN is trained in a simpli"ed version with a faster convergence speed. The paper
X. Dai / Signal Processing 80 (2000) 1597}1606
concludes with some experimental results and "nal conclusion in Sections 5 and 6.
1599
fourth-order cumulants take the form, c (q )"k (q )!k (q )k (q !q ) W W W W !k (q )k (q !q )!k (q )k (q !q ), W W W W
2. Two-layer FNN and training set For modeling a real system such as the nonlinear channel in digital communication, a two-layer feedforward neural network is considered, ' y(t)" w(i)y (t) G G + ' (1) " w(i)g w(i, m)x(t!m)!b , G K G where the activation function, g[z], is chosen as the Sigmodial function. The signals, y(t), x(t), are the output signal and the input signal, respectively. In training phase, it is assumed that the output signal is observable, the input signal, however, is unknown and unobserved except for the partial statistical information such as x(t) being at least fourth-order white. w(i), w(i, m) are the parameters of FNN output and the hidden-layer, M, I are the input number and hidden unit number, respectively. The goal of the blind training is to estimate FNN parameters given the noisy measurements d(t) of the FNN output and only partial statistics of input signal such as x(t) being at least fourthorder white. A possible solution to blind training FNN is to estimate FNN parameters based on the higher-order cumulants [2,4,9}12,14]. In this work, the fourth-order cumulants match criterion [2,4,9}12] will be used to train the two-layer FNN,
min [c (q , q , q )!c (q , q , q )], (2) B W 0 where c (q , q , q ), c (q , q , q ) are the fourthW B order cumulants of the network output y(t) and noisy measurements d(t), respectively. R represents the nonredundant support region, q , q , q denote time lags and take values from R. For short, (q , q , q ) will be denoted by (q ). The cumulants are de"ned as the coe$cients of the Taylor expansion of the cumulant-generation function. The
where, k (q ), k (q) are the fourth order central W W moments and second order central moments of signal y(t), respectively. Based on the cumulants match criterion (2), cumulant-based training is to estimate FNN parameters given a training set composed of cumulants of noisy measurements d(t) and cumulants of input x(t). As formulated in (1), the FNN output is a nonlinear function of the FNN parameters. Then the fourth-order cumulants of FNN output will be an implicit and nonlinear function of the FNN parameters. As a consequence, it is always impossible to directly estimate FNN parameters according to the fourth-order cumulants. For instance, it is very di$cult to evaluate the descent gradient of cost function (2) because the fourth-order cumulants in (2) are an implicit function of the FNN parameters. This is also the main reason why cumulant-based training two-layer FNN is more di$cult and little recent research is concerned with this subject. For solving the nonlinearity problem, the activation function of each hidden unit needs to be linearized. As the activation function is chosen as a continuous function, it can be approximated by a linear function in a su$ciently small neighborhood of its input space, and further, the activation function can be also approximated by multiple linear functions in its total input space. It follows from this fact that the input}output relation of the ith hidden unit can be approximated by ) y (t)" D y (t), G GI GI I
(3)
*g(y (t)) G y (t)"g(y )#[y (t)!y ] GI GI G GI *y (t) WG RWGI G "g(y )#l(y )[y (t)!y ], GI GI G GI + y (t)" w(i, m)x(t!m)!b , G G K
(4) (5)
1600
X. Dai / Signal Processing 80 (2000) 1597}1606
where y , k"1,2, K, are a set of "xed points in GI input space of activation function g[z]
1 y (t)3A , G GI D " GI 0 otherwise, A denotes a small neighborhood centered at GI y and satis"es ) A "A ) A represents the GI I GI G G input space of the activation function g[z] of the ith hidden unit. g(y ), l(y ) are the function value and GI GI "rst-order derivation of the activation function g[z] at the "xed point y , respectively. In fact, the GI approximations (4,5) are the "rst-order Taylor expansion of the activation function g[z] at the "xed point y . K is the number of linearly approxiGI mated systems (3)}(5). From (3) to (5), one can see that the hidden unit of two-layer FNN has been approximated with multiple linear systems in its input space. Each approximation in (4), (5) is in fact a linear "nite impulse response (FIR) system. The fourth-order cumulants of a linear FIR system are certainly easier to be evaluated than the original two-layer FNN, which makes it possible to train two-layer FNN by a cumulant-based training algorithm of the linear FIR systems.
3. Training FNN based on EM algorithm The traditional approach to train two-layer FNN is based on the back-propagation (BP) algorithm [13]. But, training FNN with BP algorithm is usually quite slow due to the nonlinearity and compact structure or strong interference (coupling) e!ect of multiple hidden units. E!orts have been made to improve training time, such as the adaptive expert networks and EM algorithms [1,5}8]. This work follows the recent trend to investigate the training FNN based on adaptive expert networks and EM algorithm. 3.1. ME model of two-layer FNN Analyzing the linear approximation (3)}(5), the approximation can be treated as one of combining multiple linear systems, or in other words, these linear systems are mixed to approximate the hidden
units. Each linear system is to exclusively approximate the original hidden unit or to solve a function approximation problem over a local region A . GI The linear approximations of all hidden units are linearly combined in the output layer to produce the FNN output. According to [1,5}8], these linearly approximated systems practically form an ME architecture. Thus, we can use an ME to remodel the two-layer FNN. In the ME model, each expert system is in fact a linear FIR system, and the two-layer FNN output is ' ' y (t)" w(i)y (t)" w(i)D y (t), (6) L G GGI GGI G G where i(k) is an index function of the ith hidden unit, it takes value in +1,2, K, and i(k)"k implies that the ith hidden unit is approximated by the kth expert system as in (4) and (5). As i(k) takes value in +1,2, K,, each expert system output (6) is rewritten as ' + y (t)" w(i)l(y ) w(i, m)x(t!m)#c (t) L GGI L G K ' + " w(i)w(i, m)l(y ) x(t!m)#c (t) GGI L K G + " a (m)x(t!m)#c (t), (7) L L K where c (t) is a constant. As i(k) takes values in L +1,2, K, and i"1,2, I, the number of expert systems will be N"K', namely the index n in (7) will take values from +1,2, N,. In fact, two-layer FNN has been approximated by N linear FIR systems as (7). Based on each expert system output (7), the twolayer FNN can be remodeled by an ME model [1,5}8],
p(c (q ), X"y(t), 0(t)) B , " [n q exp(!j(c (q )!c (q )))]BL R, L B L L (8) The motivation for using Gaussian distributions or Gaussian mixture mode (GMM) comes from the fact that they correspond to commonly used quadratic error function as criterion (2).
X. Dai / Signal Processing 80 (2000) 1597}1606
where j denotes the variance of probabilistic distribution in (8), and 0 denotes all parameters of the total network. q is a constant chosen such that (8) satis"es the probabilistic distribution constraint
1 X"n, d (t)" L 0 XOn, X is an indicator random variable (hidden variable) and takes values in +1,2, N,, with probability n "prob(X"n), and X"n implies that two-layer L FNN has been approximated by nth expert system as (7). According to probability constraint, one has , n "1. c (q ) denotes the fourth order L L L cumulants of the nth expert system output. According to (7), c (q ) can be evaluated by [2,4,9}12], L + c (q )"p a (m)a (m#q )a L L L L V K ;(m#q )a (m#q ), L
(9)
where p are the fourth-order cumulants of input V sequence x(t), and can be evaluated according to the statistical information of input sequence x(t). Cumulants (9) is, in fact, the fourth-order cumulants formula of a linear FIR system. Compared with the fourth-order cumulants evaluated directly from FNN output, the fourth-order cumulants evaluated by (9) are certainly the simpler functions of the FNN parameters. From (9), it is also very easy to evaluate the descent gradient for the parameter estimation. Hence, the complicated and implicit function problem in cumulant-based identi"cation of nonlinear system is solved. 3.2. Training through EM algorithm Since each expert is in fact a linear FIR system as (7), the parameters a (m), n"1,2, N, m" L 0,2, M, can be estimated through the combination of expectation}maximization (EM) algorithm and cumulant-based system identi"cation of a linear FIR system [2,4,9}12]. That is, the hidden variables or probabilities of X"n are estimated with E-step of the EM algorithm [1,5}8], and then, the parameters a (m) can be estimated in a straightforward L manner with M-step of the EM algorithm. Here the
1601
M-step is equivalent to cumulant-based system identi"cation of a linear FIR system [2,4,9}12]. Because of the space limitation, the details will not be discussed. After the parameters a (m), n"1,2, N, have L been estimated, the connected weights of two-layer FNN can be obtained according to the relation between parameters a (m) and FNN weights in (7), L namely, ' a (m)" w(i)w(i, m)l(y ), n"1,2, N, L GGI G where l(y ) is known as in (4), (5). In the above GGI equation, the number of unknown parameters is 1;(M#1)#I. In general, N"K''I;(M#1)# I, thus, the parameter equation is an over-determined equation. The solution of unknown parameters can be estimated with least-squared-error algorithm. Due to the linear approximation and ME model, the parameter estimation from (8) is rather di!erent from the conventional training algorithm [13]. The new algorithm uses the linear approximation of the activation function around multiple points, y , k"1,2, K, to estimate the FNN parameters GI in the probability sense. In fact, the new algorithm based on (8) is a competitive learning [1,5}8] of multiple linear systems to estimate the FNN parameters. Only as X"n with probability one, the new algorithm based on (8) is reduced to the conventional training algorithm [13]. Hence, it can be conjectured that the new algorithm may have a much more improved convergence performance than the conventional algorithm in convergence speed and solving multiple local solutions problems. Because of the limitation of space, the details are not discussed.
4. Simpli5ed two-level hierarchical ME algorithm From the above discussion, one can see that the computational complexity of the training algorithm is of order K'. When both the number of hidden units and the number of linear approximations of each hidden unit are chosen as su$ciently
1602
X. Dai / Signal Processing 80 (2000) 1597}1606
large integers, the computation complexity would be very high. For simplifying the above training algorithm, the two-layer FNN can be further approximated by a simpli"ed two-level hierarchical ME (HME) model [6,7]. Namely, each hidden unit is viewed as a simple ME model of the "rst level through linear approximations (3)}(5), further each hidden unit is in turn taken as an expert to form the second-level ME model. But, the simpli"ed two-level HME is in some way di!erent from the standard two-level HME [6,7]. The second level of the simpli"ed HME is a deterministic and linear combination instead of the probabilistic combination. Discussed below, the simpli"ed two-level HME implies that all hidden units outputs are mutually fourth-order white at least. Hence, the simpli"ed two-level HME is only an approximation of real two-layer FNN. But, as shown in experiments, the approximation will approach real two-layer FNN when the number of hidden units is su$ciently large. Hence, the assumption of hidden unit outputs being mutually fourth-order white is reasonable and valid. 4.1. The xrst level of two-layer HME In the simpli"ed two-level HME, the gating network of the second level is deterministic, only the gating network or probabilities of the "rst level needs to be estimated. In the "rst level of the twolayer HME, each ME corresponds to a single neuron and contains K experts as (3)}(5). For modeling the "rst level of the HME, the desired cumulants of all hidden unit outputs are introduced as hidden variables, denoted by C(q )"[c (q ),2, c (q )]. ' With the hidden variables, each hidden unit can be modeled by an ME architecture similar to (8), p(c (q ),H "x(t),0(t)) G G ) " [n q exp(!j (c (q ) GI G G G I !c (q )))]BGI O, (10) GI where j denotes the variance of probabilistic disG tribution in (10), q is a constant chosen such that G
(10) satis"es the probabilistic distribution constraint.
1 H "k, G d (q )" GI 0 H Ok, G H is an indicator random variable (hidden variG able) and takes values on the discrete set +1,2, K, with probability n "prob(H "k) H "k, imGI G G plies that the ith hidden unit has been approximated with the kth expert system as (4), (5). c (q ) represents the fourth-order cumulants of GI the kth expert system output of the ith hidden unit, it can be evaluated by [2,4,9}12], + c (q )"p [l(y )] w(i, m)w(i, m#q ) GI V GI K ;w(i, m#q )w(i, m#q ). (11) Due to the hidden variables, training two-layer FNN has been certainly decomposed into training a set of single neurons and the linear output layer. It is obvious that training single neuron is more simple and e!ective than training total FNN. Based on the hidden variables and ME model (10), training single neuron can be ful"lled with the same procedure as the algorithm proposed in Section 3.2, namely the parameters of single neuron are estimated with EM algorithm [2,4,9}12]. But, the hidden variables C(q ) are unknown and unobserved, they must be estimated prior to FNN parameter estimation. The following gives a solution to estimate the hidden variables. 4.2. Hidden variables C(q ) estimation The hidden variables C(q ) are, in fact, the intermediate variables of the two-layer FNN. They should be optimally estimated such that the input}output behavior of the total two-layer FNN is in good agreement with the given observation of the total FNN. In the parameter estimation of each hidden unit, the observed variables are c (q ), p , which speci"es the input}output relaB V tion of the total two-layer FNN. Hence, hidden variable estimation is to "nd the most likely fourth-order cumulants of the hidden unit outputs such that total network behavior is in the best
X. Dai / Signal Processing 80 (2000) 1597}1606
agreement with the given relation between p and V c (q ). B In order to "nd the most likely fourth-order cumulants, one needs to form the conditional probabilistic model p(C(q )"c (q ), p ,0). First, the B V jointly conditional density is considered, p(c (q ), C(q )"p , 0) B V "p(c (q )"C(q ), p ,0(t))p(C(q )"p , 0(t)), B V V (12) where p(c (q )"C(q ), p , 0(t)) takes the form B V p(c (q )"C(q ), p , 0(t)) B V "p(c (q )"C(q ), 0(t)) B "B exp(!E (q )),
(13)
where E (q )"p [c (q )!"=(t)"C(q )], B p represents the variance, and "=(t)"" ["w(1)",2,"w(I)"]2 ) B is a constant chosen such that (13) satis"es the probabilistic distribution constraint. The conditional probability density of the hidden variables C(q ) given only input p is chosen as, V p(C(q )"p 0(t)) V "B exp(![C(q ) \ !C (q )]2 [C(q )!C (q )]), (14) where B follows from the same fact as B in (13), and is a covariance matrix. C (q )" [c (q ),2, c (q )]2, and c (q ) represents ' G the fourth-order cumulants of the ith hidden unit output, and can be evaluated by ) c (q )" p (t)c (q ), (15) G GI GI I where p (t) denotes the probability of H "k, i.e., GI G the probability of the ith hidden unit being
The probability model (13) is based on the assumption of all hidden unit output being mutually 4th order white at least. The Gaussian distribution follows from the same fact as (8).
1603
approximated by kth expert system, which has been estimated in the previous E-step of EM algorithm. The fourth-order cumulants c (q ) take the form GI of (11). In addition to the above conditional density (13), (14), the conditional density of the fourth-order cumulants c (q ) given only p is chosen as, B V p(c (q )"p ,0(t)) B V "B exp(!p [c (q )!c (q )]). B W
(16)
Substituting formulas (13), (14) into (12), we have p(C(q )"c (q ), x (t), 0(t)) B K p(c (q ), C(q )"x (t), 0(t)) K " B p(c (q )"x (t), 0(t)) B K "B exp(!E (q )), F !
(17)
where both B in (17) and B in (16) are the con stants chosen to satisfy the probability constraint. E (q )"[C(q )!C(q )]2 \[C(q )! ! ! C(q )]. The vector, C(q )"[c (q ), 2, c' (q )]2, is in practice the statistical expectation of the hidden variables C(q ), and can be evaluated by
\ \ C(q )"C (q )# #p ["=(t)"]2"=(t)" ;p "=(t)"[c (q )!"=(t)"C (q )] B (18) According to the conditional probabilistic density (17), the estimation C(q ) is in e!ect the con ditional expectation of the hidden variables C(q ) given the observations c (q ), p ,0(t). Hence, B V C(q ) is, in fact, an optimal estimation of the hidden variables C(q ) such that the total two layer FNN behavior is in the best agreement with the given relation between input p and the deV sired output c (q ). In (18), the variance p and B the covariance matrix can be estimated by the time averaging operation over a P-length window.
1604
X. Dai / Signal Processing 80 (2000) 1597}1606
As an example p "p( (t) 1 R " [c (q )!"=(t)"C (q )], B P RYR\.> where P is a large integer. From the above discussion, the probabilistic model (13), in fact, speci"es the input}output relation of the linear output layer of the two-layer FNN, thus, the parameters of the FNN output layer can be easily estimated according to (13). It is worth mentioning that the above training algorithm is an iterative process of the parameter estimation in Section 4.1 and hidden variable estimation in Section 4.2. The iterative process is brie#y expressed as Gives the initial guess 0(0), t"0, repeating the following: Step 1. Based on 0(t), estimate the hidden variables C(q ) by (18). Step 2. With hidden variable estimation C(q ), estimate the single neuron parameters by use of the probabilistic model (10) and FNN output-layer parameters by the probabilistic model (13), thus, one obtains 0(t#1).
5. Simulation A blind equalizer experiment of digital communication has been done to demonstrate the performance of the above algorithms. In the experiment, the channel output measurement is assumed as
d(t)"G b (t)x(t!m)!1.5 #n(t), (19) K K where b (t) are the parameters, and G(s)" K 2s/(1#s), which is an amplitude distortion operator. The additive noise, n(t), is chosen as Gaussian noise. The excitation or transmitted sequence x(t) is unknown and chosen as an independent, identically distributed (iid) random variable taking values on +0,1,2,3, with equal probability, which is in fact the four-level pulse amplitude modulation (four-PAM) signal in digital communication. The nonlinear model (19) can be found in the satellite communica-
Fig. 1. The transmitted sequence and its recovered signal. (A) The transmitted sequence, (B) the channel output, (C) the estimation of transmitted sequence produced with inverse operation of two-layer FNN and (D) the recovered sequence from (C).
tions, magnetic recording channels and physiological processes. Although transmitted sequence x(t) is unknown, its cumulants p can be easily evaluated according V to x(t) being an iid random variable and taking values on set +0,1,2,3, with equal probability. In the experiment, two-layer FNN (1) is used to approximate the nonlinear channel (19). The input number of two-layer FNN is chosen as M"4. The number of linear approximations (3)}(5) is chosen as K"15 and all "xed points are symmetrically distributed about the zero. The interval of the neighboring two "xed points is chosen as "y !y ""u /14, where u GI GI\
represents the maximum amplitude of the input signal of activation function, i.e. u "
max" w(i, m)x(t!m)!b ", which can be either K G adaptively estimated or a priori guessed in the two-layer FNN learning. Experimental results shown in Fig. 1 are the recovered signal of the transmitted sequence. In Fig. 1, curve (C) is produced byGK \[d(t)], where GK \[z] denotes the inverse operation of twolayer FNN and d(t) is noisy measurement of channel output in (19). Curve (D) represents the recovered sequence from curve (C) according to x( (t)"arg min +"GK \[d(t)]!x(t)",, where VRZ+,
X. Dai / Signal Processing 80 (2000) 1597}1606
Fig. 2. The learning curves of cumulants-based training FNN. From the experiment results, one can see that the assumption of all hidden units being independent in simpli"ed two-level hierarchical ME is valid as the FNN has a su$cient great number of hidden units.
1605
Fig. 4. The learning curves of cumulants-based training FNN, which correspond to the sandwich system (20)}(22).
experimental results also verify that the assumption of all hidden units being independent in simpli"ed two-level hierarchical ME is valid as the FNN has a su$ciently great number of hidden units. In order to further verify the e!ectiveness of the new training method, the new training algorithm has also been used to train other nonlinear systems. The nonlinear system output is given by d(t)" b (t)d(t!m)#n(t), K K
(20)
d(t)"N[u(t)]"[u(t)#"u(t)"]/2,
(21)
u(t)"H(z)[x(t)] Fig. 3. The learning curves of new algorithm based on cumulants match criterion and BP algorithm. Where, the BP algorithm is ful"lled based on known input and supervised output signal.
x( (t) represents the recovered sequence in curve (D). Figs. 2 and 3 are the learning curves of the training algorithms, which are obtained through averaging square output error of two-layer FNN over 40 independent experiments. From experimental results, one can see that the new algorithms are successful in training two-layer FNN, and the new algorithm based on the adaptive expert networks and EM algorithm is at least 10}15 times faster than the BP algorithm. The
"x(t)!1.4x(t!1)#0.65x(t!2).
(22)
Obviously, the nonlinear system (20)}(22) is a sandwich system [3]. The nonlinear operator N[ ] is a half-square operator. Following the same procedure as (19), a two-layer FNN as (1) is used to approximate the nonlinear system (20)}(22). The input number of the two-layer FNN is chosen as M"10. The number of linear approximations (3)}(5) is chosen as K"15 and all "xed points are symmetrically distributed about the zero. Fig. 4 shows the learning curves of the training algorithms. From the experimental results in Fig. 4, one can see that two-layer FNN is e!ective to
1606
X. Dai / Signal Processing 80 (2000) 1597}1606
model a sandwich system, and the new training algorithm is also successful in training the twolayer FNN.
6. Conclusion and remarks Two cumulant-based training algorithms of two-layer FNN have been developed. Through linear approximation of hidden units, the two-layer FNN has been modeled with an ME or a simpli"ed two-level HME. Hence, cumulant-based training FNN can be ful"lled with EM algorithm in faster convergence speed than BP algorithm. In particular, the M-step of EM algorithm is cumulant-based training algorithm of a linear FIR system. Thus, the cumulant-based training algorithms of linear systems can be directly used to solve cumulant-based training of two-layer FNN. The new cumulantbased training algorithm proposed in the paper can be also extended to train the other nonlinear systems, such as block-based nonlinear system, sandwich system, Wiener system and Hammerstein system.
References [1] S.I. Amari, Information geometry of the EM and em algorithms for neural network, Neural Networks 8 (9) (1995) 1379}1408. [2] H.A. Cadzow, Blind deconvolution via cumulant extrema, IEEE Signal Process. Mag. (May 1996) 24}42.
[3] H.-w. Chen, Modeling and identi"cation of parallel nonlinear systems: structural classi"cation and parameter estimation methods, Proc. IEEE 81 (1) (1995) 37}65. [4] D. Hatzinakos, C.L. Nikias, Blind equalization using a tricespectrum-based algorithm, IEEE Trans. Commun. 39 (May 1991) 669}681. [5] R.A. Jacobs, Adaptive mixtures of local experts, Neural Comput. 3 (1991) 79}87. [6] M.I. Jordan, Hierarchical mixtures of experts and EM algorithm, Neural Comput. 6 (1994) 181}224. [7] M.I. Jordan, L. Xu, Convergence results for the EM approach to mixtures of experts architectures, Neural Networks 8 (9) (1995) 1409}1431. [8] S. Ma, J. Farmer, An e$cient EM-based training algorithm for feedforward neural networks, Neural Networks 10 (2) (1997) 243}256. [9] J.M. Mendel, Tutorial on high-order statistics (spectra) in signal processing and system theory: theoretic results and some applications, Proc. IEEE 79 (March 1991) 278}305. [10] B. Port, B. Ftiedlander, Blind equalization of digital communication channels using higher-order moments, IEEE Trans. Accoust. Speech Signal Process. 39 (February 1991) 522}526. [11] J.K. Tugnait, Identi"cation of linear stochastic systems via second and fourth-order cumulants matching, IEEE Trans. Inform. Theory 33 (May 1987). [12] JK. Tugnait, Blind equalization and channel estimation with partial response input signals, IEEE Trans. Commun. 45 (9) (1997) 1025}1031. [13] B. Widrow, 30 Years of adaptive neural networks: perceptron, madaline, and backpropagation, proc. IEEE 78 (September 1990) 1415}1442. [14] V. Zivojnvic, Minimum "sher information of momentconstrained distributions with application to robust blind identi"cation, Signal Processing 65 (1998) 297}313.