Estimating the number of components in a mixture of multilayer perceptrons

Estimating the number of components in a mixture of multilayer perceptrons

ARTICLE IN PRESS Neurocomputing 71 (2008) 1321–1329 www.elsevier.com/locate/neucom Estimating the number of components in a mixture of multilayer pe...

318KB Sizes 0 Downloads 73 Views

ARTICLE IN PRESS

Neurocomputing 71 (2008) 1321–1329 www.elsevier.com/locate/neucom

Estimating the number of components in a mixture of multilayer perceptrons M. Olteanu, J. Rynkiewicz SAMOS-MATISSE-CES Universite Paris 1, UMR 8174, 90 Rue de Tolbiac, 75013 Paris, France Available online 26 January 2008

Abstract Bayesian information criterion (BIC) criterion is widely used by the neural-network community for model selection tasks, although its convergence properties are not always theoretically established. In this paper we will focus on estimating the number of components in a mixture of multilayer perceptrons and proving the convergence of the BIC criterion in this frame. The penalized marginal-likelihood for mixture models and hidden Markov models introduced by Keribin [Consistent estimation of the order of mixture models, Sankhya Indian J. Stat. 62 (2000) 49–66] and, respectively, Gassiat [Likelihood ratio inequalities with applications to various mixtures, Ann. Inst. Henri Poincare 38 (2002) 897–906] is extended to mixtures of multilayer perceptrons for which a penalized-likelihood criterion is proposed. We prove its convergence under some hypothesis which involve essentially the bracketing entropy of the generalized scorefunction class and illustrate it by some numerical examples. r 2008 Elsevier B.V. All rights reserved. Keywords: Penalized likelihood; Bayesian information criterion (BIC); Mixture models; Multilayer perceptrons

1. Introduction Although linear models have been the standard tool for time series analysis for a long time, their limitations have been underlined during the past 20 years. Real data often exhibit characteristics that are not taken into account by linear models. Financial series, for instance, alternate strong and weak volatility periods, while economic series are often related to the business cycle and switch from recession to normal periods. Several solutions such as heteroscedatic ARCH, GARCH models, threshold models, multilayer perceptrons or autoregressive switching Markov models were proposed to overcome these problems. In this paper, we consider models which allow the series to switch between regimes and more particularly we study the case of mixtures of multilayer perceptrons. In this frame, rather than using a single global model, we estimate several local models from the data. For the moment, we assume that switches between different models occur Corresponding author. Tel.: +33 1 44 07 892 25.

E-mail address: [email protected] (M. Olteanu). 0925-2312/$ - see front matter r 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2007.12.022

independently, the next step of this approach being to also learn how to split the input space and to consider the more general case of gated experts or mixtures of experts models [5]. The problem we address here is how to select the number of components in a mixture of multilayer perceptrons. This is typically a problem of non-identifiability which leads to a degenerate Fisher information matrix and the classical chi-square theory on the convergence of the likelihood ratio fails to apply. One possible method to answer this problem is to consider penalized criteria. The consistency of the Bayesian information criterion (BIC) criterion was recently proven for nonidentifiable models such as mixtures of densities or hidden Markov models [6,4]. We extend these results to mixtures of nonlinear autoregressive models and prove the consistency of a penalized estimate for the number of components under some good regularity conditions. The rest of the paper is organized as follows: in Section 2 we give the definition of the general model and state sufficient conditions for regularity. Then, we introduce the penalized-likelihood estimate for the number of components and state the result of consistency. Section 3 is concerned with applying the main result to mixtures of

ARTICLE IN PRESS 1322

M. Olteanu, J. Rynkiewicz / Neurocomputing 71 (2008) 1321–1329

multilayer perceptrons. Some open questions, as well as some possible extensions are discussed in the conclusion.

2. Penalized-likelihood estimate for the number of components in a mixture of nonlinear autoregressive models This section is devoted to the setting of the general theoretical frame: model, definition and consistency of the penalized-likelihood estimate for the number of components.

follows, using X t and a standard Gaussian noise et : ( 0:5Y t1 þ et if X t ¼ 1; Yt ¼ 0:5Y t1 þ et if X t ¼ 0: The penalized-likelihood estimate which we introduce in the next subsection converges in probability to the true number of components of the model under some regularity conditions on the process Y t . More precisely, we need the following hypothesis which implies, according to [15], strict stationarity and geometric ergodicity for Y t : p0 X

ðHSÞ 2.1. The model—definition and regularity conditions Throughout the paper, we shall consider that the number of lags is known and, for ease of writing, we shall set the number of lags equal to one, the extension to l time-lags being immediate. Let us consider the real-valued time series Y t which verifies the following model: Y t ¼ F 0yX t ðY t1 Þ þ eyX t ðtÞ,

(1)

where

 

X t is an iid sequence of random variables valued in a finite space f1; . . . ; p0 g and with probability distribution p0 ; for every i 2 f1; . . . ; p0 g, F 0yi ðyÞ 2 F and F ¼ fF y ; y 2 Y; Y  Rl compact setg



p0i ja0i js o1.

i¼1

is the family of possible regression functions. We suppose throughout the rest of the paper that F 0yi are sublinear, that is they are continuous and there exist ða0i ; b0i Þ 2 R2þ such that jF 0yi ðyÞjpa0i jyj þ b0i ; ð8Þy 2 R; for every i 2 f1; . . . ; p0 g, ðeyi ðtÞÞt is an iid noise such that eyi ðtÞ is independent of ðY tk ÞkX1 . Moreover, eyi ðtÞ has a centered Gaussian density f 0yi .

The sublinearity condition on the regression functions is quite general and the consistency for the number of components holds for various classes of processes, such as mixtures of densities, mixtures of linear autoregressive functions or mixtures of multilayer perceptrons. Let us also remark that besides its necessity in the proof of the theoretical result, the compactness hypothesis for the parameter space is also useful in practice. Indeed, one needs to bound the parameter space in order to avoid numerical problems in the multilayer perceptrons such as hidden-unit saturation. In our case, 106 seems to be an acceptable bound for the computations. On the other hand, mixture probabilities are naturally bounded. The next example of a linear mixture illustrates the model introduced by (1). The hidden process X t is a sequence of iid variables with Bernoulli(0.5) distribution. We define Y t as

Let us remark that hypothesis (HS) does not request every component to be stationary and that it allows nonstationary ‘‘regimes’’ as long as they do not appear too often. Since multilayer perceptrons are bounded function, this hypothesis will be naturally fulfilled. 2.2. Construction of the penalized-likelihood criterion Let us consider an observed sample fy1 ; . . . ; yn g of the time series Y t . Then, for every observation yt , the conditional density with respect to the previous yt1 and marginally in X t is g0 ðyt jyt1 Þ ¼

p0 X

p0i f 0yi ðyt  F 0yi ðyt1 ÞÞ.

i¼1

As the goal is to estimate p0 , the number of regimes of the model, let us consider all possible conditional densities up to a maximal number of regimes P, a fixed positive integer. We shall consider the class of functions ( ) p P [ X GP ¼ Gp ; Gp ¼ gjgðy1 ; y2 Þ ¼ pi f yi ðy2  F yi ðy1 ÞÞ , p¼1

i¼1

where pi XZ40, p X

pi ¼ 1,

i¼1

F yi ðyÞ 2 F and f yi is a centered Gaussian density. For every g 2 GP we define the number of regimes as pðgÞ ¼ minfp 2 f1; . . . ; Pg; g 2 Gp g and let p0 ¼ pðg0 Þ be the true number of regimes. We can now define the estimate p^ as the argument p 2 f1; . . . ; Pg maximizing the penalized criterion T n ðpÞ ¼ sup l n ðgÞ  an ðpÞ,

(2)

g2Gp

where l n ðgÞ ¼

n X

ln gðyt1 ; yt Þ

t¼2

is the log-likelihood marginal in X k and an ðpÞ is a penalty term.

ARTICLE IN PRESS M. Olteanu, J. Rynkiewicz / Neurocomputing 71 (2008) 1321–1329

2.3. Convergence of the penalized-likelihood estimate Several statistical and probabilistic notions such as mixing processes, bracketing entropy or Donsker classes will be used hereafter. For parcimony purposes we shall not remind them, but the reader may refer to [2,12] for complete monographs on the subject. The consistency of pb is given by the next result, which in an extension of [4]: Theorem 1. Consider the model ðY k ; X k Þ defined by (1) and the penalized-likelihood criterion introduced in (2). Let us introduce the next assumptions: (A1) an ðÞ is an increasing function of p, an ðp1 Þ  an ðp2 Þ ! 1 when n ! 1 for every p1 4p2 and an ðpÞ=n ! 0 when n ! 1 for every p; (A2) the model ðY k ; X k Þ verifies the weak identifiability assumption (HI) p X

pi f i ðy2  F i ðy1 ÞÞ ¼

i¼1

3

p0 X i¼1 p X

ratio inequality which is an immediate generalization of [4] to multivariate dependent data. Let G  GP be a parametric family of conditional densities containing the true model g0 and consider the generalized score functions gðy1 ; y2 Þ 1 g0 ðy1 ; y2 Þ   sg ðy1 ; y2 Þ ¼  ,   g  1  g0  2 L ðmÞ where m is the stationary measure of ðY k1 ; Y k Þ. Then, P 1 ð n sg ðy ; y ÞÞ2 supðl n ðgÞ  l n ðg0 ÞÞp sup Pn k¼2 2 k1 k , 2 g2G k¼2 ðsg Þ ðyk1 ; yk Þ g2G with ðsg Þ ðyk1 ; yk Þ ¼ minð0; sg ðyk1 ; yk ÞÞ. Then we have ^ Pðp4p 0 Þp

p i dy i ¼

i¼1

¼ p0i dy0i ;

i¼1

(A3) the parameterization yi ! f i ðy2  F i ðy1 ÞÞ is continuous for every ðy1 ; y2 Þ and there exists mðy1 ; y2 Þ an integrable map with respect to the stationary measure of ðY k ; Y k1 Þ such that j logðgÞjom; (A4) Y k is strictly stationary and geometrically b-mixing, and the family of generalized score functions associated to GP 8 > gðy1 ; y2 Þ > > 1 < f ðy1 ; y2 Þ   S ¼ sg ; sg ðy1 ; y2 Þ ¼  ;  > >  g  1 > : f  2 L ðmÞ 9 > >   > = g    g 2 GP ;   1 a0  L2 ðmÞ, > f > L2 ðmÞ > ; where m is the stationary measure of ðY k ; Y k1 Þ and for every e40 H½ ðe; S; k  k2 Þ ¼ Oðj log ejÞ, H½ ðe; S; k  k2 Þ being the bracketing entropy of S with respect to the L2 -norm.

Then, under hypothesis (A1)–(A4), (HS) and (HC), p^ ! p0 in probability. Proof of Theorem 1. First, let us show that pb does not overestimate p0 . We shall need the following likelihood

P X

PðT n ðpÞ4T n ðp0 ÞÞ

p¼p0 þ1

p0i f 0i ðy2  F 0i ðy1 ÞÞ p0 X

1323

P X p¼p0 þ1

! P sup l n ðgÞ  an ðpÞ4 sup l n ðgÞ  an ðp0 Þ g2Gp

g2Gp0

P P X 1 ð nk¼2 sg ðY k1 ; Y k ÞÞ2 sup P p P 2 g2Gp nk¼2 ðsg Þ2 ðY k1 ; Y k Þ p¼p0 þ1 ! 4an ðpÞ  an ðp0 Þ . Under the hypothesis (HS), there exists a unique strictly stationary solution Y k which is also geometrically ergodic and this implies that Y k is in particular geometrically b-mixing. Then, by remarking that k k1 ;Y k Þ bðY ¼ bY n n1

we obtain that the bivariate series ðY k1 ; Y k Þ is also strictly stationary and geometrically b-mixing. This fact, together with the assumption on the e-bracketing entropy of S with respect to the k  kL2 ðmÞ norm and the condition that S  L2 ðmÞ ensures that Theorem 4 in [3] holds and ( ) n 1 X pffiffiffiffiffiffiffiffiffiffiffi sg ðY k1 ; Y k Þjg 2 Gp n  1 k¼2 is uniformly tight and verifies a functional central limit theorem. Then, !2 n X 1 sup sg ðY k1 ; Y k Þ ¼ OP ð1Þ. g2Gp n  1 k¼2 On the other hand, S  L2 ðmÞ, thus S2  L1 ðmÞ and using the L2 -entropy condition S2 ¼ fðsg Þ2 ; g 2 Gp g is Glivenko–Cantelli. Since ðY k1 ; Y k Þ is ergodic and strictly stationary, we obtain the following uniform convergence

ARTICLE IN PRESS M. Olteanu, J. Rynkiewicz / Neurocomputing 71 (2008) 1321–1329

1324

in probability: inf

g2Gp

Since non-identifiability problems also arise in multilayer perceptrons (see, for instance [10]), we shall simplify the problem by considering one hidden layer and a fixed number of units on every layer, k. Then, we have that for every i 2 f1; . . . ; p0 g

n X

1 ðsg Þ2 ðY k1 ; Y k Þ!n!1 inf kðsg Þ k22 . g2Gp n  1 k¼2

To finish the first part, let us prove that inf kðsg Þ k2 40.

F 0i ðyÞ ¼ a0;i 0 þ

g2Gp

If we suppose, on the contrary, that inf g2Gp kðsg Þ k2 ¼ 0, then there exists a sequence of functions ðsgn ÞnX1 , gn 2 Gp such that kðsgn Þ k2 ! 0. The L2 -convergence implies that ðsgn Þ ! 0 in L1 and a.s. for a subsequence sgn;k . R Since sgn dm ¼ 0 and sgn ¼ ðsgn Þ þ ðsgn Þþ , where ðsgn Þþ ¼ R R maxð0; sgn Þ, we obtain that ðsgn Þþ dm ¼  ðsgn Þ dm ¼ R jðsgn Þ j dm and thus ðsgn Þþ ! 0 in L1 and a.s. for a subsequence sgn;k0 . The hypothesis (A4) ensures the existence of a square-integrable dominating function for S and, finally, we get that a subsequence of sgn converges to 0 R a.s. and in L2 , which contradicts the fact that s2g dm ¼ 1 for every g 2 Gp , so that P ð nk¼2 sg ðY k1 ; Y k ÞÞ2 ¼ OP ð1Þ. sup Pn 2 g2Gp k¼2 ðsg Þ ðY k1 ; Y k Þ Then, by the uniform tightness above and the hypothesis (A1),

p

p¼1

where f is the hyperbolic tangent and 0;i 0;i 0;i 0;i 0;i 0;i 0;i y0i ¼ ða0;i 0 ; a1 ; . . . ; ak ; b0;1 ; b1;1 ; . . . ; b0;k ; b1;k ; s Þ

is the true parameter with the true variance. Let us check if the hypothesis of the main result of Section 2 apply in the case of mixtures of multilayer perceptrons. Hypothesis (HS): The stationarity and ergodicity assumption (HS) is immediately verified since the output of every perceptron is bounded, by construction. Thus, every regime is stationary and the global model is also stationary. Let us consider the class of all possible conditional densities up to a maximum number of components P40: GP ¼

Gp ,

( Gp ¼

PðT n ðpÞ4T n ðp0 ÞÞ P

P [ p¼1



p¼1 pX 0 1

j¼1

gjgðy1 ; y2 Þ ¼

p X

) pi f i ðy2  F i ðy1 ÞÞ ,

where

Let us now prove that p^ does not underestimate p0 : ^ Pðpop 0 Þp

0;i 0;i a0;i j fðb0;j þ b1;j yÞ,

i¼1

^ Pðp4p 0 Þ!n!1 0. pX 0 1

k X

supg2Gp ðl n ðgÞ  l n ðg0 ÞÞ n1

!

4

an ðpÞ  an ðp0 Þ . n1

Now,   n X gðY k1 ; Y k Þ ln 0 l n ðgÞ  l n ðg Þ ¼ g ðY k1 ; Y k Þ k¼2



Pp

i¼1 pi ¼ 1 and we may suppose quite naturally that for every i 2 f1; . . . ; pg, pi XZ40; for every i 2 f1; . . . ; pg, F i is a multilayer perceptron

F i ðyÞ ¼ ai0 þ

k X

aij fðbi0;j þ bi1;j yÞ,

j¼1

where yi ¼ ðai0 ; ai1 ; . . . ; aik ; bi0;1 ; bi1;1 ; . . . ; bi0;k ; bi1;k ; si Þ belongs to a compact set.

0

and under the hypothesis (A3), the class of functions fln g=g0 ; g 2 Gp g is P-Glivenko–Cantelli (the general proof for a parametric family can be found in [12]) and since ðY k1 ; Y k Þ is ergodic and strictly stationary, we obtain the following uniform convergence in probability: Z 1 g 0 sup ðl n ðgÞ  l n ðg ÞÞ! sup ln 0 g0 dm. n  1 g2Gp g g2Gp Since pop0 and using assumption (A2), the limit is negative. By hypothesis (A1), ðan ðpÞ  an ðp0 ÞÞ=ðn  1Þ converges to 0 when n ! 1, so we finally have that ^ Pðpop 0 Þ ! 0 and the proof is done. & 3. Mixtures of multilayer perceptrons In this section, we consider the model defined in (1) such that, for every i 2 f1; . . . ; p0 g, F 0i is a multilayer perceptron.

Hypothesis (A1): an ðÞ may be chosen, for instance, equal to the BIC penalizing term, an ðpÞ ¼ 12 p logðnÞ. Hypothesis (A2)–(A3): Since the noise is normally distributed, the weak identifiability hypothesis is verified according to the result of [11], while assumption (A3) is a regularity condition verified by Gaussian densities. Hypothesis (A4): We consider the class of generalized score functions 9 8 > > g > >   > > 1 = < g  f     . S ¼ sg ; sg ¼   1 ; g 2 G ; a0 P f  2  > > > > L ðmÞ  g  1 > > ; : f  2 L ðmÞ The difficult part will be to show that H½ ðe; S; k  k2 Þ ¼ Oðj log ejÞ for all e40 which, since we are on a functional space, is equivalent to prove that ‘‘the dimension’’ of S can be controlled. For g 2 Gp , let us denote y ¼ ðy1 ; . . . ; yp Þ and p ¼ ðp1 ; . . . ; pp Þ, so that the global parameter will be

ARTICLE IN PRESS M. Olteanu, J. Rynkiewicz / Neurocomputing 71 (2008) 1321–1329

F ¼ ðy; pÞ and the associated generalized score function sF :¼sg . Proving that a parametric family like S verifies the condition on the bracketing entropy is usually immediate under good regularity conditions (see, for instance [12]). A sufficient condition is that the bracketing number grows as a polynomial function of 1=e. In this particular case, the problems arise when g ! f and the limits in L2 ðmÞ of sg have to be computed. Let us split S into two classes of functions. We shall consider F0  GP a neighborhood of f such that F0 ¼ fg 2 Gp ; kg=f  1kL2 ðmÞ peg and S0 ¼ fsg ; g 2 F0 g. On SnS0 , it can be easily seen that     g g2  1   1 1    2 g1 g2  f    f     p    .      g1 g2 e f f L2 ðmÞ       1   1 f  2   f L2 ðmÞ L ðmÞ L2 ðmÞ Hence, on SnS0 , it is sufficient that   2 g1 g2     oe f  f 2 2

j¼ti1 þ1

pj  p0i ;

q j ¼ P ti

pj

l¼ti1 þ1

j¼ti1 þ1

pl

and the new parameterization will be Yt ¼ ðft ; ct Þ, ft ¼ ððyj Þ1pjpp ; ðsi Þ1pipp0 1 Þ, ct ¼ ðqj Þ1pjpp , with ft containing all the identifiable parameters of the model and ct the non-identifiable ones. Then, for g ¼ f , we will have that f0t ¼ ðy01 ; . . . ; y01 ; . . . ; y0p0 ; . . . ; y0p0 ; 0; . . . ; 0 ÞT . |fflfflfflfflfflffl{zfflfflfflfflfflffl} |fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl} |fflfflfflffl{zfflfflfflffl} t1

tp0 tp0 1

p0 1

This reparameterization allows to write a second-order Taylor expansion of g=f  1 at f0t . For ease of writing, we shall first denote f j ðy2  F j ðy1 ÞÞ  F 0i ðy1 ÞÞ

 1.

Then, the density ratio becomes

where diam G p is the diameter of the smallest sphere of R3ðkþ1ÞP including the set of possible parameters. So, we get that N½ ðe; SnS0 ; k  k2 Þ ¼ Oð1=eÞ6ðkþ1ÞP , where N½ ðe; SnS0 ; k  k2 Þ is the number of e-brackets necessary to cover SnS0 and the bracketing entropy is computed as log N½ ðe; SnS0 ; k  k2 Þ. As for S0 , the idea is to reparameterize the model in a convenient manner which will allow a Taylor expansion around the identifiable part of the true value. For that, we shall use a slight modification of the method proposed by Liu and Shao [7]. Let us remark that when g=f  1 ¼ 0, the weak identifiability hypothesis (A2) and the fact that for every i 2 f1; . . . ; pg, pi XZ40, imply that there exists a vector t ¼ ðti Þ0pipp0 such that 0 ¼ t0 ot1 o    otp0 ¼ p and, modulo a permutation, F can be rewritten as follows:

yti1 þ1 ¼    ¼ yti ¼

ti X

si ¼

0 0 i¼1 pi f i ðy2

Now, SnS0 is a parametric class. Since the derivatives of the transfer functions are bounded, according to the example 19.7 of [12], it exists a constant K so that the bracketing number of Se is lower than pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi!6ðkþ1ÞP   diam G p diam G p 3ðkþ1ÞP K ¼K , 2 e e

ti X

With this remark, one can define in the general case s ¼ ðsi Þ1pipp0 and q ¼ ðqj Þ1pjpp so that, for every i 2 f1; . . . ; p0 g, j 2 fti1 þ 1; . . . ; ti g,

gj ðy1 ; y2 Þ ¼ gyj ðy1 ; y2 Þ ¼ Pp0

for     g1 g2    1  1   f f  oe.       g    g2  1  1     1      f f 2 2 2

y0i ;

1325

pj ¼

p0i ;

i 2 f1; . . . ; p0 g.

pX ti 0 1 X g 1¼ ðsi þ p0i Þ qj gj f j¼ti1 þ1 i¼1 ! tp pX 0 0 1 X 0 þ pp 0  si i¼1

qj gj .

j¼tp0 1 þ1

By remarking that when ft ¼ f0t , g=f does not vary with ct , we will study the variation of this ratio in a neighborhood of f0t and for fixed ct . Let us note qgj =qyj the vector of derivatives of gj with respect of each components of yj and q2 gj =qy2j the vector of second derivatives of gj with respect of each components of yj . Assuming that ðgj Þ1pjpp , ðg0j Þ1pjpp and ðg00j Þ1pjpp , where g0j :¼

qgj 0 ðf ; c Þ; qyj t t

g00j :¼

q2 gj qy2j

ðf0t ; ct Þ

are linearly independent in L2 ðmÞ, one can prove the following: Proposition 1. Let us denote Dðft ; ct Þ ¼ kgðft ;ct Þ = f  1kL2 ðmÞ . For any fixed ct , there exists the second-order Taylor expansion at f0t : g  1 ¼ ðft  f0t ÞT g0ðf0 ;c Þ þ 12ðft  f0t ÞT g00ðf0 ;c Þ ðft  f0t Þ t t t t f þ oðDðft ; ct ÞÞ, with

ðft 

f0t ÞT g0ðf0 ;c Þ t t

¼

p0 X i¼1

p0i

ti X j¼ti1 þ1

!T qj yj 

y0i

g0i þ

p0 X i¼1

si gy0i

ARTICLE IN PRESS M. Olteanu, J. Rynkiewicz / Neurocomputing 71 (2008) 1321–1329

1326

and

for any vector of norm 1 with components

ðft 

f0t ÞT g00ðf0 ;c Þ ðft t t

þp0i

ti X



f0t Þ

¼

p0 X

2

ti X

42si

i¼1

qj ðyj 

y0i ÞT g00i ðyj

!T qj yj 

y0i

C ¼ ðc1 ; . . . ; cp0 ð3kþ1Þ ; d 1 ; . . . ; d p0 ð3kþ1Þ ; e1 ; . . . ; ep0 ð3kþ1Þ Þ g0i

j¼ti1 þ1

3 

kC T V ðzÞk2 4m þ e.

y0i Þ5.

Since any function ðg=f  1Þ=kg=f  1k2 can be written as:

j¼ti1 þ1

C T V ðzÞ þ oðkC T V ðzÞk2 Þ , kC T V ðzÞ þ oðkC T V ðzÞk2 Þk2

Moreover, ðft  f0t ÞT g0ðf0 ;c Þ þ 12 ðft  f0t ÞT g00ðf0 ;c Þ ðft  f0t Þ t

t

t

t

¼ 0 3 ft ¼ f0t . Proof of Proposition 1. The first term in the development can be computed easily by remarking that the gradient of g=g0  1 at ðf0t ; ct Þ is:



for i 2 f1; . . . ; p0 g and j 2 fti1 þ 1; . . . ; ti g,   g q 01 g ðf0t ; ct Þ ¼ p0i qj g0i qyj

i1



j¼tp0 1 þ1

4.1. Mixtures of linear models

qj gy0p ¼ gy0i  gy0p . 0

0

The term of second order can be obtained by direct computations once the hessian in computed at ðf0t ; ct Þ:

    

whose bracketing number is smaller or equal to Oð1=eÞ3p0 ð3kþ1Þþ1 and the assumptions of Theorem 1 are verified. &

The theoretical result proven above may be applied in practice to compute the number of components in a mixture model on simulated or real data. Some examples are presented below, illustrating the stability and the speed of convergence of the algorithm.

for i 2 f1; . . . ; p0  1g,   g q 01 ti X g ðf0t ; ct Þ ¼ qj gy 0 i qsi j¼t þ1 tp 0 X

S0 belongs to the set of functions:   1 T D V ðzÞ þ oð1Þ; kDk2 p m   1 T  D V ðzÞ þ g; kDk2 p ; jgjo1 m

4. Some numerical examples





and e sufficiently small:

ðq2 ðg=g0  1Þ=qy2j Þðf0t ; ct Þ ¼ p0i qj g00i , i ¼ 1; . . . ; p0 and j ¼ ti1 þ 1; . . . ; ti ; ðq2 ðg=g0  1Þ=qyj qyl Þðf0t ; ct Þ ¼ 0, j; l ¼ 1; . . . ; p and jal; ðq2 ðg=g0  1Þ=qsi qsk Þðf0t ; ct Þ ¼ 0, i; k ¼ 1; . . . ; p0  1; ðq2 ðg=g0  1Þ=qsi qyj Þðf0t ; ct Þ ¼ qj g0i , i ¼ 1; . . . ; p0  1 and j ¼ ti1 þ 1; . . . ; ti ; ðq2 ðg=g0  1Þ=qsi qyj Þðf0t ; ct Þ ¼ qj g0p0 , i ¼ 1; . . . ; p0  1 and j ¼ tp0 1 þ 1; . . . ; tp0 ; the other crossed derivatives of si and yj are zero.

It remains to prove that the rest is oðkft  f0t kÞ but this follows directly from [14] and the fact that, since the noise is normally distributed, Y t has moments of any order. Using the Taylor expansion above, for y belonging to S0 , f y ðzÞ=f  1 is the sum of a linear combination of V ðzÞ:¼ðg1 ; . . . ; gp ; g01 ; . . . ; g0p ; g001 ; . . . ; g00p Þ and of a term whose L2 norm is negligible compared to the L2 norm of this combination when e goes to 0. By assumption (A3), a strictly positive number m exists so that

Let us first consider the particular case of linear models, corresponding to zero hidden-unit perceptrons. The examples are mixtures of two autoregressive models in which we vary the leading coefficients and the weights of the discrete mixing distribution. For every example, we simulate 20 samples of lengths n ¼ 200; 500; 1000; 1500; 2000 and we fix P ¼ 3 the upper bound for the number of regimes. The likelihood is maximized via the EM algorithm (see, for instance [1,8]). This algorithm is well suited to find a sequence of parameters, which increases the likelihood at each step and so converges to a local maximum for a very wide class of models and for our model in particular. The idea of the EM algorithm is to replace the latent variables of the mixture by their conditional expectation. A brief recall on the main steps of the algorithm is given below: 1. Let X be the vector containing the component of the mixture and considered as a latent variable and let y ¼ ðy1 ; . . . ; yn Þ be the vector of observations. 2. Initialization: Set k ¼ 0 and choose y0 . 3. E-step: Set y ¼ yk and compute Qð:; y Þ with  

Ly ðy; X Þ   Qðy; y Þ ¼ E y ln , Ly ðy; X Þ where Ly ðy; X Þ is the likelihood of the observations and the vector of mixture X for the parameter y. This step

ARTICLE IN PRESS M. Olteanu, J. Rynkiewicz / Neurocomputing 71 (2008) 1321–1329

1327

likelihood value, or a maximum number of iterations, fixed at 200 here for reasonable computation time, is reached. The true conditional density is

computes the probabilities of the mixture conditionally to the observations and with respect to the parameter y . 4. M-step: Find: y^ ¼ arg maxQðy; y Þ.

g0 ðy1 ; y2 Þ ¼ p01 f 01 ðy2  F 01 ðy1 ÞÞ þ ð1  p01 Þf 02 ðy2  F 02 ðy1 ÞÞ

^ and go back to step 3 until a stopping 5. Replace ykþ1 by y, criterion is satisfied (i.e. when the parameters do not seem to change anymore).

with F 0i ðy1 Þ ¼ a0i y1 þ b0i and f 0i Nð0; ðs0i Þ2 Þ for i 2 f1; 2g. For every example, we pick equal standard errors s01 ¼ s02 ¼ 0:5, b01 ¼ 0:5 and b02 ¼ 0:5 and let vary the rest of the coefficients: p01 2 f0:5; 0:7; 0:9g, a01 ; a02 2 f0:1; 0:5; 0:9g. Let us focus in one particular example. Fig. 1 illustrates one out of the 20 samples in the case n ¼ 1000, p01 ¼ 0:7, a01 ¼ 0:1 and a02 ¼ 0:5. The observed values of the series Y t are plotted on the first graph. On the second graph, we represent the convergence of the estimates for the mixture probabilities (solid and dashed lines) to the true values (dotted lines). Only the best result

The sequence ðyk Þ gives non-decreasing values of the likelihood function up to a local maximum. Qðy; y Þ is called conditional pseudo-log-likelihood. To avoid local maxima, the procedure is initialized several times with different starting values: in our case, 10 different initializations provided good results. The stopping criterion applies when there is either no improvement in the 2

Y(t)

1 0 -1 -2

Probabilities

0

200

400

600

800

1000

0.8 0.6 0.4 0.2 0

50

100 Number of iterations

150

200

Fig. 1. Series Y t and estimates for mixture probabilities p0i .

Table 1 Number of components for b01 ¼ 0:5 and b02 ¼ 0:5 p01

0.5

n

p^ ¼ 1

p^ ¼ 2

p^ ¼ 3

p^ ¼ 1

p^ ¼ 2

p^ ¼ 3

p^ ¼ 1

p^ ¼ 2

p^ ¼ 3

a01 ¼ 0:1 a02 ¼ 0:1

200 500 1000 1500 2000

20 18 14 6 5

0 2 6 14 15

0 0 0 0 0

20 18 9 4 0

0 2 11 16 20

0 0 0 0 0

20 20 11 5 1

0 0 9 15 19

0 0 0 0 0

a01 ¼ 0:1 a02 ¼ 0:5

200 500 1000 1500 2000

12 11 0 0 0

8 9 20 20 20

0 0 0 0 0

13 6 1 0 0

7 14 19 20 20

0 0 0 0 0

20 18 14 8 7

0 2 6 12 13

0 0 0 0 0

a01 ¼ 0:1 a02 ¼ 0:9

200 500 1000 1500 2000

0 0 0 0 0

20 20 20 20 20

0 0 0 0 0

4 0 0 0 0

16 20 20 20 20

0 0 0 0 0

17 9 9 4 0

3 11 11 16 20

0 0 0 0 0

0.7

0.9

ARTICLE IN PRESS M. Olteanu, J. Rynkiewicz / Neurocomputing 71 (2008) 1321–1329

1328

Laser series 2.5

laser [11500:12500, ]

2.0 1.5 1.0 0.5 0.0 11600

11800

12000

12200

12400

11500:12500 Fig. 2. 1000 last patterns of the laser series.

from the different initializations of the EM algorithm was drawn. The summary of the results for all the examples is given in Table 1. In almost every case, the convergence is reached for the samples containing 2000 inputs. In practice, the results will then be more or less accurate, depending on the size of the sample, but also on the proximity of the components and on their frequency. 4.2. Laser time series A second example studies the complete laser series of ‘‘Santa Fe time series prediction and analysis competition’’. The level of noise in this series is very low, the main source of noise being the errors of measurement. We use the 12 500 patterns for estimation. Fig. 2 shows the last 1000 patterns. The series is supposed to be stationary, and recall from Section 2 that a piecewise stationary time series is globally stationary if every component is stationary itself. The mixture of expert models is an example of piecewise stationary and globally stationary time series. We want to choose the number of components of the mixture by minimizing the BIC criteria. Based on previous study [9] we choose to use experts with 10 entries, five hidden units, one linear output, and hyperbolic tangent activation functions. We want to know the number of experts to use with such series. As this is a real application, it is impossible to check the main assumption of our theory: the true model belongs to the set of possible models. However, we want to know if the developed theory can give an insight for choosing the number of experts. The parameters are estimated using the standard EM algorithm. In order to avoid bad local maxima, estimation is performed with 100 different initializations of model parameters. For each estimation we proceed with 200 iterations of the EM algorithm and for each M-step we

optimize the parameter of the MLPs until their error of prediction does not improve anymore. The estimated loglikelihood and the estimated probability of the mixture are the following: Number of experts

BIC

Probabilities of the mixture

1 2 3

32.16894 25.92 38.42

1 (0.8,0.2) (0.7,0.21,0.09)

The results are clear for our model, the best model is the model with two experts. It is difficult to give an interpretation of the regimes because mixing probabilities remain constant over time. However, if we look at the prediction made by each expert, we can see that one expert seems to be specialized in the general regime of the series and the second one with the collapse regime. The proposed method gives an insight on the way to choose the number of experts in a mixture model for laser time series. However, since the probabilities of the mixture are constant, it would be better to choose probabilities depending on the previous value of the time series as in the gating expert of Weigend et al. [13] or of the time as in hybrid hidden Markov/MLP Models [10]. The prediction error of this simple mixture model is not competitive with such more complex models, however, we need to improve the theory to deal with such complex modeling. 5. Conclusion and future work We have proven the consistency of the BIC criterion for estimating the number of components in a mixture of multilayer perceptrons. In our opinion, two important

ARTICLE IN PRESS M. Olteanu, J. Rynkiewicz / Neurocomputing 71 (2008) 1321–1329

directions are to be studied in the future. The case of mixtures should be extended to the general case of gated experts which allow the probability distribution of the multilayer perceptrons to depend on the input and thus, to learn how to split the input space. The second possible extension should remove the hypothesis of a fixed number of units on the hidden layer. The problem of estimating the number of hidden units in one multilayer perceptron was solved in [10], but it would be interesting to mix the two results and prove the consistency of a penalized criterion when there is a double non-identifiability problem: number of experts and number of hidden units.

References [1] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. (B) 39 (1) (1977) 1–38. [2] P. Doukhan, Mixing: Properties and Examples, Springer, New York, 1995. [3] P. Doukhan, P. Massart, E. Rio, Invariance principles for absolutely regular empirical processes, Ann. Inst. Henri Poincare (B) Probab. Stat. 31 (2) (1995) 393–427. [4] E. Gassiat, Likelihood ratio inequalities with applications to various mixtures, Ann. Inst. Henri Poincare 38 (2002) 897–906. [5] R.A. Jacobs, M.I. Jordan, S.J. Nowlan, G.E. Hinton, Adaptive mixtures of local experts, Neural Comput. 3 (1991) 79–87. [6] C. Keribin, Consistent estimation of the order of mixture models, Sankhya Indian J. Stat. 62 (2000) 49–66. [7] X. Liu, Y. Shao, Asymptotics for likelihood ratio tests under loss of identifiability, Ann. Stat. 31 (3) (2003) 807–832. [8] R.A. Redner, H.F. Walker, Mixture densities, maximum likelihood and the EM algorithm, SIAM Rev. 26 (2) (1984) 195–239.

1329

[9] J. Rynkiewicz, Hybrid HMM/MLP models for time series prediction, in: ESANN’2006 Proceedings, d-side publi., 1999, pp. 455–462. [10] J. Rynkiewicz, Consistent estimation of the architecture of multilayer perceptrons, in: ESANN’2006 Proceedings, d-side publi., 2006, pp. 149–154. [11] H. Teicher, Identifiability of finite mixtures, Ann. Math. Stat. 34 (2) (1963) 1265–1269. [12] A.W. Van der Vaart, Asymptotic Statistics, Cambridge University Press, Cambridge, 2000. [13] A.S. Weigend, M. Mangeas, A.N. Srivastava, Nonlinear gated experts for time series: discovering regimes and avoiding overfitting, Int. J. Neural Syst. 6 (1995) 373–399. [14] J.F. Yao, On least-square estimation for stable nonlinear AR processes, Ann. Inst. Math. Stat. 52 (2000) 316–331. [15] J.F. Yao, J.G. Attali, On stability of nonlinear AR processes with Markov switching, Adv. Appl. Probab. 32 (2) (2000) 394–407. Madalina Olteanu was born on September 14, 1978. She graduated the University of Bucarest in 2001 and obtained her PhD in Applied Mathematics in December 2006 at Universite´ Paris 1. Since September 2007 she is Assistant Professor at Universite´ Paris 1 Panthe´on-Sorbonne and member of SAMOS-CES research center. Her main research interests are in the field of nonlinear time series, model selection, clustering and artificial neural networks. Joseph Rynkiewicz was born on February 10, 1968 in Antony (France). He obtained his PhD in Applied Mathematics from the Universite´ Paris-I Panthe´on Sorbonne in 2000. He is presently ‘‘Maıˆ tre de Confe´rences’’ at the Universite´ Paris-I and member of the SAMOS research center. His research activities cover artificial neural networks, time series and hidden Markov models.