The Population-Based Incremental Learning Algorithm converges to local optima

The Population-Based Incremental Learning Algorithm converges to local optima

ARTICLE IN PRESS Neurocomputing 69 (2006) 1772–1775 www.elsevier.com/locate/neucom Letters The Population-Based Incremental Learning Algorithm conv...

159KB Sizes 1 Downloads 32 Views

ARTICLE IN PRESS

Neurocomputing 69 (2006) 1772–1775 www.elsevier.com/locate/neucom

Letters

The Population-Based Incremental Learning Algorithm converges to local optima Reza Rastegar, Arash Hariri Iran Telecommunication Research Center, Tehran, Iran Received 16 August 2005; received in revised form 3 December 2005; accepted 7 December 2005 Available online 21 February 2006 Communicated by R.W. Newcomb

Abstract Here, we propose a convergence proof for the Population-Based Incremental Learning (PBIL). First, we model the PBIL by a Markov process and approximate its behavior using an Ordinary Differential Equation (ODE). Then we prove that the corresponding ODE does not have any stable stationary point in the configuration space except the local maxima of the function to be optimized. Finally, we show that the ODE and consequently the PBIL converge to one of these stable points. r 2006 Elsevier B.V. All rights reserved. Keywords: PBIL; Markov process; ODE; Stationary points; Stability; Convergence

1. Introduction Recently, a new class of evolutionary algorithms called Estimation of Distribution Algorithms (EDAs) has been developed with the main target of preventing the disruption of partial solutions. Based on the interdependencies between the variables of solutions, the EDAs are classified into three classes: (1) the no dependencies model such as the Population-Based Incremental Learning (PBIL) [1], (2) the bivariate dependencies model such as the MIMIC [2] and finally, (3) the multiple dependencies model such as the FDA [7]. Though having low efficiency in solving difficult problems, the no dependencies model benefits low memory usage and simple computation unlike the bivariate and multiple dependencies models. The PBIL is one of the simplest algorithms in the no dependencies model. As the first attempt, Ho¨hfeld and Rudolph in [5] have studied it for unimodal optimization problems. They have modeled it by a Martingale and shown that when the learning step tends to zero it surely converges to the global maximum. In [4] Gonza´lez et al. have shown that the PBIL strongly depends on initial parameters. They have proven that [3] Corresponding author. Tel.: +9821 88807447.

E-mail addresses: [email protected] (R. Rastegar), [email protected] (A. Hariri). 0925-2312/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2005.12.116

when the learning step tends to zero the local optima of the search space will be asymptotically stationary points of the PBIL and then claimed that because just the local optima of the search space are stable stationary points, the PBIL can only converge to the local optima. However, they have left two questions unanswered: (1) does the PBIL have some other stable points in the probabilistic space? (2) is it possible that the PBIL exhibit limit cycle or chaotic behavior? Here to answer these questions, we approximate the PBIL by an Ordinary Differential Equation (ODE) and analyze it. This work is organized as follows; Section 2 describes the PBIL precisely and states some required definitions and lemmas. In Section 3, the analysis of the PBIL is done. Finally, Section 4 concludes. 2. The problem formulation By using the PBIL (Fig. 1), we would like to find max{g(x);xAO} given a finite search space O ¼ f0; 1gn ,where { } denotes a set , and an injective pseudo Boolean function g:O-R40. We call the function g injective if for x and y in O and gðxÞ ¼ gðyÞ we have x ¼ y. The PBIL is a combination of evolutionary optimization and hill climbing [1]. The goal of the algorithm is to create a real valued Probability Vector (PV), p ¼ ðp1 ; . . . ; pm ; . . . ; pn Þ, which, when sampled,

ARTICLE IN PRESS R. Rastegar, A. Hariri / Neurocomputing 69 (2006) 1772–1775

where PrðyjpðkÞÞ denotes the probability of sampling the solution y [67].

Do Using p (k) obtain N solutions x 1 (k), ..., xN (k) Evaluate them with respect to g Select M solutions x 1:N (k), ..., x M:N (k) , g(xi:N(k))

1773

Lemma 2. Assume that pm and ym are the mth positions of p and y, respectively. Then we have

where for i
PrðyjpÞ ¼

n Y

y

pi i ð1  pi Þ1yi ,

(3)

i¼1

Fig. 1. The pseudocode of the PBIL.

reveals high-quality solutions with a high probability. Note that pm is the probability of obtaining ‘‘1’’ in the mth variable. Initially, the PV is set to p(0). As search progresses, the values in the PV gradually shift to represent high-quality solutions. This is done as follows; at iteration k, N solutions, x1 ðkÞ; . . . ; xN ðkÞ, are generated based on the probabilities specified in p(k). Then by using a selection method (often the truncation selection model), M solutions, x1:N ðkÞ ; . . . ; xM:N ðkÞ, are selected from the population. Then the PV is pushed towards the selected solutions according to (1). After updating the PV, sampling from the updated PV generates a new set of solutions and the cycle continues. In the remainder of this section, we introduce our definitions and derive some results that will be used later to analyze the PBIL. Definition 1. A solution y is called a local maximum of the function g, if and only if, for each solution z, whose hamming distance to the solution y is one, i.e. d H ðy; zÞ ¼ 1, we have gðyÞXgðzÞ.

(  1 if ym ¼ 1; qPrðyjpÞ ¼  1 if ym ¼ 0; qpm y (  zm ¼ 1; ym ¼ 0; 1 if d H ðz; yÞ ¼ 1; qPrðzjpÞ ¼  1 if d ðz; yÞ ¼ 1; z qpm y H m ¼ 0; ym ¼ 1; ( 0 for p 2 K  ; pay; PrðyjpÞ ¼ 1 for p 2 K  ; p ¼ y;  qPrðzjpÞ ¼ 0 for all z : d H ðz; yÞX2, qpm y  qPrðzjpÞ ¼ 0 if d H ðz; yÞ ¼ 1; ym ¼ zm , qpm y

(4)

(5)

(6)

(7)

(8)

where y and z belong to O and PrðyjpÞ denotes the probability of sampling the solution y. Proof. Eq. (3) is trivial by the fact that all yi’s are independent and the other results can be easily obtained by (3) [3]. & 3. Analysis of the PBIL

Definition 2. The configuration space of the PBIL is K ¼ [0,1]n where [ ] denotes a closed interval and p(k) belongs to K for each k. Also K  ¼ O ¼ f0; 1gn and K2K  are called the deterministic and non-deterministic subspaces of K, respectively.

Under the algorithm specified by (1), fpðkÞ; kX0g is a Markov process. The analysis of this process is done in two stages. In the first stage, we derive an ODE whose solutions approximate the asymptotic behavior of p(k) for a sufficiently small learning step a used in (1). In the second stage, we characterize the solutions of the the ODE and obtain the long-term behavior of p(k). We represent (1) as

Definition 3. p(k) is called a deterministic configuration if pi ðkÞ ¼ 0 or 1 for each i ¼ 1; . . . ; n, i.e. p(k)AK*, in the other cases, p(k) is called a non-deterministic configuration, i.e. p(k)AK–K*.

pðk þ 1Þ ¼ pðkÞ þ aGðpðkÞ; x1:N ðkÞ; . . . ; xM:N ðkÞÞ, " # M X 1 xi:N  pðkÞ. GðpðkÞ; x1:N ðkÞ; :::; xM:N ðkÞÞ ¼ M i¼1

Lemma 1. Let Prðx1:N ðkÞ ¼ y j pðkÞÞ be the probability of obtaining y as the best solution at the kth iteration of the algorithm. Then Prðx1:N ðkÞ ¼ yjpðkÞÞ ¼ PrðyjpðkÞÞ 8" #j1 N < X X PrðzjpðkÞÞ  : gðzÞ4gðyÞ j¼1 " #Nj 9 = X  , PrðzjpðkÞÞ ; gðzÞXgðyÞ

ð2Þ

ð9Þ

In this paper, we assume that M equals one i.e. in each iteration only one solution is selected. Now, we define DpðkÞ ¼ Efpðk þ 1ÞjpðkÞg2pðkÞ. Since fpðkÞ; kX0g is Markovian and x1:N ðk21Þ only depends on p(k) not on k, Dp(k) can be given as DpðkÞ ¼ af ðpðkÞÞ where e:K-K and  f ðpÞ ¼ EfGðpðkÞ; x1:N ðkÞÞ pðkÞ ¼ pg  ð10Þ ¼ Efx1:N ðkÞ  pðkÞ pðkÞ ¼ pg. Now, pa(  ) is a sequence of continuous-time interpolations of (9) and defined by pa ðtÞ ¼ pðkÞ where t 2 ½ka; ðk þ 1ÞaÞ. The objective is to study the limit behavior of {pa(t): tX0} as a tends to zero. For a sufficiently small a, the equation

ARTICLE IN PRESS 1774

R. Rastegar, A. Hariri / Neurocomputing 69 (2006) 1772–1775

DpðkÞ ¼ af ðpðkÞÞ can be written as dp=dt ¼ f ðpÞ. We are interested in characterizing the long-term behavior of p(k) and hence the asymptotic behavior of the ODE dp=dt ¼ f ðpÞ. Now, we show that {pa(  )} weakly converges to the solution of dp=dt ¼ f ðpÞ with the initial configuration p(0). This implies that asymptotic behavior of p(k) can be obtained from the solution of dp=dt ¼ f ðpÞ. Theorem 1. Consider the sequence of interpolated processes {pa(.)}. Let X 0 ¼ pa ð0Þ ¼ pð0Þ. When a ! 0, the sequence weakly converges to X(  ), which is the solution of the ODE dX =dt ¼ f ðX Þ where X ð0Þ ¼ X 0 .

By inspection P of (11) and taking into account that jdpi =dtj40. Therefore, at least there is N S 41, we have one i where dpi =dta0, and consequently, p is not a stationary configuration. (c) Based on the discussion stated in [3], Theorem 3, we conclude the following results. (I) If p0 is a local maximum, then all eigenvalues of J(f(p0)), the Jacobean matrix of f(.) in p0, are ‘‘–1’’ and by Lyapunov’s indirect method [8], p0 is an asymptotically stable stationary configuration of f(  ). (II) If p0 is not a local maximum then at least there is one v 2 O, gðvÞ4gðp0 Þ, where vq ap0q . In this case, we conclude that at least one of its eigenvalues is positive and by Lyapunov’s indirect method, f(  ) is unstable in p0. &

Proof. The theorem is a particular case of a general result to Kushner’s Theorem ([6], Theorem 3.2). We note the following about the PBIL given by (1). (1) fpðkÞ; x1:N ðk  1Þ; kX1g is a Markov process and 1:N x (k) takes values in a compact metric space. (2) The function G(.,.) is bounded, continuous and independent of a. (3) For pðkÞ ¼ p, a constant, fx1:N ðkÞ; kX0g is an independent identically distributed (i.i.d.) sequence denoted by pp. (4) The ODE dX =dt ¼ f ðX Þ has a unique solution for each initial condition. So, by [6], Theorem 3.2, when a ! 0, the sequence {pa(  )} weakly converges to the solution of the ODE dX =dt ¼ G^ðpÞ where X ð0Þ ¼ X 0 , G^ðpÞ ¼ E p GðpðkÞ, x1:N(k)) and Ep denotes the expectation respecting the invariant measure pp. Since x1:N(k) only depends on pðkÞ ¼ p and g, we have G^ðpÞ ¼ f ðpÞ and hence the theorem. & Now we study the solution of the ODE and obtain the long-term behavior of p(k).

So, we can conclude that the PBIL will never stay in a configuration, which is not a local maximum of g. Now, we answer the question ‘‘Is it possible that p(k) not converge to a local maximum of g?’’. For this question, we study the convergence of the PBIL for four different initial configurations denoted by p(0), which cover all points in K; (1) p(0) is very close to a local maximum p0. By Theorem 2 (Part 3), there is a neighborhood around p0 where the PBIL is absorbed by p0. Thus, the PBIL converges to a local maximum of g. (2) p(0) is close to a deterministic configuration p0, which is not a local maximum. By Theorem 2 (Part 3), no matter how much p(0) is close to p0, the PBIL leaves that neighborhood and enters K–K*. (3) p(0) is a deterministic configuration. No matter whether p(0) is a local maximum or not, the PBIL remains in p(0). (4) For the initial configurations in KK* the convergence results are stated in Theorem 3 below.

Theorem 2. If the learning step is sufficiently small, the following is true for the PBIL: (a) all deterministic configurations are stationary configurations, (b) all non-deterministic configurations are nonstationary configurations, and (c) all local maxima of g are asymptotically stable and the other points of O are unstable.

Theorem 3. For every initial configuration in K–K*, the PBIL always converges to the local maxima.

Proof. (a) By inspecting (10), if p is a deterministic configuration, i.e. pAK*, then by Lemma 2, f ðpÞ ¼ 0; therefore, p is a stationary configuration. (b) Because g is an injective function, there is a set, S ¼ fyu : yu 2 O; Prðyu jpÞ40; 1pupN S g, where for each y i and y j in S, ioj, we have gðyi Þ4gðyj Þ. It is clear that if p is a non-deterministic configuration, then NS, the cardinality of S, is an even number greater than one. Let jxj be the absolute value of x, then  n  n  X X   dpi  ¼ f ðpÞ i  dt  i¼1 i¼1 8  " #j1 Ns
Proof. The function f(  ) is bounded and continuous on K, so there is a differentiable function F:Rn-R where for 1pipn and pAK, ðqF =qpi ÞðpÞ ¼ f i ðpÞ. Now, consider the variation of F along the trajectories of the ODE; then  n n n  X X X dF qF dp dp dpi 2 ðpÞ ¼ ðpÞ i ¼ f i ðpÞ i ¼ X0. dt qpi dt dt dt i¼1 i¼1 i¼1 (12) Thus, F is non-decreasing along the trajectories of the ODE. Also, due to the nature of the algorithm given by (1), for each initial configuration in K–K*, the solution of the ODE dX =dt ¼ f ðX Þ will be confined to K which is a compact subset of Rn. Hence, by LaSalle’s Theorem ([8], Theorem 2.7), asymptotically, all the trajectories will be in the set K1 ¼ {pAK: (dF/dt)(p) ¼ 0}. Therefore by (12) we conclude that for each i, we have ðdpi =dtÞ ¼ 0 and consequently, p is a stationary configuration of the ODE. Thus, all solutions should converge to some stationary configurations. Since by Theorem 2 all stationary configurations which are not local maxima are unstable, the theorem follows. &

ARTICLE IN PRESS R. Rastegar, A. Hariri / Neurocomputing 69 (2006) 1772–1775

Theorems 2 and 3 together characterize the long-term behavior of the PBIL when the function g is injective. Theorem 2 states that only local maxima of g are asymptotically stable stationary configurations of the algorithm. In addition, Theorem 3 shows that the PBIL cannot converge to any point in K which is not a local maximum. If the function g is not an injective function, then Theorem 2 (Part 2) may be invalid and we cannot make sure that the local maxima of g are the only stable stationary configurations of the PBIL. In this case, the PBIL may converge to some non-deterministic configurations and stay at them. 4. Conclusion In this paper, based on the weak convergence and nonlinear systems theories, we have proven that the local maxima of an injective function are asymptotically stable stationary points of the PBIL and shown that the PBIL converges to one of these local maxima. References [1] S. Baluja, Population based incremental learning: a method for integrating genetic search based function optimization and competitive learning, Technical Report, Carnegie Mellon University, 1994. [2] J.S. De Bonet, C.L. Isbell, P. Viola, MIMIC: finding optima by estimating probability densities, in: Proceedings of NIPS’97, MIT Press, Cambridge, MA, 1997, pp. 424–431. [3] C. Gonza´lez, J.A. Lozano, P. Larran˜aga, Analyzing the PBIL algorithm by means of discrete dynamical systems, Complex Systems 12 (4) (2000) 465–479.

1775

[4] C. Gonza´lez, J.A. Lozano, P. Larran˜aga, The convergence behavior of the PBIL algorithm: a preliminary approach, in: Proceedings of ICANNGA’01, Prague, Czech Republic, 2001, pp. 228–231. [5] M. Ho¨hfeld, G. Rudolph, Towards a theory of population based incremental learning, in: Proceedings of IEEE CEC’97, IEEE Press, Indianapolis, USA, 1997, pp. 1–5. [6] H.J. Kushner, Approximation and Weak Convergence Methods for Random Processes, MIT Press, Cambridge, MA, 1984. [7] H. Mu¨hlenbein, T. Mahnig, The factorized distribution algorithm for additively decomposed functions, in: Proceedings of IEEE CEC’99, IEEE press, Washington, DC, USA, 1999, pp. 752–759. [8] K.S. Narendra, A. Annaswamy, Stable Adaptive Systems, Printice Hall, Englewood Cliffs, NJ, 1989. Reza Restegar received his BS degree in computer engineering in 2002 and his MS degree in artificial intelligence in 2005, both from Amirkabir University of Technology, Iran. Currently he is studying for his MS in mathematics at Southern Illinois University at Carbondale, US. His main research interests include Quantum Computing, Evolutionary Computing, Combinatorics, and Optimization. He is a student member of IEEE, SIAM, and ACM. Arash Hariri received his BS degree in computer engineering from Amirkabir University of Technology, Iran in 2003 and currently he is studying for an MS degree in computer engineering at Shahid Beheshti University, Iran. His research interests are primarily in the areas of VLSI, Reconfigurable Computing and Machine Learning. He is a student member of IEEE.