An optimal Q-state neural network using mutual information

An optimal Q-state neural network using mutual information

13 May 2002 Physics Letters A 297 (2002) 156–161 www.elsevier.com/locate/pla An optimal Q-state neural network using mutual information D. Bollé, T...

78KB Sizes 3 Downloads 92 Views

13 May 2002

Physics Letters A 297 (2002) 156–161 www.elsevier.com/locate/pla

An optimal Q-state neural network using mutual information D. Bollé, T. Verbeiren ∗ Instituut voor Theoretische Fysica, KU Leuven, B-3001 Leuven, Belgium Received 21 December 2001; received in revised form 26 March 2002; accepted 27 March 2002 Communicated by C.R. Doering

Abstract Starting from the mutual information we present a method in order to find a Hamiltonian for a fully connected neural network model with an arbitrary, finite number of neuron states, Q. For small initial correlations between the neurons and the patterns it leads to optimal retrieval performance. For binary neurons, Q = 2, and biased patterns we recover the Hopfield model. For three-state neurons, Q = 3, we find back the recently introduced Blume–Emery–Griffiths network Hamiltonian. We derive its phase diagram and compare it with those of related three-state models. We find that the retrieval region is the largest.  2002 Elsevier Science B.V. All rights reserved. PACS: 87.18.Sn; 05.20.-y; 87.10.+e Keywords: Spin-glass neural networks; Phase diagram; Mutual information; Multi-state neurons

1. Introduction One of the challenging problems in the statistical mechanics approach to associative memory neural networks is the choice of the Hamiltonian and/or learning rule leading to the best retrieval properties including, e.g., the largest retrieval overlap, loading capacity, basin of attraction, convergence time. Recently, it has been shown that the mutual information is the most appropriate concept to measure the retrieval quality, especially for sparsely coded networks but also in general ([1,2] and references therein). A natural question is then whether one could use the mutual information in a systematic way to deter* Corresponding author.

E-mail address: [email protected] (T. Verbeiren).

mine a priori an optimal Hamiltonian guaranteeing the properties described above for an arbitrary scalar valued neuron (spin) model. Optimal means especially that although the network might start initially far from the embedded pattern it is still able to retrieve it. In the following we answer this question by presenting a general scheme in order to express the mutual information as a function of the relevant macroscopic parameters like, e.g., overlap with the embedded patterns, activity, . . . and constructing a Hamiltonian from it for general Q-state neural networks. For Q = 2, we find back the Hopfield model for biased patterns [3] ensuring that this Hamiltonian is optimal in the sense described above. For Q = 3, we obtain a Blume–Emery–Griffiths type Hamiltonian, providing a firm basis for the result found in [4]. However, in that paper the properties of this Hamiltonian have not been discussed, rather the dynamics for an extremely

0375-9601/02/$ – see front matter  2002 Elsevier Science B.V. All rights reserved. PII: S 0 3 7 5 - 9 6 0 1 ( 0 2 ) 0 0 4 3 7 - 1

D. Bollé, T. Verbeiren / Physics Letters A 297 (2002) 156–161

diluted version of the model has been treated. Hence, we derive the thermodynamic phase diagram for the fully connected network modeled by this Hamiltonian and show, e.g., that it has the largest retrieval region compared with the other three-state models known in the literature.

2. Optimal Hamiltonian Consider a network of N neurons Σi , i = 1, . . . , N , taking different values, σi , from a discrete set of Q states, S, with a certain probability distribution. In µ this network we want to store p = αN patterns Ξi , µ µ = 1, . . . , p, taking different values, ξi , out of the same set S with a certain probability distribution. Both sets of random variables are chosen to be independent identically distributed with respect to i. We want to study the mutual information between the neurons and the patterns, a measure of the correlations between them. At this point we note that, since the interactions are of infinite range, the neural network system is mean-field such that the probability distributions of all the neurons and  all the patterns are of product type, e.g., p({σi }) = i p(σi ). In a statistical mechanical treatment of these mean-field models any order parameter O µ can be written in the thermodynamic limit N → ∞, as   µ  O µ ≡ Oi σi , ξi {σ },{ξ i

1 N→∞ N  

= lim =

N 

i}

 µ Oi σi , ξi

i=1

  pΣΞ (σ, ξ )O σ, ξ µ ,

(1)

σ ∈S ξ ∈Sp

where ξ = {ξ µ } and where pΣΞ (σ, ξ ) is the joint probability distribution of the neurons and the patterns. In the last equality we have used the mean-field character of the system in order to obtain single-site expressions. We do not note this single-site index i. The mutual information I for the random variables Σ and Ξ is then given by (see, e.g., [5])     pΣΞ (σ, ξ ) pΣΞ (σ, ξ ) ln . I (Σ, Ξ ) = pΣ (σ )pΞ (ξ ) σ ∈S ξ ∈Sp (2)

157

A good network would be one that starts initially far from a pattern but is still able to retrieve it. Far in this context means that the random variables, neurons and patterns, are almost independent. When Σ and Ξ are completely independent, then pΣΞ = pΣ pΞ . Consequently, when they are almost independent we can write pΣΞ = pΣ pΞ + ∆ΣΞ ,

(3)

with ∆ΣΞ small pointwise. We remark that  ∆ΣΞ (σ, ξ ) = 0.

(4)

σ,ξ

Plugging the relation (3) into the definition (2) and expanding the logarithm up to second order in the small correlations, ∆, we find using (4) I (Σ, Ξ ) =

  1  (∆ΣΞ (σ, ξ ))2 + O ∆3 . 2 pΣ (σ )pΞ (ξ )

(5)

σ,ξ

The r.h.s. of (5) can be rewritten as  

  1 ∆ΣΞ (σ, ξ ) 2 I (Σ, Ξ ) = + O ∆3 . 2 pΣ (σ )pΞ (ξ ) σ ξ

(6) This expression is the average of the square of the scaled difference between the correlated and uncorrelated probability distribution showing that the approximation (5) is in fact natural. We remark that still all the patterns are contained in (5). As usual, only one pattern is considered to be condensed, i.e., to have an overlap of order 1 with the initial state of the system. The overlaps with the remaining patterns are of order 0 and responsible for the retrieval noise. We do not write the condensed pattern index. Next, we want to express ∆ΣΞ in terms of macroscopic, physical quantities of the system (order parameters). Referring to (1) we write down the following Q2 moments 1  c d  c d mcd = lim σi ξi = σ ξ {σ,ξ } N→∞ N i  = pΣΞ (σ, ξ )σ c ξ d , (7) σ,ξ

with c, d = 0, . . . , Q − 1, and using the notation 00 = 1. We remark that m00 = 1 such that we have

158

D. Bollé, T. Verbeiren / Physics Letters A 297 (2002) 156–161

in general Q2 − 1 independent parameters specifying pΣΞ (σ, ξ ). Up to now the derivation is valid for general Qstate scalar-valued neurons. To fix the ideas we choose the neuron states as σc = −1 +

c−1 , Q−1

with c = 1, . . . , Q.

(8)

This choice corresponds to a Q-state Ising-type architecture leading to m

=

cd

Q 

T cdxy pΣΞ (σx , ξy ),

(9)

monotonically as a function of the correlation between spins and patterns. Therefore, H = −I˜N is a good candidate for a Hamiltonian. Configurational averages also enter into the denominator of (13). Since we are mainly interested in the correlations between spins and patterns, rather than in the respective single probability distributions we assume that the latter are equal such that we use the known distribution of the patterns in the denominator.

3. Results for Q = 2 and Q = 3

x,y=1

with

    x −1 c y −1 d T cdxy = −1 + . −1 + Q−1 Q−1

(10)

In a similar way we introduce A by  Acx pΣ (σx ), mc0 = x

m

0c

=



Acx pΞ (ξx ),

x

Acx

  x −1 c = −1 + . Q−1

What we have presented up to now is a scheme to calculate a Hamiltonian for a general Q-state network with an Ising-type architecture using mutual information. Next, we discuss some specific examples, Q = 2 and Q = 3, in detail. We start with Q = 2 states. Given the probabilities associated with each state, the inversion of the transformations T (10) and A (11) leads to pΣΞ (σ, ξ )

(11)

=

Finally, by introducing the inverse transformations S and B  T cdxy Sxyc d = δa,a δb,b , x,y



Acx Bxd = δc,d ,

(12)

x

we can write the approximation of the mutual information up to order ∆2 as 1  ( cd Sxycd mcd − cd Bxc Byd mc0 m0d )2 I˜ = , c0 0d 2 x,y cd Bxc Byd m m (13) where we have left out the dependence on Σ and Ξ . In (13), I˜ is expressed as a function of the macroscopic variables of the system. Employing (1), it can be rewritten in terms of configurational averages of the system and, hence, for large N as a function of the microscopic variables σi and ξi . Using (13) for every pattern µ, summing over µ and multiplying by N we get an extensive quantity, denoted by I˜N , which grows

1 − m10 − m01 + m11 δσ,−1 δξ,−1 4 1 − m10 + m01 − m11 δσ,−1 δξ,1 + 4 1 + m10 − m01 − m11 δσ,1 δξ,−1 + 4 1 + m10 + m01 + m11 δσ,1 δξ,1 . + 4

(14)

The distributions pΣ (σ ) and pΞ (ξ ) can be found by summing out ξ and σ , respectively. Using these distributions, I˜ becomes I˜ =

(m11 − m01 m10 )2 . 2(1 − (m01 )2 )(1 − (m10 )2 )

(15)

Substituting the averages over the probability distributions by configurational averages and putting m10 = m01 = b in the denominator, where b is the bias of the patterns, we get H =−

2  1  1 µ N σi ξi − bσi . (16) 2(1 − b2 )2 N µ i

D. Bollé, T. Verbeiren / Physics Letters A 297 (2002) 156–161

This Hamiltonian can be written as 1 H =− Jij σi σj , 2 ij

with Jij =

 µ  µ  1 ξi − b ξj − b , (17) 2 N(1 − b) µ

and this is precisely the Hopfield Hamiltonian with [3] and without [6] bias. We remark that a particularly nice aspect of this treatment is that the adjustment of the learning rule due to the bias enters in a natural way. Furthermore, we learn that the Hopfield Hamiltonian is the optimal twostate Hamiltonian in the sense that we started from the mutual information calculated for an initial state having a small overlap with the condensed pattern. This confirms a well-known fact in the literature. For Q = 3 we focus, without loss of generality, on the case where the distributions are taken symmetric around zero, meaning that all the odd moments vanish. Following the scheme proposed above we arrive at 1  11 2 1 m I˜ = 02 2 m m20 1 1 + 2 m02 m20 (1 − m02 )(1 − m20 ) 2  × m22 − m02 m20 .

(18)

Identifying m02 = a as the activity of the patterns, m20 = q as the activity of the neurons, m11 = am as the overlap, m22 = an as the activity overlap [2], and defining l = (n − q)/(1 − a) we arrive at  1 I˜ = m2 + l 2 . 2 This leads to a Hamiltonian 1 1 H =− Jij σi σj − Kij σi2 σj2 , 2 2 i,j

(19)

(20)

This Hamiltonian resembles the Blume–Emery–Griffiths (BEG) Hamiltonian [7]. The derivation above provides a firm basis for, and generalizes the results found in [4]. In that paper the dynamics has been studied for an extremely diluted asymmetric version of this model. Here we want to discuss the fully connected architecture and derive the thermodynamic phase diagram, which has not been done in the literature, in order to compare it with the other Q = 3 state models known. In order to calculate the free energy we use the standard replica method [8]. Starting from the replicated partition function and assuming replica symmetry we obtain  α 1  2 log(1 − χ) mν + lν2 + f= 2 ν 2β α χ α φ α log(1 − φ) + + 2β 2β 1 − χ 2β 1 − φ α Aq1 χ α Bp1 φ + + 2 2 (1 − χ) 2 (1 − φ)2    1 − , Ds Dt log Trσ exp β H˜ β {ξ ν } +

with ν denoting the condensed patterns and   √ ν ˜ H = Aσ mν ξ + αr s ν

+ Bσ

2



√ lν η + αu t

with p 1  µ µ Jij = 2 ξi ξj , a N µ=1

(21)



ν

ν

σ 2 αAχ σ 4 αBφ + + , 2(1 − χ) 2(1 − φ)

(22)

and A = 1/a, B = 1/(a(1 − a)), Ds and Dt Gaussian measures, Ds = ds (2π)−1/2 exp(−s 2 /2), and χ = Aβ(q0 − q1 ), φ = Bβ(p0 − p1 ), q1 , r= (1 − χ)2 p1 u= . (1 − φ)2

i,j

p  1 µ µ Kij = ηi ηj , (a(1 − a))2 N µ=1  µ 2 µ ηi = ξi − a.

159

(23)

For Q = 3 (σ 2 = σ 4 ), the order parameters are defined as follows  , mν = A ξ ν Ds Dt σ β {ξ ν }

160

D. Bollé, T. Verbeiren / Physics Letters A 297 (2002) 156–161

  2 ν lν = B η Ds Dt σ β q0 = p0 =

  Ds Dt σ 2 β

q1 =



 Ds Dt σ 2β

p1 =

,

{ξ ν }

Ds Dt



2 σ2 β

, {ξ ν }

, {ξ ν }



,

(24)

{ξ ν }

with the small brackets · · ·β denoting the usual thermal average. We recall that mν is the overlap, lν is related to the activity overlap, q0 is the activity of the neurons and q1 and p1 are Edwards–Anderson parameters. For one condensed pattern the index ν can be dropped. Solving the fixed-point equations for the order parameters and considering uniform patterns (a = 2/3), we obtain a rich T –α phase diagram. The phases that are important from a neural network point of view are presented in Fig. 1. The border of the retrieval phase (m > 0, l > 0) is denoted by a thick full line. The most important result is that the capacity of the BEG neural network is much greater than that of other Q = 3 models. Compared with the Q = 3-Ising model [9], e.g., it is almost twice at T = 0. Moreover, also the overlap m stays closer to the maximal value, 1, in the whole retrieval region. A study of the dynamics of this model, which is in progress, confirms these results. Another new feature in the phase diagram, compared with other models, is the so-called quadrupolar phase (m = 0 but l > 0) which lies below the thin full line. Simulations for T = 0 indicate that this phase is not stable. However, for T greater than 0 stability may occur, depending on the specific parameters of the model. This phase has also been seen for the extremely diluted BEG-network [4]. In a different context it is also present in BEG spin-models (e.g., [7, 10–13] and references therein). In this phase the overlap is zero but the activity overlap is not, meaning that the active neurons (±1) coincide with the active patterns but the signs are not correlated. This implies that the information content of the system is non-zero. One might imagine an example in pattern recognition where, looking at a black and white picture on a grey background, this phase would tell us the exact location

Fig. 1. Q = 3 T –α phase diagram for uniform patterns. The meaning of the lines is explained in the text.

of the picture with respect to the background without finding the details of the picture itself. Besides these phases one also has a spin-glass phase and a paramagnetic phase (separated by the broken line in Fig. 1). The latter coexists with the retrieval phase in a region near the T -axis. This coexistence region arises because the transition at α = 0 from the retrieval to the paramagnetic phase is first order giving rise to a tricritical point. This has also been seen in the Potts neural network model [14]. In the case of non-uniform patterns, preliminary results indicate that the Q-phase extends outside the retrieval phase for a  0.7, especially in the high temperature regime. Moreover, the capacity is a nonmonotonic function of a with a maximum for a ∼ 0.65 and the tricritical point moves to the α = 0 axis for a = 1/2. An extended version of the work on the BEG fully connected neural network is in progress.

4. Concluding remarks In conclusion, we have presented a method starting from the mutual information between the neurons and the patterns to derive an optimal Hamiltonian for a general Q-state neural network. The derivation assumes that the correlations between the neurons and patterns are small initially, and thus guarantees optimal retrieval properties (loading capacity, basin of attraction) for the model. For Q = 2, we find

D. Bollé, T. Verbeiren / Physics Letters A 297 (2002) 156–161

back the Hopfield Hamiltonian for biased patterns, while for Q = 3 we find the Blume–Emery–Griffiths Hamiltonian. We have derived the phase diagram for this fully connected BEG model confirming that the capacity is larger than the one for related models. We believe that similar results can be obtained for vector models and other architectures.

Acknowledgements We would like to thank D. Dominguez, R. Erichsen Jr., M. Idiart, G.M. Shim, A. Theumann and W.K. Theumann for useful discussions. This work was supported in part by the Fund for Scientific Research, Flanders (Belgium).

References [1] D.R.C. Dominguez, D. Bollé, Phys. Rev. Lett. 80 (1998) 2961.

161

[2] D. Bollé, D. Dominguez Carreta, Physica A 286 (2000) 401. [3] D.J. Amit, H.G. Gutfreund, H. Sompolinsky, Phys. Rev. A 35 (1987) 2293. [4] D. Carreta Dominguez, E. Korutcheva, Phys. Rev. E 62 (2000) 2620. [5] R.E. Blahut, Principles and Practice of Information Theory, Addison-Wesley, Reading, MA, 1990, chapter 5. [6] J.J. Hopfield, Proc. Natl. Acad. Sci. USA 79 (1982) 2554. [7] M. Blume, J. Emery, R.B. Griffiths, Phys. Rev. A 4 (1971) 1071. [8] M. Mézard, G. Parisi, M.A. Virasoro, Spin Glass Theory and Beyond, World Scientific, Singapore, 1978. [9] D. Bollé, H. Rieger, G.M. Shim, J. Phys. A 27 (1994) 3411. [10] M. Sellitto, M. Nicodemi, J. Arenzon, J. Physique I 7 (1997) 945. [11] S. Schreiber, Eur. Phys. J. B 9 (1999) 471. [12] B.S. Branco, Phys. Rev. B 60 (1999) 1033. [13] A. de Caudia, cond-mat/0110029. [14] D. Bollé, P. Dupont, J. Huyghebaert, Physica A 185 (1992) 363.