Strongly diluted networks with selfinteraction

Strongly diluted networks with selfinteraction

NeuralNetworks.Vol.6. pp. 681-688, 1993 Printed in the USA.All rightsreserved. 0893-6080/93 $6.00 + .00 Copyright© 1993PergamonPressLtd. ORIGINAL C ...

633KB Sizes 1 Downloads 51 Views

NeuralNetworks.Vol.6. pp. 681-688, 1993 Printed in the USA.All rightsreserved.

0893-6080/93 $6.00 + .00 Copyright© 1993PergamonPressLtd.

ORIGINAL C O N T R I B U T I O N

Strongly Diluted Networks With Selfinteraction H A R A L D E N G L I S C H , * Y E G A O XIAO, AND K A I L U N Y A O Huazhong Universityof Science and Technology,P.R. China ( Received 18 January 1991; revised and accepted 15 October 1992) Abstract--The retrieval quality ]br the strongly diluted network with sel/interaction is cak'ulated rigorously.lbr any time step and arbitrary distribution ~f the stabilities. For the netuwrk with a delta and a Gaussian distribution of the stabilities, respectivel); the optimal strength o./the selfinteraction./br mw time step is given by an explicit ¢:xpressiott. Different d~:[initions./br the critical storage capacity of the net with constant stabilities are analyzed. According to the most usual d~,/inition, the net with optimal selfinteraction has an it!/inite storage capacity. The approximated optimal ivtrieval quality fimnd hy neglection qf correlations is larger than the ~:xact optimal retrieval quality.

Keywords--Neural networks, Autoassociative memory, Optimal threshold, Selfinteraction, Strong dilution, Rigorously solvable model, Critical storage capacity, Distribution of stabilities.

1. I N T R O D U C T I O N

J= ~ ~"-~r/N ,~=1 1

In the Little-Hopfield model with parallel updating, a perturbed version 5 ( 0 ) E { + 1 } N of one o f p given patterns -~ ~ { _+1 } N should be mapped onto an improved version ~ ( t ) (usually t = 1 or =oo is taken) of this pattern via the dynamics ~ ( t + 1) = s g n ( J ~ ( t ) + (t) ). The matrix of the synaptic couplings J contains the information about the set { ~ ' } and ~ ( t ) is the threshold vector. Usually the threshold vector does not appear in the Little-Hopfield model and the diagonal elements J,, are zero. But as it was shown in Englisch and Herrmann (1989), input-dependent thresholds (also called persistent stimuli (Engel, Bouten, Komoda, & Serneels, 1990), external stimuli (Rau & Sherrington, 1990), external fields (Engel, Englisch, & Schiitte, 1989)) improve the performance of the net. The selfinteraction (also called hysteresis (Yanai & Sawada, 1990) or nonzero-diagonal term in the m e m o r y matrix (Wang, Krile, Walkup, & Jong, 1991 )), i.e., J/i :~ 0, can also have a positive effect. It is surprising that only in a few papers (cf., e.g., Krauth, Mezard, & Nadal ( 1988 ); Fontanari & K6berle ( 1988 ); Yanai & Sawada (1990)) the selfinteraction is discussed though it appears in a natural way. If the Hebb rule

(I)

( ~ r is the transposed ofA') is applied consequently for the storage of independent identically distributed random patterns with mean E~" = 0 in a fully connected network, then one automatically gets a model with selfinteraction, i.e., J,~ 4= O, of size J~ = ~ (~ = p / N ) . The same holds for the projection rule J

= Ptin{~"l.

(2)

(PL~,I~,', is the orthoprojector onto the subspace { Xv ~,,~. ~,~ ~ R}) for N--~ oo and ~ < 1. (The last condition ensures that the set { ~" } is linearly independent with probability 1, i.e., the trace fulfills TrJ = p (Odlyzko, 1988).) A few authors tried to optimize the size of the selfinteraction numerically (Krauth et al., 1988; Fontanari & K6berle, 1988) or to find approximate formulae for the retrieval quality (Yanai & Sawada, 1990). The approximations are necessary since the correlations in the fully connected network cause difficulty in determining analytically the retrieval quality M(t) = ~ " r ~ ( t ) / N

(3)

(we assume throughout the paper that 5 ( 0 ) is a random perturbation of ~", independent of ~", # 4= v) even in the case of small t and without selfinteraction (Patrick & Zagrebnov, 1991). We want to study the effect of selfinteraction in one of the few rigorously solvable models, the strongly diluted model ( Derrida, Gardner, & Zippelius, 1987), also in the limit t --~ oo. (Since the other rigorously solvable model, the multilayer

Acknowledgment:We thank the editor Prof. S.-l. Amari for drawing our attention to Yanai and Sawada (1990). * On leave of absence from Universitfit Leipzig, lnstitut f'tir lnformatik. PSF 920, Leipzig-D-O-7010,Germany. Requests for reprints should be sent to Prof. Dr. H. Englisch, Universiflit Leipzig, Institut f'tirlnformatik, PSF 920, Leipzig-D-O7010, Germany. 681

682

H. Englisch, Y Xiao, and K. Yao

feedforward net by Domany, Kinzel, & Meir (1989), is a heteroassociative memory, a selfinteraction cannot be defined there.) Rigorously solvable models are important as a test for the quality of simulations and approximations. Further, time consuming simulations can be avoided for such models. In the strongly diluted model, the matrix elements J,7 are zero with probability 1 - C / N , 1 ~ C < N I/z~ (cf. Kleinz, 1989; App. 1 ), where the dilution is done independently of the patterns { ~"} and of the connections Jkl (k 4= i or / v~j). The storage capacity a has to be defined as a = p / C and the Hebb rule redefined as J = ~ ~*~"r/C.

M ( t ) . Since the selfinteraction has to be treated separately, we shift it into the threshold. Let us denote with h ( k ) the control p a r a m e t e r in the threshold g~(k) = h ( k ) S ~ ( k ) , i.e., the strength of the selfinteraction can depend on time, and withJ~ = ~ 7 ( • j ÷ i J i j S j ( k ) -I- g~( k ) ), k = 0, 1. . . . . t, the a r g u m e n t of the sign-function multiplied with ~ ' . Then .f~ = M(t)A + h ( t ) ~ S ~ ( t ) + r,

where r, = ~ Z JijSj(t) - M ( t ) A is a Gaussian random variable (if we take random initial conditions with fixed retrieval quality M ( 0 ) ) w i t h mean zero and variance

(4)

f( 1

tJ=l

such that for the fully connected model with C = N, the original formula is included. Instead of only considering the analogies ofeqns ( l ) and (2) in the strongly diluted net, we adopt G a r d n e r ' s notion of stabilities (Gardner, 1988): Calculate the distribution P ( A ) of the stabilities

J'kt

I

1

M r ( t ) " ) P ( I ") dF.

Er, rk = ESj(t)Sj(k) = f

Z P,Pl .....

2. R E C U R S I V E D E T E R M I N A T I O N RETRIEVAL QUALITY

OF THE

In generalization to Englisch, Xiao, and Yao ( 1991 ), we introduce the probabilities p~.a .... p, (p, Ok = +1, k = 1,2 . . . . . t) of the event { ~ = pSi(O), S~( 1 ) = O~Si(0), .... Si(t) = ptSi(O)} provided that Af = A. Since M ( t ) = f M a ( t ) P ( A ) d A with M'X(t) = Zp.p, ..... ~, pptp~;a. .... p, the determination of these probabilities allows one to calculate the retrieval quality

(7)

For fixed r , formula (7) is well-known ( Krauth et al., 1988). In order to integrate over the variable r,, the knowledge of the covariances with rk, k = 0, 1. . . . . t - 1, is required:

J¢i

which yields, e.g., for eqn (4) a Gaussian distribution with mean 1/ ~ and unit variance and for eqn ( 2 ) under the assumption that J , is selfaveraging a delta-distribution at / / ( 1 - a ) / a (van H e m m e n & Kfihn, 1991). Let us assume that the rows of the J matrix are normalized, i.e., for every i, Zj+~ J~j equals 1. Then the matrix elements o f e q n ( 1 ) have to be divided by ~ and those of eqn (2) divided by f)ii - J~ ( van H e m m e n & Kfihn, 1991 ), which should be f~a - a-'. For given a, the strongly diluted net with constant stabilities is defined by A7 = A for all i and # with the maximally possible A. The replica ansatz ( Abbott & Kepler, 1989; eqn 3.3 for k --~ k') yields again A = V ( l - a)/c~. In the next section, the derivation of the retrieval quality for the Hebb rule ( 1 ) (Englisch, Xiao, & Yao, 1991 ) is generalized to nets with arbitrary distribution of the stabilities. Section 3 is devoted to the special cases of a delta distribution and a Gaussian distribution, and Section 4 contains a discussion of different definitions of the critical storage capacity.

(6)

,,,,, p,,.v

p P(F) dP.

(8)

PI

Now we are ready to express the desired probability po,.x as integral over the t + 1 Gaussian variables Pl ..... Pt~' l Ib . . . . , rt

No,

.....

t,,,,

:

• " •

Xlpplfo20,pp2f~l~O

.....

pp,÷lftA~O}

× (l + p M ( O ) ) / 2 . e -rrR ,7/z dro . . . . .

drt/

((2r)'÷ldet R) I/2,

(9)

where X is the characteristic function, 7 is the vector with c o m p o n e n t s r0 . . . . . r, and R is the (t + 1) × (t + I ) covariance matrix with elements given by eqns (7) and (8). The variables ~ ' S i ( k ) occurring in f~. are replaced by PPk (and p for k = 0, respectively) such that eqn (9) contains terms which depend only on P ; f . ..... . The introduction of the Gaussian variables rk, k = 0 . . . . . t, has the advantage that we integrate only over (t + 1 ) variables instead of 2' variables according to the approach in Englisch, Xiao, and Yao ( 1991 ). The retrieval quality can be calculated even for the c o m b i n e d threshold g~(k) = h ( k ) S i ( k ) + h ' ( k ) S ~ ( O ) . For this c o m b i n e d case, eqn (6) has to be replaced by f = M ( t ) A + [ h ( t ) ~ ; S ~ ( t ) + h'(t)~pS~(O)] + rt, where ~'jSi(l) = POt, ~ S i ( O ) = p and r, = ~ Zj4~i JijSj(l) - M(t)A. 3. N E T S W I T H C O N S T A N T A N D G A U S S I A N DISTRIBUTED STABILITIES The formulae in the preceding section obviously get simpler if P ( A ) is a delta distribution. But also for

Networks With Sel.[interaction

683

Gaussian distributions the formulae simplify in such a way that an explicit expression for the parameter h (0) maximizing M( 1 ) can be derived. For the ansatz g(t) = h(t)Si(O) with a threshold being a multiple of the input configuration even the optimal sequence { h(t), t = 0, 1. . . . }, maximizing every M ( t ) can be written down explicitly (Englisch, Engel, Schfitte, & Stcherbina, 1989b), Section 3. I is devoted to the explicitly solvable cases for the parameter h(0) and { h ( t ) } , whereas in Section 3.2 for Gaussian distributed stabilities and g(t) = h(t)S~(t) the optimal M ( 2 ) is determined from the partial derivatives with respect to h(0) and h( 1 ), i.e., two coupled nonlinear equations, and compared with the approximation from Engel et al. (1990) for it. 3.1.

Explicit Expressions

for the O p t i m a l T h r e s h o l d

For a distribution P(A) degenerated to a delta distribution at Ao, the formulae in the preceding section can be simplified as the following: M(t) =

~ p.pl

.

.

.

ppte~ ......... with P~........ = P~;~....... .

.

pt

Er~ = I - M(t) 2, E(r, rk) =

~ p,pl

.

.

.

p, ptP; ........ .

.

p

1 +

H

opl

1.65

1.10

0.55

]



0.00

0.0o

0.25

,

o.5o

o.75

FIGURE 1. The optimal parameter ho~ vs M(0). (1) and (1)' stand for hop,(0) and hop,(1), respectively, according to the formula (12), and (2), and (2)' stand for h~,t(0) and h ~ ( 1 ) , respectively, according to formula (18) with a 2 = 1 (a = 0.4).

,

pM(O) 2 erf{(M(O)Ao+h(O)p)/ VI-M(O)2}.

(10)

only one zero less than I / A o for large h(0), OM(1)/ Oh(O) < 0 and limh(0)--o~M( I ) = M ( 0 ) . By the way, let us look for the minimal constant C such that independently of M ( 0 ) and Ao the ansatz h(0) = C/Ao yields M( 1 ) > M ( 0 ) . Then the limit M ( 0 ) --~ 0 and Ao -+ 0 shows that C = 1 corresponding to Ju in the normalized orthoprojector is indeed the optimal value. Formula ( 11 ) should be compared with the optimal value

The derivative of M( l ), with respect to h(O), shows us that for

I - M(/) 2

I + M(0)

hop,(t)- 2AoM(t-----~In 1 -M(0-----~ I - M(0)-"

hopt(0)

t.oo

M(0)

pt

and in eqn ( 9 ), one has to calculate the right-hand side only for A = Ao. For one time step the retrieval quality is nothing else than the well-known formula (Engel et al., 1990; Krauth et al., 1988) M(I)= ~

2.20

(12)

I + M(0)

2AoM(0) In I - M(0---~ '

(11)

the maximum of M( 1 ) is attained, hoot(0) is a decreasing function of M ( 0 ) (cf. Appendix 3) with hoot(0) = 1/A o for M ( 0 ) = 0 and hopt(0) = 0 for M ( 0 ) = 1, cf. Figure 1. As a multiple of the function (1 + A ~ ) - ' hoot(0) V~-~ot- a 2 is a decreasing function of A0 which contrasts with Figure 2 in Krauth et al. (1988). The monotonicity of h o p t ( 0 ) ~ - a 2, with respect to Ao, is a fact everybody should expect. The stronger the stability is, the more unessential is the input configuration. For the fully connected network hopt(0) is for any M(0) 4:0 less than the matrix element J , = l / A o of the normalized orthoprojector PLi. { ~ } / V~a - a 2 . (The normalized orthoprojector with constant reduced diagonal terms J , = hoot(0) is no longer positively semidefinite, hence the corresponding parallel dynamics can possess limit cycles of length 2 instead of limit points (Englisch, 1991 ).) The choice h(0) = 1/Ao ensures that M( 1 ) > M ( 0 ) valid for all M ( 0 ) since OM(l)/Oh(O) has

for the h-parameter in the net with threshold gi(t) = h (t) Sj(0) being a multiple of the input configuration. Equation (12) can be derived easily: Correlations between the threshold h(t)Si(O) and ~q÷iJijSj(t) do not occur such that for arbitrary t the retrieval quality is given by the rigorous expression (cf. Engel et al. (1990); eqn 3.1). M(t + 1)= ~

p

1 +

pM(O) erf{(M(t)Ao 2

+

phopt(l))/

!h - M ( t ) : } .

(13)

The corresponding M ( t ) is an increasing and h(t) is a decreasing function of t, cf. Figure 1 for a comparison of hoot( 1 ) with hoot(0). Since the thresholds h(t)Si(t) and h(t)Si(O) coincide for t = 0, eqns ( 11 ) and (12) have to coincide for the first time step. As was already pointed out in Engel et al. (1990), M ( t ) tends for a < acrit(M(0)) to 1. The critical a-values are calculated

684

H. Englisch, Y. Xiao. and K. Yao 1 .00

M(I)=

X 1 + 0~'/(0)_ f erf{(M(0)& + h(O)p)/ o

Ogcrit

VI - M(0)'-}.e-a-a°)~/'-~"/~_~a d&

(15)

which can be simplified to (cf. Kleinz, 1989; Appendix

O.B5

4) M(I)=

~

1 +

pM(0) 2 erf{(M(O)Ao+h(O)p)/

p

0,70

VI +M(0)-'(~r 2 - 1)}.

(16)

For cr = 0 we recognize eqn (10), whereas for a -- 1 is coincides with the results from Englisch et al. (1989b). The optimal p a r a m e t e r hoot is now given by

0.55

hopt(0) = 0.40 O.

.0

0.5

M(O)

FIGURE 2. The critical storage capacity a¢n, vs M ( 0 ) . a=,, is determined numerically by the two equations (14).

from the solution ~0 of the two coupled nonlinear equations

M(~.) = F ( M ( ~ ) , ~o, M(0)) / [

(14)

I = OF(M(oc ), '.~o, M(O))/OM(~.)J with F ( M ( ~ ) , Ao, M ( 0 ) ) = Ep {[1 + pM(O)]/ 2 }erf{(M(o¢)~0 + phopt(OcJ))/~l - M(oo)2 }, hopt(o~ ) = [1 - M(o¢ ) 2 ] / [ 2 ~ o M ( o ~ )]ln[l + M ( 0 ) ] / [ 1 - M ( 0 ) ] via a = 1/( 1 + z ~ ) (cf. §1 ). The critical a-value is, of course, an increasing function of M ( 0 ) with a m t ( 1 ) = 1 and a~it(0) = 2 / ( 2 + 7r) ~ 0.387, cf. Figure 2 and §4. It is not obvious that a Gaussian distribution P ( ~ ) also leads to an explicit formula for hop,(0). The special cases of the Gaussian distribution occurs for important classes of neural nets. I. The distribution with unit variance belongs to the Hebb rule ( 1 ) as well as to its nonlinear versions (Englisch & H e r r m a n n , 1989) J0 = (I/PP/C)ch(Z, × (?~fy/l/~,)with @ ( - x ) = -4~(x), cf. also Derrida & Nadal (1987) for further examples. 2. The delta distribution is the limit of Gaussian distribution with vanishing variance, cf. also Abbott and Kepler ( 1989 ). Distributions with variance less than 1 appear for nets with bias E(~ v~ 0 (Englisch & H e r r m a n n , in press) and for relearned pattern, cf. Appendix 2. Let us assume that P ( A ) is the Gaussian distribution with mean Z-koand variance a. Then formula (10) has to be replaced by (cf. also Krauth et al., 1988, Engel et al., 1990)

1 + M(O)2(a 2 - 1) 1 +M(0) 2AoM(0) In 1 - M(O~) "

(17)

where for cr = 1 the results in Englisch et al. (1989b) are confirmed. Yanai and Sawada (1990) defined for a = 1 with some arbitrariness the quantity h*pt(0) = f hopt(0, M ( 0 ) ) dM(O) and claimed h*pt(0) > l/A0. But this follows immediately from Englisch et al. (1989b). For a 2 > 2 hop,(0) is an increasing function of M ( 0 ) (cf. Appendix 3) in contrast to the case a = 0, cf. Figure 1. Surprisingly for a2 > ] and any M ( 0 ) > 0, hop,(0) is larger than the diagonal term Ji~ = l/A0 in the original normalized Hebb matrix J = Z ~"~""///-da with selfinteraction, again in contrast to the original projection operator. For the threshold g~( t ) = hi(t) & ( 0 ), eqn ( 17 ) can be generalized analogously to eqn (11) to hopt(t) =

I + M(t)'-(a 2-

2.koM(t

1)

I + /1[(0)

In - 1 - M(0)

18)

belonging to the equation

M(t + 1)= ~ I + pM(O)erf{ ,,

2

} W

)

19) For a _< 1 hop,(t) is as it should be decreasing m t. O f course, for a > 0 independently of A0 and M ( 0 ) , M ( t ) cannot tend to 1. The simplified version o f e q n (19) for h(t) = 0 was already studied in Englisch and H e r r m a n n (in press) and independently in Kleinz (1989) (Eqn (4.5)) where it was shown that a = Vl - ~r/6 ~ 0.690 is a critical point such that also for sufficiently large Ao < ~ / 2 besides M ( o o ) = 0 there exists a second stable fixed point M(oo ) > 0. Finally, we want to decide whether the H e b b or the projection rule yields the better retrieval quality M ( I + 1 ). If we put h(t) = 0, the condition

a > M(t)'-

(20)

is obviously necessary and sufficient that the Hebb rule has a better performance than the projection rule for the same a and M ( t ) , cf. eqns (10) and (16). This

Networks With Selfinteraction

685

condition is even necessary and sufficient for h ( t ) = hoot(t) according to eqns ( 11 ) and (17), respectively. Hence the limit retrieval quality M(c~ ) is better for the Hebb rule if the retrieval quality for the projection rule satisfies c~> M(ov )'-.

(21)

I'00 I k

(1)

\\ 0.80[

(I)

3.2. N u m e r i c a l D e t e r m i n a t i o n of the M a x i m a l Retrieval Q u a l i t y for t = 2

o.6s[ For the model with threshold g,(t) = h(t)S~(t) and Gaussian distribution of the stabilities (with mean Ao and variance cr2 = 1 ), one gets for t = 1, according to eqn (16), M(I)=

Z

1 + pA,I( 0 )

p

2

eft{ M(0)Ao + h(O)p }

(22)

0,0

0.5

.0 Xlaoz

with

2.

1

FIGURE 3. The retrieval qualities M vs 1/Ao (1), (2), (3), and (4) stand for M,~,(2) according to (23), Mood2)and M,.o~(2) according to (25) and M(1) according to (22), respectively.

1 + M(0)

hop~(0)- - In 2AoM(0) I - M(0) " For t > 1, g;(t) and r, are correlated. Neglecting this correlation one has for t = 2, the a p p r o x i m a t e formula (Engel et al., 1990) M(2) = Y.

p

1 + pAY( 1) ~ eft{ M( l)Ao +

h(l )p }

(23)

with 1

1 +M(I)

hop,(I) - 2AoM( I--~In I - M( 1 )

(24)

But the correct formula is

M(2) = Z

p

×

o.4s/

1 + pAl(O)

4. D E F I N I T I O N S

FOR THE CRITICAL S T O R A G E CAPACITY

2

eft{ [M( l)Ao + h(l)sgn(M(0)Ao

+ h(O)o+ r)+flr]/y}.e-'2ag/2ko/

Also for the Hebb rule and the three different thresholds g / = hopt( t )S,( t ), g~' = hoot( t )S;( O) and the convex combination g) = Xh(t)Si(t) + X'h'(t)Si(O) (X + X' = I ), the retrieval quality for t = 2 are compared. This indicates that g~(t) shows a better improvement of the retrieval quality than g~(t) (e.g., for a = 0.4 and M ( 0 ) = 0.4, M ( 2 ) = 0.650 and M ( 2 ) = 0.671, respectively), and g ) ( t ) shows nearly the same i m p r o v e m e n t ( M ( 2 ) = 0.674) as g~'(t) for the optimal parameters X and X'.

2~rd, (25)

with/3 = '~/(Y.o p,'_)(5"p p{~) and y = v (p% _ p,,_) (Englisch, Xiao, & Yao, 1991 ). It seems that no explicit formulae for the optimal values h ( 0 ) and h( 1 ) exist yielding the maximal M ( 2 ) . The optimal h ( 0 ) is not longer given by formula ( 17 ) since that value maximizes only M( 1 ). But the optimal h(0) for M ( 2 ) gives a value M( 1 ) which is (slightly) lower than the maximal possible one. Nevertheless, formulae (17) and (24) provide good approximations for the optimal h ( 0 ) and h( 1 ) values as is shown in Figure 3. The optimal value Mopt(2) is only for large Ao distinguishable from the semioptimal value M~-oot(2 ) calculated from ( 17 ) and (24). Both curves are remarkably lower than that calculated from the crude approximation M~ppd 2) from eqn (23) but nevertheless higher than the optimal M( l )-value.

Several defnitions for the critical storage capacity, especially of the model with constant stabilities, are discussed whereby the usually accepted one by Kanter and Sompolinsky ( 1987 ) is revealed in a different light. Since there is a large zoo of definitions for critical avalues, let us try a taxonomy. The superscripts denote the following models: H = Hebb (Gaussian distribution of stabilities with a =1) P = projection (delta distribution of stabilities) Z = zero threshold (g;(t) = O) l = input threshold optimal (g,(t) /toot( l )Si( 0 ) ) O = optimal selfinteraction (gi(t) = hoo,(t)S;(t)) U = usual selfinteraction (g;(t) = S~(t)/Ao) W = without bias ( E ~ = 0) B = bias of the patterns ( E ~ = /3 fE (0, 1)), cf. eqn (16) and Englisch and H e r r m a n n (in press) S = strongly diluted net F = fully connected net =

686

M = multilayer perceptron of Domany, Kinzel, and Meir( 1989 ), a bias (if contained) should be the same in any layer An exclamation mark behind the reference denotes rigorously proven results. It is missing for the index combination SP since for the diluted net with constant stabilities the relation ~ = V( 1 - a ) / a (Abbott & Kepler, 1989) is only shown with the help of the replica method. If not the full chain of the superscripts is written (e.g., PO), then it means that the given value corresponds to all models (e.g., POS and P O F ) containing these characterizations: 1. The most fundamental question is whether a synaptical matrix exists fulfilling given constraints. Extending the work of G a r d n e r (1988), this question was posed especially in Abbott and Kepler (1989). a H = a M = o-~ (Abbott & Kepler, 1989) a es = 1, a PFz = 1 (Abbott & Kepler, 1989), a r'v° = or. All critical values defined later cannot be larger. 2. The next natural problem reads: Does the network improve the retrieval quality for at least some small interval of initial values M ( 0 ) > 0? Depending on which retrieval quality one is interested in the following cases can be distinguished. a. Is there some M ( 0 ) > 0 with M( I ) > M ( 0 ) ? c~Mw = a Hwz = 2/7r ~ 0 . 6 3 7 ~ i n z e l , 1985), a MB = a HBz = 2/7r for B <_ VTr/6 ~ 0.724 and t~ MB = a HBz > 2/7r for B > rr/6 with asymptotics aHBZ ~ _ 1 ( 1 -- B ) l n ( l - B) f o r B ~ I (Englisch & Herrmann, in press), a H' = a H° = arid = oV (Englisch, Engel, & SchiJtte, 1989a: Appendix 1), a P = I (§3.1). b. Is there some M ( 0 ) > 0 with M ( 2 ) > M ( 0 ) ? aHWZF = 0.67 (Patrick & Zagrebnov, 1991; Gardner, Derrida, & Mottishaw, 1987; Amari & Maginu, 1988) in contrast to a Hwzs = 2/7r (Derrida, Gardner, & Zippelius, 1987)is an important result, a H w O S = (X H W I S = ( 3 0 (§3.2!), a Ps = a P F = I (§3.1; Opper, Klein, K6hler, & Kinzel, 1989) c. Is there some M ( 0 ) > 0 with M ( ~ ) > M ( 0 ) ? a Hwzv ~ 0.14 (Amit, Gutfreund, & Sompolinsky, 1985 ) is the famous conjecture, a H w w = (Englisch et al., 1989a), a P = I, a Mw ~ 0.27 ( D o m a n y et al., 1989; Patrick & Zagrebnov, 1990). a Hwzs = 2/~r (Derrida et al., 1987). 3. The other extreme to the mild question is the following, which means automatically that all critical values cannot be larger than those o f the previous part: Does the network improve the retrieval quality for all M ( 0 ) > 0? Again, one can ask for M( l ) > M ( 0 ) or M(o~) > M ( 0 ) , but we prefer the easier question concerning M( 1 ). a Hz = a Hu = a M = 0 (§3.1; D o m a n y et al., 1989], a Pz = 2 / ( 2 + r ) 0.389 (cf. Engel et al. (1990), consider eqn (10) for

H. Englisch. Y. Xiao. a n d K. Yao

M ( 0 ) --~ 0), a H° ( = a m) = oo (§3.1), a P° ( = a el) = a eu = 1 (§3.1). The distinction between "wide" and " n a r r o w " retrieval ( A m i t et al., 1990) corresponds to ours between case 2 and 1, respectively. , Between these two extreme cases lies the question whether certain initial values M ( 0 ) are improved. Again, we only consider one time step. a. Various variants exist to ask for M ( 0 ) ~ 1. I. Is for sufficiently large M ( 0 ) M( 1 ) > M ( 0 ) ? The critical a has to lie between those of (2) and (3) i.e., if both values coincide, then the new one has to coincide with the two previous ones. a M = a H Z = a H U = 0 , a p z = I . Exotic variants of this question are now following, taking into account effects o f order I/N.

2. I f l - M ( 0 ) ~ l,then I-M(I)=0? a ez = a e° = 1, a ewu = 0.5 ( K a n t e r & Sompolinsky, 1987). It is surprising that this unusual definition is widely accepted for the projection operator. The good performance of the zero-threshold network according to this definition is in agreement with eqn (10) yielding an optimal threshold 0 for M ( 0 ) 1, But the superior behavior o f the zerothreshold net in the case M ( 0 ) ~ 1 should be contrasted with definition (3) where the natural selfinteraction leads to a larger critical storage capacity. 3. I f l - M ( 0 ) = 0 i s t h e n 1 -M(l)=0with probability 1? a Hwz = 1/(2 In N) (McEliece, Posner, Rodemich, & Venkatesh, 1987), a Hw° = ov (Yanai & Sawada, 1990) with the identity as trivial synaptic matrix shows the drawback of this definition. 4. Is for every pattern 1 - M( 1 ) = 0 if I - M ( 0 ) = 0 with probability? ~HWZ = 1/(4 In N) (McEliece et al., 1987). b. Some more information about the fixed points of the dynamics can be gained from the other end of the M(0)-interval: is for sufficiently small M ( 0 ) > 0 M ( 1 ) > M ( 0 ) ? a MB = a HBz = 2/~r for any B (but cf. l . l ) ) . This overview shows impressively that one has to be cautious if critical capacities are c o m p a r e d for different models. Often the definitions for these capacities disagree. 5. C O N C L U S I O N S The retrieval quality for the strongly diluted network with selfinteraction can be calculated recursively with Gaussian integrals for any time step and arbitrary distribution of the stabilities. For the network with a delta and a Gaussian distribution of the stabilities, i.e., the projection matrix and the Hebb matrix with its non-

Networks H~TtltSelJinteraction l i n e a r v a r i a n t s o r its v a r i a n t for biased patterns, respectively, the o p t i m a l s t r e n g t h o f the s e l f i n t e r a c t i o n for o n e t i m e step is given by an e x p l i c i t e x p r e s s i o n . T h e critical storage c a p a c i t y for the net w i t h o p t i m a l selfinteraction is infinite. T h e r e is n o c o n t r a d i c t i o n to e n t r o p y c o n s i d e r a t i o n ; the infinity a p p e a r s d u e to the widely a c c e p t e d w e a k d e f i n i t i o n o f this capacity. T h e analysis o f the different d e f i n i t i o n s for the critical storage capacity s h o w s t h a t especially the usual d e f i n i t i o n for the net w i t h p r o j e c t i o n m a t r i x is suspicious. T h e a p p r o x i m a t e d o p t i m a l retrieval q u a l i t y f o u n d by n e g l e c t i o n o f c o r r e l a t i o n s is larger t h a n the e x a c t o p t i m a l retrieval quality. As can be seen f r o m a n u m e r i c a l c o m p a r i s o n o p t i m a l , i n p u t - d e p e n d e n t t h r e s h o l d s are m o r e effective t h a n an o p t i m a l selfinteraction. REFERENCES Abbott, L. E, & Kepler, T. B. (1989). Universality in the space of interactions for network models. Journal q/Ph)wics. A22, 20312038. Amari, S., & Maginu, K. (1988). Statistical neurodynamics of associative memory. Neural Networks. I, 63-73. Amit, D. J., Evans, M. R., Hornet, H., & Wong, K. Y. M. (1990). Retrieval phase diagrams for attractor neural networks with optimal interactions. Jmo'nal of Physics. A23, 3361-3381. Amit, D. J., Gutfreund, H., & Sompolinsky, H. ( 1985 ). Storing infinite numbers of patterns in a spin-glass model of neural networks. Physical Review Letters. 55, 1530-1533. Derrida, B., Gardner, E., & Zippelius, A. ( 1987 ). An exactly solvable asymmetric neural network model. Europlo'sics Letters. 4, 167173. Derrida, B., & Nadal, J. P. ( 1987 ). Learning and forgetting on asymmetric, diluted neural networks. Journal of Statistical Theory 49, 993-1009. Domany, E., Kinzel, W., & Melt, R. (1989). Layered neural networks. Journal of Physics. A22, 2081-2102. Engel. A., Bouten, M., Komoda, A., & Serneels, R. (1990). Enlarged basin of attraction in neural networks with persistent stimuli. Physical Reviot: A42, 4998-5005. Engel, A., Englisch, H., & Schiitte, A. (1989). Improved retreival in neural networks with external fields. Europhysics Leners. 8, 393397. Englisch, H. ( 1991 ). Limit points and limit cycles in neural nets with nonlinear synapses. In E Pasemann & H.-D. Doebner (Eds.), Neurodj'nanlics (Series in Neural Networks, Vol. I ). Singapore: World Scientific. Englisch, H., Engel, A., & Schiitte, A. (1989a). Improved retrieval in neural networks by means of input-dependent thresholds. Journal
687 Gardner, E. (1988). The space of interactions in neural network models. Journal of Physics. A21,257-270. Gardner, E., Derrida, B., & Mottishaw, E ( 1987 ). Zero temperature parallel dynamics for infinite range spin glasses and neural networks. Journal de Physique. 48, 741-755. Kanter, I., & Sompolinsky, H. (1987). Associative recall of memory without errors. Physical Review. A35, 380-392. Kinzel, W. (1985). Learning and pattern recognition in spin glass models. Zeitschrift flit Physik. B60, 205-213. Kinzel, W., & Opper, M. ( 1991 ). Dynamics of learning. In J. L. van Hemmen, E. Domany, & K. Schulten (Eds.), Physics of neural networks. Berlin: Springer-Verlag. Kleinz, S. (1989). Attraktionsgebiete extrem verdiinnter neuronaler Netzwerke. Diphmla thesis. GieBen: Justus-Liebig-Universitat, Physikalisches Institut. Krauth, W., Mezard. M.. & Nadal, J. P. (1988). Basins of attraction in a perceptron-like neural network. Complex Systems. 2, 387408. McEliece, R. J., Posner, E. C., Rodemich, E. R., & Venkatesh, S. S. ( 1987 ). The capacity of the Hopfield associative memory. Institute o[Electrical atul Electronics Engineers Transactions ln/brmation Them3: 1"1"-33,461-482. Odlyzko, A. M. ( 1988 ). On subspaces spanned by random selection of _+1 vectors. Journal of Combinatorial Theoo; A47, 124-133. Opper, M. (1989). Learning in neural networks. Europhysics Letters, 8, 389-392. Opper, M., Klein, J., K6hler, H., & Kinzel, W. (1989). Basins of attraction near the critical storage capacity for neural networks with constant stabilities. Journal of Physics. A22, L407-L411. Patrick, A. E., & Zagrebnov, V. A. (1990). Dynamics ofa multilayered perceptron model: A rigorous result. Journal de" Plo'sique, 51, 1129-1137. Patrick, A. E., & Zagrebnov, V. A. ( 1991 ). On the parallel dynamics for the Little-Hopfield model. Journal of Statistical Theoo: 63, 59-71. Rau, A., & Sherrington, D. (1990). Retrieval enhancement due to external stimuli in diluted neural networks. Europhysics Letters. ! 1,499-504. van Hemmen, J. L., & Kfihn, R. ( 1991 ). Collective phenomena in neural networks. In J. L. van Hemmen, E. Domany, & K. Schulten (Eds.), Ph)'sics ~?fneural networks ( pp. I- 106 ). Berlin: SpringerVerlag. Wang, J.-H., Krile, T. F., Walkup, J. E, & Jong, T.-L. ( 1991 ). On the characteristics of the autoassociative memory with nonzero-diagonal terms in the memory matrix. Neural Computation, 3, 428439. Yanai, H., & Sawada, Y. (1990). Associative memory network composed of neurons with hysteretic property. Neural Networks, 3, 223-228.

APPENDIX 1: C A L C U L A T I O N otflu I N C A S E 2a

OF

The critical case is a small Ao > 0, let us choose M(0) < f3Ao. Then

M(I)-M(0)=

~

I + pAl(O) --erf(M(0)Ao+p/Ao)-M(0) 2

p=±[

(

g ( 0 ) \ e r f l~_

I) + /2 \( e r f ka~ ( 1 + MAp/]

688

H. E n g l i s c h ,

Y X i a o , a n d K. Yao

The optimal convergence rate corresponds to 7 = 1/( 1 + c~). i.e., o~+a 2 o"

(A4)

1 + 3a + a 2'

whereas the minimal variance -~

e '/-~o[(-M~

a2=

+ M.ao3) + ( M A o - M3Ao/3)] > 0

for Ao > M ( 0 ) 2 1 3 .

APPENDIX

2: T H E

ONE

ADALINE

HEBB

RULE

LEARNING

AFTER STEP

Opper ( 1989 ) ( cf. Englisch & Pastur, 1991 ) combined the Hebb rule with the ADALINE learning algorithm and got the projection operator in the limit. After one learning step, the synaptic matrix J is given by J,, = 0 and for i g: j by

J,s = (( 1 + 3")J(0) - 3"J(O):), s

(AI)

where J ( 0 ) is the unnormalized Hebb matrix with J , = 0. The square Zs J~ = 2Zs ( 1 + 3')z(J(0)o)2 - 23'( I + 3")(J(O)2),sJ(O),: + 3' ( ( J ( 0 ) ' ) o ) " ofthe n o r m for the rows of J c o n t a m s terms ofslze ( l + 3')2a. - 2 ( 1 + 3')'ra and 3'2(a + a2). respectively. In order to calculate (J~"),, the element of the Hebb matrix are represented as J(O), s = ( ~ ( ~ / N + K,j and J , ( 0 ) = K. = 0. Then 2

~



'

(JU), = ~7 + ~. K , s ( ' j - 3" ~ g,sK.,k~'k )

-

3')2 + a3"2 I + o:"1"2

(A2)

I,k

I - -23, - I +

3'-"

a3" 2 "

1

1/1+4a+

1

2a

=

1 +2a+1/1

(A5) +4a

belongs to 3' = 2/(t/1 + 4a + 1 ). For a E ( 0 , ~ ) , the variances (A4) and (A5) are both increasing functions with range (0, I). Though the A D A L I N E algorithm is not converging for a = I, the variances after one time step reduces, nevertheless, to ~r-" = ~ and a z = (3 !f5)/2 ~ 0.382, respectively. Unfortunately, with increasing 3', the stability Ao is reduced. Formula (16) shows that for small M ( 0 ) , a large stability, i.e., 3' ~ 0, is deciding, but for M ( 0 ) = 1, a small ratio Ao/o. is important. But this ratio is maximal just for the optimal learning rate, i.e., 3" = 1/(1 + a ) .

(A6)

Formula ( A 4 ) was derived independently by Kleinz ( 1989 ), cf. Kinzel and Opper (1991). (In the formula (53), [I - M r ( t ) ] 2 has to be replaced by [I - M~(t)2].)

"

is up to terms of lower order the sum of the same signal term (~" as for the Hebb matrix and two noise terms. The first one which appears also for the Hebb matrix has a variance a, the second one a variance 3'-'(a + a 2) and the correlation between both terms produces a reduction - 23'~ of the sum of the variances. After normalization, the signal is 1~7/ ~ + c~-'3,-' and the noise is normally distributed with variance o.2 = ( I

1/1 + 4 o : -

(A3)

APPENDIX

3:

MONOTONICITY

OF

hopt(0 )

In order to show that [( I - M 2 ) / M ] In( I + M ) / ( 1 - M ) is decreasing and [ [ I + M ' ( a 2 - I ) ] / M ~ l n ( l + :11)/(1 - M ) is increasing for a-' >_ ~. with respect to M, we have to prove for their derviatives ( 2 / M) - [( I + M 2 ) / M 2 ] I n ( I + M ) / ( 1 - M) < 0 and ~[ M " ( a " - I ) - l]IM2',In(l + M)I(I - M ) + 12[I + M2(o. 2 - t ) ] I I [ M ( I - M 2 ) ] > 0 for o.2 >_ ]. Both inequalities follow from 2 M I ( I + .112 ) < I n ( l + M ) I ( I - M ) < [ 2 M ( 3 - M 2 ) ] I [ ( I - M 2 ) ( 3 + 312)], which can be derived by differentiating again. For ] > o.2 > 0. h,,pd 0 ) is decreasing for all sul~ciently small :11( 0 ) and increasing for all sufflciently large M ( 0 ) . REMARK After submitting the manuscript, we obtained the papers by Kinzel and Opper (1990) and Kleinz (1989), which enabled us to delete an appendix.