Mathematical Approaches to Neural Networks J.G. Taylor (Editor) 0 1993 Elsevier Science Publishers B.V. All rights reserved.
26 I
Neural networks: the spin glass approach David Sherrington Department of Physics, University of Oxford, Theoretical Physics, 1 Keble Road, Oxford, OX1 3NP Abstract A brief overview is given of the conceptual basis for and the mathematical formulation of the fruitful transfer of techniques developed for the theory of spin glasses to the analysis of the performance, potential and training of neural networks. 1. INTRODUCTION
Spin glasses are disordered magnetic systems. Their relevance to neural networks lies not in any physical similarity, but rather in conceptual analogy and in the transfer of mathematical techniques developed for their analysis to the quantitative study of several aspects of neural networks. This chapter is concerned with the basis and application of this transfer. A brief introduction to spin glassses in their conventional manifestation is appropriate to set the scene - for a fuller consideration the reader is referred to more specialist reviews (MBzard et. al. 1987, Fischer and Hertz 1991, Binder and Young 1986, Sherrington 1990, 1992). At a microscopic level spin glasses consist of many elementary atomic magnets (spins), fixed in location but free to orient, interacting strongly but randomly with one another through pairwise forces. Individually these forces try to orient their spin pairs either parallel or antiparallel, but collectively they lead to conflicts, or frustration, with regard to the global orientations. The consequence is a system with many non-equivalent metastable global states and consequentially many interesting physical properties. Most of the latter will not concern us here, but the many-state structure has relevance for analogues in neural memory and the mathematical techniques devised to analyze spin glasses have direct applicability. Neural networks also involve the cooperation of many relatively simple units, the neurons, under the influence of conflicting interactions, and they possess many different global asymptotic behaviours in their dynamics. In this case the conflicts arise from a mixture of excitatory and inhibitory synapses, respectively increasing and decreasing the tendency of a post-synaptic neuron to fire if the pre-synaptic neuron fires. The recognition of a conceptual relationship between spin glasses and recurrent neural networks, together with a mathematical mapping between idealizations of each (Hopfield 1982), provided the first hint of what has turned out to be a fruitful transplantation. In fact, there are now two main respects in which spin glass analysis has been of value in considering neural networks for storing and interpreting static data. The first concerns the macroscopic asymptotic behaviour of a neural network of given architecture
262
and synaptic efficacies. The second concerns the choice of efficacies in order to optimize various performance measures. Both will be discussed in this chapter. We shall discuss networks suggested as idealizations of neurobiological structures and also those devised for applied decision making. We shall not, however, dwell on the extent to which these idealizations are faithful, or otherwise, to nature. Although neural networks can also be employed to store and analyse dynamical information, and techniques of non-equilibrium statistical mechanics are being applied to their analysis, we shall restrict discussion in this chapter to static information, albeit stored in dynamic networks. An accompanying chapter (Coolen and Sherrington 1992) gives a brief introduction to dynamics. 2. TYPES OF NEURAL NETWORK
There are two principal types of neural network architecture which have been the subject of active study. The first is that of layered feedforward networks in which many input neurons drive various numbers of hidden units eventually to one or few output neurons, with signals progressing only forward from layer to layer, never backwards or sideways within a layer. This is the preferred architecture of many artificial neural networks for application as expert systems, with the interest lying in training and operating the networks for the deduction of appropriate few-state conclusions from the simultaneous input of many, possibly corrupted, pieces of data. The second type is of recurrent networks where there is no simple feedforward-only or even layered operation, but rather the neurons drive one another collectively and repetitively without particular directionality. In these networks the interest is in the global behaviour of all the neurons and the associative retrieval of memorized states from initialisations in noisy representations thereof. These networks are often referred to as attractor neural networks’. They are idealizations of parts of the brain, such as cerebral cortex. Both of the above can be considered as made up from simple ‘units’ in which a single neuron receives input from several other neurons which collectively determine its output. That output may then, depending upon the architecture considered, provide part of the inputs to other neurons in other units. Many specific forms of idealized neuron are possible, but here we shall concentrate on those in which the neuron state (activity) can be characterized by a single real scalar. Similarly, many types of rule can be envisaged relating the output state of a neuron to those of the neurons which input directly to it. We shall concentrate, however, on those in which the efferent (post-synaptic) behaviour is determined from the states of the afferent (pre-synaptic) neurons via an ‘effective field’ hi =
C J ; j ~-j W;, j#i
‘They are often abbreviated as ANN, but we shall avoid this notation since it is also common for artificial neural networks.
263
where aj measures the firing state of neuron j , J;j is the synaptic weight from j to i and W; is a threshold. For example, a deterministic perceptron obeys the output-input relation
where a,!is the output state of the neuron. More generally one has a stochastic rule, where f(h;)is modified in some random fashion at each step. Specializing/approximately further to binary-state (McCulloch-Pitts) neurons, taken to have a;= fl denoting firing/non-firing, the standard deterministic perceptron rule is u: = sgn(h;).
(3)
Typical stochastic extensions modify (3) to a random update rule, such as the Glauber rule, 1 2
a;+ u,!with probability -[1+ tanh(ph;u:)],
(4)
or the Gaussian rule ai -+
a: = sgn(h;
+ Tz),
(5)
where z is a Gaussian-distributed random variable of unit variance and T = p-' is a measure of the degree of stochasticity, with T = O(p = m) corresponding to determinism. In a network of such units, updates can be effectuated either synchronously (in parallel) or randomly asynchronously. More generally, a system of binary neurons satisfies local rules of the form
where the u ; ~..., aiC= +l are the states of the neurons feeding neuron i, Rj and Ri are independent tunable stochastic operators randomly changing the signs of their operands, and F; is a Boolean function of its arguments (Aleksander 1988, Wong and Sherrington 1988, 1989). The linearly-separable synaptic form of (2)-(5) is just a small subset of possible Boolean forms. 3. ANALOGY BETWEEN MAGNETISM AND NEURAL NETWORKS
In order to prepare for later transfer of mathematical techniques from the theory of spin glasses to the analysis of neural networks, in this section we give a brief outline of the relevant physical and conceptual aspects of disordered magnets which provide the stimulus for that transfer. 3.1 Magnets
A common simple model magnet idealizes the atomic magnetic moments to have only two states, spin up and spin down, indicated by a binary (Ising) variable a;= f l , where i
264
labels the location and u the state of the spin. A global microstate is a set { u ; } ; i= 1,...N where N is the number of spins. The energy of such a state is typically given by
1
E({u;}) = - - ~ J ; ~ u ; u-, C b;u; 2
ij
(7)
I
where the J;j (known as exchange interactions) correspond to contributions from pairwise forces and the b; t o local magnetic fields. The prime indicates exclusion of i = j . The set of d microstates is referred t o as ‘phase space’. The standard dynamics of such a system in a thermal environment at temperature T is a random sequential updating of the spin states according t o rule (4) with W; -+ -b;. Thus, with this identification, there is a mathematical mapping between the statistical thermodynamics of the spin system and the dynamics of a corresponding recurrent neural network. The converse is not necessarily true since the above spin model has J;j = Jj;, whereas no such restriction need apply t o a general neural network. However, for developmental purposes, we shall assume this symmetry initially, lifting the restriction later in our discussion. Magnetic systems of this kind have been much studied. Let us first concentrate on their asymptotic behaviour. This leads t o a thermodynamic state in which the system randomly passes through the microstates with a Gibbs probabalistic distribution
and the system can be viewed as equivalent to an ensemble of systems with this distributiona. At high temperatures T, effectively all the microstates of any energy are equally likely t o be accessed in a finite time and there are no serious barriers to a change of microstate. At low enough temperatures, however, there is a spontaneous breaking of the phase space symmetry on finite timescales and only a sub-set of microstates is effectively available in a physical measurement on a very large (N -+ co)system. The onset of such a spontaneous separation of phase space is known as a ‘phase transition’. A common example is the onset of ferromagnetism in a system with positive exchange intGactions {J;j} and b = 0. Beneath a critical temperature T,,despite the probabalistic symmetry between a microstate and its mirror image with every spin reversed, as given by (8), there is an effective barrier between the sub-sets of microstates with overall spin up and those with overall spin down which cannot be breached in a physical time, and hence the thermal dynamics is effectively confined t o one or other of these subsets. The origin of this effect lies in the fact that for T < T, the most probable microstates have non-zero values of the average magnetization m = N-’ u;,strongly peaked around two values f m ( T ) , while the probability of a state of different Im( is exponentially smaller. To go from m = m ( T ) t o m = -m(T) would require the system to pass through intermediate states, such as m = 0, of probability which is vanishingly small as N ---f 00. For T = 0 the dynamics (3) leads t o a minimum of E ( a ) , which would have m = fl for the ferromagnet, with no means of further change.
xi
ZFora further, more complete, discussion of equivalences between temporal and ensemble averages see the subject of ‘ergodicity’ in texts on statistical mechanics
265
A useful picture within which to envisage the effect of spontaneous symmetry-breaking is of an effective energy landscape which incorporates the randomizing tendencies of temperature as well as the ordering tendencies of the energy terms of (7). This is known as a free energy landscape and it evolves with temperature. At high temperature it is such that all energetically equivalent states are equally accessible, but at low temperature it splits into disconnected regions separated by insurmountable ridges. If the system under consideration is the ferromagnet with only positive and zero Jij and without magnetic fields, b = 0, the phase space is thus split into two inversion symmetry related parts. If, however, the Jij are of random sign, but frozen (or quenched), then the resultant low temperature state can have many non-equivalent disconnected regions, or basins, in its free-energy structure; this is the case for spin glasses. Thus, if one starts the dynamical system in a microstate contained within one of the disconnected sub-sets, then in a physical time its evolution will be restricted to that subspace. The system will iterate towards a distribution as given by (8) but restricted to microstates within the sub-space.
3.2 Neural networks Thus one arrives at a potential scenario for a recurrent neural network capable of associatively retrieving any of several patterns { Q } ; p = 1,...p. This is to choose a system in which the J i j are such that, beneath an appropriate temperature (stochasticity) T , there are p disconnected basins, each having a macroscopic overlap3 with just one of the patterns and such that if the system is started in a microstate which is a noisy version of a pattern it will iterate towards a distribution with a macroscopic overlap with that pattern and perhaps, for T -+ 0, to the pattern itself. To store many non-equivalent patterns clearly requires many non-equivalent basins and therefore requires competition among the synaptic weights/exchange interactions {J;j}4. The mathematical machinery devised to study ordering in random magnets is thus a natural choice to consider for adaptation for the analysis of retrieval in the corresponding neural networks. An introduction to this adaptation is the subject of the next section. However, before passing to that analysis a further analogy and stimulus for mathematical transfer will be mentioned. This second area for transfer concerns the choice of {Jij}to achieve a desired network performance. Provided that performance can be quantified, the problem of choosing the optimal { J ; j } is equivalent to one of minimizing some effective energy function in the space of all { J ; j } . The performance requirements, such as which patterns are to be stored and with what quality, impose ‘costs’ on the J;j combinations, much as the exchange interactions do on the spins in (7), and there are normally conflicts in matching local (few-J;j) with global (all-J;j)optimization. Thus, the global optimization problem is conceptually isomorphic with that of finding the ground state of a spin glass, and again a conceptual and mathematical transfer has proved valuable. 8A precise definition of overlap is given later in eqn (9). With the normalization used there an overlap
is macroscopic if it is of order 1.
4Note that this concept applies even if there is no Lyapunov or energy function. The expression ‘basin’ refers to a restricted microscopic phase space of the {u},even in a purely dynamical context.
266
4. STATISTICAL PHYSICS OF RETRIEVAL In this section we consider the use of techniques of statistical physics, particularly as developed for the study of spin glasses, for the analysis of the retrieval properties of simple recurrent neural networks. Let us consider such a network of N binary-state neurons, characterized by state variables n; = f l , i = 1,...N, interacting via stochastic synaptic operations (as discussed in section 2) and storing, or attempting to store, p patterns {(f} = {fl};p = 1,...p. Interest is in the global state of the network. Its closeness to a pattern can be measured in terms of the corresponding (normalized) overlap I
or in terms of the (complementary) fractional Hamming distance
which measures the average number of differing bits. To act as a retrieving memory the phase space of the system must separate so as to include effectively non-communicating sub-spaces, each with macroscopic O ( N o )overlap with a single pattern. 4.1 The Hopfield model
A particularly interesting example for analysis was proposed by Hopfield. It employs symmetric synapses Jij = Jj; and randomly asynchronously updating dynamics, leading to the asymptotic activity distribution (over all microstates)
where E ( a ) has the form of eqn. (7). This permits the applications of the machinery of equilibrium statistical mechanics to study retrieval behaviour. In particular, one studies the resultant thermodynamic phase structure with particular concern for the behaviour of the m w . 4.2 Statistical Mechanics
Basic statistical mechanics for the investigation of equilibrium thermodynamics proceeds by introducing the partition function
Several other quantities of thermodynamic interest, such as the average thermal energy and the entropy, follow immediately; for example
( E ) = 2-l
{W
a aP
E(u)exp ( - P E ( u ) ) = --enZ.
267
Others can be obtained by the inclusion of small generating fields; for example, for any observable O(u),
In particular, the average overlap with pattern p follows from
where
Spontaneous symmetry breaking is usually monitored implicitly, often signalled by divergencies of appropriate response functions or fluctuation correlations in the highsymmetry phase. However, it can be made explicit by the inclusion of infinitesimal symmetry-breaking fields; for example k! = &” will pick out the v t h sub-space if the phase space is disconnected, even for h --t Of, but will be inconsequential for k --t O+ if phase space is connected. 4.3 H e b b i a n Synapses
Hopfield proposed the simple synaptic form .7;j
= N-’
cg y ( r
1 - bij),
inspired by the observations of Hebb; we shall refer to this choice as Hebbian. Let us turn to the analysis and implications of this choice, with all the {Wi} taken t o be zero and for random uncorrelated patterns {tP}. For a system storing just a single pattern, the problem transforms immediately, under u; --t u;&, to a pure ferromagnetic Ising model with J;j = N - ’ . The solution is well known and m satisfies the self-consistency equation m = tanh
(pm),
(19)
with the physical solution rn = 0 for T > 1 (P < 1) and a symmetry-breaking phase transition to two separated solutions f l m ( , with m # 0, for T < 1. For general p one may express exp(-PE(u)) for the Hopfield-Hebb model in a separable form,
268 =
/cfi
d f i p ( P N / 2 ~ ) ) e) ~ p [ ~ ( - N P ( f i ” ) ~-/pfhpc 2 u;
p=l
p=l
L
where we have employed the useful identity for exponentials of complete squares exp
( ~ a 2= )
(27r-f
j dz exp (-z2/2 + ( 2 ~ ) t a r ) ,
(22)
u;
2=
4 (
dfip(PN/27r))) exp (-NPf({fh”})) D
where
f({mP})
(23) D
+ i”)
= c ( f i p ) ’ / 2 - (NP)-l c t n [ 2 cosh ,f3 c ( m P
(24)
4.4 Intensive numbers of patterns
If p is finite (does not scale with N ) , f({+‘}) is intensive in the thermodynamic limit + m) and the integral is e x t r e m d y dominated. The minima of f give the selfconsistent solutions which correspond t o the stable dynamics, although only the lowest minima are relevant t o a thermodynamic calculation. The self-consistency equations are
(N
Furthermore, it is straightforward to show from (16) and (23) that m p = f i p at these extremal solutions. = 0 yields the result: The subsequent analysis of the equations (25) in the limit {k} 1. T=O (i) All embedded patterns { ( p } are solutions; ie. there are p solutions (all of equal probability) mr = 1;one p = 0;rest (ii) There are also mixed solutions in which more than one m p is non-zero in the limit
N
4 00.
Solutions of type (i) correspond t o memory retrieval, while hybrids of type (ii) are normally not wanted. 2. 0.46 > T > 0 There are solutions (i) m p = m # 0; one p = 0;rest (ii) mixed states. In the case of type (i) solutions, we may again speak of retrieval, but now it is imperfect.
269
3. 1 > T > 0.46 Only type (i) solutions remain, each equally stable and with extensive barriers between them. 4.
T>1
Only the paramagnetic solution (all m” = 0) remains. Thus we see that retrieval noise can serve a useful purpose in eliminating or reducing spurious hybrid solutions in favour of unique retrieval. 4.5 Extensive numbers of patterns
The analysis of the last section shows no dependence of the critical temperatures on p. This is correct for p independent of N (and N + co). However, even simple signal-to-noise arguments demonstrate that interference between patterns will destroy retrieval, even at T = 0, for p large enough and scaling appropriately with N. Geometrical, informationtheoretical and statistical-mechanical arguments (to be discussed later) in fact show that the maximum pattern storage allowing retrieval scales as p = aN, where a is an N independent storage capacity. Thus we need to be able to analyse retrieval for p of order N, which requires a different method than that used in (23) - (25). One is available from the theory of spin glasses. This is the so called replica theory (Edwards and Anderson 1975, Sherrington and Kirkpatrick 1975, Kirkpatrick and Sherrington 1978, MCzard et al. 1987). As noted earlier, physical quantities of interest are obtained from ln2. This will depend on the specific set of { J ; j } , which will itself depend on the patterns {t”} to be stored. Statistically, however, one is interested not in a particular set of { J i j } or {t:} but in relevant averages over generic sets, for example over all sets of p patterns drawn randomly from the 2N possible pattern choices. Furthermore, the pattern averages of most interest are self-averaging5,strongly peaked around their most probable values. Thus, we may ignore fluctuations of en2 over nominally equivalent sets of pattern choices and hence consider ( l n Z ) { ~where ~ , ( ){(I means an average over the specific pattern choices. Although in principle one might envisage the calculation first of ln2 for a particular pattern choice and then its average, in practice this would be essentially impossible for large p without some other approximation‘. Rather, one would like to average formally over the patterns {<”} first and then perform the {a}summation. Unfortunately, however, averaging the logarithm of a sum of exponentials is not easy, whereas averaging a sum of exponentials themselves is relatively straightforward. The replica ‘trick’ provides a procedure to transform the difficult average to the easier one, albeit at some cost in complexity and subtlety. 51u a system with quenched disorder a quantity is said to be self-averaging if it overwhelmingly takes the same value (or form) for all specific choices of the disorder. In fact, non self-averaging of certain (usually somewhat subtle) quantities is one of the conceptually important discoveries of the statistical physics of systems with quenched disorder; see e.g. MCeard et. al. 1987. eBut see, for example, the cavity-field approach of MCzard et. al. 1987, or its precursor due to Thouless et. al. 1977.
270
The replica procedure is based on the identity 1 en2 = lim -(Z" - l), n-0 n
together with the identification of 2" as the partition function of an n-replicated system, n
where the a";a = 1,...n are (dummy) replica spins/neuron states7. Given an algorithm for J;j in terms of the {Q}, an average over 2" can be performed to yield
where F({a"}) is some effective 'energy' function of the {a"}, now independent of the particular patterns but involving interaction/cross-terms between a" with different values of the replica label a;i.e. with interaction between replicas. For a system in which there is no short-range structure (for example, is such that the probability that any two i, j have a particular d u e of synaptic connection is independent of the choice of i , j ) F scales as N and, in principle, this problem is exactly soluble, effectively through an exact mean field theory. In practice, however, the solution may be highly non-trivial. If there is short-range structure one may need to resort to approximation or simulation.
{t}
4.6 Replica analysis of a spin glass
Before discussing the neural network problem, let us, for orientational purposes, digress to a discussion of the replica procedure for a spin glass problem where 2 is given by (12) with E(a)as given by (7), with the b; = 0, and the J;j independent random quenched parameters drawn from a Gaussian distribution
(Sherrington and Kirkpatrick 1975). Averaging over the { J ; j } one obtains
7We follow convention and use a and 0 as replica labels. They occur only as superscript labels and summations, which it is hoped will be sufficient to distinguish them trom a , P used for the storage ratio and inverse temperature respectively (again conventional symbols); the latter never appear as subscripts nor as elements to be summed over.
27 1
where (aP) refers t o pairs of diflerent indices and we have deliberately indicated the occurrence of complete squares so that eqn. ( 2 2 ) can be further employed t o write
where g has the N-independent form
g ( { m a } , {q"'}) = ( m a ) ' / 2
+ (qap)'/2 - Ten
toP)
exp { P J o ~ u a m a 0
+(PJ)"n
+ 2 c aaUPqap)} (4
(33)
where the {u"} summation is over n replicas, each with u" = f l , but the ua carry no site labels. Because g is N-independent, the integral in ( 3 2 ) is overwhelmingly dominated by the values of { m a } ,{ q a P } for which g is minimized. At these extremal values ma,qap are identifiable as ma = (uq)F qap
(34)
= (uq<)F
where the (
)F
(35) refers to a thermodynamic average against the effective energy F({u"}),
However, it is still difficult to evaluate ( 3 2 ) for arbitrary n, needed t o perform analytic continuation to n + 0, without some further ansatz. Since the a,p replicas are artificial, one such as ansatz suggests itself - the replica-symmetric (RS) ansatz
m a = m ; alla
(37)
which effectively treats all replicas as equivalent. With this ansatz g is readily evaluated, in ( 3 3 ) being transformed into a single summation through the double summation C(a.p) another application of eqn ( 2 2 ) t o exp((PJ)'q(E, aa)'). In the limit n + 0, g ( m , q ) then has its extrema at the solutions of the self-consistent equations m=
Q
=
1
dhP(h)tanh p h
(39)
dhP(h)(tanhph)'
(40)
212
where
P ( h ) = (2nJ2q) exp ( - ( h - J0m)’/2J2q).
(41)
These solutions can be interpreted as
where ( u ; )is~ the thermally averaged magnetization of the original system at site i with a particular set of exchange interactions { J ; j } ,
As T, J , / J are varied, one fmds three sorts of solutions (Sherrington and Kirkpatrick 1975); (i) paramagnetic; m = 0 , q = 0, for T > max ( 1 ,J , / J ) (ii) ferromagnetic, m # 0 , q # 0, for T < J o / J and J o / J greater than the T-dependent value of 0 ( 1 ) , and (iii) spin glass, m = 0 , q # 0 for T < 1 and J,/J less than a T-dependent value of O(1). Within the spin glass problem the greatest interest is the third of these, interpreted as frozen order without periodicity, but for neural networks the interest is in an analogue of the second, ferromagnetism. 4.7. Replica analysis of the Hopfield model
Let us now turn to the neural network problem. In place of the order parameter m one now has all the overlap parameters m”. However, since we are principally interested in retrieving symmetry-breaking solutions, we can concentrate on extrema with only one, or a few, rn” macroscopic ( O ( N o ) )and the rest microscopic ( 5 O(N-f)). This enables one to obtain self-consistent equations for the overlaps with the nominated (potentially macroscopically overlapped or condensed) patterns
where the 1 , ...a label the nominated patterns and ( )T denotes the thermal (symmetrybroken) average at fixed {(}, coupled with a spin-glass like order parameter
and a mean-square average of the overlaps with the un-nominated patterns (itself expressible in terms of 9). Retrieval corresponds to a solution with just one m” non-zero. For the case of the Hopfield-Hebb model the analysis follows readily from an extension of (21). Averaging over random patterns yields ‘in fact, within the RS ansats the physical extremum is found from mazimizing the substituted g with respect to q; this is because the number of (ap) combinations n(n - 1)/2 becomes negative in the limit n
+ 0.
273
(Zn) = exp (-np/3/2)
{ma}
1fi
fi{dmp(/3N/2n)! exp [ - N ~ ( / 3 ~ ( r n a ) ' / 2
p=la=l
a
-1
en cosh
+N-' i
(BEmpu:))]}.
(47)
P
To proceed further we separate out the condensed and non-condensed patterns and carry out a sequence of manipulations to obtain an extremally dominated form analagous to eqn (32). Details are deferred to Appendix A, but yield a form
(Z"){,> = (@N/2x)"/'
1 fi
p,a=l
where
drnw
1n
dqapdrape-Np*
(4)
is intensive. (48) is thus extremally dominated. At the extremum
Within a replica-symmetric ansatz m p = mp, qap = q, rap = r , self-consistency equations follow relatively straightforwardly. For the retrieval situation in which only one m P is macroscopic (and denoted by m below) they are
1 15 dz
m= q=
exp ( - z 2 / 2 ) tanh [p(z&
exp (-.'/a)
+ m)]
tanh2p(z&G+ m)]
(52) (53)
where T
= q ( l - p(1 - q ) ) - 2
(54)
Retrieval corresponds to a solution m # 0. There are two types of non-retrieval solution, (i) m = 0 , q = 0, called paramagnetic, in which the system samples all of phase space, (ii) m = 0,q # 0, the spin glass solution, in which the accessible phase space is restricted but not correlated with a pattern. Fig. 1 shows the phase diagram (Amit et. al. 1985); retrieval is only possible provided the combination of fast (stochastic) noise T and slow (pattern interference) noise a is not too great. There are also (spurious) solutions with more than one m p # 0, but these are not displayed in the figure. In the above analysis, replica symmetry was assumed. This can be checked for stability against small fluctuations by expanding the effective free energy functional
214
F({m"}, { q " P } ) to second order in E" = m" - m, quo = q"P - q and studying the resultant normal mode spectra (de Almeida and Thouless 1978). In fact, it turns out to be unstable in the spin glass region and in a small part of the retrieval region of ( T , a ) space near the maximum a for retrieval. A methodology for going beyond this ansatz has been developed (Parisi 1979) but is both subtle and complicated and is beyond our present scope. However, it might be noted as (i) corresponding to a further hierarchical disconnectedness of phase space (Parisi 1983), and (ii) giving rise to only relatively small changes in the critical retrieval capacity. For the example of section 4.6 replica-symmetry breaking changes the critical boundary between spin-glass and ferromagnet to J,/J = 1. A similar procedure may be used, at least in principle, to analyze retrieval in other networks with J;j = Jj;. Transitions between non-retrieval (m = 0) and retrieval m # 0 may be either continuous or discontinuous; for the fully connected Hopfield-Hebb model the transition is discontinuous but for its dilutely, but still symmetrically, connected counterpart the transition is continuouss (Kanter 1989, Watkin and Sherrington 1991).
10
+
05 -
0
0
0 05
0 10
a
f
I
015
a c = 0.138
Figure 1. Phase diagram of the Hopfield model (after Amit et. al. 1985). T, indicates the limit of retrieval solutions, between T, and Tgthere are spin-glass like non-retrieval solutions, above Tgonly paramagnetic non-retrieval.
4.8 Dilute asymmetric connectivity W e might note that a second type of network provides for relatively straightforward analysis of retrieval, including not only that of the asymptotic retrieval overlap (the m
obtained in the last section) but also the size of the basin from which retrieval is possible (i.e. the minimum initial overlap permitting asymptotic retrieval). This is the dilute 'The dilute case also has greater RS-breaking effects (Watkin and Sherrington 1991)
275
asymmetric network (Derrida et. al 1987) in which synapses are only present with a probability C / N and C is sufficiently small compared with N that self-correlation via synaptic loops is inconsequential. C <( l n N suffices for this to apply, albeit that both C and N might be individually large. Note that Jij and Jji are considered separately and thus effectively are never both present in this limit. For this problem retrieval can be considered via iterative maps for the instantaneous overlaps”, such as
m’(t
+ 1) = f”({m”(t)})
(55)
for synchronous dynamics or
dm’ dt
-= f”({m”(t))) - mP(t)
for asynchronous dynamics. For uncorrelated patterns, if the system is started with a finite overlap with just one pattern, the maps simplify; for the synchronous case to
where m is the overlap with the relevant pattern. The asymptotic overlaps are given by the stable fixed points of the map, while the unstable fixed points give the basin boundaries. Fig. 2a illustrates iterative retrieval. A non-zero stable fixed point therefore implies retrieval, albeit imperfect if the corresponding m* is not unity. In principle, there can be several stable fixed points separated by unstable fixed points, corresponding to several possible retrieval basins for the same pattern, having different qualities of retrieval accessible from different ranges of quality of initialization. Varying the retrieval temperature T or the capacity a causes f ( m ) to vary and hence affects retrieval. There is a phase transition when a non-zero stable fixed point and an associated unstable fixed point are eliminated under such a variation. This is illustrated in Fig. 2b; in this example the transition is first order (discontinuous reduction of m*),but it is also possible to have a second order (continuous) transition if the relevant unstable fixed point is at mg = 0 throughout the variation of T or a as m* 0. Given any algorithm relating { J i j } to the (0, together with a retrieval dynamics, f ( m ( t ) )follows directly and hence so do the retrieval and basin boundary overlaps. In fact, f ( m ( t ) ) can be usefully expressed in terms of a simple measure, the local stability field” distribution --f
p ( A ) = (Np)-’
&(A - A:); i
A: =
r
c
J$)i.
Jij(T/(c
j#i
a#;
For the case of the dynamics of (46) f ( m )= / d A p ( A ) erf [mA/(2(1- m2
+T2))i]
(59)
l0This simplification is a consequence of the fact that, with no self-correlations, the only relevant global state measure is the probability that any neuron is in a state corresponding to a pattern. “A: is called a stability field since pattern p is stably iterated at neuron i if A: > 0.
216
(Wong and Sherrington 1990a). Thus (50) and (52) can be used to determine the retrieval of any such network, given p ( A ) . In particular, this provides a convenient measure for assessing different algorithms for { J i j } . Of course, for a given relationship between { J ; j } and { t } , p ( h ) follows directly; for the Hebb rule (18), p ( A ) is a Gaussian of mean a-f and standard deviation unity.
1
7 1
f (m), m
/ I
f (m). m
0 mB
mo
m'
1
0
m
1
Figure 2 Schematic illustrations of (a) iterative retrieval; 0, m* are stable fixed points, asymptotically reached respectively from initial states in 0 _< m < m ~mg , < m _< 1, (b) variation of f ( m )with capacity or retrieval temperature, showing the occurrence of a phase transition between retrieval and non-retrieval.
5. STATISTICAL MECHANICS OF LEARNING In the last section we considered the problem of assessing the retrieval capability of a system of given architecture, local update rule and algorithm for {Jjj}. Another important issue is the converse; how to choose/train the { J i j } , and possibly the architecture, in order to achieve the best performance as assessed by some measure. Various such performance measures are possible; for example, in a recurrent network one might ask for the best overlap improvement in one sweep, or the best asymptotic retrieval, or the largest size of attractor basin, or the largest storage capacity, or the best resistance to damage; in a feedforward network trying to learn a r d e from examples one might ask for the best performance on the examples presented, or the best ability to generalize. Statistical mechanics, again largely as originally developed for spin glasses, has played an important role in assessing what is achievable in such optimization and also provides a possible mechanism for achieving such optima (although there may be other algorithms which are quicker to attain the goals which have been shown to be accessible). Thus in this section we discuss the statistical physics of optimization, as applied to neural networks.
211
5.1 Statistical physics of optimization Consider a problem specifiable as the minimization of a function Eia)({b}) where the {a} are quenched parameters and the {b} are the variables to be adjusted, and furthermore, the number of possible values of {b} is very large. In general such a problem is hard. One cannot try all combinations of {b} since there are too many. Nor can one generally find a successful iterative improvement scheme in which one chooses an initial value of {b} and gradually adjusts the value so as to accept only moves reducing E. Rather, if the set {a} imposes conflicts, the system is likely to have a ‘landscape’ structure for E as a function of {b} which has many valleys ringed by ridges, so that a downhill start from most starting points is likely to lead one to a secondary higher-E (local) minimum and not a true (global) minimum or even a close approximation to it. To deal with such problems computationally the technique of simulated annealing was invented (Kirkpatrick et. al. 1983). In this technique one simulates the probabalistic energy-increase (hill-climbing) procedure used by a metallurgist to anneal out the defects which typically result from rapid quenches (downhill only). Specifically, one treats E as a microscopic ‘energy’, invents a complementary ‘temperature’, the annealing temperature TA,and simulates a stochastic thermal dynamics in {b} which iterates to a distribution of the Gibbs form
Then one reduces TA gradually to zero. The actual dynamics has some freedom - for example for discrete variables Monte Car10 simulations with a heat bath algorithm (Glauber 1963), such as (4), or with a Metropolis algorithm (Metropolis et. al. 1953), both lead to (60). For continuous variables Langevin dynamics with = -vbE(b) ~ ( t )where , ~ ( tis) white noise of strength TA, would also be appropriate. Computational simulated annealing is used to determine specific {J}to store specific pattern sets with specific performance measures (sometimes without the limit TA -+ 0 in order to further simulate noisy data). It is also of interest, however, to consider the generic results on what is achievable and its consequences, averaged over all equivalently chosen pattern sets. An additional relevance lies in the fact that there exist algorithms which can be proven to achieve certain performance measures if they are achiewabk (and the analysis indicates if this is the case). The analytic equivalent of simulated annealing defines a generalized partition function
+
where we use C to denote an appropriately constrained sum or integral, from which the average Lenergy’at temperature TAfollows from
218
and the minimum E from the zero ‘temperature’ limit, Em,,,= lim ( E ) T ~ . TA-0
As noted earlier, we are often interested in typicallaverage behaviour, as characterized by averaging the result over a random choice of {a} from some distribution. Hence we require to study (!nZn)(,,},which naturally suggests the use of replicas again. In fact, the replica procedure has been used to study several hard combinatorial optimization problems, such as various graph partitioning (Fu and Anderson 1986, Kanter and Sompolinsky 1987, Wong and Sherrington 1987) and travelling salesman (M6zard and Parisi 1986) problems. Here, however, we shall concentrate on neural network applications. 5.2. Cost functions ddpendent on stability fields
One important class of training problems for pattern-recognition neural networks is that in which the objective can be defined as minimization of a cost function dependent on patterns and synapses only through the stability fields; that is, in which the ‘energy’ to be minimized can be expressed in the form
E&({JH = -
cC 9 ( A 3 P
i
(64)
The reason for the minus sign is that we are often concerned with maximizing performance functions, here the g(A). Before discussing general procedure, some examples of g(A) might be in order. The original application of this technique to neural networks concerned the maximum capacity for stable storage of patterns in a network satisfying the local rule =
sgn
(CJijuj) i#i
(66)
(Gardner 1988). Stability is determined by the A:; if A: > 0, the input of the correct bits of pattern p to site i yields the correct bit as output. Thus a pattern p is stable under the network dynamics if
A: > 0;all i.
(67)
A possible performance measure is therefore given by (64) with g ( A ) = -@(-A)
where @(x)
= 1;z > 0
0 ; x < 0.
(68)
279
g(At) is thus non-zero (and negative) when pattern p is not stably stored at site i. Choosing the { J ; j } such that the minimum E is zero ensures stability. The maximum
capacity for stable storage is the limiting value for which stable storage is possible. An extension is to maximal stability (Gardner and Derrida, 1988). In this case the performance measure employed is g(A) = -O(K
-
A)
(70)
and the search is for the maximum value of n for which Em;,, can be held to zero for any capacity a, or, equivalently, the maximum capacity for which Em;,,= 0 for any n. All patterns are then stored with stability fields greater than IC. In fact, for synapses restricted only by the spherical constraint"
C J;"j= N , j#i
with J;j and Jj; independent, the stability field n and the storage ratio a = p / N at criticality can be shown to be related by
For n = 0, the conventional problem of minimal stability, this reduces to the usual a, = 2 (Cover 1965). Yet another example is to consider a system trained to give the greatest increase in overlap with a pattern in one step of the dynamics, when started in a state with overlap mt. In this case, for the update rule (5) the appropriate performance function, averaged over all specific starting states of overlap mt, is (Wong and Sherrington 1990a)
where
This performance function is also that which would result from a modification of eqn (68) in which A: is replaced by
(r
is the result of randomly distorting ,$ with a (training) noise dt = (1 - mt)/2 where and E A is averaged over all implementations of the randomization (with fixed m t ) . This is referred to as 'training with noise' and is based on the physical concept of the use of such "Note that this is a different normalization than that used in eqn (18).
280
noise to spread out information in a network, perhaps with an aim towards producing better generalization, association or stability. 5.3 Methodology
Let us now turn to the methodology. For specific pattern sets we could proceed by computational simulated annealing, as discussed in the first part of section 5.1. Analytically, we require ( l n Z A { ( } ) { , , , where
from which the average minimum cost is given by
( P n Z A ) ( o is obtained via the replica procedure, (26), averaging over the {tp} to yield a replica-coupled effective pure system which is then analyzed and studied in the limit n + 0. The detailed calculations are somewhat complicated and are deferred to Appendix B. However we note here that the general procedure is analagous to those of sections (4.5) - (4.7) but with the {J} as the annealed variables, the as the quenched ones and the retrieval temperature replaced by the annealing temperature. For systems with many neurons the relevant integrals are again extremally dominated, permitting steepest descent analysis. New order parameters are again introduced, including an analogue of the spin glass order parameter q"@;here
{t}
where ( )eg is an average against the effective system resulting from averaging over the patterns; cf eqn (35). Again a mathematical simplification is consequential upon a replica-symmetric ansatz. The net result of such an analysis is that the local field distribution p ( A ) in the optimal configuration is given", within RS theory, for synapses obeying the spherical rule (64) by (Wong and Sherrington 1990)
where Dt = dt exp(-t2/2)
/&
(80)
laNote that when the expression for the partition function is extremally dominated, any other thermal measure is similarly dominated and is often straightforward to evaluate; this is the case here with
( p ( A ) ) { € l= ( N p ) - ' ( x x 6 ( A - A:)){€),
as demonstrated in Appendix B.
28 1 and X ( t ) is the value of X giving the largest value of [g(X) implicitly by a-l =
1Dt(X(t)
-q
- (A
- t ) ’ / 2 ~ where ] 7 is given
2 .
The same expressions apply to single perceptrons storing random input-output associations, where the index i can be dropped and AP = q” CjJj(,”/(CJ:): where {(,”}; j = 1,...N are the inputs and 7’’ the output of pattern p, and for dilute networks where N is replaced by the connectivity C. Immediately, one gets the one-step update of any network optimized as above. Thus, for the dynamics of (5) m’ = / d A p ( A ) erf [mA/(2(1 - m2
+T2))i].
(82)
For a dilute asymmetric network this applies to each update step, as in (57). Wong and Sherrington (1990a,b) have used the above method to investigate how the p(A) and the resultant retrieval behaviour depend on training noise, via (66). They have demonstrated that infinitesimal training noise yields the same p ( A ) as the maximum stability rule, while the limit of very strong training noise yields that of the Hebb rule. The former gives perfect retrieval for T = 0 and a < 2 but has only narrow basins of attraction for a > 0.42, while the Hebb rule has only imperfect retrieval, and that only for a < 0.637, but has wide basins of attraction. Varying mt gives a method of tuning performance between these limits. Similarly, for general T one can determine the optimal mt for the best retrieval overlap or basin size and the largest capacity for retrieval with any mt (Wong and Sherrington 1990b); for example for maximum capacity it is better to use small training noise for low T,high training noise for higher T. Just as in the replica analysis of retrieval, the assumption of replica symmetry for qap of (71) needs to be checked and a more subtle ansatz employed when it is unstable against small fluctuations q p p q 9”p;9pp small. In fact, it should also be tested even when small fluctuations are stable (since large ones may not be). Such effects, however, seem to be absent or small for many cases of continuous { J ; j } , while for discrete {Jij} they are more important’*. --f
+
5.4 Learning a rule
So far, our discussion of optimal learning has concentrated on recurrent networks and on training perceptron units for association of given patterns. Another important area of practical employment of neural networks is as expert systems, trained to try to give correct few-option decisions on the basis of many observed pieces of input data. More precisely, one tries to train a network to reproduce the results of some usually-unknown rule relating many-variable input to few-variable output, on the basis of training with a few examples of input-output sets arising from the operation of the rule (possibly with error in this training data). “For discrete { J , l } there is first order replica-symmetry breaking (Krauth and MCnard 1989) and small fluctuation analysis is insufficient.
282
To assess the potential of an artifical network of some structure to reproduce the output of a rule on the basis of examples, one needs to consider the training of the network with examples of input-output sets generated by known rules, but without the student network receiving any further information, except perhaps the probability that the teacher rule makes an error (if it is allowed to do so). Thus let us consider first a deterministic teacher rule 9 = V({€I),
(83)
relating N elements of input data deterministic student network
((:,ti
..(:)
to a single output B”, being learned by a
9 = B({€)).
(84)
B is known whereas V is not. Training consists of modifying B on the basis of examples drawn from the operation of V. Problems of interest are to train B to give (i) the best possible performance on the example set, (ii) the best possible performance on any random sample drawn from the operation of V, irrespective of whether it is a member of the training set or not. The first of these refers to the ability of the student to learn what he is taught, the second t o his ability to generalise from that training. Note that the relative structures of teacher and student can be either such that the rule is learnable, or not (for example, a perceptron is incapable of learning a parity rule (Minsky and Papert 1969)). The performance on the training set p = 1, ...p can be assessed by a training error
where e(z,y) is zero if z = y, positive otherwise. We shall sometimes work with the fractional training error Et
(86)
= Et/p.
The corresponding average generalisation error is
A common choice for
e
is quadratic in the difference (z - y). With the scaling
e(z,y) = (2 - YI2/4
(88)
one has for binary outputs, 9 = 6 1 , e(z,y) = @(-ZY),
so that if E l ( { ( } ) is a perceptron,
(89)
283 then e’ =
@(-A’)
where now.
A” = qJ”1J j ( r / ( CJ:)i, i
i
making Et analagous t o E A of eqn (64) with ‘performance function’ (68). This we refer t o as minimal stability learning. Similarly t o section 4, one can extend the error definition to e” = @(n
- A”)
(93)
and, for learnable rules, look for the solution with the maximum K for zero training error. This is maximal stability learning. Minimizing Et can proceed as discussed above, either simulationally or analytically. Note, however, that for the analytic study of average performance the ( 7 , t ) combinations are now related by the rule V, rather than being completely independent. eg follows from the resultant distribution p ( A ) . For the case in which the teacher is also a perceptron, the rule is learnable and therefore the student can achieve zero training error. The resultant generalization error is however, not necessarily zero. For continuous weights Jj the generalization error with the above two training error formulations scales as 1/a for large a,where p = a N , with the maximal stability form (Opper et. al. 1990) yielding a smaller multiplicative factor than the minimal stability form (Gyorgy and Tishby 1990). Note, however, that maximal stability training does not guarantee the best generalization; that has been obtained by Watkin (1992), on the basis of Bayesian theory, as the ‘centre of gravity’ of the possible J space permitted by minimal stability. For a perceptron student with binary weights { J j = fl}, learning from a similarly constrained teacher, there is a transition from imperfect t o perfect generalization a t a critical number of presented patterns p = a , N . This is because, in order for the system t o have no training error, beyond critical capacity it must have exactly the same weights as the teacher. Just as in the case of recurrent networks for association, it can be of interest t o consider rule-learning networks trained with randomly corrupted data or with unreliable (noisy) teachers or students. Another possibility is to train a t finite temperature; that is t o keep the annealing temperature finite rather than allowing it to tend to zero. Analysis for PA small is straightforward and shows that for a student perceptron learning to reproduce a teacher perceptron the generalization error scales as eg 1/(/3~a), so that increasing a leads to qualitatively similar performance as a zero-temperature optimized network with p/N =& = (Sompolinsky et. al. 1990). There are many other rules which can be analyzed for possible reproduction by a singlelayer perceptron, some learnable, some not, and attention is now also turning towards the analysis of multilayer perceptrons, but for further details the reader is referred elsewhere (Seung et. al. 1992, Watkin et. al. 1992, Barkai et. al. 1992, Engel et. al. 1992).
-
284 6. CONCLUSION
In this chapter we have tried to introduce the conceptual and mathematical basis for the transfer of techniques developed for spin glasses to the quantitative analysis of neural networks. The emphasis has been on the underlying theme and the principles behind the analysis, rather than the presentation of all the intricacies and the applications. For further details the reader is referred to texts such as those of Amit (1989), Miiller and Reinhardt (1990), Hertz et. al. (1991), the review of Watkin et. al. (1992) and to the specialist research literature. We have restricted discussion to networks whose local dynamics is determined by pairwise synaptic forces and in applications have employed binary neurons and zero thresholds, but all of these restrictions can be lifted in principle and mostly in practice. For example, with binary neurons the synaptic update rule includes only a small subset of all Boolean rules and it is possible to extend the analysis of retrieval in a dilute network and the optimization of rules for pattern association to the general set of Boolean rules (Wong and Sherrington 1989a, 198913). Binary neurons can be replaced by either continuous-valued or multi-state discrete ones. Thresholds can be considered as arising from extra neurons in a fixed activity state. We have discussed only a rather limited set of the types of problems which can be envisaged for neural networks. In particular we have discussed only the storage and retrieval of static data and only one-step or asymptotic retrieval (but see also the accompanying chapter on dynamics: Coolen and Sherrington 1992). Statistical mechanical techniques have been applied t o networks with temporally structured attractors and to the issue of competition between such attractors and ones associated with static associations. Indeed, the study of more sophisticated aspects of dynamics is an active and growing one. Also, we have discussed only supervised learning whilst it is clear that unsupervised learning is also of major biological and engineering relevance - again, there has been and continues to be statistical mechanical transfer to this area also. We have not discussed the problem of optimizing architecture, except insofar as this is implicit in the inclusion of the possibility of Jij = 0. Nor have we discussed training algorithms other than that of simulated annealing, but we note again that there exist other algorithms for certain problems which are known to work zf a solution ezists, while the analytic theory can show if one does. Similarly, we have not discussed the rate of convergence of any algorithms, either to optimum or specified sub-optimal performance. However, overall, it is hoped that it will be apparent that the statistical physics developed for spin glasses has already brought to the subject of neural networks both new conceptual viewpoints and new techniques, particularly oriented towards the quantitative study of typical rather than worst cases and allowing for the consideration of imprecise information and assessment of the resilience of solutions. There is much further potential for the application of the statistical physics of disordered systems to neural networks and possibly also for the converse, where we note in conclusion that the corresponding investigations of spin glasses, started almost two decades ago, led to a major reconsideration of both tenets and techniques of statistical physics, and neural networks could provide an interesting sequel to the fascinating developments which unfolded in that study.
285
Appendix A Here we consider in more detail the derivation of eqns (48)-(54), starting from eqn (47). For the non-condensed patterns, p > s, only small r n p contribute and the corresponding en cosh can be expanded to second order to approximate
The resultant Gaussian form in r n p is inconvenient for direct integration since it would yield an awkward function of u's. However, C usuf may be effectively decoupled by the introduction of a spin-glass like order parameter qa' via the identities
In eqn(47) the m w ; p > s integrations now yield the u-independent result (2n/NP) t ( p - " ) ( det A)- i(p-a), where
A"' = (1 - P)S,p - Pf',
(A-5)
while the a; contributions now enter in the form
which is separable in i. Further anticipating the result that the relevant r' scales as p = a N , (A.6) has the consequence that (Z"){,l is extremally dominated, as in (48). Re-scaling
yields eqn (48) with
286
(-4.8) (4)
P=1
{OP}
where the {u”} summations are now single-site. Minimizing @ with respect to {ma}, {qaa}, {T”@} yields the dominant behaviour, with (49)-(51) providing an interpretation of the extremal values, as follows from an analagous evaluation of the right hand sides of those equations, which are again extremally dominated. Explicit evaluation and the limit n --t 0 are facilitated by the replica symmetric ansatz m p = mP7 TaB
(-4.9)
q,
(A.lO)
= T.
(A.ll)
With these assumptions the exponential in the last term of (A.8) may be written as (A.12)
where we have re-expressed the exp(C Cn cosh) to obtain a form with a linear u dependence in an exponential argument. The second term of (A.12) can similarly be transformed t o a linear form by the use of (22), thereby permitting the separation and straightforward execution of the sums of {u“} in (A.7). Also, with the use of (A.lO) the evaluation of CnA is straightforward and in the limit n + 0 yields (A.13)
Thus {mP}, q and 9({ma},q,T)
T
are given as the extrema of
= a/2
+ C(m’)2/2 + aPT(1-
q)/2
P
+ ( a / 2 P ) ( M l - 4 1 - 4)) - P d ( 1 - P(1 - 4)))
-p-’
/ d~e”’/~(Cn[2cosh
@(z&
+ 2 m”~P)])p=*l,r=~... (A.14) p=l
Specializing to the case of retrieval, with only one m’ macroscopic (i.e. s = l ) , there result the self-consistency equations (52)-(54).
287
Appendix B In this appendix we demonstrate how to perform the analytic minimization of a cost function of the general form E$>({JH= -
cd A P ) P
where e
A’ = (:
C Jj[$,
j=1
with respect to
Jj
which satisfy spherical constraints
The {(} are random quenched fl. The solution of this problem also solves that of eqn (57) since the J;j of (57) are uncorrelated in i and thus the problem separates in the i label; note that J;j and Jj; are optimized independently. The method we employ is the analytic simulated annealing discussed in section 5.1, with the minimum with respect to the {J}then averaged over the choice of {(}. Thus we require (.!nZ{t)){t)where
{t}
In order to evaluate the average we separate out the explicit ( dependence via delta functions S ( P - ( P cjJ,(j”/Ct)and express all the delta functions in exponential integral represent ation,
exp(iq5”(XP - ( ” c J j ( T / C ! ) ) . i
(B.5)
Replica theory requires ( Z c > ) { oand therefore the introduction of a dummy replica index on J , E , X and 4; we use a = 1, ..A. For the case in which all the ( are independently distributed and equally likely to be fl,the ( average involves
For large C (and q5.J exponentiation) by
I O(1)) the cosine can be approximated
(after expansion and re-
288
where
9 is conveniently re-expressed as
where we have used the fact that only C Jj" = C contributes to 2. In analogy with the procedure of Appendix A, we eliminate the term in Cj J;Jf in favour of a spin glass like order parameter/variable qa@introduced via the identities 1=
1
dqaP6(qaP - C-'
= (C/27r)
1
c JTJ;)
(B.ll)
j
dzaPdqaPexp(izaP(CqaP-
c J;J;)).
(B.12)
j
The j = 1,...C and p = 1,...p contribute only multiplicatively in relevant terms and (Z"){,i can be written as
where =)/)n d J a e x p ( - ~ a " ( ( J a ) 2 exp G J ( { E " } , { Z " ~ OL
a
-
1)
+ C zaPJaJP)
(B.14)
a
and there are now no j or p labels. Since GJ and G, are intensive, as is p / C = a for the situation of interest, (B.13) is dominated by the maximum of its integrand.
289
In the replica symmetric ansatz E“ = &,zap = z and quo = q. In the limit n elimination of E and x at the saddle point &&/a& = a@/& = a @ / a q= 0 yields 1 2 +a/ D t Pn(27r(1-
(PnZ){o = C ext,(-Pn[2~(1- q)]
3
0,
+ (2(1 - q))-’
(I))-;
/W exppAg(X) - (A - &)‘/2(1
- q ) ] ) (B.16)
(B.17)
where D t = dt exp(-t2/2)/&.
In the low temperature limit, ,Ba + 00,q + 1 and p ~ ( -l q ) -i 7,independent of TA to leading order. The integration over X can then be simplified by steepest descent, so that
1
dX exp(kJag(X) - (A - t)’/2(1 - q ) )
+
exp(PA(g(i(t)) - ( i ( t ) - t)2/27>>
(B.18)
where i ( t ) is the value of X which maximizes (g(X) - (A - t)’/27); i.e. the inverse function of t(X) given by
- 7g’(X)l+x.
t(X) =
(B.19)
Extremizing the expression in (A.15) with respect to q, or equivalently to 7,gives the (implicit) determining equation for 7
/ Dt(i(t)
(B.20)
- t ) 2 = a.
The average minimum cost follows from (B.21) This can be obtained straightforwardly from (B.16). Similarly, any measure f(h”))~,)(C} may be obtained by means of the generating functional procedure of eqn (14). Alternatively, they follow from the local field distribution p ( X ) defined by
((x;=l
(B.22) which is given by p(X) =
J DtqX - X(t)).
(B.23)
A convenient derivation of (B.23) proceeds as follows. The thermal average for fixed {(} is given by P
(P-’ C &(A - A ” ) ) T A ”=l
=
C(P-’C &(A (4
fi
- A”))eXP(PA
C g(A”))/Z. ”
(B.24)
290
Multiplying numerator and denominator by 2”-’ and taking the limit n (p-’
c
&(A - A’)),
c1
= lim
c
(p-’
n-10 {JQ};a=l, ...n
c P
which permits straightforward averaging over domination as above, so that
exP(c[PAg(Aa) a
&(A - A”’))eXp(PA
---t
g(A””))
0 gives
(B.25)
P”
{ I } .There then results the same extremal
+ i A n Y - (4’”)2/21 - a
(B.26)
whence the replica-symmetric ansatz and limit PA---t 00 yield the result (B.23). In the corresponding problems associated with learning a rule, to in (B.2) is replaced by the teacher output 9, while certain aspects of noise may be incorporated in the form of g(hP). Minimization of the corresponding training error may then be effectuated analagously to the above treatment of random patterns, but with 7 related to the { I j } ; j = 1,... by the teacher rule (76).
REFERENCES Aleksander I.; 1988, in “Neural Computing Architectures” ed. I. Aleksander (North Oxford Academic), 133 Amit D.J.; 1989, “Modelling Brain Function” (Cambridge University Press) Amit D.J., Gutfreund H. and Sompolinsky H.; 1985, Ann. Phys. 173, 30 de Almeida J.R.L. and Thouless D.J.; 1978, J. Phys. A l l , 983 Binder K. and Young A.P.; 1986, Rev. Mod. Phys. 58, 801 Coolen A.C.C. and Sherrington D.; 1992 “Dynamics of Attractor Neural Networks”, this volume Derrida B., Gardner E. and Zippelius A.; 1987, Europhys. Lett. 4, 167 Edwards S.F. and Anderson P.W.; 1975, J. Phys F5, 965 Fischer K.H. and Hertz J.A.; 1991 “Spin Glasses” (Cambridge University Press) Gardner E.; 1988, J. Phys A21, 257 Gardner E. and Derrida B.; 1988, J. Phys A21, 271 Glauber R.; 1963, J. Math. Phys. 4, 294 Gyorgy G. and Tishby N.; 1990, in “Neural Networks and Spin Glasses”, eds. W.K. Theumann and R. Koberle (World Scientific) Hertz J.A., Krogh A. and Palmer R.G.; 1991, “Introduction to the Theory of Neural Computation” ( Addison-Wesley) Hopfield J.J.; 1982, Proc. Natl. Acad. Sci. USA 79,2554 Kanter I. and Sompolinsky H.; 1987, Phys. Rev. Lett. 58,164 Kirkpatrick S., Gelatt C.D. and Vecchi M.P., 1983, Science 220, 671 Kirkpatrick S. and Sherrington D.; 1978, Phys. Rev. B17,4384 Krauth W. and MCzard M.; J. Physique (France) 5 0 , 3057 (1989)
291
Metropolis N., Rosenbluth A.W., Rosenbluth M.N., Teller A.H. and Teller E.; 1953, J. Chem. Phys. 21, 1087 Mizard M. and Parisi G.; 1986, J. Physique (Paris) 47, 1285 Mdaard M., Parisi G. and Virasoro M.A.; 1987, “Spin Glass Theory and Beyond” (World Scientific) Minsky M.L. and Papert, S.A.; 1969 “Perceptrons” (MIT Univ. Press) Muller B. and Reinhardt J.; 1990, “Neural Networks: an Introduction” (Springer-Verlag) Opper M., Kinael W., Kleina J. and Nehl R.; 1990, J. Phys A23, L581 Parisi G.; 1979, Phys. Rev. Lett. 43, 1754 Parisi G.; 1983, Phys. Rev. Lett. 5 0 , 1946 Seung H.S., Sompolinsky H. and Tishby N.; 1992, Phys. Rev. A45,6056 Sherrington D.; 1990, in “1989 Lectures on Complex Systems” ed. E. Jen (Addison-Wesley) p.415 Sherrington D.; 1992, in “Electronic Phase Transitions” ed. W. Hanke and Yu.V. Kopaev (North-Holland) p.79 Sherrington D. and Kirkpatrick S.; 1975, Phys. Rev. Lett 35, 1972 Thouless D.J., Anderson P.W. and Palmer R.; 1977, Phil. Mag. 35, 1972 Watkin T.L.H.; 1992 “Optimal Learning with a Neural Network”, to be published in Europhys. Lett. Watkin T.L.H., Rau A. and Biehl M.; 1992 “The Statistical Mechanics of Learning a Rule”, to be published in Revs. Mod. Phys. Watkin T.L.H., Sherrington D.; 1991, Europhys. Lett. 14,791 Wong K.Y.M. and Sherrington D.; 1987, J. Phys. A20, L793 Wong K.Y.M. and Sherrington D.; 1988, Europhys. Lett. 7 , 197 Wong K.Y.M. and Sherrington D.; 1989, J. Phys A22, 2233 Wong K.Y.M. and Sherrington D.; 1990a, J. Phys. A23, L175 Wong K.Y.M. and Sherrington D.; 1990b, J. Phys. A23, 4659 Wong K.Y.M. and Sherrington D.; 199Oc, Europhys. Lett. 10,419