Improving recall in associative memories by dynamic threshold

Improving recall in associative memories by dynamic threshold

Neural Networks, Vol. 7, No. 9, pp. 1379-1385, 1994 Copyright © 1994 ElsevierScience Ltd Printed in the USA. All rights reserved 0893-6080/94 $6.00 + ...

517KB Sizes 0 Downloads 35 Views

Neural Networks, Vol. 7, No. 9, pp. 1379-1385, 1994 Copyright © 1994 ElsevierScience Ltd Printed in the USA. All rights reserved 0893-6080/94 $6.00 + .00

Pergamon 0893-6080(94)E0014-C

CONTRIBUTED ARTICLE

Improving Recall in Associative Memories by Dynamic Threshold TAO W A N G Zhejiang University (Received 9 October 1992; revised and accepted 16 June 1993 )

Abstract--In this paper, a simple learning method and a dynamic threshold concept for associative memories (AMs) is presented. The learning approach is designed to store all training patterns with basins of attraction as large as possible. After the learning process stops, the dynamic threshold introduces a threshold in the recall phase. It can reduce the probability of converging to spurious states. A large number of computer simulations are implemented to show the improved recalls. Keywords--Neural network. Associative memory, Learning algorithm, Basin of attraction, Recall phase, Dynamic threshold, Spurious state, Hamming distance. 1. INTRODUCTION

one at a time. It can be shown that, for a symmetric connection matrix W, starting from any initial state S(0), the associative memory will converge to a stable state using the recall rule ( 1 ). The proof follows form the construction of an energy function (i.e., Lyapunov function),

Artificial neural networks have recently received considerable attention from various research fields. Considerable effort has been given towards the understanding of associative memories (AMs) that have been used as information storage and recall. These models offer an important attribute of recalling a stored pattern from its partial or noisy input. The associative memory is composed of Ninterconnected neurons. At time t, the state of the ith neurons Si(t) takes on the value + l (active) or - 1 (inactive), and the state of the network is denoted by S(t) = [Sl (t), S2(t) . . . . . SN(t)]. There is a connection weight W ( i , j) from neuron j to neuron i. It is assumed that the connection matrix W = [ W ( i , j)] is symmetric with zero diagonal elements, W ( i , j ) = W ( j , i), and IV(i,

E(S) = - ½ ~ ~ W(i,j)S, Sj i

whose value is always reduced or remains constant during the recall procedure. In the learning phase, one means finding the connection matrix W such that M training patterns, Xi (u) (u = 1, 2 . . . . . M; i = 1, 2 . . . . , N), are stored in the network. The Hebbian learning rule defines the weight W ( i , j ) as

W(i,j)

i ) = 0.

In the recall phase, the network starts from the initial state, S(t = 0), and evolves in time according to the recall rule,

S i ( t + 1) = sgn[~ W(i,j)Si(t)]

(2)

j

=

{ ~ Xi(u)Xj(u) 0

if i q:j

(3)

if i = j

The associative memory, trained by the Hebbian learning method (3), has been thoroughly investigated in terms of the statistical neurodynamics (Amari, 1990; Amari & Maginu, 1988; Blum, 1990), the storage capacity (Hopfield, 1982; McEliece et al., 1987), the convergence property (Bruck & Goodman, 1988 ), and so on. From experiments, it was shown (Hopfield, 1982 ) that the storage capacity in the system of N neurons is less than 0.15N. In theory, it was proven (McEliece et al., 1987) that the storage capacity is of the order of N / l o g ( N ) . For certain cases, the number of spurious states grows exponentially with the number of training

(1)

where sgn(x) = +1 i f x >_ 0, and sgn(x) = -1 i f x < 0. This paper uses the asynchronous updating mode (Hopfield, 1982), where the neuron states are updated

Requests for reprints should be sent to the author at P.O. Box 175, Department of Computer Science and Engineering, Zhejiang University, Hangzhou, 310027, P.R. China.

1379

1380

T. Wang

patterns (Bruck & Roychowdhury, 1990). A spurious state is a stable state that is not in the training set. Although the Hebbian prescription for the connection matrix is popular for its simplicity, it does not guarantee that correlated training patterns are accurately stored. The sufficient condition (Abbott & Kepler, 1989; Diederich & Opper, 1987; Krauth & Mezard, 1987) for M training patterns X ( u ) (u = 1, 2, . . . . M) to be stored (i.e., to be stable states) in the network is that

Fi(lt) = ( ~j W(i,j)Xj(u))Xi(u) > O u = 1,2 . . . . . M;

u = 1,2 . . . . . N.

(4)

For the special case where F, (u) = 1, the projection rule (Kohonen, I984; Personnaz, Guyon, & Dreyfus, 1986) determines the connection matrix W using a Moore-Penrose pseudoinverse solution. The pseudoinverse allows the storage of N linearly independent patterns. It can also be efficiently carried out using a local iterative scheme (Amari, 1977; Diederich & Opper, 1987). As argued by Michel and Farrell (1990), the projection approach fails to take basins of attraction into account seriously. In fact, what one really wants from the associative memory is its capability of recalling a stored pattern from a noisy input. This requires that M training patterns have nontrivial basins of attraction around them. THEOREM 1. Let 14(X (u), S) be the Hamming distance

between a network state S and X ( u ). I f HIX(u), S] -< H,, : F/(2W,~) where Fi (u) >-_F > 0 (i = 1, 2 . . . . .

(5)

N), and Wm =

max~jI W ( i , J)l, then S wifl converge to X ( u ) after

each neuron state is updated once according to the recall rule ( 1 ). Proof See Krauth and Mezard (1987), and Wang, Xing, and Zhuang (1992). A more general form including multistep state transitions has been derived by Amari (1972). • Let us define a hyperball of radius Hm centered at X ( u ) in the Hamming space as

A(X(u), Hm) : {S]H(X(u), S) <- Hm}

(6)

The meaning of Theorem 1 is that, if the initial state falls into A [ X ( u ) , Hm], that is, S(0) E A [ X ( u ) , H,~], it will converge to X ( u ) at one step. The initial state outside A [ X ( u ) , H,,] may or may not converge to X ( u ) . Thus, the basin of attraction around X ( u ) at least includes A[ X ( u), Hm]. According to Theorem 1, to form nontrivial basins of attraction around M training patterns X (u) ( u = 1, 2 . . . . . M), one needs to find the connection matrix W such that

Fi(u) = [ ~ W(i,j)Xj(u)]Xi(u)>-F>O u = 1,2 . . . . . M; subject to

i = 1,2 . . . . . N (Ta)

I W(i,j)l <- Win.

(7b)

In addition to eqn (7b), another commonly used constraint is

Z W2(i,J) = N.

(8)

J

The purpose of using the constraint (7b) or (8) is to limit the magnitude of any weight W(i, j), and hence to prevent the trap of scaling up the size of the connections (Forrest, 1988 ). Based on eqns (7a) and (8), Gardner (1988) proposed a local iterative learning approach: set the initial weights W ( i , j ) = 0, and modify the weights after a training pattern is presented to the network. With respect to the currently presented pattern X ( u ) , the weights W( i, j ) are updated as follows,

A W ( i , j ) = [ei(u) + ej(u)]Xi(u)Xj(u)

i 4=j

(9a)

and the mask e~(u) is defined at each neuron i o f X (u),

where T ( x ) = l i f x > 0, and T ( x ) = 0 i f x _< 0. The minimum-overlap learning rule (Krauth & Mezard, 1987) first seeks a training pattern with the minimal value o f F i ( u ) = min~ F~(v) (v = 1, 2 . . . . . M), and then updates the connection weights W ( i , j) by eqn (9) using the pattern X (u), until every F, (u) (i = 1, 2 . . . . . N) is larger than a prespecified positive value. Its weakness is that, to obtain the minimal value of Fi (u), it has to visit M training patterns before updating the weights W ( i , j). This will expend a lot of computation, especially when M is large. Both iterative error-correcting rules permit the storage of up to 2N different patterns (Cover, 1965), and guarantee to converge alter a finite number of iterations i f a solution exists (Gardner, 1988; Krauth & Mezard, 1987). 2. S I M P L E LEARNING A L G O R I T H M Theorem 1 motivates that, to form basins of attraction as large as possible, one has to select maximal value of F. How large should the value F be chosen? As mentioned by Muller and Reinhardt (1990), if F is taken too large, no solution to eqn (7a) with the constraint (7b) or (8) may exist, and then the learning rules will not converge. I f F is taken too small, basins of attraction are small and the network may not recognize a slightly perturbed pattern. The minimum-overlap learning method avoids guessing this value by imposing no constraint upon the weights W( i, j) during iteration. It can form basins of

Dynamic Threshold Recall

1381

attraction as large as possible if F is taken sufficiently large. But, at this time, a great number of iterations are required. Although the value o f f that depends on the average over a huge ensemble of uncorrelated patterns has been obtained (Forrest, 1988; Kepler & Abbott, 1988) for the learning rule (9), the value o f f for a special choice of training patterns is still unknown. An efficient way to solve this problem is to start from a small value of F, and to gradually increase it until the learning method is unable to terminate. This is suitable to any set of the training patterns because there is no need to guess F. To get rid of computing the constraint (8) in eqn (9), one may use the constraint (7b). A simple learning rule, modified from eqn (9), is as follows: for the currently presented training pattern X ( u ) , the weights W(i, j) are updated by

W(i,j)(t + 1) = G[W(i,j)(t) + AW(i,j)] AW(i,j) = [e~(u)+es(u)]Xi(u)Xj(u)

i¢j

E(X(u)) ~ -NF/2.

Proof According to the energy function eqn (2), we obtain

= -½ ~ Fi(u) <- -NF/2.

Therefore, the energy at X ( u ) is no more than - N F / 2. • THEOREM 3. Let H[ X ( u ), S] be the Hamming distance between a network state S and X (u). I f

H[X(u), S] < 2H,, = F/W,

if --WIn
Wm

if x > W,,

(14)

where F/(u) >- F > 0 (i = 1, 2 . . . . . N), and W,, = max/a] W(i, J) l, then E[X(u)] < E(S).

(15)

(10b)

Proof Without loss of generality, we assume that the different bits between S and X ( u ) are at the first h = H ( X ( u ) , S) components Si =

-Win if x < - V I m x

(13)

i

where G(x) is a hard limiter-type function,

G(x)=

(12)

(10a)

and the mask el(u) is of the form,

e~(u) = T I E - F~(u)]

THEOREM 2. IfFt(u) > F > 0 (i = 1, 2 . . . . . N),

(11)

if i < h

(16a)

if i > h

for simplicity.

E(S) =-½ ~ ~ W(i,j)S~Sj = E[X(u)] i

whose goal is to exactly hold the constraint eqn (7b). Now, we summarize this learning procedure as follows: Step l: Initialization. Encode M training patterns, and set the initial weights W(i,j)(O) = 0, F = 0, and Wm (e.g., Wm= 10). Step 2: Iteration with fixed value o f F . Sequentially present M training patterns X ( v ) (v = 1, 2 , . . . , M). For the presented pattern X ( u ) , update the connection weights W(i, j) according to eqn (10). Record the number of iterations n (F). One iteration involves presenting every training pattern once to the network. Step 3: If the learning rule (10) converges with the value n(F) less than a prespecified number (e.g., 50), then increase F by AF, that is, F = F + AF, and go to step 2. By the convergence, it means that there is no change of any connection weights W(i, j), AW(i, j) = 0 in eqn (10a), after every training pattern has been presented. Otherwise, one may think that the learning rule is unable to converge for the current value F, then stop and output the maximal value F = F - AF.

-Xi(u) I[X~(u)

j

h

N

+~

~

W(i,j)Xi(u)Xj(u)

iffil j=h+l N

h

+ Z ~ W(i,j)Xi(u)Xj(u)

=

E[X(u)]

i=h+l j=l h

N

+ ~ ~ W(i,j)Xi(u)Xj(u) i=1 j ~ l N

h

h

h

+ ~ ~ W(i,j)Xi(u)Xj(u) - 2 ~ ~ W(i,j) i=1 j = l

i=1 j = l

× X~(u)Xj(u) >--E[X(u)] + 2hE h

h

- 2 ~ ~ W(i,j)X~(u)Xj(u)

(16b)

i=1 j = l

where the symmetry of the connection matrix W is utilized. The sufficient condition for E[X(u)] < E(S) is that

,=~ ~ W(i,j)Xi(u)Xj(u)l
(16c)

jffil

From the assumptions of this theorem, one can obtain 3. DYNAMIC THRESHOLD

In this section, a dynamic threshold is proposed to reduce the probability of converging to spurious states in the recall phase.

~j~W(i,j)X~(u)Xj(u) h

h

< Z Z [W(i,j)l
(16d)

T. Wang

1382 ~cs)

the input S(0) E B [ X ( u ) , Hm] may or may not converge to X ( u ) . Spurious states may appear, but their energy values are larger than E [ X ( u)]. Theorem 3 suggests that, if the network only changes those neuron states that can cause large energy decrement, some spurious states of high energy values may be omitted during the recall process, and then the probability of converging to spurious states may be reduced. Figure 1 gives a simple example, where S a and S b are spurious states. Based on the recall rule (1), S(0) will converge to S b. However, if the network only changes those neuron states that can cause larger energy decrement than &E, S b will be omitted and S(0) may converge to X ( u ) . This is realized by introducing a threshold 0 (>__0) into the recall rule ( 1 ),

st,)


_l

×C~)

FIGURE 1. The state space around X(u).

Obviously, eqn (16c) holds if h2Wm < hF, that is,

{!ii &(t+l)=

h < F~ W,, = 2Hm,

if h i ( t ) > O if-O
which is given by eqn (14). • Now can we associate Theorem 2 and Theorem 3 with the learning method (10)? When this learning method stops, one finds a solution to the constrained problem (7) where F is the value outputted in step 3 of the learning process, and the constraint (7b) is ensured by the function G(x). That is because, if there exists some Fi(u) <- F, then ei(u) = 1 and A W ( i , j ) 0, which is contradictory to the meaning of the convergence defined in step 3. At this time, the conditions for both theorems hold. Therefore, the energy values at M training patterns are no more than - N F / 2 , and less than the states around them. One can divide the state space { - 1 , + 1} u around X ( u ) into three regions, shown in Figure 1. In the region A, there is no spurious states, and any input S(0) E A [ X ( u ) , Hm] converges to X ( u ) after updating all neuron states once. In the region B where B(X(u),Hm)={SIHm
t)

hi(t) = ~ W(i,j)Sj(t).

THEOREM 4. Starting from any initial state S( O), the network can terminate at a stable state using the recall rule (18) in a finite number of steps. The amount of the energy decrement for any neuron state change is no less than 20. Proof Because the neuron states are updated in the asynchronous mode, let ASh(t) = Si(t + 1) - Si(t) and Sj( t + 1 ) = Sj( t) (j ~ i). The change of the energy E(S) with respect to AS~(t) is AE~ = -hi(t)AS~(t).

\

\~.

.......

~':;~k:;

u

............. L. '~

'------'----'----'----'----'--------" 0.00 0.05 O.]0 0.15 0.20 0.25

FIGURE 2. Percents

Lc

\ ~.

40", F

00.

(19)

According to AS~(t), one can enumerate three cases as follows:

(17)

X'~,:

so,, IF

(18b)

i

80", 20", I

hi(t) < - 0

where hi (t) is the total input to neuron i,

90~

~o', t-.

(18a)

0.30

0.35

of successful recalls where

0.40

0.45

0.50 R

N = 32 and M = 8.

Dynamic Threshold Recall

1383 100~,

,_,,~,,~-......... • °'°'.°%

80~,

,%".....,

ZO~ I-

I

'°"

il"

00~ Ii

~

'~'x"

",{'..

\

K

.......

'(~'..

Ld

•............ L,

\~(i",3. ".,-

3o,, I-

o.oo 0.o5 0.10 o.10 o.So 0.20 o.zo o.35 0.40 o.40 o.so 12 FIGURE 3, Percents of successful recalls where N = 32 and M = 12.

1. i f A S i ( t ) = +2, that is, Si(t + 1) = +1 and Si(t) = - 1 , then hi(t) > 0, AE; _< -20; 2. if ASi(t) = - 2 , that is, S~(t + 1 ) = - 1 and S~(t) = +1, then hi(t) ~ - 0 , AE~ < -28; 3. if ASi(t) = 0, that is, S~(t + 1) = Si(t), then AE,-

=0.

In addition, the energy E ( S ) is bounded, IE(S)I = -½ Z Z W ( i , j ) S i S j i• j

½ Z Z Iw(i,j)l. i

j

(20)

Therefore, the network will converge to a stable state in a finite number of steps, and the energy decrement AEi -< -20 at the time of altering any neuron state. • If the network continues employing eqn (18) during the whole recall procedure, X ( u ) may not be retrieved. For example, F~(u) = F, Si(t) 4=X i ( u ) , and Sj(t) = Xj(u) (j 4= i),

(10). Two simple schemes, used in this paper, for reducing 0 are as follows: (S1)O:=O-F; ($2) 0 : - 0 - F/2. The dynamic threshold is based on the heuristics that the energy landscapes of the spurious states may be shallower than those of the training patterns. This is true for the spurious states around the training patterns (Theorem 3). From this viewpoint, first updating those neuron states that can produce a larger energy decrement may neglect some spurious states. The main difference between (Sl) and ($2) is that the former requires a larger energy reduction for one time (i.e., 0 = F), but the latter for two times (i.e., 0 = F a n d 0 = F / 2 ) . Other methods to reduce the value of 0 may also be used. Note that, according to eqn (18), the role of the dynamic threshold is different from the network with a self-connection to each neuron (Yanai & Sawada, 1990). 4. SIMULATION RESULTS

because IV(i, i) = 0. I f 0 > F, then - 0 < hi(t) < 0 and S~( t + 1 ) = Si (t) 4=Xi ( u ). This implies that X (u) may never be recalled. One way to tackle this problem is that the network starts from 8, and gradually reduces this value. In summary, the recall phase can be carried out by the following procedure: Step 1: Set the initial state S(0) and initial value 8; Step 2: Update the neuron states according to eqn (18) until a stable state S' is reached; Step 3: If 0 equals 0, then stop; otherwise, reduce 0 by some method, use S' as the initial state for the decreased 0, and go to step 2. Here, the initial value of 0 is equal to the maximal value of F that is outputted by the learning method

In this section, a large number of simulations are carried out to evaluate the performance of the proposedsimple learning method and the dynamic threshold. The following five approaches have been compared: (La) Gardner's learning rule; (Lb) minimum-overlap learning rule; (Lc) learning rule ( 1O) with the recall rule ( 1 );

A B .C... B .....E ..F...@ H Z FIGURE 4. Training pattems.

1384

T Wang

..... ~........>??.... 90~, ,0,

I

\\v ............. Lo - '~\\ ~',..

so,, I-"

30~

10~ lI o.0o 0.05

"x~'~"%. o.io

o.15

o.zo

o.s5

o.30

0.35

o.40

o.45

I

0.50

FIGURE 5. Percents of successful recalls for specific training patterns.

(Ld) learning rule (10) with the dynamic threshold (Sl); (Le) learning rule (10) with the dynamic threshold (S2). In each simulation, W m = 10, n ( F ) = 50, and AF = Win~2 = 5 for the simple learning algorithm, and the initial value of 0 equals the maximal value of F, obtained by the learning algorithm (step 3 ).

4.1. Histogram of Successful Recalls for Random Patterns In this part, we use randomly generated training patterns whose components have the same probability to take on the value + 1 or - 1, the number of the neurons N = 32, and the number of training patterns M = 8 and 12. For each parameter setting (N, M), 50 sets of different training patterns are tested. For each set, the network is trained by three learning rules, respectively. Noisy inputs are produced by randomly inverting some bits from + 1 to - 1 and vice verse according to an inversion ratio R < 0.5, and then the Hamming distance from an input S(0) to a training pattern X ( u ) is equal to R N . For each inversion ratio R, 1000 noisy inputs are tested for every training pattern. It is declared a success if the noisy input S(0) converges to X ( u ) , and otherwise a failure. The percents of successful recalls are given in Figures 2 and 3. From the experimental results, one can find that (a) the Gardner's learning rule, the minimum-overlap learning rule, and the Lc have similar performance; (b) the dynamic threshold schemes (S1) and ($2) can improve the average rate of convergence of a noisy input to a training pattern. For example, when M = 8, the Gardner's learning method realizes more than 90% of successful recalls if R < 0.10, and the dynamic threshold schemes (S1) and ($2) extend it to R < 0,15 and R < 0.20, respectively.

4.2. Histogram of Successful Recalls for Specific Patterns In a character recognition problem, one wants to recognize M = l0 characters, shown in Figure 4. The network is trained in advance with these specific patterns, and retrieves them from noisy inputs. The percents of successful recalls with respect to the inversion ratio R are given in Figure 5. The dynamic threshold also demonstrates improved recalls. Finally, two points need to be emphasized. First, the three methods Lc, Ld, and Le are based on the same connection matrix, but Ld and Le have higher percentages of successful recalls. This implies that the dynamic threshold is able to improve the recall reliability of the network. Second, the dynamic threshold actually does not reduce the number of spurious states. It only decreases the probability of converging to these states. 5. CONCLUSION In this paper, we have described a simple learning approach and a dynamic threshold concept for associative memories. The learning rule tends to store training patterns with basins of attraction as large as possible. The dynamic threshold rule aims at reducing the probability of converging to spurious states by using a gradually decreased threshold in the recall process. Future work is to develop a learning algorithm that minimizes the number of spurious states. REFERENCES Abbott, L, E, & Kepler,T. B. (1989). Optimal learning in neural networkmemories.Journal of Physics A: Mathematical and General 22, L711-717. Amari, S. (1972). Learningpatterns and pattern sequencesby selforganizing nets of threshold elements. IEEE Transactions on Computers. 21, 1197-1206.

Dynamic Threshold Recall Amari, S. (1977). Neural theory of association and concept formation. Biological Cybernetics, 26, 175-185. Amari, S. (1990). Mathematical foundations of neurocomputing. Proceedings of the IEEE, 78, 1443-1463. Amari, S., & Maginu, K. (1988). Statistical neurodynamics of associative memory. Neural Networks, 1, 63-73. Blum, E. K. (1990). Mathematical aspects of outer-product asynchronous content-addressable memories. Biological Cybernetics, 62, 337-348. Bruck, J., & Goodman, J. W. (1988). A generalized convergence theorem for neural networks. IEEE Transactions on Information Theo~ 34, 1089-1092. Bruck, J., & Roychowdhury, V. P. (1990). On the number of spurious memories in the Hopfield model. 1EEE Transactions on Information Theory, 36, 393-397. Cover, T. M. ( 1965 ). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, 14, 326-334. Diederich, S., & Opper, M. ( 1987 ). Learning of correlated patterns in spin-class networks by local learning rules. Physical Review Letters, 58, 949-952. Forrest, B. M. ( 1988 ). Content-addressabilityand learning in neural networks. Journal of Physics A: Mathematical and General, 21, 245-255. Gardner, E. (1988). The space of interactions in neural network models. Journal of Physics A: Mathematical and General, 21, 257-270. Hassoun, M. H., & Youssef, A. M. (1989). High performance recording algorithm for Hopfield model associative memories. Optical Engineering, 28, 46-54.

1385 Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences USA, 79 2554-2558. Kepler, T. B., & Abbott, L. E (1988). Domains of attraction in neural networks. Journal de Physique I, 49, 1657-1662. Kohonen, T. (1984). Self-organization and associative memory. New York: Springer-Verlag. Krauth, W., & Mezard, M. (1987). Learning algorithms with optimal stability in neural networks. Journal of Physics A: Mathematical and General, 20, L745-752. McEiiece, R. J., Posner, E. C., Rodemich, E. R., & S. S. Venkatecsh (1987). The capacity of the Hopfield associative memory. IEEE Transactions on Information Theo~ 33, 461--482. Michel, A. N., & Farrell, J. A. (1990). Associative memories via artificial neural networks. IEEE Control Systems Magazine, April, 6-17. Muller, B., & Reinhardt, J. (1990). Neural networks:An introduction. Berlin: Springer-Vedag. Personnaz, L., Guyon, 1., & Dreyfus, G. (1986). Collective computational properties of neural networks: New learning mechanisms. Physical Review A, 34, 4217--4228. Wang, T., Xing, X., & Zhuan?~ X. (1992). Characterizing one-layer associative neural networks with optimal noise-reduction ability. International Journal of Pattern Recognition and Artificial Intelligence, 6, 1009-1025. Yanai, H., & Sawada, Y. (1990). Associative memory network composed of neurons with hysteretic property. Neural Networks, 3, 223-228.