INFORMATION
SCIENCES
57-58,
171-180 (1991)
171
Distributed Associative Memory and the Computation of Membership Functions EROLGELENBE Ecole des Hautes Etudes en Informatique, Universitk Renk Descartes 45 rue des Saints-P&-es. 75006 Paris, France
(Paris V),
ABSTRACT The purpose of this paper is to introduce an extension and generalizations to sparse distributed memory introduced by Kanerva [7], which we shall call distributed associative memory (DAM), concerning both the manner in which information is stored and techniques for improving the reading mechanism. Its effectiveness as a learning tool is discussed using a statistical model. We then show how distributed associative memory can be used to compute membership functions for decision-making under uncertainty.
1.
INTRODUCTION
The storage and data processing capacities of the human brain appear to be enormous in comparison with the possibilities offered by computers, even though the amount of “hardware” in computers is comparable to that hypothetically available in the brain, while the basic speed of computers is orders of magnitude larger than the speed at which natural information processing systems carry out their basic operations. Thus, in order to perform tasks as difficult as visual or auditory recognition, or even the understanding of sentences in natural language, research has been conducted widely using parallel algorithms and systems inspired by models of the human brain [ 11. The programming of such systems is necessarily very different from that of classical sequential algorithms, and different also from the usual parallel algorithms based on a relatively small number of parallel processing components. Therefore, many recent ideas in this area are based on neural network models of learning [4]. The main “architectural” advantage of the human brain over computers is still, for the time being, its very highly parallel organization. Even if each neuron is no more than a simple binary automaton whose behavior is stochastic and imperfect, assemblies of billions of neurons and of dendrites are 0 Elsevier Science Publishing Co., Inc. 1991 655 Avenue of the Americas, New York, NY 10010
0020-0255/911$03,50
172
EROLGELENBE
obviously more powerful in the scope of the problems they can handle, and in terms of their resiliency and robustness, than any existing computer system. Problem-solving approaches based on algorithms inspired by the structure of neural networks are called connectionist because the information is recorded by connections between the neurons. Contrary to von Neumann computers, artificial neural networks or connectionist algorithms provide a highly distributed mode of data processing [4] in which memory is stored in connections and computations are carried out within the neurons. On the other hand, the random access memory of a classical von Neumann computer consists of a set of storage locations. Each location is identified by a number (address) giving its position, and contains a data item which is also a number. The address and the data are’ both in general represented by binary vectors. Reading a conventional memory can be considered as providing an input to the system (an address) and obtaining a response (the data contents of the address). To process realistic visual or auditory information, both as “input” (address) and “output” (data contents of the address) such a system would have to accept very long binary address vectors. A small 20-by-20 bit “artificial retina,” for example, can code 2400different visual patterns. Unfortunately, existing random access memories possess on the order of one to several million storage locations (20*’ to 224), which can be conveniently addressed by virtual memory systems using 32-to-64 bit addresses. Content adressable (or associative) memories constitute another altemative for storing and retrieving data relative to artificial perception. However, the physical main memory sizes of conventional machines are obviously insufficient if one wishes to deal with content-addressable memories efficiently, since the address size must now be able to code as many objects as there are possible values of the memory contents. Sparse distributed memory (SDM), the associative memory technique introduced by Kanerva [5], which we investigate and develop in this paper, provides a practically and theoretically appealing solution to the problem of implementing or simulating an associative memory capable of storing a very large number of items for a set of interesting applications. Its basic idea is to choose a few addresses among all the possible ones in order to address a very large associative memory space, and to store information redundantly by superimposing data concerning different items in partially identical locations so that the effect of other data items on a given item of information is perceived as “noise.” It can be a very effective approach when the information which is stored and retrieved is of an approximate nature, as in many realistic applications such as pattern recognition or computer assisted decision [6]. We also believe that it is a means of computing membership functions used in fuzzy logic [lo].
DISTRIBUTED ASSOCIATIVE MEMORY
173
In this paper we first discuss SDM and then present some extensions which concern the manner in which information is stored and read. The generalizations pertain in particular to the iteration of the read operation in order to improve the quality of the information being retrieved, to the introduction of “strength” or intensity of stimuli which are being stored, and to the possibility of “forgetting” past information. We also modify the principle of information storage in SDM by assuming that it is carried out in a deterministic manner, contrary to the random choice of a storage area suggested for SDM. These modifications and extensions constitute a system we call distributed associative memory. In Section 3 we present a mathemtical model of the performance of DAM [4], and in Section 4 we indicate how it can be used to compute membership functions which are useful for decision-making under uncertainty. DAM can be simulated on conventional hardware. It can also be implemented on a highly parallel machine since all operations are carried out in parallel on a set of counters. It could also obviously be implemented on specialized hardware. 2. SPARSE DISTRIBUTED MEMORY AND DISTRIBUTED ASSOCIATIVE MEMORY Let us suppose we wish to store binary k-vectors denoted by v, in a memory composed of the set of locations SQ.The addresses of locations in Spare themselves k-vectors. Each vector v is stored in a set oflocations culled the neighborhood ofu D(u): D(u) =
{a E
SB111a
-
u 11<
d}
(1)
Here ]] . II is some appropriate distance function (for instance a Hamming distance) and d is a fixed value, for instance an integer. In [7], it is suggested that the set D(v) may be generated at random rather than in the deterministic manner indicated above. Each storage location a contains a set of k counters (K(% l), - f * 9 K(a, k)). Let us now introduce the operations of writing and reading in this memory. Wriring. Suppose we want to store the vector v = (v(l), . . . , v(k)), v(i) E (0, l}, in the memory. We simply carry out the operation:
K(u, i) = K(a, i) + (2v(i) - l), for each a E D(v).
fori=
l,...,k,
EROLGELENBE
174
Thus, + 1 is added to the i-th counter if v(i) = 1, whereas - 1 is added to the i-th counter if v(i) = 0, for each a in the neighborhood D(v) of v. Reading. In order to read from the memory, a read vector v is used as the search vector. The output vector o(v) corresponding to v is obtained as follows. Let o(v,i) be the i-th element of O(V):
dv, 9 = sgnLLED~,~ K(a, ill,
i=
l,...,k
where sgn(x) = 1 if x > 0, and sgn(x) = 0 if x 5 0. We shall call distributed associative memory (DAM) the system described above. It differs from Kanerva’s [7] SDM in the manner in which the set of addresses which store the data is chosen. In DAM, the set D(v) is chosen in the deterministic manner given by (1). In SDM, in fact, a random subset 3 of the set of all possible 2” memory locations Sp is chosen for reasons of size, and this set is chosen so that E[ 1% ) ] 4 ) d I. Thus the probability of using identical memory locations for distinct patterns can be significant. This probability becomes small if the number of distinct patterns M to be stored in the memory is small compared to the capacity of the memory: M e E[ 13 ) I. However, as shown below, the superimposition of different information patterns in the same locations is acceptable as long as we can be sure that these do not bias each other in a systematic manner, i.e., as long as different information patterns can be perceived by each other as being “unbiased noise.” 2.1
DISTRIBUTED
ASSOCIATIVE
MEMORY
In this section we shall introduce several extensions to distributed associative memory. They will concern both the manner in which reading can be carried out and the nature of the information stored. We shall also discuss the introduction of forgetfulness in the memory. Iterative read. To read the data corresponding to vector v, we first compute Q(V)as above, then @(o(v)), etc. The procedure is stopped after i iterations, when a’(v) is “very close” to a’-‘( v) according to an appropriate stopping criterion. a’(v) is then the value of the data which is read. Thus iterative read uses the data which has been read to improve its estimate of the address, and so on, until the procedure does not produce any further improvement.
DISTRIBUTED
ASSOCIATIVE
MEMORY
175
Strength ofstimuli. When one wishes to store a pattern v, it may be possible to consider patterns as well as the “strength” of the pattern, A stimulus will be a couple (v, e) where v is a pattern, and e is a positive integer representing the strength of the stimulus. Thus, when a stimulus is stored, the writing algorithm becomes K(a, i) = K(f7, i) + (2v(i) -
l).e,
fori=
l,...,k,
for each n E D(v). This allows the user of the memory to introduce a measure of the importance of the information being stored, or perhaps of the confidence one has in it. Similarly, the information we read from memory may also contain this type of information, so that the generalized output becomes the vector O(v) = (O(v, l), . . . , 6(v, k) where
fXv, 9 = (sgn[Zfr,D,,, K(rr, 01. 1[X.rEDtv,K(f7, i)] 1) Forgetfulness. If an associative memory is used over long time periods, it may be useful to be able to erase the memory, an obvious way of doing this is to set all of its contents to zero; however, this would be, in general, too brutal. On the other hand, it is very difficult to set the memory “selectively” to zero for certain of its contents. Therefore, it may be quite convenient to erase “old” information as the memory operates. In order to achieve this it suffices to modify the writing algorithm as follows: K(a, i) = K(cr, i)p + (2v(i) -
l).(l
- p),
fori
= 1,. . . , k,
for each a E D(v), where 0 < p < 1 is used to reduce information on the memory contents.
3.
ASYMPTOTIC
ANALYSIS
the effect of past
[4]
Let us denote by k = 0, 1, . . . the successive instants at which information is stored into the memory, and let w(k) be the binary vector representing the information which is being stored at time k. Thus W = {w(k), k = 0, 1, . . . } is the sequence of vectors which is being stored. We denote by w(i, k) the i-th element of vector w(k). d,(w(k)) denotes the size of the intersection of sets D(v) and D(w(k)), while d,. denotes the size of D(v). Suppose that we are interested in the remanence of a vector v which is
EROLGELENBE
176
stored regularly in the memory; thus, there is some subsequence of W all of whose elements are either identical to Y or are “close” to v. Let C(v) and C’(v) be (not necessarily disjoint) subsets of ~4: w(k) E C(v) if at time k, v is being written, possibly with some bit errors, while if w(k) E C’(v), then at time k the intention is not to write v. The memory will be observed via the contents of Q(V)at time k, and more particularly through the variable Vi, k) = XaEDcv, Kta, 9. We assume without loss of generality that v(i) = 1 and shall examine P[V(i, k) > 01, since it is the probability that the i-th element of o(v) and v(i)
coincide at time k. At any instant k, we need consider three possibilities concerning w(i, k): (i) w(k) E C(v) and w(i, k) = v(i) = 1, (ii) w(k) E C(v) and w(i, k) = 0, so that is there is a bit error in position i of w(k), (iii) w(k) E C’(v). This yields:
W, k) + d,(wW), V(i, k + 1) = V(i, k) - d,(w(k)), V(i, k) + B(i, k),
if (i) is true if (ii) is true if (iii) is true
(2)
where B(i, k) is the effect of the possibly non-empty intersection between D(v) and D(w(k)) for (iii), so that 1B(i, k) 1 = d,(w(k)), for w(k) E C’(v). In order to proceed with a probabilistic analysis, let us make the following assumptions;
-The {B(i, k), k = 0, 1, . . . } are independent and identically distributed random variables; clearly we always have 1B(k) 1 I d,. -The variables which indicate whether w(k) is in C(v), denoted by {l(w(k) E C(v))} are also iid random variables and we write p = P{l(w(k) E C(v))}. p is the probability that at any instant k, the intention is to store V.
-We denote by f the probability of a bit error in the i-th position when we wish to store v: f = P[w(i, k) = 0 1w(k) E C(v)]. Using (2) and the above probabilistic assumptions, we use the central limit theorem (since V(i, k) is the sum of independent random variables) to write as k -+ m:
177
DISTRIBUTED ASSOCIATIVE MEMORY
p)b,
+ tf -
%(I
- f) + Qpf + T(1 - PI k
where m = E[d,tw(k)l n=
1(i)l,
E[&fw(k)) 1(ii)],
b= EM, s=
(3)
@I,
Var[&(w(k)) I WI,
Q= Varr~“~w(k))1W)l, T = Var[B(i, k)]. It is easy to see that n < m. Thus, intuitively, DAM will have a satisfactory behavior if E[V(i, k)/k] > 0. As a consequence of (3) we see that P[V(i, k) > 0] -3 1, implying that in the long run the DAM will have stored the i-th position v(i) of v “perfectly” (with probability 1) under the following conditions: (a) If the values stored which are different from v (i.e. when w(k) E C’(v)), do not introduce a bias: b = E[B(i, k)] = 0, we see that it suffkes that the bit error probability f, for any stored v, be f < 0.5. For a positive bias b > 0, obviously the same condition will be sufficient. (b) If on the other hand, we have the worst case where the bias is negative b < 0, then the necessary condition for the SDM to have “perfect behavior in the long run” is
f
(1 - P> I b I ] m
3.1
THE SIGNAL
TO NOISE RATlO
Since by (3) the variable V(j,k)/k is approximately Gaussian for large enough k, we can examine the signal-to-noise ratio R(k) concerning the storage of the i-th bit v(i) of v just dter time k. Assuming that b = 0, it is given by R(k) =
pm(l - .f)k*.’ l/(fn~)~
+ W
+ TU - ~1
EROL GELENBE
178
The worst case is obtained by letting n = d, et Q = T = (dJ* since these are their largest possible values. Without loss of generality, let us write ??I =
(I - n) d,
for some 0 zz n I I; we then obtain a lower bound (worst case) to the signafto-noise ratio R (k) which is: ~(1 - f)(l - x)kO,* ~ffp)z
+ fp + 1 - p
The parameter x is a measure of the error made in writing v when w(i, k) = 1, i.e., when the i-th bit is correctly written v(i) = 1, so that we are certain that x < 1. We see that in the worst case, the signal-to-noise ratio must increase as the square-root of k. From the expression for R(k) we also can easily see the effect of p. If the number of patterns which have to be stored in memory is M, we can assume that we may have p = l/M. Thus, the signal-to-noise ratio is approximately inversely proportional to the number of patterns to be stored. 4. ASSOCIATIVE MEMORY AND THE CG~PUTATION MEMBERSHIP FUNCTIONS
OF
Let us now see how DAM can be used to compute membership functions used in fuzzy decision processes. Consider a set of M patterns (vo,, . . . , voM} which have been stored in the associative memory, i.e., over a large number of occurrences, the values w(k) of the information written in the distributed associative memory using the writing algorithm of Section 2 is among these M patterns, or patterns “close” to these M patterns. Each pattern voi can be considered to be the “typical” value of the following set V,,i =
{V:
11 V
-
VOi 11 5
A?},
i=
i,...,M,
where R is an appropriate distance. When a pattern v is presented to the distributed associative memory and a read operation takes place yielding the output value a-(v), the following functions are computed; for i = 1, . . . , M we take:
Er-XV) = 1 - II O(V)- 4VOi) II In
DISTRIBUTED ASSOCIATIVE MEMORY Thus we are computing the membership “perception”
that the associative
memory
179
function
of u with respect
to the
has of v and of voi.
This “perception” is, of course, based on the real data which has been stored into the associative memory, i.e., on the sequence w(k), k = 0, 1, . . . which is the data actually introduced into the memory. Obviously, 0 I pa,(v) I 1, since the distance ]] . (1between any two vectors can never exceed n which is the length of the vector; indeed, the distance between two binary vectors is greatest when all their elements are different. We may also consider the vector membership function p(v) = (pa,(v), . . . pod\‘)) 3
which will provide a global characterization of v with respect to the “perception” which the distributed associative memory has of the set of typical patterns {voI, . . . , vOM}. 5.
CONCLUSIONS
Starting with the concept of sparse distributed memory introduced by Kanerva [7], we have introduced some modifications and extensions which we call distributed associative memory. Using methods of statistical analysis, we show that distributed associative memory has learning capacity even in the presence of errors in the information which is presented to it and in noise in the data. We then indicate how it can be used to compute membership functions useful in fuzzy decision-making. REFERENCES 1. J. Bourrely and E. Gelenbe, Memoires associatives: evaluation et architectures, Comptes-.Rendus Acad. Science de Paris, t. 309, Strie II, 523-526 (1989). 2. C. F. Foster, Content Addressable Parallel Processors, Van Nostrand-Reinhold Pub. Co., New York, 1976. 3. E. Gelenbe, Random neural networks with negative and positive signals and product form solution, to appear in Neural Computation, Vol. 1, No. 4 (1989). 4. E. Gelenbe, An evaluation of sparse associative memory, submitted to Neural Computafion (1990). 5. D. 0. Hebb, The Organization ofBehavior, John Wiley & Sons, New York, 1984. 6. J. J. Hoptield, Neural networks and physical systems with emergent collective computational abilities, Proc. Natl. Acad. Sci. USA, 81:3088-3092 (1982). 7. P. Kanerva, Self-propagating search-A unified theory of memory, Ph.D. Thesis, Stanford University, 1984. See also “Sparse Distributed Memory,” MIT Press, Cambridge, Mass., 1988. 8. T. Kohonen, Self-organization andAssociative Memory, Springer Verlag, Berlin (1984).
EROLGELENBE
180
9. T. Kohonen, E. Oja, and P. Lehti(i, Storage and processing of information in distributed associative memory, in Parallel Models ofAssociative Memory (G. E. Hinton and J. A. Anderson, Eds.), Lawrence Erlbaum Associates, Hillsdale, N.J., 1981, pp. 105-143. 10. D. Marr, Simple memory: a theory for archicortex, Philosophical Transactions of the Royal Society of London,
B-262:23-81
(1971).
Il. D. E. Rumelha~, J. L. McClelland, and the PDP Research Group, Parallel distributed Processing Received
Vol. I, Ii, Bradford Books & MIT Press, Cambridge,
1 April 1990; revised 25 July 1990
Mass. (1986).