Information storage and effective data retrieval in sparse matrices

Information storage and effective data retrieval in sparse matrices

Neuml Nrrworks. Vol. 2. pp. 28Y-293, 19X9 Prmted m the USA. All rights rc~ervcd. ORIGINAL Copyright 0893.hOXtJ/XY $3.60 + .oo C IYXY Maxwell Pcrgam...

582KB Sizes 0 Downloads 80 Views

Neuml Nrrworks. Vol. 2. pp. 28Y-293, 19X9 Prmted m the USA. All rights rc~ervcd.

ORIGINAL

Copyright

0893.hOXtJ/XY $3.60 + .oo C IYXY Maxwell Pcrgamon Macmillan plc

CONTRIBUTION

Information Storage and Effective Data Retrieval in Sparse Matrices

1. INTRODUCTION In recent

years there has been an increasing interest

in neural-network implementations of computational tasks (Hopfield CyrTank, 1986). One of these tasks. where the computational analysis of the performance of a neural-network implementation is comparatively advanced is associative memory. The most obvious reason for this recent development is the fact that the units in a neural network can work on the task in parallel and thus a considerable gain in speed of performance is to be expected. It is usually conceded that such a gain in speed has to be paid for (more than proportionally) in terms of hardware. In the associative memory task it could be shown that even serial simulation of the associative memory network would have to perform less operations than a conventional implementation of an associative memory in terms of bitwise masked comparisons (Palm. 1987). In this article we consider a variation of a serial implementation of an associative memory network. The basic idea is to use a sparse matrix of zeros and ones as network connectivity matrix C = (a), where

Request!, lor reprints should be sent to Prof. Dr. Hans J. Bcntz, FB Mathematikilnformatik, University of Osnahrueck. Alhrechtstr. 3X. D4SOO Osnabrueck.

c’,, denotes the connectivity from i to j. and to represent this matrix by means of pointers. More specifically, the matrix is considered as an m x II matrix, where m << II, and the II columns of the matrix represent certain features of the patterns to be stored (in our example in the next section the patterns are typed words and the features are triplets of consecutive letters) and the m rows of the matrix represent the patterns themselves (or the desired response or commands associated to these patterns). When the rth pattern is stored (learned). the set of its features is layed down as a sequence of zeros and ones in the rth row of the storage matrix (where a 1 signifies presence and a 0 absence of a feature). A stored pattern is retrieved from the memory by inputting a similar pattern to the matrix, that is, converting it into a vector x of O-l-feature-indicators which is multiplied by the matrix C. The response of the memory is the pattern (or row) r that has the largest entry in the output vector y = Cx. Similar schemes have been investigated by several authors (e.g., Hopfield & Tank. 19X6: Kohonen, 1984; Palm, 1980; Rumelhart, McClelland. & PDP Research Group, 1986); here we want to concentrate on the investigation of the computational advantages of a serial implementation by means of pointers. In the work of Palm (e.g., Palm 1087) it has been stressed that such a scheme works more efficiently, if the ones in the row- and column-vectors to be

H. J. Beniz, M. Hugstroen~ und G. Pub

30

associated are sparse. In this application there is only one row active at a time, but the number of columns activated by a pattern equals the number of features that are present in the pattern. Thus the requirement of sparseness translates into the requirement that all features should have an equally low probability of being present in a pattern. In the course of the subsequent analysis it will turn out that the algorithm works indeed faster than conventional approaches and requires at most a little more memory space, provided that the patterns arc extremely sparse. Thus one has to consider techniques of representing a pattern in terms of a very large number of features which all have a very low probability of occurring.

Another possibility would be to instalt a new pointer for each newly occurring triplet. 111this caxc one would have to replace the two dots occurring repeatedly in the “headline” of Table 1 bv arrovv5. The number of the column of a CCItain triplet iit’&\’ in the matrix is found by the formuiti

: =

W'(u

1

-1

1 )

W’( u’ t

f

WC

+

12' -+

1

I ) 1 Wll -t 1’ * i

:: II ::. ; / ,

‘I.

;

Here W = 26 is the length of the alphabet: /1. s 0. Hs 1.. . . Z = 25. The first triplet #A A has number N(#. 0, 0) = 1 and for the last. ZZ#. one finds the number N(25, 35, #) = 1892X. ‘The latter also indicatcs the number II of columns of the matrix. Now the entries in the matrix arc rcpresentrd by pointers. leading to the arrangement of T’able I, The empty rest of the matrix-scheme is omitted. The answer to an input-question can bc found as fc~llows. Assume TEA is offered in the input-string. I‘he adjoint triplets #1’E. TEA. EA# will call thL*addresses

2. ASSOCIATIVE STORAGE AND RETRIEVAL USING POINTER-STRUCTURES The matrix which contains the patterns representing the stored items (data) is filled with “zeros” and “ones.” As we intend to use a code where the ones are sparsely distributed in the code-string wc may expect that a large proportion of the matrix is ‘&empty.” Some space is not used directly and one may ask if this cannot be saved by defining the matrix-scheme in a different way. In fact, the method we think of, allows to structure a set of data with the help of “pointers,” We will apply this idea to our matrix. To make things clear, let us discuss briefly an example. Given three items, for example. (a) TEA, (b) TEAM, and (c) CREAM. which have to be stored in the matrix. For simplicity we have chosen an input code which discriminates the items according to triplets of consecutive letters. The output consists of joint address-numbers of the items (i.e.. (a). (b), (c) here). It’ IS very useful to add some silent sign, # say, to each item, just to indicate the beginning and the end of the word. Among the many (different) possibilities for the matrix-arrangement with pointers. we will discuss a setting where an initial pointer is reserved for every column of the matrix. In this setting, now, the coordinates of the ‘.pattern-atoms” would read as in Table 1.

which leads-on level ?-to the output with address (a), i.e.. TEA. In case there is no unique decision on the outputaddress. more than one answer is associated to the input-string, and in the output all possible options can be shown. 3. OPTIMISATION OF CODING PARAMETERS In this section we will define some parameters concerning both the storage capacity and the retrieval speed of the pointer arrangement. !Qill we think of a matrix-scheme with extremely clipped Hebb-like synaptic rule. that is, each cell in the matrix is filled with either “0” or “I” and only the “ones” are represented by pointers. Storage Capacity Let Z be the number of words to store in the scheme. Let B be the number of bits per word. In

TABLE 1 Entries of Features in the HeadMe of the Matrix Pointing to Addrar~se~of OcCWrmCe

. .

#CR

..

#TE

..

CRE I

-1 (i,

..

EAM I

iii

II -= f

.

REA

TEA

..

AM# i

1 24

I. ] EA# .I 4,

..’ 1

Datu Rrtriubal in h4atrice.s

terms of matrix-parameters amount of information

this corresponds

B = II . (-11. Idp - 9

to the

Idq),

(2)

where n equals the number of columns of the matrix, the probability for finding a “I” at a certain position. and q = 1 - p. Id denotes the base-2 logarithm. Thus the total amount of information to store is given by the product Z . R. This has to be related to the storage space that is necessary to create the pointer-structure described above. The quotient of these two amounts will serve as a criterion for the (relative) storage capacity of such schemes, so we have to investigate the term

p equals

H r Bits of words to store/Bits

necessary

Now, given b and Z, the formulas (5) and (6) give an estimate of the best possible 1~ and the storage capacity of the scheme.

Retrieval

As to the retrieval six steps. mainly:

H - n . X(-p

Idp - q ;[?I

Z(ldZ

+ A)]

(3)

which has to be optimized with respect to/_‘. Standard calculations lead to the equation

(0

Some obvious lation (lip’)

In 2

+

estimations

Z(ln

z

+

ti . In 2)].

(1)

give the simplified

re-

In 2

(5)

In(l:~~) =- Z(ln Z + (i. In 7)/&.

which WC call “Optimality-Condition.” the optimal /I can be determined explicitly known.

from which as soon as Z is

Remark: Relation

(5) gives a very interesting ofp.

This number

means

that

again

spar~rlv

IJcrc. we omit the first. because the time necdcd transform one input string heavily depends on the code function. As lens as this function is not known. it is useless to base calculations on it. (In our concrete example it does not contribute much. see bclo~v). For the second. the length of the columns is important. As the retrieval runs tia addrc\scs, the average column-length is of interest. This number can be calculated. Wc find /2/12’11. or simply p.7. The third and fourth steps consist of arithmetical additions and pointer settings. according to threshold and cxistencc of addrcsscs in the list. In detail: Assume that the input string hits li “atoms“ in the hcadline. The attached columns have to be transformed into a list. Here either some place must bc prepared for the addresses or. if already present. some addition must bc made to fix the actual indicator. appendcd to the single entries. In step five the maximum indicator has to bc found. This number indicates the address which contains the output string (may be more than one). Altogether the machine has to perform about 1. kp% comparisons k . kpZ comparisons kpZ additions 4. kpZ insertions 5. kpZ comparisons 6. kpZ comparisons. 3

nitude

distributed

one

hint towards

the mag-

must be small if Z >>

1, which

has to use a coding with rathe

“ones”

n

As to the number H. which cannot expressed by easy means as a function of p (or Z) only. the following simplified expression serves as a lower bound (for sufficiently large Z) Z

p/h

where Z and p are related tion (5).

In 2 1 H.

by the optimality-condi-

((1)

we have to invcstigatc

put) either fill address into list. or add up threshold indicator at address in list find maximum threshold indicator in list find attached address( es) and send out.

for the pointer

Idq) ,j + II . p

speed.

coding of input search through columns along pointers copy relevant addresses into list (list for final out-

structure.

The numerator is the product Z . B (bits). The denominator consists of several terms: II . /I . Z . IdZ, necessary to express Z addresses; 17 . Z . II 0. to express the pointers according to Table 1: II 0. for the initial pointers of T’able 1; and 0 is some constant depending on the machine and programming-language, namely the number of bits to specify one pointer. Furthermore we write “ln” to denote the natural logarithm.

Speed

3:

(upper

bound)

Total: k . kpZ + 5kpZ. which is the (average) number of operations on the memory-structure to find an “answer” on the input. Denote this numher by T. In order to express T in terms of Z and B. we make use of the relations k = pn and II = -B/p . ldp. T = pZB’l(

-1dp)’

+ SpZB/( -I@).

(7)

2Y2

H. J. Be&z,

By omitting the square in the denominator bound for this expression is found:

an upper

T 5 pZB:/( - Idp) + SpZBl( - Idp) =

In z[pZB:/( -In p) + SpZB/( -1np)j

= (In 2)‘(B? + SB)G/(ln Z + 6 In 2).

Furthermore, leads to

omitting

T 5 (In 2)( B’ + SB)

which gives an upper bound for T, independent of 2, surprisingly. Compared with bitwise masked comparisons, where the number of operations is proportional to 2%. the method analyzed here, is by far faster for (sufficiently) large Z. The respective time-units depend on the CPU and the program. We have done some estimations in our concrete example. 4. CONCRETE

NUMERICAL

Z( IdZ + 6) = li,OW

(8)

(9)

EXAMPLE

Storage Space In this section we will introduce some calculations regarding different values of Z. In our machine and programming language a pointer demands 4 Bytes, hence h‘ = 32. Some rough estimates are given in Table 2. We have realised an associative memory based on the pointer representation of the matrix, as explained above. About 5000 normal words of maximal length 16 and average length 8 were stored. Input code: “triplets” of letters, as in the example of Table 1. Output code: addresses 1 to 5000. The matrix-representation itself would need a space of more than 1000 KByte. too much for the standard RAM in our PC/AT. The pointer structure. however, only needs about 300 KByte, which fits well into the RAM. This result can be checked by the following calculation. The “headline” looks as in Table 1, and has initial length of about 19,000. In our particular problem it can be reduced to about 15,000, because some triplets of letters do not occur in normal lan-

+ 8

.1

5000(2 + 1: -T 300,000 Bytes.

The numbers 8 (average number of ones in the input-string), 15.000 (length of headline) and p are intimately related, as we may assume p = 81 15.000 = 111875. This p is fairly close to 111882, found in the third entry of Table 2. So, by these conditions, the Table tells us how many words one may store in the scheme. such that the space is well used. plus how well it is used.

Retrieval Speed The coding of the input in the retrieval-phase is rather fast, in comparison -to the columns-operation, so we may neglect the time needed. The search for the correct addresses is the task which is most time consuming in our example. According to the numbers of additions and comparisons noted above, the machine has to do up to (Jk + 4)kpZ comparisons and kpZ additions, Altogether, as k = 8 and Z = 5OOOinthe present case, the computer performs K . 500011875 additions and 20 . X . 5000/1875 comparisons. Now, the PC/AT on which our program runs here needs less than 7 lo-” seconds for a compaiison and an addition, respectively. So. for this part of the operations. the computer answers in a fraction of a second. We calculate the average retrieval time: (X +- 5) 8 5000/1875

7

10 ” =-’0.0020

However. as normal language words are rather redundant with respect to the occurring letters, it can happen sin-

PMawter p

100 1000 10,000 100,000 1 ,OoO,OOo

Z(ln Z + 32 .

In 2)/32 . In 2 121 1 316

14 286 163846 1 666 667

(seconds),

which also was the retrieval time we observed on our computer. The retrieval time still stays far below 1 second, even if the full number of 10.000 items, according to Table 2. was stored. Remark:

TABLE 2 StomgeGap&tyHasaFunetionoftheHUAICb)CBf~~~snd

Z

and G‘. Palm

guage. So it needs ~5,000 times 4 Bytes, which~equals 60,000 Bytes. The field of the columns needs 2 (address) plus 4 (pointer) Bytes per entry. Hence the denominator of (3) reads as follow>: ~1. 6 + n . p

In 2 in the denominator,

M. Hagstraetn,

P l/34 1 I240 111882 1115,760 1 /137,600

(1 @fln(l@)

++

120

1323%

1 316 14286 153846 1 666 667

18:95% 24.16% 2~JH% 33.03%

Duta Retrieval in Matrices gularly, that the column length is stretched by factor 100, as we have seen from some internal statistics. In this “worse” case, the following number gives an estimate for the upper bound of retrieval time in our example: (X + 5)

8

500011875 . 7 . 10 (’ < 0.20

100

(seconds).

To compare this have to estimate H. X. one letter carries so WC‘Cind B = 40.

result with formula (8) we first The average word length equals 5 bit information (roughly). and Hcncc,

7‘ YGIn ‘( If300 + 700) 5 IXJ and the time needed erations will be

1’50

7

to perform

IO ” ‘: 0.009

5. INTERPRETATION

this number

of op-

(seconda).

OF THE

RESULTS

The question arises in which cases the studied method is recommendable. It depends on the rclation of the numbers Z and B. plus on the way of coding (which determines the parameter p). In our opinion the method is applicable if f3 is of about same size for all items to store and 2 is large compared to B. If these requirements are given, one has to find a way of coding such that the parameter p is small. This implies that pZ-which tells the average number of “ones” in a column of the matrixis not too big, and. consequently, “short” columns in the pointer representation can be expected. Hence the total number of pointers (time and space consuming) needed for the scheme remains small. compared with the “many” empty places in the full bitwise matrix representation. Small p means sparse coding. The optimality condition (5) serves as ;I tool to determine the magnitude of this parameter. How difficult or easy is sparse coding? In many problems sparse coding can easily be realized. Note that such codes are not uniquely determined, in general. In written language problems. there is a big variety of sparse codes. because there arc many obvious correlations of the letters in

words to be used (in our example. we have used triplets of consecutive letters), and to find an appropriate p close to the optimal one (according to the actual problem) should not be too difficult. However. there are also problems. where the ‘.right” type of features cannot be seen immediately: this implies further investigations of the problem. We hope that this note helps to encourage many more researchers to investigate storage-problems with respect to sparse coding. not only in order to present alternative and sometimes more adequate solutions. but also to gain more experience with this new coding technique. Implementations of associative memories that we ha\:e realized so far, were mostly based on simulations of neural networks on the (conventional) computer (Bentz. 193X; Bentz & Meierarend. 1988). In the future. many processes in the “learning-phase” as well as in the retrieval-phase could run completely independently and in parallel. Thus, hardware realizations of massively parallel systems are highly needed. Such configurations would allow to speed up operations in the associative memory extremely well. such that a large amount of information could hc stored and at the same time stored items could be read out very fast.

REFERENCES Ucntr.

H. J. (IYXX).

with

extrcmcl~

manuwript. Bentz.

rtdc.

nctworh

Unpublished

(1Wi).

C.

Aswziative

Speicher

In A. U. C’remcrs

als

& W. Gcis-

(Eds. ). 1. ~ltl~~~~,t~dcr~or~rt~~E,xpcrrcws \‘t~emcI’nmwl-

wlhardt

4%~4.41 1. Dui\burg:

J. J.. d Tanh.

circuit\.

D.

W.

Univcrsltiit

Dukhurg.

(lUX6). (‘omputing

with

neural

S~~icww. 233. (C-h3.3,

Kohoncn.

T.

( I YX1). :I.uoc~icrriwmcmor~mrl v~//-o~,~crrrhlr/iorl.

No-York:

CJ.

01 i, ncuronal

synaptic

OsnahriicL. fur Expcrtcn5y5temc.

irfgc.(pp. Hopfield.

model

Hchh-lihc

H. J.. & Mcwrarend.

Baustcinc

Palm.

DIKAS--A

clipped

Springer-Vcrlag.

( IYX7).

(‘omputing

\vith

ncur;tJ networhs.

S~~icr7w.

235.

1’97-119x. Kumclhart. (Edj. lwdge.

D. E.. McClelland. 1.

J. L. & the I’DP

( IYX6). Prrrcilld rli~wih~r/r~~prowu,r~,~. MA:

MIT

Preah.

Kcscarch (Vol.

Group I ). Cam-