Fault-tolerant database using distributed associative memories

Fault-tolerant database using distributed associative memories

INFORMATION SCIENCES 135 53, 135-158 (1991) Fault-Tolerant Database Using Distributed Associative Memories* VLADIMIR CHERKASSKY Department of El...

1MB Sizes 0 Downloads 85 Views

INFORMATION

SCIENCES

135

53, 135-158 (1991)

Fault-Tolerant Database Using Distributed Associative Memories* VLADIMIR

CHERKASSKY

Department of Electrical Engineering, MALATHI

lJniversi@ of Minnesota, Minneapolis, Minnesota 55455

RAO

Network Systems Corporation, Brooklyn Park, Minnesota 55428 and HARRY

WECHSLER

Department of Computer Science, George Mason UniLaersity,Fairfax, Virginia 22030

Communicated

by Ahmed

K. Elmagarmid

ABSTRACT

This paper describes an approach to fault-tolerant database retrieval using the distributed-associative-memory (DAM) paradigm. The fault tolerance is with respect to noise (errors) in the input key and/or corruption in the memory itself. Experimental results are presented to prove the feasibility of our approach. These results show how fault-tolerant retrieval is affected by various factors, such as the type of stored data, database size, and type of errors. We also show how the DAM model can be applied to implement a fault-tolerant relational database, in which selection operations may be easily performed with a noisy key/subkey, and complex queries can be answered correctly even if there are spelling inconsistencies in the internal representation.

1.

INTRODUCTION

Humans have the capacity of correctly recognizing mutilated words. In fact, the evidence seems to suggest that humans do not recognize individual letters to construct a word, but rather they recognize the entire word based on information of various kinds and thence know which individual letters are

*This work was supported in part by the Graduate and by NSF grant EET-8713563. OElsevier Science Publishing 655 Avenue of the Americas,

Co., Inc. 1991 New York, NY 10010

School

at the University

of Minnesota

0020-0255/91/$03.50

136

V. CHERKASSKY,

M. RAG, AND H. WECHSLER

correct [l]. It has been estimated that the English language is over 50% redundant; as a consequence, the characters surrounding those in error might be sufficient to detect errors and suggest corrections. The task we are concerned with is that of robust data retrieval for situations where the input key and the storage (memory) are error prone. This has numerous applications, e.g., in optical character rendition (OCR), spelling-checker programs, database retrieval. Much research has been done, and we classify the different approaches that have been used into four categories: Given an input key that may be noisy, the (1) Dictionary lookup methods. dictionary is searched for that key. If the word appears in the dictionary, it still does not necessarily mean that there is no error. The error may be inherently undetectable, i.e., one word in the dictionary may have been corrupted to read like another valid word. Such errors cannot be discovered unless the broader context is taken into consideration, and hence we will not be concerned with errors of this type. If the word is not in the dictionary, then the input key is assumed to be noisy. There are three main classes of errors: (a) Deletion, i.e. a letter(s) from the word is (are) missing. (b) Substitution, i.e. a letter(s) in the word is replaced by some other letter(s). (cl Insertion, i.e. an additional letter(s) has been introduced in the word. Substitution errors are handled by comparing the input key with all words in the dictionary that have the same length as the input key and then returning the nearest word. The processing time for insertion and deletion errors will be much more, since the input key must be compared against all words in the dictionary. Moreover the comparison is not straightforward because the jth letter in the input key may not correspond to the jth letter in the actual word. This is why many implementations of dictionary lookup handle only substitution errors. The straightforward application of such methods gives very good error correction rates, but the computation time becomes prohibitive for dictionaries large enough for any practical application, particularly when all three types of errors must be handled. Therefore, in practical applications (e.g., in spelling checker programs [2]) the performance of dictionary lookup methods is usually improved by using other techniques, e.g. (a) Two-leuel search strategy [2]. The most common words are searched first. (b) Data-compression techniques. For example, a large dictionary contains many words with the same root. Hence, ail identical roots can be replaced with one-character tokens, so that much of the dictionary’s vocabulary comes from prefix-root and suffix-root logic [3]. (2) Feature-extraction (bigrams) or three-letter

methods. sequences

These methods use two-letter sequences (trigrams) which are extracted from the

FAULT-TOLERANT

DATABASE

137

word to serve as “features” of the word. An approach used in the TYPO spelling checker program on UNIX [4] is based on the fact that the frequency of these bigrams and trigrams in the English text varies greatly, many being extremely rare. If the word in question (token) contains several very rare bigrams or trigrams, it is potentially misspelt. Kohonen [5] discusses a dictionary search method in which n-grams are extracted from the word to serve as features of the word. A hashing function is defined that will map an n-gram to a particular memory slot, and pointers are constructed from each n-gram’s slot to all the words in the dictionary for which that n-gram is a feature. In order to find in the dictionary the closest word to a search key, the n-grams of the search key are extracted and their pointers are followed to obtain a set of candidate words. From this set of candidate words the one with the maximum number of pointers is selected as the actual word. When two or more candidate words have the maximum number of pointers pointing to them, the words are compared individually with the search key for translation (shift) errors, in order to find the closest word. (3) Markov-chain and probability-distribution approximation methods [6, 71. These methods are based on the Vitterbi algorithm [7] and use dynamic programming. It is assumed that the language is an mth-order Markov source and hence no dictionary is needed. These methods use probability distributions and return a word which yields the highest probability for a given input word. With the Markov-chain method the system may return a word that was never stored in the first place. Markov-source-based models do not perform as well as the dictionary lookup methods, but have execution times that do not vary with the number of words stored or the types of errors being considered. (4) Hybrid methods [8]. The hybrid methods first carry out a Markovsource-based search and follow it with a dictionary search to ensure that the word is actually a part of the database. We focus in this paper on a different approach to fault-tolerant database retrieval, that is, we view the task of information retrieval based on partial contents as a generic recognition problem and propose to use the distributedassociative-memory (DAM) model to solve it. Our motivation is based on the following recent developments: (1) The growing problem of information explosion (large intractable databases) suggests that the filing of data by associations, rather than by indexing, should be used. (2) The recent research in neural networks [9-111 suggests the possibility of massively parallel (collective) computation radically different from the traditional von Neumann approach. Neural networks, also known as paralleldistributed-processing (PDP) models, achieve massive parallelism via dense

138

V. CHERKASSKY,

M. RAO, AND H. WECHSLER

interconnection of simple computational elements (“neurons”). Most neural networks can be also viewed and/or used as associative memories. For example, when only a part of an input pattern is available as an input (stimulus), a neural network outputs the complete response pattern (recall). It is worthwhile to briefly survey the origins of neural networks. The idea of a distributed network for computational tasks like recognition was suggested by Rosenblatt [12] in the form of the two-layer perceptron. Minsky and Papert [13] showed that such networks could fail on even simple tasks like implementing the XOR function. Limitations of the early perceptrons can be overcome by using multiple-layer networks of nonlinear units. It has been shown by Kolmogorov that any continuous function of n variables can be computed using a three-layer perceptron with nonlinear units [lo]. Unfortunately, the theorem is not of the constructive type. However, more recently PDP connectionist models have been suggested in the form of multilayer networks with hidden units. The learning algorithm is based on the back-propagation learning introduced by Rumelhart et al. [ll]. Even though such an algorithm doesn’t enjoy proof of convergence, it has still proved very successful in a system, NETtalk, developed by Sejnowski [14] for word pronounciation. The connectionist models could be implemented in parallel (SIMD) on a Connection Machine [15] running under the “marker-passing type” mode [16]. Using such an approach, Stanfill and Waltz [17] suggested the idea of memory-based reasoning (similarity-based induction) and were able to duplicate to a close approximation the results obtained by Sejnowski. Distributed associative memories (DAMS) are another class of neural networks. The particular kind of DAM that we deal with in this paper was introduced by Kohonen [9]. It is a memory matrix which, like an adaptive filter, can modify the flow of information. Stimulus vectors are associated with response vectors, and the result of this association is spread over the entire memory space. Distributing in this manner means that information about a small portion of the association can be found in a large area of the memory. New associations are placed over the older ones, and they are allowed to interact. Because information is distributed in memory, the overall function of the memory becomes resistant to both faults in memory and degraded (noisy) stimulus vectors. When the stimulus and response vectors are identical, the memory is of the autoassociative type. When they are different, the memory is of heteroassociative type. The above discussion illuminates several properties of distributed associative memories which are different from the more traditional ones in regard to memory. Because the associations are allowed to interact with each other, an implicit representation of structural relationships and contextual information can develop, and as a consequence a very rich level of interactions can be

FAULT-TOLERANT

139

DATABASE

captured. Since there are few restrictions on what vectors can be associated, there can exist extensive indexing and cross-referencing in the memory. Distributed associative memory captures a distributed representation which is context dependent. A recent comparison of several distributed associative memories for storing binary vectors 1181 shows that the DAM as defined by Kohonen provides superior performance. We have used the DAM model for the task of fault-tolerant database retrieval. The rest of the article is organized in four sections. In Section 2 we discuss the theory behind DAMS. In Section 3 we introduce the DAM model that we used for data-retrieval purposes. Section 4 contains the results of our experiments that show the appropriateness of such a model for fault-tolerant database implementation. Conclusions and future research directions are discussed in Section 5. 2.

DAM BACKGROUND

We first describe the DAM modeling in terms of representation and algorithms [9] and then briefly discuss the fault-tolerant aspects of DAMS. There are two main phases in the operation of DAMS: (1) Memory construction phase, when the memory matrix is created from a given set of associations, or stimulus-response vector pairs. (2) Recall &zse, when a noisy (or perfect) version of a stimulus vector is used to retrieve its associated response vector (or a close approximation of it). The construction stage assumes that there are n pairs of vectors that are to be associated by the distributed associative memory. This can be written as Msi = ri

for

i=l,...,n,

where si denotes the ith stimulus vector (of dimension ml and ri denotes the ith corresponding response (stored) vector (of dimension I). In this paper, bold letters are used to denote (column) vectors. In order to find the memory matrix M, we need to solve the following equation: MS=R,

(2)

where S =[s, sz . . . SJ and R =[rl r2 . . . rn] are matrices composed of stimulus and response column vectors, respectively. A unique solution for this equation does not necessarily exist for any arbitrary group of associations that might be chosen. When the number of associations, n, is smaller than m, the length of the vector to be associated, the system of equations is undercon-

140

V. CHERKASSKY,

M. RAO, AND H. WECHSLER

strained; when n > m, the system is overconstrained. The constraint used to solve for a unique matrix M is that of minimizing the squared error, IJMS - R112,which yields the following solution: M = RS+, where S+ is known as the Moore-Penrose generalized inverse of S [9]. The computationally intensive module needed to compute the generalized inverse of a matrix implements the least-squares solution based on the Gram-Schmidt orthogonalization process 1191. Our simulation experiments which implement such an approach are described in Section 4. The recall operation projects an unknown stimulus vector 5 onto the space defined by S, followed by mapping onto the space R as shown below: Z = MS = (RS+SS+)S

= (RS+)(SS+)j:

= Ms^.

The resulting projection yields the response vector F. If the memorized stimulus vectors are independent and the unknown stimulus vector S is one of the memorized vectors sk, then the recalled vector will be the associated response vector rk. If the memorized stimulus vectors are dependent, then the vector recalled by one of the memorized stimulus vectors will contain not only the associated response vector but also some crosstalk from the other stored response vectors. The recall can be viewed as the weighted sum of the response vectors. The recall begins by assigning weights according to how well the unknown stimulus vector matches with the memorized stimulus vector, using a linear least-squares classifier. The response vectors are multiplied by the corresponding weights, and they are summed together to build the recalled response .vector. The recalled response vector is usually dominated by the memorized response (stored) vector corresponding to the memorized stimulus vector which is closest to the unknown stimulus vector. The distributed associative memory will have interactions between the different associations, and this allows some generalization of responses to previously unknown stimuli. The recall operation is implemented through the use of M as a projection operator. The retrieved data E obeys the relationship f = i+r*,

(5)

where i plays the role of an optimal associative recollection and linear regression in terms of M. The residual, r*, is orthogonal to the space spanned by M and represents the novelty in I with respect to the stored data in M. It is

FAULT-TOLERANT

141

DATABASE

the decomposition of F in terms of (i,r*) that can be used to facilitate the learning of new things. The retrieval operation for an autoassociative case, i.e. the attempt to recall memory contents for an input key t, is given as i=M&

(6)

where G is the recalled approximation of the association. The recall operation for the heteroassociative case can retrieve either the forcing (stimulus) function s or the coupling (response) r. Specifically, if the corresponding matrices are S and R, respectively, then if

te{r)

then

M=SR+,

S=Mt,

(7)

if

tE{s}

then

M=RS+,

F=Mt.

(8)

The error-correcting capabilities can be explained as follows [9]. Assume that there are n associations between stimulus vectors of length m and response vectors of length 1. This means that the memory matrix has lm elements. Also assume that the noise that is added to each element of a memorized stimulus vector is independent, zero mean, with a variance of a,‘. The recall from the memory is then r=

rk

+

v, =

M(sk + vi) = rk + Mvi,

where vi is the input noise vector and v, is the output noise vector. The ratio of the average output noise variance to the average input noise variance is 2

a0 q2

-

For the autoassociative

=

(10)

fTr[MMr].

case (I = m) this simplifies 2 a0 _=_

n

ui2

m

to [9]:

(11)

This says that when a noisy version of a memorized input vector is applied to the memory, the recall is improved by a factor corresponding to the ratio of the number of memorized vectors to the number of elements in the vectors.

142

V. CHERKASSKY,

For the heteroassociative

M. RAO, AND H. WECHSLER

memory matrix a similar formula

d = +Tr[RRT]Tr[(s’s)-‘]. cri*

holds [20]:

(12)

Another way of viewing this error-correcting process is to notice that the memory matrix is the orthogonal projection matrix for the set of stimulus vectors. The noise vector in this m-dimensional space will be projected onto the space spanned by the n memorized vectors. The parts of the noise vector that are orthogonal to the memorized stimulus vector will be suppressed, and this accounts for the noise reduction in the output recall vector. Fault tolerance is a by-product of the distributed nature and error-correcting capabilities of the distributed associative memory. By distributing the information, no single memory cell carries a significant portion of the information critical to the overall performance of the memory. We may further distinguish three important DAM paradigms: (1) Autoassaciutor. Stimulus vectors are associated with themselves, and the system is supposed to “store” these vectors. Later, during the recall phase, when a degraded version of the stored vector is presented, the system should produce an improved version of it. (2) Heteroassociator . The system stores (stimulus, response) vector pairs. When a stimulus (possibly noisy) is presented to the system, it is supposed to produce (recall) the correct response vector. (3) Classifier. This is an important special case of the heteroassociator. In this paradigm, response vectors represent a fixed set of classes (categories) into which the stimulus vectors are to be classified. During memory construction the system is presented with correct classification of a stimulus vector, i.e., it learns to associate given (stimulus,class) pairs. During the recall phase, when a stimulus (possibly distorted) is presented, the system will classify it properly.

3.

DATABASE

RETRIEVAL

We have developed and implemented a DAM-based model for fault-tolerant information retrieval of alphanumeric data. Our model allows one to retrieve information in the presence of noise (errors) in the input key (index) and/or the memory itself. We created a database of names from a student directory. The purpose is to identify prestored name(s) with noisy (misspelt) inputs using associative recall.

FAULT-TOLERANT

DATABASE

143

The standard DAM model operates on real-valued vectors. Hence, in order to apply this model successfully for database retrieval one has to address the following problems: It is necessary to find a “good” hashing (1) Znput (data) representation. function for mapping alphanumeric records onto real-valued vectors. A hashing function used to encode names should satisfy the following requirements: (a) The DAM model expects all stored vectors to be of the same size. Therefore, a name to be stored in the database is transformed into a real-valued vector of fixed length using some hashing function. This fixedlength vector then serves as a stimulus to the model. (b) The hashing function must have the property that a typical error (e.g. a wrong, omitted, or inserted letter) in the input key at the time of retrieval will produce a hash vector that is close to the hash vector corresponding to an error-free name. This implies, for example, that the standard ASCII encoding for letters would not work, because an insertion error or deletion error (missing letter) in the input key will result in a stimulus vector totally different from the stimulus of an error-free name. Our DAM model for database retrieval employs three kinds of hash functions, known as n-gram extraction methods (1 Q n < 3). The hash function creates a vector that is 26” elements long. Each element represents an n-gram corresponding to a set of n adjacent letters. An element has the value one if the n-gram to which it corresponds appears at least once in the name. In our model the last and the first letter of a name are considered as adjacent, so that the first J’ letters and the last n - i letters form a n-gram. For every n-gram that is absent in the name the corresponding element in the vector is set to zero. We shall also refer to these methods as alphabet extraction, bigram extraction, or trigram extraction, depending on the value of the parameter n. Obviously, these methods can be extended to encode alphanumeric records (however, the hashing-vector length will be larger for the larger alphabet size). Clearly, the hash vectors thus obtained (and the corresponding memory matrix) are typically very sparse. This fact can potentially be used to improve the DAM performance by using data compression techniques or using special data structures for sparse matrices. Using the frequency of occurrence of n-grams in the English language should also improve the performance of this model. We, however, obtained very good results even without the use of frequencies. (2) Appropriate DAM paradigm. The choice among the three possible paradigms was influenced by the fact that we could not find a natural representation of alphanumeric data using real-valued vectors that satisfies the closeness property of hashing functions described above, and at the same

144

V. CHERKASSKY,

M. RAO, AND H. WECHSLER

time allows (unique) inverse mapping from the hash vector onto the original alphanumeric record. For example, whereas it is easy to map a record onto a real vector using an n-gram hashing method, it is not possible to obtain the original record from its real-vector representation. Therefore, autoassociative DAM cannot be used. We have chosen to use the classifier paradigm for database retrieval. Each class consists of a correct name and all of its variations (misspellings). During memory construction, the system is presented only with correct names, so that each name corresponds to a different class. During recall, the system should be able to properly classify misspelt input names. Operation of the DAM model for database retrieval is described next. During the memory construction phase, every name is encoded by a hash function into a stimulus vector of fixed length m. Let N be the name matrix with p rows and n columns, where n is the number of names in the database and p is the maximum size (number of letters) of any name. Let S be a stimulus matrix with m rows and n columns, where each column corresponds to the hash vector of a (prestored) name. In order to achieve correct classification using noisy inputs, output (response) vectors of the DAM model serve as identification tags for classes, or prestored names. We use orthonormal column vectors as identification tags, e.g. rl = [l, 0,. . . , O]‘, . . . , r, = [O,O,.. . , l]r. Thus our model appears as a classifier, or heteroassociative DAM, with the response matrix R = [rl, r2,. . . , rn] being simply an identity matrix. The memory matrix for this DAM is

During the recall phase the input key is converted to its hash vector t. We then multiply S+ by t to obtain the identification tag r. This vector r is treated like a histogram that indicates how close the vector t is to each of the stored hash vectors. If r[i] is the maximum element in the vector r, then the ith hash vector is closest to t, and the system returns the ith name in the name matrix N as the output of the recall phase. In situations where more than one element in r is very close in value (e.g., 1%) to the maximum element, the system makes a note of all those elements and returns the names corresponding to all those elements. Note that in our classifier model the recalled name is one of the prestored names closest to the noisy input, whereas in image-recognition applications using autoassociative DAM [9, 211 the recalled image is usually a weighted linear combination of prestored images. The latter works well for image recognition, since grey-scale images can be naturally represented as real-valued vectors.

FAULT-TOLERANT 4.

145

DATABASE

EXPERIMENTAL

RESULTS

The DAM model for database retrieval was implemented in a UNIX environment on a Sun 3/75 workstation. The results are presented under two subsections: (1) fault-tolerant retrieval, including effects of DAM saturation ory corruption; (2) relational-database implementation.

4.1

FAULT-TOLERANT

and mem-

RETRIEVAL

We performed several experiments to evaluate the fault tolerance of database retrieval. A small database was created with names taken from a student directory. Each name consisted of up to 11 letters, and we considered the following two types of data: (1) Random (noncorrelated) data, i.e., names were extracted randomly from the student directory [see Figure l(a)]. (2) Highly correlated data, i.e., names were taken from the same page of the student directory [see Figure l(b)]. Fault

tolerance

is analyzed

with respect

to the following

rothgeb

cers

randolph

Cervantes

vohnoutka

cervenka

perlman

cesnik

weingart

chabot

fathima

chace

armineh xerxes (a)

common

types of

chadbourn chae (b)

Fig. 1. Sample contents of a database of size 8: (a) random data; (b) correlated data

146

V. CHERKASSKY, Hashing

method:

Size of database:

Method

8 Correlated

List of names:

cers Cervantes cervenka cesnik chabot chace chadbourn chae cesnik

key:

key:cesnik

Contents

of vector

2) Input

key: cesnk

Contents

of vector

Fig. 2.

Bigram

Type of data:

Input 1)Input

M. RAO, AND H. WECHSLER

‘r’:[O,O,O,l ,O,O,O,O]

‘r’:[.15,.1 ,.OI ,.98,0,.02,0,.1

J

Contents of the recalled histogram vector r.

input errors: (1) Single-letter deletion errors. In our experiments we delete every letter in each name (one at a time) and then try to retrieve the correct name. (2) Single-letter substitution errors. For every letter we tried two or three typical misspellings. An example of the DAM retail operation (see Figure 2) shows the eiements of the recalled histogram vector obtained when recalling a name from a database of size 8 (correlated data, bigram method). Although we did not collect statistics on the ~mputational performance (speed) of our software model running on the Sun 3/75 workstation, we indicate here sample response times for a database of 20 names using bigram extraction method: (1) Memory construction: 0.25 sec. (2) Data retrieval: 0.1 sec. In all experiments, we used the same nonlinear decision rule during recall to obtain correct classification given the DAM output vector, e.g.: (1) Find the maximum “near-m~imum” elements

element in the output close to it (within 1%).

histogram

vector

and all

FAULT-TOLERANT

147

DATABASE

(2) Set all these elements to 1, and set all other elements to zero. Note that each element equal to 1 corresponds to a prestored (correct) name. (3) The system then returns correct name(s) corresponding to one or more output classes. Notice that multiple retrievals are possible. Even though multiple-retrieval capability does not seem particularly important for spellingchecker applications, it becomes extremely useful for partial-key (subkey) retrieval in database applications (see Section 4.2). As a measure of fault-tolerant retrieval we consider the percentage of correct retrievals with respect to a particular type of error in the input key. In our experiments we consider the following factors which affect fault-tolerant retrieval: (1) (2) (3) (4)

type of data stored, i.e. random versus correlated; database size, i.e. the number of names stored; type of hashing function used (alphabet, bigram, trigram extraction); type of input errors.

Results are presented in the form of tables showing the percentage of correct retrievals as a function of database size. Figures 3 and 4 show the performance of fault-tolerant retrieval for random and correlated data, respectively. As expected, the performance degrades with increase in database size for all three hashing methods on both random and correlated data. The results obtained with the alphabet-extraction method on random data (Figure 3a) also

Words in memory

Words in memory

% Correct retrievals

8

100

8

100

12

92

12

97

16

83

16

97

20

68

20

83

24

51

24

71

28

51

28

82

Substitution Fig. 3a. method.

% Correct retrievals

Fault-tolerant

errors retrieval

for random

Deletion data:

errors

performance

of the alphabet

extraction

148

V. CHERKASSKY, Words in memory

Words in memory

% Correct retrievals

8

100

8

100

16

100

16

100

20

100

20

100

24

100

24

100

28

92

28

98

70

90

70

96

100

88

100

93

200

69

200

70

300

15

300

15

Substitution Fig. 3b.

% Correct retrievals

M. RAO, AND H. WECHSLER

Fault-tolerant

errors

Deletion

errors

retrieval for random data: performance of the bigram method.

exhibit the same trend, except for a quirk at a database size of 28 that can be explained by the instability of the generalized inverse hetero-associative memory [20]. As expected, the trigram method shows better fault tolerance than the bigram method, which in turn shows better results than a simpler method based on alphabet extraction. Although we did not gather exhaustive statistics about other kinds of errors (e.g. insertion errors, multiple-letter deletion and/or substitution), we found that the system handles them very well. For example, Figure 5 shows the statistics for double errors (e.g., one misspelling and one insertion) obtained with the bigram method for correlated data. It is interesting to compare these results with similar statistics for single errors [Figure 4(b)]: the performance of fault-tolerant retrieval for double errors is only about 10% worse than the performance for single errors. Also note that the computational complexity of the DAM-based data retrieval does not depend on the number of errors. This compares very favorably with many other methods for robust data retrieval (e.g. dictionary lookup methods), which exhibit rapidly degrading performance for multiple errors.

FAULT-TOLERANT Words in memory

% Correct retrievals

149 Words in memory

% Correct retrievals

8

100

8

100

16

100

16

100

20

100

20

100

24

100

24

100

28

100

28

100

70

100

70

100

100

100

100

100

200

100

200

100

300

100

300

100

Substitution Fig. 3c.

DATABASE

Fault-tolerant

Deletion

errors retrieval

for random

errors

data: performance

of the trigram

method.

Our experimental results can be explained qualitatively using the general DAM analysis (as discussed in Section 2). Namely, according to [9, 201, when a noisy version of a memorized input vector is applied to the memory, the recall is improved by a factor corresponding to the ratio of the vector dimensionality to the number of memorized vectors. We performed several experiments to analyze how robust retrieval is affected by the size of memory (DAM saturation effect). Figure 6 shows the maximum database size for which the percentage of correct retrievals (assuming deletion errors) exceeds a certain value. The table shows that when the database size exceeds 20 the alphabet extraction method’s retrieval rate falls below 66%, while that of the bigram method deteriorates below 66% only when the number of stored names exceeds 200. The trigram method yields good results even when more than 300 names are stored. This illustrates the saturation effect typical for DAM-based models. We did not continue to experiment with larger databases, to determine when the trigram method will saturate because the computation time for the trigram method is quite large. Saturation is a consequence of the crosstalk effect caused by the memory being nonorthogonal due to similarity among the stored stimuli. This explains

150

V. CHERKASSKY, Words in memory

% Correct retrievals

91

8

98

12

84

12

89

16

65

16

70

20

60

20

66

24

56

24

61

28

50

28

58

Fault-tolerant

Words in memory

errors

Deletion

errors

retrieval for correlated data: performance of the alphabet extraction

% Correct retrievals

Words in memory

% Correct retrievals

8

100

8

100

16

100

16

100

20

100

20

100

24

96

24

98

28

94

28

98

70

89

70

97

100

85

100

92

200

60

200

67

300

8

300

12

Substitution Fig. 4b.

Words in memory

8

Substitution Fig. 4a.

% Correct retrievals

M. RAO, AND H. WECHSLER

Fault-tolerant

errors

Deletion

errors

retrieval for correlated data: performance of the bigram method.

FAULT-TOLERANT Words in memory

% Correct retrievals

Words in memory

% Correct retrievals

8

100

8

100

16

100

16

100

20

100

20

100

24

100

24

100

28

100

28

100

70

100

70

100

100

100

100

100

200

100

200

100

300

100

300

100

Substitution Fig. 4c.

151

DATABASE

errors

Deletion

errors

Fault-tolerant retrieval for correlated data: performance of the trigram method.

why the alphabet-extraction method with a vector size of 26 saturates faster than the bigram method with a vector size of 26’, which in turn saturates faster than the trigram method with a vector size of 263. The DAM database is a distributed memory. Therefore, when a portion of the memory is corrupted, the system does not lose its retrieval capabilities altogether, but degrades gracefully. Portions of the memory were randomly selected and set to 0.5. The results are shown in Figure 7. Initially we had set portions of the memory to zero as in earlier experiments with image recognition [9], but this did not seem to affect the performance of the system at all, because in our case the original (error-free) memory matrix is sparse. This -experiment was performed on a database of 28 correlated names that used the bigram hashing method. These results show that the DAM database is very resilient to corruption in the memory itself. 4.2.

RELATIONAL

DATABASE

Among several competing database models (hierarchical, relational, network, etc.) the relational model seems to be the most appropriate, because its

V. CHERKASSKY,

152

Words in memory

M. RAO, AND H. WECHSLER

8 Correct retrievals

8

100

16

95

20

91

2.4

90

28

85

70

80

100

72

200

48

300

3 Fig. 5.

Data Type: Correlated Hashing

Method: bigram

Type of errors: misspelling

Fault-tolerant

&

insertion

retrieval in the case of double errors.

representation of data in terms of relations can be easily transformed to associations of the DAM type. A relational database contains one or more relations. Each relation has one or more attributes. The entries in a relation are known as tuples. No duplicate tuples are permitted in a relation. Every relation has one primary key that may be a single attribute or a composite of two or more attributes of that relation.

%Correct Retrievals

Fig. 6.

Alphabet Extraction

Bigram Extraction

Trigram Extraction

Saturation effects observed on correlated data.

FAULT-TOLERANT

153

DATABASE 8 Correct retrievals

% Memory corrupted 0

Database Type

Fig. 7.

98

I

size:

I

28 deletion

of errors:

Type of data:

correlated

Hashing

bigram

method:

extraction

Graceful degradation in the case of memory corruption.

There are two types of operations

supported

(1) The first type uses one relation e.g.:

by a relational

as the input to create

database

[22]:

a new relation;

select selects tuples from a relation based on the values of some of its attributes. project creates a new relation that has only selected attributes of the input relation. (2) The second e.g.:

type uses two relations

as input to create

a new relation;

union, intersect treat the tuples in a relation like elements of a set, and perform corresponding set-theoretical operations. join creates a cross product of two relations. A tuple from the first relation is merged with a tuple from the second relation to form a tuple for the new relation, provided they match on a preselected attribute. The new relation will have the attributes of both input relations. This operation is useful in answering recursive queries. For example, given a relation with two attributes “father” and “son,” one can use the join operation to find out who the grandson of a person is by joining the relation with itself, so that the son attribute of the first copy matches the father attribute of the second copy. We developed a simple relational database system based on the bigram hashing DAM model. In our system, alphabetic data are stored in the form of relations (tables) which are implemented as DAM associations. Our assump-

154

V. CHERKASSKY,

M. RAO, AND H. WECHSLER

tion about the alphabetic type of data is not restrictive; the same model can be used for storing, for example, alphanumeric data (of course, this would result in the increased size of the hash vector and the memory matrix). Implementation of relational database operation requires that the corresponding DAM model be able to perform selective associative recall (i.e. to select tuples based on the vaIues of some attributes, or subkeys). In general it has become evident that the selectivity of associative mappings is better with mappings which are mutually more orthogonal, i.e., the quality of associative recall is better if arbitrary fragments of stimulus vectors are as orthogonal as possible with respect to the corresponding fragments in al1 other stimulus vectors 191.

Relation 1 Attributes: Profession, Table size: 1.5 Sample data: Profession Nurse Doctor Student Housewife

.._ _ _._..

Place of work

I‘lace of work

I

Hospital

Hospital School Home

..*._._._..

(a) Q: Select all tuples where Place of work = “Hospital”

wj Ih) Q: Select the tupb where Profession = “Nrse” R: Nurse

Hospital Cc)

Fig. 8. Fault-tolerant relational database: (a> sample relation in the database; (b) select query with multiple retrieval; (c) select query with noisy subkey.

FAULT-TOLERANT

DATABASE

155

Select operations on a relation are based on the value of either the entire key or some subkeys. An encoding method must be chosen so that the fragment corresponding to a subkey in a tuple is as orthogonal as possible to the fragment corresponding to the same subkey in any other tuple of that relation. Hence the stimulus vector that we store for any key is obtained by concatenating the bigram hash vectors (of size 26 X 26 = 676) for each of the subkeys. Therefore the length of the stimulus vector is 676/c, where k is the number of attributes in the primary key. During memory construction, these error-free vectors are associated with output classes, as described in Section 3. While retrieving data we place zeros in the portion of the stimulus vector corresponding to subkeys whose value we do not know or care about (this guarantees orthogonality between different subkeys), and the system returns the entire tuple. Our system can create a number of relations, each of which has a primary key that may be simple or composite: We also implemented a select operation that selects tuples based on the value of the primary key or on subkeys of the primary key. Figure 8 shows a sample relation that we created and some of the select operations performed on it. Note that it performs multiple retrievals in the presence of noisy subkey values. We implemented the join operation, so that the system can answer complex queries by forming only those tuples that are appropriate for the reply. Figure 9 shows a pair of relations that we created

Relation

Relation

1

Attributes:

Primary

Profession Place of work Key: Profession & Place of work

Sample query: Response: Fig. 9. tencies.

Attributes:

Primary

Where does Zulema

2 Name of person Profession Key: Name of person & Profession

Work?

Hospital

Join queries

implemented

on a relational

database

which contains

spelling

inconsis-

156

V. CHERKASSKY,

M. RAO, AND H. WECHSLER

and some of the queries that were answered correctly in the presence of spelling inconsistencies in the two tables. Similarly, our system can handle correctly recursive queries in the presence of noise (errors) in the input key and/or internal representation (tables). The main advantage of a relational database based on the DAM model is its fault tolerance. Selection operations performed with a noisy key/subkey take the same time as retrieval with a correct key. Recursive and join queries can be answered even if there are spelling inconsistencies in the stored data.

5.

DISCUSSION

AND FUTURE

EXTENSIONS

We successfully used the DAM model to create a fault-tolerant database that handles noisy inputs in the same way as error-free inputs. The robustness of such a model is significantly affected by the hashing method used to encode the input key (data) and by the database size. We have implemented three different encoding methods and found that the bigram and trigram methods perform quite well. The bigram method allows us to store about 200 words (of correlated data) in the memory before robust retrieval is significantly degraded by memory saturation. The trigram method has a much higher errorcorrection rate than the bigram method, but it is computationally more intensive, as explained below. Database creation is a time-consuming operation because the generalized inverse of the stimulus matrix of dimension n (number of names) by m (dimension of hash vector) must be evaluated. Fortunately this is a one-time operation and it does not affect the retrieval time. The retrieval (recall) corresponds to matrix-vector multiplication and takes 0(m2) time. This explains why the trigram method with m = 17,576 is much slower than the bigram method with m = 676. There is an obvious tradeoff between the quality of recall and the retrieval time. We have also shown that our model is extremely resilient to corruption in the memory itself. For a database of 28 words, if 20% of the memory is corrupted, the probability of correct retrieval is 0.85; and if 60% of the memory is corrupted, correct retrieval is still possible with probability 0.825. Further studies are needed to investigate the effects of memory corruption for larger database sizes (i.e. when memory saturation becomes significant). We are presently concerned with the possibility of regenerating a faulty memory to its original condition, and with developing alternative hashing methods (for decreasing the saturation effect). We also demonstrated that the DAM model can be used to implement a fault-tolerant relational database. The main advantage of our implementation seems to be the tolerance to noise in the attribute values during database

FAULT-TOLERANT

DATABASE

157

operations. A lot of work still remains to be done on the relational-database implementation. At this stage we have only implemented some of the basic functions. We also must modify the system to make a decision as to when it must return a null relation instead of simply returning the closest solution all the time. We are also concerned with the possibility of large-scale implementations. It is appropriate to mention here that neural network models do not scale well with size (this is equally true for hardware and software implementations). Thus for a large-scale database we propose to investigate an internal decomposition into several DAMS, i.e. using several smaller-size DAMS to store data. Under such an approach the same stimulus vector applied to several DAMS generates several response vectors, one of which (closest to the stimulus) is chosen as the true recall. Besides reducing the memory-saturation effect, this approach has an additional advantage, that of incremental memory construction, so that when adding/modifying information in a database only small portions of the DAM need to be changed. We are also looking at several other alternatives for efficient large-scale software implementations, e.g.: (1) Using sparse-matrix techniques to reduce computational costs of memory construction and retrieval. (2) Using parallel computers for large matrix operations. In particular, large-scale matrix operations can be efficiently implemented on hypercube multiprocessors if large matrices are partitioned into several block submatrices that are stored and manipulated in the nodes of the hypercube [23].

REFERENCES 1. G. T. Toussaint, The use of context in pattern recognition, Pattern Recognition 10:189-204 (1978). 2. J. L. Peterson, Computer programs for detecting and correcting spelling errors, Comm. ACM 23(12):676-687 (1980). 3. P. Lemmons, Five spelling-correction programs for cp/M-based systems, Byte 11:434-448 (1981). 4. R. Morris and L. L. Cherry, Computer detection of typographical errors, IEEE Trans. Professional Comm. PC-18:54-64 (1975). 5. T. Kohonen, Content Addressable Memories, Springer-Verlag, 1980. 6. R. W. Ehrich, Can a priori probabilities help in character recognition?, J. Assoc. Comput. Mach. 11:465-470 (1964). 7. G. D. Forney, The Vitterbi algorithm, Proc. IEEE 61:268-278 (1973). 8. J. J. Hull, S. N. Srihari, and R. Choudhari, An integrated algorithm for text recognition: Comparison with a cascaded algorithm, IEEE Trans. PAMI 5(4):384-395 (1983). 9. T. Kohonen, Self-Organization and Associative Memories, Springer-Verlag, 1984.

V. CHERKASSKY,

158

M. RAO, AND H. WECHSLER

10. R. P. Lippmann, An introduction to computing with neural nets, IEEE ASSP Msg., 1987, pp. 4-22. 11. J. L. McClelland, D. E. Rumelhart, and the PDP Research Group (Eds.), Parallel Distributed Processing, Vols. 1, 2, MIT Press, 1986. 12. F. Rosenblatt, Principles of Neurodynamics, Spartan, New York, 1962. 13. M. Minsky and S. Papert, Perceptrons, MIT Press, 1969. 14. T. J. Sejnowski and C. R. Rosenberg, NETtalk: A Parallel Network That Learns to Read Aloud, TR JHU/EECS-86/01, Johns Hopkins Univ., 1986. 15. W. D. Hillis, The Connection Machine, MIT Press, 1985. 16. S. E. Fahlman and G. E. Hinton, Connectionist architecture for artificial intelligence, Computer, 1987, pp. 100-109. 17. C. Stanfill and D. Waltz, Toward memory-based reasoning, Comm. ACM 29(12): 1213-1239 (1986). 18. G. S. Stiles and D. L. Deng, A quantitative comparison of the performance of the three discrete distributed associative memory models, IEEE Trans. Compuf. C-36, No. 3 (1987). 19. B. Rust, W. R. Burrus, and C. Schneeberger, A simple algorithm for computing the generalized inverse of a matrix, Comm. ACM :381-387 (1966). 20. G. S. Stiles and D. L. Deng, On the effect of noise in the Moore-Penrose generalized inverse associative memory, IEEE Trans. PAM, 7(3X358-360 (1985). 21. J. M. Char, V. Cherkassky, H. Wechsler, and G. L. Zimmerman, Distributed and fault-tolerant computations for retrieval tasks, IEEE Trans. Comput. 37(4):484-490 (1988). 22. C. J. Date, An Introduction to Database Systems, Addision-Wesley, 1986.

23. V. Cherkassky and R. Smith, Efficient mapping and implementation on a hypercube, J. Supercomputing 2:7-27 (1988). Received

7 September

1987; revised 7 January 1988

of matrix algorithms