Int. Z Man-Machine Studies (1979) 11, 201-212
A bottom-up and top-down approach to using context in text recognitiont RAJJAN SHINGHAL
Department of Computer Science, Concordia University, Montreal, Oudbec, Canada H3G 1M8 AND GODFRIED T. TOUSSAINT
School of Computer Science, McGill University, Montreal, Qu#bec, Canada H3A 21(,6 (Received 15 September 1978) Existing approaches to using contextual information in text recognition tend to fall into two categories: dictionary look-up methods and Markov methods. Markov methods use transition probabilities between letters and represent a bottom-up approach to using context which is characterized by being very efficient but exhibiting mediocre errorcorrecting capability. Dictionary look-up methods, on the other hand, constrain the choice of letter sequences to be legal words and represent a top-down approach characterized by impressive error-correcting capabilities at a stiff price in storage and computation. In this paper, a combined bottom-up top-down algorithm is proposed. Exhaustive experimentation shows that the algorithm achieves the error-correcting capability of the dictionary look-up methods at half the cost.
1. Introduction H u m a n information processing models differentiate between two types of processing. Systems that process patterns by analysing the data or input information with ever increasing levels of sophistication are referred to as data-driven or b o t t o m - u p systems. Those that start from overall expectations are called conceptually-driven or top-down systems. These concepts are more fully discussed in Norman & Bobrow (1976) and Palmer (1975). In machine recognition either a b o t t o m - u p or top-down approach has generally been used in implementing algorithms, when the input is in the form of spoken or printed text. Examples of b o t t o m - u p approaches are the use of syntactic rules (Kashyap & Mittal, 1977) and the Markov methods (e.g., Abend, 1968; Baker, 1975; Chung, 1975; Forney, 1973; Neuhoff, 1975; Raviv, 1967; Shinghal, 1977; Shinghal & Toussaint, 1978). The Markov methods invoke the assumption that the language is a Markov source. Thus these methods require in their implementation the a priori knowledge of the statistical structure of the language. Examples of top-down approaches are dictionary look-up methods (e.g., A b e & Fukumura, 1971 ; Bledsoe & Browning, 1966; Carlson, 1966; Cornew, 1968; D u d a & Hart, 1968; Reddy, 1976; Shinghal, 1977; Vossler & Branston, 1964; Wolf, 1976). The dictionary methods require that the word to be recognized exist in a previously compiled dictionary available to the recognizer. t This research was supported by the National Research Council of Canada, Grant Number NRC-A9293. 201 0020-7373/79/020201 + 12 $02.00/0
9 1979 Academic Press Inc. (London) Limited
202
R. SHINGHAL
AND
G . T. T O U S S A I N T
The dictionary methods are thus restricted to a previously defined vocabulary. Dictionary methods have low error-rates but suffer from large storage demands and high computational complexity. The Markov methods have the inverse characteristics. There exist some hybrid methods (Toussaint, 1977), which use the characteristics of both the bottom-up and the top-down approaches; for example, Riseman & Ehrich (1971) used quantized bigram probabilities (all non-zero bigram probabilities were changed to 1) for different word-position combinations in a word to limit dictionary search. This paper presents a top-down bottom-up approach, which will be referred to as the Predictor-Corrector algorithm. Experiments are described in recognition of English text by three different methods: a bottom-up approach, a top-down approach, and the Predictor-Corrector algorithm. The performance versus cost of these methods is compared. The text-recognition problem is described in section 2. A simulation of the textrecognition problem is given in section 3. Section 4 details the three approaches to text-recognition and describes the experiments conducted. Conclusions from the results of the experiments are given in section 5.
2. The text-recognition problem The text to be recognized comprises patterns, which are presented sequentially to the recognizer. On each pattern a set of measurements, called the feature-vector, is made. Based on the feature-vectors an unknown pattern would then be recognized as one of these 27 pattern-classes or characters: B (blank) or the letters " A " to " Z " . Let = X0, X,, Xz . . . . . X,,, X.+I be a sequence of feature-vectors presented to the recognizer such that n -> 1. Xo and X,+I are blanks. X1 to X , is a string of feature-vectors representing a word. A text is thus recognised one word at a time. Let P(RIX -- 2) denote the probability of ,X conditioned on the sequence of patternclasses = Ao, Ax, A2. . . . . A.,A.+I taking on the values = Zo, Z b Z 2 . . . . . Z., Z.+I. Let P(A = 7.) denote the a priori probability of the vector A taking on the values 7.. The probability of correctly recognizing X is maximized by selecting that sequence of characters which maximizes the a posteriori probability P(A = Z IY() or a monotonic function of it. For simplicity in notation, P(A = Z) will be written as P(Z), P(h. -- Z IX) as P(21R) and P(XIA = 7.) as P(X] Z). By Bayes' rule log P(ZI X) = log P(XI Z) + log P(Z) - log P(X).
USING
CONTEXT
IN TEXT
203
RECOGNITION
Since P(X) is not a function of Z, maximizing log P(ZIX) over Z is equivalent to maximizing log (X IZ) + log P(7.)
(2.1)
over Z. To reduce the computation required in expression (2.1), the following three assumptions are made.
Assumption 1. The feature-vectors Xo, X~. . . . . X,+a are conditionally independent; that is, r~+l
log P(XtZ) = Y. log P(X, IZi). i=0
As the experiments described in this paper used machine-printed characters, this assumption is valid (Toussaint, 1977). Thus, maximizing expression (2.1) is equivalent to maximizing n+l
Y. log P(X~ IZ~) + log P(Z).
(2.2)
i=0
Assumption 2. All blanks are perfectly recognizable; that is, P("/r IX0) = P("/r IX,+1) = 1 and
P("~"lXi)=0
for l<-i<-n.
Thus Zo and Z,+I are "/~", and Z;(1 - i - n) can take on any of the values of the 26 letters of the English alphabet. Since P(Zo] X0) and P(Z,§ a IX,+x) have been assumed to be constant, it follows from Bayes' rule that P(X01z0) and P(X.+~Iz.+~) are constant, too. Thus maximizing expression (2.2) is equivalent to maximizing log P(Xi IZ,) + log P(Z).
(2.3)
i=1
Assumption 3. The words in the English language are a Markov source of order 9 = 1 ; that is, P(ZklZ0, Zl, Z2. . . . . Zk_l)=P(ZklZk-l)
for 1 - k - < n + l .
Maximizing expression (2.3) is then equivalent to maximizing n+l
g(X, Z) = Y.. log P(X, IZ,) + Y. log P(Z, lZ,-I). i=l
(2.4)
i=1
P(ZilZi_~) are transition probabilities, which should be available to the recognizer. g(X, Z) should be maximized over the 26 n values of Z1, Z2 . . . . . Z,. That, however, leads to a combinatorial problem in computation. The algorithms described in section 4, maximize g(X, Z) using different approaches. To varying degrees, all of them reduce the computation required as compared to the maximization of g(X, Z) over 26" sequences. To conduct the experiments a text-recognition problem was simulated.
204
R. SHINGI-IAL AND
G . T. T O U S S A I N T
3. Simulation of the text-recognition problem An English corpus of 531,445 words was compiled (Toussaint & Shinghal, 1978). Special symbols, numerals and punctuation were deleted from the corpus, leaving only the 27 characters: "/r and the 26 letters from A to Z. These characters are also referred to as C1 to C27. Between any two words in the corpus there was one blank. From the corpus estimates of unigram probabilities P(Ci) and transition probabilities P(Ci ICi) were obtained for 1 -< i, j-< 27. So P(C~) is the estimate of the probability of occurrence of character C~ in the corpus, and ~'(C~ ICj) is the estimate of the probability of occurrence of character Ci immediately after character Ci in the corpus. The use of these unigram and transition probabilities in maximizing a heuristic model of equation (2.4) is explained in section 4. The texts to be used as data for the experiments were compiled using patterns taken from Ryan's data set (Pattern Recognition Data Base No. 1.1.1A, IEEE Computer Society). The patterns were mixed-font, machine-printed letters quantized on a 24 • 24 grid. The letters were size-normalized to reduce the sensitivity of the feature-extraction scheme to the variation in the sizes of the letters. Each 24 x 24 size-normalized pattern was split into thirty-six 4 x 4 non-overlapping regions. A feature-vector )7= (Yl, Y2 . . . . . Y36)was extracted from each pattern such that Yi equals the number of dark points in the jth region for 1 -<[ -< 36. Thus 0 -< Yi -< 16. The 13,337 patterns available were split into a training set of 6651 patterns and a testing set of 6686 patterns. The recognizer was trained on samples from the training set. Two texts, called the Old Passage and the New Passage, were compiled using samples of the testing set. The Old Passage comprised arbitrarily chosen segments from the corpus described above. The New Passage comprised arbitrarily chosen segments from publications not included in the corpus and thus was not used in estimating the unigram and transition probabilities. The Old Passage contained 2521 words and the New Passage contained 2256 words. These two texts were used as data in the recognition experiments described in section 4. Two texts were compiled with the motive that if the experimental results were similar on both the texts then that would add robustness to conclusions drawn from the results in the sense that the statistics from the English language would have been accurately estimated.
4. Experiments This section describes text-recognition experiments using contextual information conducted by the following algorithms: (1) the modified Viterbi algorithm (MVA) which is a bottom-up approach; (2) a dictionary look-up algorithm (DLA) which is a top-down approach; (3) the proposed Predictor-Corrector algorithm (PCA) which is a combined bottomup top-down approach. In each algorithm the objective is" to maximize g(X, Z) of equation (2.4). The algorithms selected use different techniques of "pruning" to maximize g(X, Z) over fewer than the possible 26" sequences. The description of each algorithm is given below, followed by the experiment conducted in using the algorithm. The computational complexity CC(n) is also given for
USING CONTEXT
IN TEXT
205
RECOGNITION
each algorithm. CC(n) is the count of the units of computation required in maximizing g(X, Z) by the corresponding algorithm. In other words, it is the number of units of computation required to recognize a word of length n. One experiment each for the M V A and the D L A and two experiments for the PCA were conducted. The motivation for each experiment is explained. The New and the Old Passages were used as separate pieces of data for each of the experiments and their results are also given separately. The experiments are then compared in terms of their performance and cost. The cost of an algorithm is the expected cost in recognizing a word by the algorithm, where expected cost per word = memory in megabytes on IBM 360/75 required x Y~Q(n)CC(n) tl
and Q(n) = the probability of occurrence of a word of length n in English text. The estimation of Q(n) was described by Shinghal et al. (1978). The memory required is for storing either (a) the transition and unigram probabilities as for the MVA, or (b) the dictionary as for the DLA, or (c) both as for the PCA. All experiments were conducted on an IBM 360/75 then available at McGill University, Montreal, and hence memory requirements are as for that machine. 4.1. THE MODIFIED VITERBI ALGORITHM The MVA is formally described in detail by Shinghal etal. (1978). Informally, the MVA contains two steps: the selection-of-alternatives and path-tracing. For every Xi(1 -< i -< n) the d(1-< d-< 26) most likely letters into which Xi can be classified are selected. These d alternatives are those that contribute to the d largest weighted likelihoods, log P(XiIZi)+ log P(Z~). It should be noted that P(Zi) are unigram probabilities; their estimation was described in section 3. After the selection-of-alternatives, a path is traced through the d by n directed graph such that the d most probable letter-sequences ending with the d alternatives of Xi are chosen iteratively from X1 to Xn. Of these, the one which is most probable to terminate with a blank is chosen as the n-letter word. The value of d, called "depth of search", is a heuristic, which reduces the number of paths to be considered in the directed graph. It was shown by Shinghal et al. (1978) that the computational complexity of this algorithm is a function of both n and d, and is written as V(n, d), where V(n, 1) = 51n, Min[26-d,d]
V(n, d) --- 26n + n
7.
and V(n, 26) = 77(26n - 25).
( 2 6 - i ) + ( 3 d - 1)(1 + ( n - 1)d)
for 2<_d_<25,
(4.1)
206
R. S H I N G H A L A N D G. T. T O U S S A I N T
The optimal value of d was experimentally shown to be 5 by Shinghal et al. (1978). The text-recognition experiment by the MVA with d = 5, was conducted. The error-rate (probability of misclassification of letters) for the Old Passages and the New Passages was 0.128 and 0.131, respectively. The cost was 2.85 units.
4.2. A D I C T I O N A R Y L O O K - U P A L G O R I T H M
This method, described by Toussaint (1977), is an extension of the Bledsoe-Browning (1966) algorithm. It is assumed that there exists a dictionary of words, and that any word to be classified occurs in the dictionary. The input feature-vector sequence X1, X: . . . . . X, is recognized as that n-letter word which maximizes g(X, Z) over all the n-letter words in the dictionary. Let g(X, Z) be called the score of the word Z. Let the VALUE of the word ZI, Z2 . . . . . Z , be written as VAL(Zb Z2 . . . . . Z , ) and be defined by the formula n+l
y log f'(z, Iz,-1).
i--1
Thus the second term of g(X, Z) gives the V A L U E of the word. The V A L U E of any word is a constant. During the design stage of the recognizer, the V A L U E of every word in the dictionary is computed and stored with the dictionary. Thus every word in the dictionary is associated with its VALUE. The mathematical model for the computational complexity is now derived. Since the V A L U E of a word is already stored with the dictionary, n additions are needed to compute the score of a word. If the dictionary has q, words of length n, then the number of additions required to calculate the scores of all n-letter words is nq,. To select the word with the largest score needs ( q , - 1 ) comparisons. Thus the computational complexity in recognizing a word of length n is D(n) where D(n)=(nq,, + q , - 1 ) .
(4.2)
It is assumed that a comparison and an addition require one unit of computation each. It is seen that the larger the dictionary the greater is the computational complexity. For the experiment, a dictionary of 11,603 distinct words was compiled. The dictionary contained all words present in the New and Old Passages plus words from arbitrarily chosen segments of the corpus described in section 3. The constituents of the dictionary are shown in Table 1. This was considered to be a dictionary of a conveniently large size. Transition probabilities estimated from the corpus were used to compute the V A LUE of every word in the dictionary. The V AL U E of every word was also stored in the dictionary. The Old and the New Passages were, recognized. The error-rates were 0.034 and 0.036 for the Old and the New Passages respectively. The cost was 786.5 units. Comparing this result to the result of the experiment with the MVA, it is noticed that this experiment gave a remarkably lower error-rate but at 276 times the cost.
USING CONTEXT IN TEXT RECOGNITION
207
TABLE 1 Constituents of the dictionary Length of word, n
Number of such words in the dictionary, q. 22 89 397 1007 1384 1773 1888 1666 1364 918 575 292 154 76 30 10 5 1
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Total
11,603
4.3. THE PROPOSED BO'VI'OM-UP TOP-DOWN APPROACH To reduce the cost of the dictionary method, one approach would be to reduce the size of the dictionary. But, if that is done, then the n u m b e r of words that can be recognized is small, and that does not m a k e it useful from a practical point of view. So, a reasonably large dictionary is needed for a recognizer. In the proposed Predictor-Corrector algorithm the cost is lowered, not by reducing the m e m o r y requirement, but by reducing the computational complexity. To implement the Predictor-Corrector algorithm, the dictionary is rearranged so that it consists of lists el, e2, e3 . . . . such that e, contains all the n-letter words of the dictionary for n-> 1. Furthermore, within e, the words have been sorted on their V A L U E S and arranged in descending order. The list e, has q, words, and the words will be individually referred to as e,l, e,: . . . . e,~. Table 2 shows a typical list, e3, in which e31
=
"THE",
e32 = " A T E " ,
and so on. If Z is an n-letter word, then eni is called a mate of Z iff [VAL(e,i) - VAL(~;)] -< ]VAL(e,,) - VAL(Z)I
for 1 --
If VAL(e,j) = V A L ( Z ) then eni is called the perfect mate of Z. Let [aJ mean floor of a, and [a] mean ceiling of a. Thus [7.2J, [7.51, [7.6J are all equivalent to 7; [7.2], [7.5], [7.6] are all equivalent to 8.
208
R. S H I N G H A L
A N D G . T. T O U S S A I N T
TABLE 2
Sample list e3 in the dictionary Word
Value
THE ATE ARE TIN TEN ONE
-18.39 -19.20 -19.77 -19.82 -20.05 -20.31
In the Predictor-Corrector algorithm, though the entire dictionary is stored in memory, scores are calculated for only a fraction f(0 <-f-< 1) of the words in any list e,. The value o f f is a heuristic decided upon by the user. Let q" = Ffqn]. Thus scores will be calculated for only q'n words of the list e,. Let the designated word be that word which the input pattern sequence )( is recognized as. It was seen earlier that the M V A has a low cost but gives a high error-rate. The D L A has the inverse characteristics. The P C A is a compromise: recognize X by the M V A to get 7.; check to see if Z is in the dictionary; if so, 7. is the designated word; else calculate the scores of the words in the neighbourhood (size of which depends on jr) of the mate (which can be considered the "closest-fit") of Z. The word with the largest score is then the designated word.
The Predictor-Corrector Algorithm [B 1] Predictor Step Recognize the input pattern sequence X by the modified Viterbi algorithm. Let the output be the word Z. Calculate VAL(7.). Using binary search, find the mate e,j of 7.. If e,j is a perfect mate of Z, then Z is the designated word, so stop; else go to [B2], the Corrector Step. [B2] Corrector Step Set
t~Max(1, j-Fq'/2]),
b~Min(q~,j+Fq'/2]).
Calculate scores of all the words e,k(t <-k <-b) and find the word e,m(t <-m <-b) that has the largest score of these, e,m is the designated word, so stop. A comment is necessary about the way it is checked that Z is in the dictionary during the Predictor Step. This is done b~ calculating VAL(Z) and testing whether the dictionary has a word e,~ of the same length as 7. and VAL(Z) = VAL(enj). There is an implied assumption in this: no two distinct words have the same length and the same V A L U E . Extensive testing has shown this assumption to be true.
209
U S I N G C O N T E X T IN T E X T R E C O G N I T I O N
The mathematical model for the computational complexity of the PCA is now derived. In the Predictor Step, V(n, d) units of computation are required by the MVA; n additions to compute VAL(Z); and Ilog2 q~] + 1 comparisons to find the mate of Z, using a binary search as given by Knuth (1975). In the Corrector Step f D ( n ) units of computation are required. But, it should be noted that whereas the Predictor Step is executed for all the words in the text, the Corrector Step is executed for only a fraction fl of the words in the text. Thus the computational complexity of the Predictor-Corrector algorithm, from equations (4.1) and (4.2) is PC(n, d , f ) where: PC(n, d, f) = V(n, d) + n + Llog2 qn J + 1 + [3f(nq~ + qn - 1). To compute the computational complexity per word in a text-recognition experiment, it would be necessary to observe the value of/3. Two experiments were carried out: Experiment (a) was carried out to determine the optimal depth of search d for the MVA in the Predictor Step. Experiment (b) was carried out to determine the optimal value of [. Both experiments used the dictionary that was compiled for the DLA.
4.3.1. Experiment (a) It was shown by Shinghal et al. (1978) that d for the MVA need not exceed 5. Thus the Old and the New Passages were recognized by the PCA for d = 1, 2, 3, 4, and 5. The value o f / w a s chosen to be 1. Table 3 shows the error-rates and cost for different values of d. From Table 3, it is seen that the error-rate remained the same for all values of d. The cost was the lowest at d = 2. This held true for both the Old and the New Passages. It was thus concluded that the optimal value of d in the Predictor Step is 2 for [ = 1. TABLE 3 Results o[ the experiment to find the optimum value of the depth of search in the predictor step
Depth of search 1 2 3 4 5
Old Passage . " Error rate Cost
New Passage r ^ Error rate Cost
0.034 0.034 0.034 0.034 0.034
0.037 0.037 0.037 0.037 0.037
452.59 415.56 422.28 446.65 472.56
445.35 414.76 427.10 450.68 477.39
4.3.2. Experiment (b ) The Old and the New Passages were recognized by the PCA. The value of d was fixed at 2, because it was concluded in Experiment (a) that this was the optimal value. The value of f varied from 0.4 to 1.0 in steps of 0.1. Computational considerations did not allow both d and [ to be varied. I
210
R. S H I N O H A L A N D G. T. T O U S S A I N T 0.1~ 9 f=0.4
0-14
~ f =0.5 0.12
9 f =0.6 0.10 f =0"7 0-o8
9 f =0"8
\
L~
9 f:0.9
O-06
0-04 f l: ~O
~
0.02
0
-@ Error-rote bythe dlctionory look-up olgorifhm
1
1
1
I
200
400
600
800
1000 Cost FIG. l . Old Passages: error-rate versus cost. Predictor-Corrector (d = 2 ) and the dictionary look-up algorithm. 0.16
9 f =0-4
0.14 f =0"5 0-I~ ~9 f=0"6
\
O.IC
9 f=0"7
\
0
0.08"
9
~ f=0"9
0"06
0.04
f=l.0 9
0"02
0
f=0-8
200
Error-rote by the dtctbonory look-up algorithm
400 600 Cost
800
I000
FIG. 2. New Passages: error-rate versus cost. Predictor-Corrector (d = 2) and the dictionary look-up algorithm.
USING
CONTEXT
IN TEXT
REECOGNITION
211
Figures 1 and 2 show the plot of the error-rate versus cost for the recognition of the Old and the New Passages. To facilitate the comparison of the P C A with the D L A , the error-rate of the D L A experiment is also shown in the two plots. It should be noticed from Figs 1 and 2, that as f decreases, the error-rate increases sharply. At f = 1, the cost of the P C A is half that of the D L A for the same error-rate. The reason that the cost has dropped so sharply is because for a little more than half the w o r d s - - 5 5 . 3 % in the New Passages and 55.2% in the Old P a s s a g e s - - t h e Corrector Step was not executed. The Predictor Step was sufficient to recognize the word. It can also be concluded that it is best to set je = 1, thus using the full dictionary.
5. Discussion and conclusions An algorithm (Predictor-Corrector) has been proposed, for using contextual information in text recognition, that combines the benefits of both the modified Viterbi algorithm and a dictionary look-up algorithm. The proposed algorithm was experimentally c o m p a r e d to the modified Viterbi algorithm and the dictionary look-up algorithm. The experimental results showed that the Predictor-Corrector algorithm can achieve the same low error rate as the dictionary look-up algorithm, but at half the cost. This happens when the p a r a m e t e r f is set equal to 1. This also implies that V A L is a very poor measure of proximity of words in the dictionary and suggests that searching for a better measure might be a fruitful area for further research. Finally, it should be noted that the Predictor-Corrector algorithm is a combined b o t t o m - u p and top-down approach to context as discussed in the introduction. Hence this algorithm represents a formal model of the b o t t o m - u p top-down theoretical construct being presently considered in information processing models of human behaviour.
References ABE, K. & FUKUMURA, T. (1971). Word-searching methods in character recognition system with dictionary. Systems, Computers, Controls, 2 (5), 1-9. ABEND, K. (1968). Compound decision procedures for unknown distributions and for dependent states in nature. In KANAL,L. N., Ed., Pattern Recognition, pp. 207-249. Washington, D.C.: Thompson Book Company. BAKEER,J. K. (1975). Stochastic modelling for automatic speech understanding. In REDDY, D. R., Ed., Speech Recognition, pp. 521-542. New York: Academic Press. BLEDSOE, W. W. & BROWNING, J. (1966). Pattern recognition and reading by machine. In UHR, L., Ed., Pattern Recognition, pp. 301-316. New York: John Wiley. CARLSON,G. (1966). Techniques for replacing characters that are garbled on input. Proceedings of the Spring Joint Computer Conference, pp. 189-192. CHUNG, S. S. (1975). Using contextual constraints from the English language to improve the performance of character recognition machines. M. Sc. Thesis. School of Computer Science, McGill University, Montreal. CORNEW, R. W. (1968). A statistical method of error correction. Information and Control, 12 (2), 79-82. DUDA, R. O. & HART, P. E. (1968). Experiments in the recognition of hand-printed text: Part II---context analysis. AFIPS Conference Proceedings, 33, 1139-1149. FORNEY, G. D. JR. (1973). The Viterbi algorithm. Proceedings of the IEEE, 63, 268-278. KASHYAP, R. L. & MITrAL, M. C. (1977). A new method for error correction in strings with applications to spoken word recognition. IEEE Computer Society Conference on Pattern Recognition and Image Processing, Troy, N. Y., pp. 76-82.
212
R. SHINGHAL AND G. T. TOUSSAINT
KNUTH, D. E. (1975). The Art of Computer Programming, Volume 3, Sorting and Searching. Reading, Mass: Addison-Wesley. NEUHOFF, D. L. (1975). The Viterbi algorithm as an aid in text recognition. IEEE Transactions on Information Theory, IT-21 (2), 222-226. NORMAN, D. A. & BOBROW, D. G. (1976). On the role of active memory-processors in perception and cognition. In COFER, C. N., Ed., The Structure of Human Memory. San Francisco: W. H. Freeman. PALMER, S. E. (1975). Visual perception and world knowledge. In NORMAN, D. A., RUMELHART, O. E. & THE L.N.R. RESEARCH GROUP, Eds, Explorations in Cognition. San Francisco: W. H. Freeman. RAVlV, J. (1967). Decision making in Markov chains applied to the problem of pattern recognition. IEEE Transactions on Information Theory, IT-13 (4), 536-551. REDDY, O. R. (1976). Speech recognition by machine: a review. Proceedings of the IEEE, 64, 501-531. RISEMAN, E. M. & EHRICH, R. W. (1971). Contextual word recognition using binary digrams. IEEE Transactions on Computers, C-20 (4), 397-403. SHINGHAL, R. (1977). Using contextual information to improve performance of character recognition machines. Ph.D. dissertation. School of Computer Science, McGill University, Montreal. SHINGHAL, R. & TOUSSAINT, G. T. (1978). Experiments in text-recognition with the modified. Viterbi algorithm. IEEE Computer Society Workshop on Pattern Recognition and Artificial Intelligence, Princeton (also to appear in Special Issue on Pattern Recognition and Artificial Intelligence of the IEEE Transactions on Pattern Analysis and Machine Intelligence). TOUSSAINT, G. T. (1977). The use of context in pattern recognition. In Proceedings o f l E E E Computer Society Conference on Pattern Recognition and Image Processing, Troy, N. Y., pp. 1-10 (also in Pattern Recognition, 1978, 10, 189-204). TOUSSAINT, G. T. & SHINGHAL, R. (1978). Cluster analysis of English text. In Proceedings of Pattern Recognition and Image Processing Conference, Chicago, pp. 164-172. VOSSLER, C. M. & BRANSTON, N. M. (1964). The use of context for correcting garbled English text. In Proceedings o f A C M 19th National Conference, pp. D2 4-1 to D2 4-3. WOLF, J. J. (1976). Speech recognition and understanding. In Fu, K. S., Ed., Digital Pattern Recognition, pp. 167-203. New York: Springer-Verlag.