Print keyword spotting with dynamically synthesized pseudo 2D HMMs

Print keyword spotting with dynamically synthesized pseudo 2D HMMs

Pattern Recognition Letters 25 (2004) 999–1011 www.elsevier.com/locate/patrec Print keyword spotting with dynamically synthesized pseudo 2D HMMs Beom...

539KB Sizes 0 Downloads 22 Views

Pattern Recognition Letters 25 (2004) 999–1011 www.elsevier.com/locate/patrec

Print keyword spotting with dynamically synthesized pseudo 2D HMMs Beom-Joon Cho *, Jin H. Kim Department of Computer Science, Korea Advanced Institute of Science and Technology, 373-1 Kusung-dong, Yusung-ku, Taejon 305-701, South Korea Received 20 June 2003; received in revised form 28 January 2004 Available online 9 April 2004

Abstract We propose a new method of dynamically synthesizing Korean character image templates and then converting them into P2DHMMs in real-time. This method is more advantageous than whole character HMMs in memory requirement as well as training difficulty.  2004 Elsevier B.V. All rights reserved. Keywords: Pseudo 2D HMM; Character modeling; Keyword spotting; Document retrieval

1. Introduction In the field of OCR the neural network is a highly successful model for recognizing machineprinted characters. However, one problem with the neural network is that the sequential nature of texts running left-to-right is not well captured without sophisticated network architectures like that of TDNN (Lang et al., 1990). As a result, most of the neural network systems with ordinary architectures assume external segmentation of character blocks prior to recognition. In this case the overall system performance is usually limited by the performance of the segmenter and the quality of the resulting segments. Another problem

*

Corresponding author. E-mail address: [email protected] (B.-J. Cho).

with the neural network model is that it is a purely wholistic model that cannot be decomposed, analyzed nor synthesized; therefore training thousands of character models is extremely difficult, if not impossible. Since the early nineties one model has come into the arena of document analysis; it is the hidden Markov model or HMM. Stimulated by the success in speech recognition, the capability of modeling variability and sequential structure has been used in diverse areas successfully (Rabiner, 1989). The application of the HMM benefits from a wide range of experience accumulated in speech recognition and many other fields. Since document texts run sequentially and, mostly, left-to-right, it is natural that the idea of using HMM occurs to researchers. To date, the HMM application to English has been reported in several places in the literature (Anigbogu and

0167-8655/$ - see front matter  2004 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2004.02.014

1000

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

Belaid, 1995; Kuo and Agazzi, 1993; Chen et al., 1993). Although texts run linearly, individual character patterns are not linear but two-dimensional. This fact has not been a barrier to the modeling of latin alphabet-based texts that run strictly left-to-right at the letter level. In fact the idea is simply straightforward. However in the case of Korean Hangul characters the problem is not so simple. At the character level or above, texts run linearly. One problem is that there are thousands of characters used in Hangul texts, and we may need the corresponding number of models. Another problem that occurs below the character level is that a Korean Hangul character is composed of either two or three graphemes arranged two-dimensionally in a way to fit into a rectangle. The two-dimensional composition of grapheme models in Hangul is not straightforward and thus the HMM has not been tried for machine-print character recognition (Yang and Oh, 2001). Without doubt, however, the composition method of character modeling is more advantageous than that of designing thousands of whole character models in regard to the memory requirement as well as the training difficulty. This research is focused on the application of the HMM method to the analysis of document text images. The basic idea lies in the real-time generation of Korean Hangul character models for spotting key words in the content analysis of optical documents. In the proposed method individual character models are synthesized in realtime using the trained grapheme image templates. Since characters are two-dimensional, it is natural to believe that the 2D HMM, an extension to the standard HMM, will be helpful and offer a great potential for analyzing and recognizing character patterns. However a fully connected 2D HMM leads to an algorithm of exponential complexity (Levin and Pieraccini, 1992). To avoid the problem, the connectivity of the network has been reduced in several ways, two among which are Markov random field and its variants (Chellapa and Chatterjee, 1985) and pseudo 2D HMM (Agazzi and Kuo, 1993; Agazzi et al., 1993; Kuo and Agazzi, 1993). The latter model, called P2DHMM, is a very simple and efficient 2D model that retains all of the useful HMM features. This

paper focuses on the real-time construction of Hangul character P2DHMMs using trained grapheme image templates. We believe the proposed method is feasible and particularly appropriate thanks to the absence of natural italic fonts in Korean corresponding to the English italics, a rationale for using P2DHMM. In the proposed method, we prepared a set of grapheme image samples for each grapheme and obtained their average, a grapheme template. Then by superposing appropriate grapheme templates, we can compose a character image template. Finally this character template is converted to a P2DHMM in a systematic way. In this method, the new idea of location-preserving 2D superposition is very simple but highly elegant and efficient for real-time processing. The idea of character composition is not new, but the application to strictly 2D model design is. It is especially true in 2D HMM framework. Another feature of the proposed method is the conversion of the grayscale template into a P2DHMM, which is theoretically correct in the sense of maximum likelihood estimation. An additional noteworthy feature is model size reduction by noting the information redundancy in the templates; successive HMM states are merged based on the similarity between their output PDs. The resulting models are often much smaller than the original and thus speed up the spotting task, and sometimes, improve the performance. The rest of the paper consists as follows. In Section 2 we will briefly review the HMM. In Section 3 the pseudo 2D HMM and its algorithm are described; and then a procedure for developing character models is discussed in detail. Section 4 describes auxiliary models needed for the proposed method of key character spotting. Section 5 presents results from preliminary experiments. Section 6 concludes the paper.

2. Hidden Markov model The hidden Markov model is a doubly stochastic process that can be described by three sets of probabilistic parameters as k ¼ ðA; B; pÞ. Given a set of N states and a set V of observable symbols,

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

the parameters are formally defined by Rabiner (1989) • Transition probability distribution: A ¼ faij ¼ pðqt ¼ jjqt1 ¼ iÞ; 1 6 i; j 6 N g X aij ¼ 1: where • Output probability distribution:

v

• Initial transition probability distribution: X p ¼ fpi ¼ pðq1 ¼ iÞ; 1 6 i 6 N g where pi ¼ 1: i

The most frequent task with an HMM is the evaluation of the model to generate an input sequence X ¼ x1 ; x2 ; . . . ; xT . It is given by the following matching score, a likelihood function of the sequence observed from the model X

pq1 bq1 ðxq Þ

Q

T Y

aqt1 qt bqt ðxt Þ

1 6 k; l 6 N

and

B ¼ fbi ðvÞ ¼ pðxt ¼ vjqt ¼ iÞ; 1 6 i 6 N ; v 2 V g X bi ðvÞ ¼ 1: where

P ðX jkÞ ¼

There are three kinds of parameters in the P2DHMM. However, since the graphical configuration is two-dimensional, we further divided the Markov transition parameters into super-state transition and sub-state transition probabilities; each are denoted as ~ akl ¼ pðrtþ1 ¼ ljrt ¼ kÞ;

j

1001

ð1Þ

t¼2

Although simple in form, the time requirement is exponential. Thanks to the use of the DP technique, this can be computed in linear time in T . However when it comes to 2D HMM formulation, even the DP technique alone is not enough. One research direction is the structural simplification of the model, and the pseudo 2D HMM is one solution.

3. Pseudo 2D HMM construction 3.1. Description Pseudo 2D HMM in this paper is realized as a horizontal connection of vertical sub-HMMs (kk ). However it is not the only one. The alternative realization is the vertical connection of horizontal sub-HMMs as in the work of Xu and Nagy (1999). In order to implement a continuous forward search method and sequential composition of word models, the former type has been used in this research.

aij ¼ pðqsþ1 ¼ jjqs ¼ iÞ;

1 6 i; j 6 M

where rt denotes a super-state which corresponds to a sub-HMM kk , and qs denotes a sub-state observing the sth pixel. The model has N superstates, and the sub-HMM kk is defined as a standard HMM consisting of M states. The sub-states of the sub-HMM has an output distribution function whereas the super-states of the superHMMs do not. The observation of the super-state is none other than the frame observed from the sub-states. Hence there is no separate output PD. This will be explicitly formulated in Eq. (2) in the following subsection. 3.2. Evaluation algorithm Let us consider a tth vertical frame Xt ¼ x1t ; x2t ; . . . ; xst , 1 6 t 6 T , in a text line image. This is a one-dimension sequence like that of X in Eq. (1). This is modeled by a sub-HMM kk with the likelihood P ðXt jkk Þ. Each sub-HMM kk may be regarded as a super-state whose observation is a vertical frame of pixels. Prt ðXt jkrt Þ ¼

X

pq1 bq1 ðx1t Þ

Q

S Y

aqt1 qt bqt ðxst Þ

ð2Þ

s¼2

Now let us consider a bitmap image which we define as a sequence of such vertical frames as X ¼ X1 ; X2 ; . . . ; XT . Each frame will be modeled by a super-state or a sub-HMM. Let K be a sequential concatenation of sub-HMMs. Then the evaluation of K given the sample image X is P ðX jKÞ ¼

X R

P1 ðX1 Þ

T Y

~ art1 rt Prt ðXt Þ

ð3Þ

t¼2

where it is assumed that super-state process starts only from the first state. The Prt function is the

1002

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

super-state likelihood. Note that both of the Eqs. (2) and (3) can be effectively approximated by the Viterbi score. One immediate goal of the Viterbi search is the calculation of the matching likelihood score between X and an HMM. The objective function for an HMM kk is defined by the maximum likelihood as S Y aqs1 qs bqs ðxst Þ ð4Þ DðXt ; kk Þ ¼ max Q

s¼1

where Q ¼ q1 ; q2 ; . . . ; qS is a sequence of states of kk , and aq0 q1 ¼ pq1 :DðXt ; kk Þ is the similarity score between two sequences of different length. The basic idea behind the efficiency of DP computation lies in formulating the expression into a recursive form dks ðjÞ ¼ max dks1 ðiÞakij bkj ðxst Þ;

parameters. For this task we assume a set of typical samples of character images X ¼ fX ð1Þ ; . . . ; X ðDÞ g of an equal dimension. Different size raises no problem if we scale the images bilinearly. Moreover, the scale difference in test images is naturally resolved with HMM method. The focus of the section lies in the construction of the P2DHMM for a Korean Hangul character. A Hangul character consists of either two or three graphemes of phonetic consonant and vowel letters. The composition follows a general rule to fit the graphemes into a rectangle. There are six types of combination (Fig. 1) according to the shape of the vowel (horizontal, vertical, or both) and the presence of consonant suffix. The proposed method of model creation is based on the given set of bitmap images. The overall procedure is shown in Fig. 2, and explained as follows:

i

j ¼ 1; . . . ; Mk ;

s ¼ 1; . . . ; S;

k ¼ 1; . . . ; K

ð5Þ

dks ðjÞ

where denotes the probability of observing the partial sequence x1t ; . . . ; xst in model k along the best state sequence reaching the state j at time/ step s. Note that DðXt ; kk Þ ¼ dkS ðNk Þ

ð6Þ

where Nk is the final state of the state sequence. The above recursion constitutes the DP in the lower level structure of the P2DHMM. The remaining DP in the upper level of the network is similarly defined by T Y ~ DðX ; KÞ ¼ max ð7Þ art1 rt DðXt ; krt Þ k

Fig. 1. Six ways of composing Korean syllable characters by arranging graphemes inside a character block. A grapheme changes its shape accordingly. From left-to-right, Types I, II, III, IV, V and VI.

t¼1

that can similarly be reformulated into a recursive form. Here ~ ar1 r2 denotes the probability of transition from super-states r1 to r2 . According to the formulation described thus far, a P2DHMM adds only one parameter set, i.e., the super-state transitions, to the conventional HMM parameter sets. Therefore it is a simple extension to conventional HMM. 3.3. Design of character models One of the most important tasks in hidden Markov modeling is estimating the probabilistic

Fig. 2. Model design procedure; steps (1) and (2) are carried out off line, whereas (3)–(5) are done at run time.

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

(1) Grapheme segmentation. This step involves extracting the individual graphemes from character samples while retaining the location inside the box enclosing the character. As illustrated in Fig. 3, the graphemes are separated while retaining the position with the box. In its simplest form this step is the most costly in the proposed method. However the problem can be avoided by using a bootstrapping strategy or a little more sophisticated prototyping idea (Xu and Nagy, 1999). (2) Average the extracted samples. Now there is a set of grapheme samples. First we classify the samples according to the type of the grapheme arrangement pattern of the original character. For

Fig. 3. Korean graphemes separated out from a syllabic character for /han/. From left-to-right: the initial consonant, the vowel, and the suffix consonant. Note that the grapheme position in the original character block is retained.

1003

the initial consonant grapheme there are six types (see Fig. 1), and two for each vowel grapheme. Then, take the sample average of the set of categorized images pixel by pixel so that a smooth grayscale-like image is obtained (see Fig. 3). Assuming binary samples, the average intensity of the pixel at ði; jÞ is xij ¼

Nij N

where Nij is the number of samples whose ði; jÞ pixel is black (or white) and N is the total number of samples. Essentially the training phase is finished at this stage. (3) Character image construction. From this step on the process belongs to the decoding or recognition phase, which is performed in real-time. Here the given task is to spot or recognize a character. For this task the image template for the key character is synthesized in the image domain from the component grapheme images generated in the previous step (see Fig. 4). The value of the ði; jÞth pixel of the character template takes the maximum of the two or three pixels ði; jÞ from each grapheme plane. Fig. 4 shows a sample of composition result with composition boundaries emphasized.

Fig. 4. (a) Character image construction and (b) the result with composition boundaries emphasized.

1004

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

Fig. 5. P2DHMMs after (a) the conversion stage and (b) the state-reduction stage.

(4) Conversion into P2DHMM. Given a character image, it is straightforward to construct a P2DHMM. First assign a state to every pixel with the output probability being the intensity value. Then the states are linked according to the topological constraint of P2DHMM: vertical sub-state transition, and then horizontal transition between super-states (see Fig. 5(a)). Note that all the transition probabilities are one without self-transitions. There is no space-warping in the current model. (5) State merge. The resulting model in the previous stage is quite big with a great deal of redundancy. One immediate goal of this stage is to reduce the redundancy. For this we consider the whole operation in two stages: first in vertical frames, and then in individual pixels in each frame. For the first step we define the distance between two frames, say xj and xjþ1 of equal length. S as an a-norm: Dðxj ; xjþ1 Þ ¼ kwj xj  ð1  wj Þxjþ1 k

a

where wj is the expected number of frames which have been merged to produce xj . If a ¼ 2, then this measures the dissimilarity in the least square sense. Then we estimate the transition probability similarly, or the whole transition parameter set may be replaced by state duration probabilities. When two or more successive states are statistically similar in the output probability (grayscale), they can be replaced with a new node with a modified output probability xij ¼ wi xij þ wiþ1 xiþ1;j where wi is the weight as a function of the duration (the number of merged states or repetition) in the

state i and satisfies wi þ wj ¼ 1. The state similarity is measured by the output probability of the states. A graphical illustration of the resulting model is shown in Fig. 5(b). The proposed procedure of creating statistical model is theoretically correct in the sense of maximum likelihood sense. One problem with the method is found in the final stage of merging states. However it is justified because, although the method of state merging itself is coarse yet, the idea of merging is correct in the information-theoretical sense.

4. Keyword spotting model For keyword spotting task, we developed two more classes of P2DHMMs in addition to key character models, which was then combined into a network model for continuous decoding of input streams. 4.1. System overview The preceding section has focused on the design of keyword models. Once the keyword models are created we can continue to the task of spotting, the search for the existence and the location of the keyword occurrences in a document image. The overall system component is shown in Fig. 6. In a spotting task we are given two types of input: a document image and a small set of keywords to search in the image. Both are shown on the leftmost blocks of the figure. The upper row of blocks corresponds to the design of spotting network, which we are mainly

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

1005

Fig. 6. System overview.

interested in this paper. The spotting network consists of three types of component models: keyword models, filler models, and a white space model. They will be described in the following sections. The second row of blocks corresponds to the ordinary process of any OCR system; input image, preprocessing, and then target search. Given a document image obtained through a optical scanner, we first project the pixels horizontally to detect text lines. Next we project the text line pixels vertically to extract word blocks. These word block images are the unit of DP search. Using the spotter network to be designed in this section, the spotting search is performed to determine whether the block is a keyword block or not. In the ensuing sections we will describe a detailed method creating a spotting network. 4.2. Filler model In keyword spotting tasks, filler corresponds to something between interesting things or keywords. It is also called a non-key character. Then a filler model is defined as the model for all nonkey characters. For convenience sake, however, it does not discriminate keys and non-keys but models all kinds of character patterns statistically. The desired characteristic of the filler model kF is f ðxK jkF Þ < f ðxK jkK Þ

f ðxF jkF Þ > f ðxF jkK Þ where f ð Þ is a probability density function, kK is a key model for the key pattern xK , and xF is a nonkey pattern. However, in general, the character patterns are not completely random and there is a certain degree of similarity between some characters. In addition it is not easy to design a single good model for numerous patterns of all characters. According to the work of Lee and Kim (1999), the filler model can behave as a threshold. For better thresholding in Korean Hangul characters, we defined six fillers, one for each of the six character composition types as in Fig. 1. Fig. 7 shows the filler images before conversion to P2DHMMs. They are simple arithmetic averages over a large set of character samples. Unlike the key character models, the filler models are not synthesized in real-time. Rather they are prepared once and for all from the image templates. Compared to the key model construction, filler model creation is very simple. There are a number of ways to refine the types of fillers. Each of the six types of characters can be further subdivided according to their shape features. It may be especially the case with the second type where upward-horizontal vowels (e.g., the left character in Fig. 10) and downward-horizontal vowels (e.g., the right character in Fig. 10) are sometimes considered quite different. There are many more such similarly distinctive features.

1006

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

Fig. 7. Bitmap images for filler models.

However, we believe such an elaboration is not needed unless there is a severe performance problem. Furthermore, what we are modeling is fillers not individual characters. And it is noted that the filler models should not be too strong compared with key character models. 4.3. White space model The region excluding the text is white space. The white space will be limited to the white frames between characters. It is modeled with a small number of nodes. In practice the state merge step reduced the nodes to one or two most of the time. 4.4. Spotting network For character spotting task we have designed a network-based transcription model (Fig. 8). It is a circular digraph with a backward link via the space model so that it can model arbitrary long sequence of non-key as well as key patterns. Given such a network, an input sequence will be aligned to every possible path circulating the network. One circulation is called a level. An l level path hypothesis represents a string of l characters (Meyers and

Rabiner (1981). The result is retrieved from the best hypothesis. 4.5. Search method The spotter network models a small set of key patterns and is used to locate them while ignoring the rest of the words of no interest. One efficient search is the one stage DP. For the continuous spotting with forward scanning, we applied a modified form of two-level one-stage DP; this performs a single forward pass consisting of alternating partial forward search and output. The time requirement for the DP is OðN 2 ST Þ where N is the total number of states or nodes, and S and T are the frame length and the number of vertical frames in a line (Sakoe, 1979) respectively. In the proposed method, this is reduced to OðNST Þ.

5. Experiments In this section three sets of test results are present. They are described in the order of increasing rigor. First, character spotting with a small set of keywords over ten points print docu-

Fig. 8. Circular network of P2DHMMs for spotting keywords.

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

ments. Then it is followed by keyword spotting with a whole set keyword models. Finally, a keyword spotting experiment in a practical setting is performed. In the second and the third set of experiments, we prepared hand-segmented letter images from a set of 29,500 characters, and the six filler images from the set. For word spotting task we tested about 540 frequently used keywords over 480 journal abstract images. 5.1. Experiment I: Hangul character spotting One significant characteristic of Korean text is that there are no natural italic fonts. This observation justifies the use of simple image-based modeling scheme. In the initial experiment a limited test has been performed using 10 point (Myongjo font) character images scanned in 200 dpi resolution. The letter models were created from the hand-segmented letter images. In this test we prepared only a single image for each letter and blurred it by a Gaussian filter to the effect of averaging multiple images. The most frequently used 97 character classes were used in character (not word) spotting task. The character set constitutes approximately the half of the test text. The test result has been analyzed in terms of correct spotting (H ), false positives (P ) and false negatives (N ). The overall spotting performance

1007

was 79.7 percent as shown in Table 1. In order to better understand the performance and weakness, we detailed the result into character type hits and failures in the same table. The character Types I– VI correspond to the six different arrangements (see Fig. 1) of Hangul vowels and consonants. Here the type hit means that the type of the character is correct regardless of the correctness of the label. According to the table, the hit ratios of Types III and V are relatively low, for which false acceptance and rejection are high. We noted this fact to refine the models and tune merge parameters for the next set of experiments. Fig. 9 shows a sample result containing spotted characters with enclosing boxes and labels. Although not marked in the figure, all the characters are correctly transcribed into either key labels or appropriate filler types. According to the test results, we have found one interesting case of failure that cannot be easily resolved using the current P2DHMM architecture. The case is illustrated in Fig. 10 where the two images are potentially very similar when analyzed in vertical frame basis. This shows the effect of false Markovian assumption that resulted in an exponential state duration distribution. The problem can be removed simply by introducing state or model duration parameters at the cost of extra computation, which we avoided.

Table 1 The spotting result of Korean Hangul character. (H ¼ h=ðh þ p þ nÞ, P ¼ p=ðh þ p þ nÞ, and N ¼ n=ðh þ p þ nÞ where h ¼ the number of correct hits, p ¼ the number of false positives, and n ¼ the number of false negatives) all figures are in % Overall Type I Type II Type III Type IV Type V Type VI

H

P

N

# Classes

Remarks

79.7 90.9 91.7 81.3 88.9 80.0 87.5

10.9 0.0 8.3 12.5 11.1 0.0 0.0

9.4 9.1 0.0 6.3 0.0 20.0 12.5

97 20 22 17 18 19 11

Character spotting Type spotting

Fig. 9. A snapshot of the character spotting system.

1008

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

Fig. 10. Two characters which are similar in the context of P2DHMM and vertical frames.

5.2. Experiment II: word spotting

80 70

%

A word is a linear left-to-right concatenation of characters in Hangul system. For word spotting task we tested a mixture of 14 keyword models on a set of one hundred journal papers’ abstract images. In this test we fixed the filler models optimized previously since they need not be created dynamically at run time. For an optimal

choice we carried a series of tests by varying the size of filler models. Unlike our expectation, it proved that the model size was not a determining factor. Hence we chose the filler models as small as possible, all with six super-states and seven substates for training images of 27 · 32. Fig. 11 shows the performance change by varying the state merging thresholds. In the graph the highest hit (H ) reaches 66.7% with the statemerging threshold of 0.03. In this case the recall is very high but the precision is sacrificed a lot; the best precision score is obtained at threshold 0.01. Compare the performance figures of the proposed method with those of spotting with Baum– Welch-trained P2DHMMs which are optimal among the set of models tested with different size (see Fig. 12). The latter models record 77.3% (H * for the hit rate), 22.6% (P * for the false positives)

H(hits) P(false+) N(false-) H*(77.3%) P*(22.6%) N*(0.01%)

60 50 40 30 20 10 0 0.007

0.008

0.01

0.015

0.02

0.03

0.04

B/W

Threshold

%

Fig. 11. Word spotting result.

100.0 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0

H*(hits) P*(false+) N*(false-)

4x5

5x6

6x7 7x8 model size

10x12

14x16

Fig. 12. The performance of Baum–Welch-trained models with increasing sizes or the number of super-states · the number of substates.

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

and 0.1% (N * for the false negatives) as are separately marked at the right end. According to the table the greatest source or error comes from false acceptances, which, we suspect, is primarily due to their strong optimization compared to the overly general filler models. In general, the Baum–Welch modeling method for the P2DHMM, although more optimal and superior in performance, cannot be used for large vocabulary keyword spotting tasks that require training tens of thousand P2DHMMs and preparing a huge number of character samples. This implies that the proposed method of dynamic synthesis of key character P2DHMMs has a definite advantage over the traditional Baum–Welch modeling. Furthermore, if a higher precision is desired, we can pass the spotted word images to a highperformance recognizer for a more accurate spotting. This method will be far faster than the full recognition of the whole documents. Fig. 13 gives a sample result, a part of screen shot, showing correct classification and filler type classification. Note the small gaps between fillers. They denote white spaces between characters. 5.3. Experiment III: keyword set spotting In the final set of experiment we compared the hit ratio by varying the number of keywords sought at a time. Table 2 summarizes the result. When the number of keywords N ¼ 1, the hit ratio

1009

reached peak, above from character spotting performance. When N increases, the confusion among words also increases thus gradually degrading the performance. The last column corresponds to the highest hit ratio in the preceding experiment. However when N is moderately large, the word spotting task is more successful than the individual character spotting task where we used about four character models at a time, about two Korean Hangul syllable characters in a word. As far as the authors’ knowledge goes, there has not been a research dealing with Korean keyword spotting in document images using HMMs. However referring to a research result reported by Yang and Oh (2001), their method is based on whole character matching and two stage wavelet coefficient matching; coarse matching and fine matching. Although a fair comparison is not possible per se, let us refer to the case of single candidate. Their system recorded 87.48% recall and 87.25% precision, which can very roughly be compared to our result of single word case in the above table. One problem with their system is the requirement of preparing the entire set of character images for each font type. One feature of the Yang and Oh’s method is the speed which is possible thanks to two stage processing. If a similar first-stage coarse matching is employed in our method, a similar speedup is expected in our system.

Fig. 13. Sample result; a part of screen shot, showing correct spotting and filler type classification.

Table 2 Character and word spotting performance with increasing number of keyword models spotted at a time

Hit ratio (H ) (%)

Word spotting (N ¼ the number of keywords)

Character spotting (N ¼ 2)

N ¼1

N ¼ 2:5

N ¼ 14

75.0

86.3

83.8

66.7

1010

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

5.4. Discussion The three sets of experiment show the strength and weakness of the proposed method. The performance of the proposed method is relatively low. This is largely due to the low discriminative power of the P2DHMM as well as the use of raw feature of pixel intensities. Note, however, that in the target task the issue is often not a greater accuracy but efficiency or speed of accessing desired documents among a huge set of document image database. In this case the conventional performance measure is not the simple recognition or hit rate but the recall rate and the precision. The use of filler models is the very factor that facilitates the speed of spotting systems. We limited the number of filler models to six, the number of Korean character types described in Sections 3.2 and 4.2. There is of course a compromise between time and performance. We have chosen the HMM framework for our spotting task since it is highly suited to sequential signal analysis tasks. And the P2DHMM is a useful option for sequential processing of 2D text image retrieval tasks. In this context the greatest contribution of the proposed method lies in the real-time synthesis of HMMs which is impossible with traditional whole word/character-based methods. Unlike conventional methods we do not need to prepare hundreds or thousands of character samples for each of the 2350 characters classes, and train and store all the character models. The modeling is very easy and efficient. Moreover the proposed method should not be limited to Korean character. It can be applied to any script system involving similar synthesis of alphabet or components. The HMM is well known for the capability of modeling variability found in many pattern recognition tasks. Since the average grapheme images have been prepared from a large collection of images, with and without noise, the resulting models can accept even noisy images to a certain extent. Moreover, it is not too demanding to assume a normal scanning condition. One problem with the method is the heuristic parameter of thresholds given in the state-merging step. For automating and optimizing the process we

need a more sophisticated or theoretically founded method. One such an option will be the use of distance metric between the output distributions (Kullback, 1968). This constitutes the future direction of our research. In fact we believe that kind of solution is the only one in urgent demand. Finally there remains a problem of insufficient discrimination power of the HMM. In fact this has been the greatest persistent problem to HMMbased method. A more robust form of HMM with increased discrimination power has been an important topic for a number of researchers. We will employ their result in the future for a higher performance retrieval system. 6. Conclusion Using a set of letter image templates, we proposed a very effective method for real-time synthesis of key word P2DHMMs; given a set of keyword labels, we can construct corresponding keyword models on the spot. The proposed method utilizes the principle of composing Hangul syllable characters. The composition itself is very efficient and its conversion to a P2DHMM is highly intuitive considering that we are dealing with machine-printed character images. With experimental results from the application to key word spotting tasks, we consider that the proposed method is highly feasible and performs adequately in a realistic retrieval condition, thus meeting our ultimate demand for the application to contentbased document image indexing and retrieval.

Acknowledgements Beom-Joon Cho is on the faculty of Computer Engineering at Chosun University, Kwangju, Korea. This study was supported by research funds from Chosun University, 2000. References Agazzi, O.E., Kuo, S., 1993. Hidden Markov model based optical character recognition in the presence of deterministic transformations. Pattern Recognition 26 (12), 1813–1826.

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011 Agazzi, O.E., Kuo, S., Levin, E., Pieraccini, R., 1993. Connected and degraded text recognition using planar hidden Markov models. Proc. ICASSP, Minneapolis, 27– 30. Anigbogu, J.C., Belaid, A., 1995. Hidden Markov models in text recognition. Internat. J. Pattern Recognition Artificial Intelligence 9 (6), 925–958. Chellapa, R., Chatterjee, S., 1985. Classification of textures using Gaussian Markov random fields. IEEE Trans. ASSP 33 (4), 959–963. Chen, F.R., Wilcox, L.D., Bloomberg, D.S., 1993. Detecting and locating partially specified keywords in scanned images using hidden Markov models. Proc. Second Internat. Conf. Document Analysis and Recognition, 133–138. Kullback, S., 1968. Information Theory and Statistics. Dover Publications, New York. Kuo, S., Agazzi, O.E., 1993. Keyword spotting in poorly printed texts using pseudo 2D hidden Markov models. Proc. IEEE Conf. CVPR, New York, 15–17. Lang, K., Waibel, A., Hinton, G., 1990. A time delay neural network architecture for isolated word recognition. Neural Networks 3, 23–44.

1011

Lee, H.K., Kim, J.H., 1999. An HMM-based threshold model approach for gesture recognition. IEEE Trans. PAMI 21 (10), 961–973. Levin, E., Pieraccini, R., 1992. Dynamic planar warping for optical character recognition. Proc. ICASSP 3, San Fransisco, CA, 149–152. Meyers, C.S., Rabiner, L.R., 1981. A level building dynamic time warping algorithm for connected word recognition. IEEE Trans. Acoust. Speech Signal Process, ASSP 29 (2), 284–297. Rabiner, L.R., 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77 (2), 257–286. Sakoe, H., 1979. Two-level DP-matching––a dynamic programming-based pattern matching algorithm for connected word recognition. IEEE Trans. Acoust. Speech Signal Process., ASSP 27 (6), 588–595. Xu, Y., Nagy, G., 1999. Prototype extraction and adaptive OCR. IEEE Trans. PAMI 21 (12), 1280–1296. Yang J.-H., Oh I.-S., 2001. Fast Retrieval of Korean Words based on Two-Pass Processing. Korea Information Science Society, Workshop on Computer Vision and Pattern Recognition, pp. 105–106.