Print keyword spotting with dynamically synthesized pseudo 2D HMMs

Pattern Recognition Letters 25 (2004) 999–1011 www.elsevier.com/locate/patrec Print keyword spotting with dynamically synthesized pseudo 2D HMMs Beom...

Download PDF

539KB Sizes 0 Downloads 22 Views

Report

PDF Reader
Full Text

Pattern Recognition Letters 25 (2004) 999–1011 www.elsevier.com/locate/patrec

Print keyword spotting with dynamically synthesized pseudo 2D HMMs Beom-Joon Cho *, Jin H. Kim Department of Computer Science, Korea Advanced Institute of Science and Technology, 373-1 Kusung-dong, Yusung-ku, Taejon 305-701, South Korea Received 20 June 2003; received in revised form 28 January 2004 Available online 9 April 2004

Abstract We propose a new method of dynamically synthesizing Korean character image templates and then converting them into P2DHMMs in real-time. This method is more advantageous than whole character HMMs in memory requirement as well as training diﬃculty. 2004 Elsevier B.V. All rights reserved. Keywords: Pseudo 2D HMM; Character modeling; Keyword spotting; Document retrieval

1. Introduction In the ﬁeld of OCR the neural network is a highly successful model for recognizing machineprinted characters. However, one problem with the neural network is that the sequential nature of texts running left-to-right is not well captured without sophisticated network architectures like that of TDNN (Lang et al., 1990). As a result, most of the neural network systems with ordinary architectures assume external segmentation of character blocks prior to recognition. In this case the overall system performance is usually limited by the performance of the segmenter and the quality of the resulting segments. Another problem

*

Corresponding author. E-mail address: [email protected] (B.-J. Cho).

with the neural network model is that it is a purely wholistic model that cannot be decomposed, analyzed nor synthesized; therefore training thousands of character models is extremely diﬃcult, if not impossible. Since the early nineties one model has come into the arena of document analysis; it is the hidden Markov model or HMM. Stimulated by the success in speech recognition, the capability of modeling variability and sequential structure has been used in diverse areas successfully (Rabiner, 1989). The application of the HMM beneﬁts from a wide range of experience accumulated in speech recognition and many other ﬁelds. Since document texts run sequentially and, mostly, left-to-right, it is natural that the idea of using HMM occurs to researchers. To date, the HMM application to English has been reported in several places in the literature (Anigbogu and

0167-8655/$ - see front matter 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2004.02.014

1000

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

Belaid, 1995; Kuo and Agazzi, 1993; Chen et al., 1993). Although texts run linearly, individual character patterns are not linear but two-dimensional. This fact has not been a barrier to the modeling of latin alphabet-based texts that run strictly left-to-right at the letter level. In fact the idea is simply straightforward. However in the case of Korean Hangul characters the problem is not so simple. At the character level or above, texts run linearly. One problem is that there are thousands of characters used in Hangul texts, and we may need the corresponding number of models. Another problem that occurs below the character level is that a Korean Hangul character is composed of either two or three graphemes arranged two-dimensionally in a way to ﬁt into a rectangle. The two-dimensional composition of grapheme models in Hangul is not straightforward and thus the HMM has not been tried for machine-print character recognition (Yang and Oh, 2001). Without doubt, however, the composition method of character modeling is more advantageous than that of designing thousands of whole character models in regard to the memory requirement as well as the training diﬃculty. This research is focused on the application of the HMM method to the analysis of document text images. The basic idea lies in the real-time generation of Korean Hangul character models for spotting key words in the content analysis of optical documents. In the proposed method individual character models are synthesized in realtime using the trained grapheme image templates. Since characters are two-dimensional, it is natural to believe that the 2D HMM, an extension to the standard HMM, will be helpful and oﬀer a great potential for analyzing and recognizing character patterns. However a fully connected 2D HMM leads to an algorithm of exponential complexity (Levin and Pieraccini, 1992). To avoid the problem, the connectivity of the network has been reduced in several ways, two among which are Markov random ﬁeld and its variants (Chellapa and Chatterjee, 1985) and pseudo 2D HMM (Agazzi and Kuo, 1993; Agazzi et al., 1993; Kuo and Agazzi, 1993). The latter model, called P2DHMM, is a very simple and eﬃcient 2D model that retains all of the useful HMM features. This

paper focuses on the real-time construction of Hangul character P2DHMMs using trained grapheme image templates. We believe the proposed method is feasible and particularly appropriate thanks to the absence of natural italic fonts in Korean corresponding to the English italics, a rationale for using P2DHMM. In the proposed method, we prepared a set of grapheme image samples for each grapheme and obtained their average, a grapheme template. Then by superposing appropriate grapheme templates, we can compose a character image template. Finally this character template is converted to a P2DHMM in a systematic way. In this method, the new idea of location-preserving 2D superposition is very simple but highly elegant and eﬃcient for real-time processing. The idea of character composition is not new, but the application to strictly 2D model design is. It is especially true in 2D HMM framework. Another feature of the proposed method is the conversion of the grayscale template into a P2DHMM, which is theoretically correct in the sense of maximum likelihood estimation. An additional noteworthy feature is model size reduction by noting the information redundancy in the templates; successive HMM states are merged based on the similarity between their output PDs. The resulting models are often much smaller than the original and thus speed up the spotting task, and sometimes, improve the performance. The rest of the paper consists as follows. In Section 2 we will brieﬂy review the HMM. In Section 3 the pseudo 2D HMM and its algorithm are described; and then a procedure for developing character models is discussed in detail. Section 4 describes auxiliary models needed for the proposed method of key character spotting. Section 5 presents results from preliminary experiments. Section 6 concludes the paper.

2. Hidden Markov model The hidden Markov model is a doubly stochastic process that can be described by three sets of probabilistic parameters as k ¼ ðA; B; pÞ. Given a set of N states and a set V of observable symbols,

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

the parameters are formally deﬁned by Rabiner (1989) • Transition probability distribution: A ¼ faij ¼ pðqt ¼ jjqt1 ¼ iÞ; 1 6 i; j 6 N g X aij ¼ 1: where • Output probability distribution:

v

• Initial transition probability distribution: X p ¼ fpi ¼ pðq1 ¼ iÞ; 1 6 i 6 N g where pi ¼ 1: i

The most frequent task with an HMM is the evaluation of the model to generate an input sequence X ¼ x1 ; x2 ; . . . ; xT . It is given by the following matching score, a likelihood function of the sequence observed from the model X

pq1 bq1 ðxq Þ

Q

T Y

aqt1 qt bqt ðxt Þ

1 6 k; l 6 N

and

B ¼ fbi ðvÞ ¼ pðxt ¼ vjqt ¼ iÞ; 1 6 i 6 N ; v 2 V g X bi ðvÞ ¼ 1: where

P ðX jkÞ ¼

There are three kinds of parameters in the P2DHMM. However, since the graphical conﬁguration is two-dimensional, we further divided the Markov transition parameters into super-state transition and sub-state transition probabilities; each are denoted as ~ akl ¼ pðrtþ1 ¼ ljrt ¼ kÞ;

j

1001

ð1Þ

t¼2

Although simple in form, the time requirement is exponential. Thanks to the use of the DP technique, this can be computed in linear time in T . However when it comes to 2D HMM formulation, even the DP technique alone is not enough. One research direction is the structural simpliﬁcation of the model, and the pseudo 2D HMM is one solution.

3. Pseudo 2D HMM construction 3.1. Description Pseudo 2D HMM in this paper is realized as a horizontal connection of vertical sub-HMMs (kk ). However it is not the only one. The alternative realization is the vertical connection of horizontal sub-HMMs as in the work of Xu and Nagy (1999). In order to implement a continuous forward search method and sequential composition of word models, the former type has been used in this research.

aij ¼ pðqsþ1 ¼ jjqs ¼ iÞ;

1 6 i; j 6 M

where rt denotes a super-state which corresponds to a sub-HMM kk , and qs denotes a sub-state observing the sth pixel. The model has N superstates, and the sub-HMM kk is deﬁned as a standard HMM consisting of M states. The sub-states of the sub-HMM has an output distribution function whereas the super-states of the superHMMs do not. The observation of the super-state is none other than the frame observed from the sub-states. Hence there is no separate output PD. This will be explicitly formulated in Eq. (2) in the following subsection. 3.2. Evaluation algorithm Let us consider a tth vertical frame Xt ¼ x1t ; x2t ; . . . ; xst , 1 6 t 6 T , in a text line image. This is a one-dimension sequence like that of X in Eq. (1). This is modeled by a sub-HMM kk with the likelihood P ðXt jkk Þ. Each sub-HMM kk may be regarded as a super-state whose observation is a vertical frame of pixels. Prt ðXt jkrt Þ ¼

X

pq1 bq1 ðx1t Þ

Q

S Y

aqt1 qt bqt ðxst Þ

ð2Þ

s¼2

Now let us consider a bitmap image which we deﬁne as a sequence of such vertical frames as X ¼ X1 ; X2 ; . . . ; XT . Each frame will be modeled by a super-state or a sub-HMM. Let K be a sequential concatenation of sub-HMMs. Then the evaluation of K given the sample image X is P ðX jKÞ ¼

X R

P1 ðX1 Þ

T Y

~ art1 rt Prt ðXt Þ

ð3Þ

t¼2

where it is assumed that super-state process starts only from the ﬁrst state. The Prt function is the

1002

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

super-state likelihood. Note that both of the Eqs. (2) and (3) can be eﬀectively approximated by the Viterbi score. One immediate goal of the Viterbi search is the calculation of the matching likelihood score between X and an HMM. The objective function for an HMM kk is deﬁned by the maximum likelihood as S Y aqs1 qs bqs ðxst Þ ð4Þ DðXt ; kk Þ ¼ max Q

s¼1

where Q ¼ q1 ; q2 ; . . . ; qS is a sequence of states of kk , and aq0 q1 ¼ pq1 :DðXt ; kk Þ is the similarity score between two sequences of diﬀerent length. The basic idea behind the eﬃciency of DP computation lies in formulating the expression into a recursive form dks ðjÞ ¼ max dks1 ðiÞakij bkj ðxst Þ;

parameters. For this task we assume a set of typical samples of character images X ¼ fX ð1Þ ; . . . ; X ðDÞ g of an equal dimension. Diﬀerent size raises no problem if we scale the images bilinearly. Moreover, the scale diﬀerence in test images is naturally resolved with HMM method. The focus of the section lies in the construction of the P2DHMM for a Korean Hangul character. A Hangul character consists of either two or three graphemes of phonetic consonant and vowel letters. The composition follows a general rule to ﬁt the graphemes into a rectangle. There are six types of combination (Fig. 1) according to the shape of the vowel (horizontal, vertical, or both) and the presence of consonant suﬃx. The proposed method of model creation is based on the given set of bitmap images. The overall procedure is shown in Fig. 2, and explained as follows:

i

j ¼ 1; . . . ; Mk ;

s ¼ 1; . . . ; S;

k ¼ 1; . . . ; K

ð5Þ

dks ðjÞ

where denotes the probability of observing the partial sequence x1t ; . . . ; xst in model k along the best state sequence reaching the state j at time/ step s. Note that DðXt ; kk Þ ¼ dkS ðNk Þ

ð6Þ

where Nk is the ﬁnal state of the state sequence. The above recursion constitutes the DP in the lower level structure of the P2DHMM. The remaining DP in the upper level of the network is similarly deﬁned by T Y ~ DðX ; KÞ ¼ max ð7Þ art1 rt DðXt ; krt Þ k

Fig. 1. Six ways of composing Korean syllable characters by arranging graphemes inside a character block. A grapheme changes its shape accordingly. From left-to-right, Types I, II, III, IV, V and VI.

t¼1

that can similarly be reformulated into a recursive form. Here ~ ar1 r2 denotes the probability of transition from super-states r1 to r2 . According to the formulation described thus far, a P2DHMM adds only one parameter set, i.e., the super-state transitions, to the conventional HMM parameter sets. Therefore it is a simple extension to conventional HMM. 3.3. Design of character models One of the most important tasks in hidden Markov modeling is estimating the probabilistic

Fig. 2. Model design procedure; steps (1) and (2) are carried out oﬀ line, whereas (3)–(5) are done at run time.

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

(1) Grapheme segmentation. This step involves extracting the individual graphemes from character samples while retaining the location inside the box enclosing the character. As illustrated in Fig. 3, the graphemes are separated while retaining the position with the box. In its simplest form this step is the most costly in the proposed method. However the problem can be avoided by using a bootstrapping strategy or a little more sophisticated prototyping idea (Xu and Nagy, 1999). (2) Average the extracted samples. Now there is a set of grapheme samples. First we classify the samples according to the type of the grapheme arrangement pattern of the original character. For

Fig. 3. Korean graphemes separated out from a syllabic character for /han/. From left-to-right: the initial consonant, the vowel, and the suﬃx consonant. Note that the grapheme position in the original character block is retained.

1003

the initial consonant grapheme there are six types (see Fig. 1), and two for each vowel grapheme. Then, take the sample average of the set of categorized images pixel by pixel so that a smooth grayscale-like image is obtained (see Fig. 3). Assuming binary samples, the average intensity of the pixel at ði; jÞ is xij ¼

Nij N

where Nij is the number of samples whose ði; jÞ pixel is black (or white) and N is the total number of samples. Essentially the training phase is ﬁnished at this stage. (3) Character image construction. From this step on the process belongs to the decoding or recognition phase, which is performed in real-time. Here the given task is to spot or recognize a character. For this task the image template for the key character is synthesized in the image domain from the component grapheme images generated in the previous step (see Fig. 4). The value of the ði; jÞth pixel of the character template takes the maximum of the two or three pixels ði; jÞ from each grapheme plane. Fig. 4 shows a sample of composition result with composition boundaries emphasized.

Fig. 4. (a) Character image construction and (b) the result with composition boundaries emphasized.

1004

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

Fig. 5. P2DHMMs after (a) the conversion stage and (b) the state-reduction stage.

(4) Conversion into P2DHMM. Given a character image, it is straightforward to construct a P2DHMM. First assign a state to every pixel with the output probability being the intensity value. Then the states are linked according to the topological constraint of P2DHMM: vertical sub-state transition, and then horizontal transition between super-states (see Fig. 5(a)). Note that all the transition probabilities are one without self-transitions. There is no space-warping in the current model. (5) State merge. The resulting model in the previous stage is quite big with a great deal of redundancy. One immediate goal of this stage is to reduce the redundancy. For this we consider the whole operation in two stages: ﬁrst in vertical frames, and then in individual pixels in each frame. For the ﬁrst step we deﬁne the distance between two frames, say xj and xjþ1 of equal length. S as an a-norm: Dðxj ; xjþ1 Þ ¼ kwj xj ð1 wj Þxjþ1 k

a

where wj is the expected number of frames which have been merged to produce xj . If a ¼ 2, then this measures the dissimilarity in the least square sense. Then we estimate the transition probability similarly, or the whole transition parameter set may be replaced by state duration probabilities. When two or more successive states are statistically similar in the output probability (grayscale), they can be replaced with a new node with a modiﬁed output probability xij ¼ wi xij þ wiþ1 xiþ1;j where wi is the weight as a function of the duration (the number of merged states or repetition) in the

state i and satisﬁes wi þ wj ¼ 1. The state similarity is measured by the output probability of the states. A graphical illustration of the resulting model is shown in Fig. 5(b). The proposed procedure of creating statistical model is theoretically correct in the sense of maximum likelihood sense. One problem with the method is found in the ﬁnal stage of merging states. However it is justiﬁed because, although the method of state merging itself is coarse yet, the idea of merging is correct in the information-theoretical sense.

4. Keyword spotting model For keyword spotting task, we developed two more classes of P2DHMMs in addition to key character models, which was then combined into a network model for continuous decoding of input streams. 4.1. System overview The preceding section has focused on the design of keyword models. Once the keyword models are created we can continue to the task of spotting, the search for the existence and the location of the keyword occurrences in a document image. The overall system component is shown in Fig. 6. In a spotting task we are given two types of input: a document image and a small set of keywords to search in the image. Both are shown on the leftmost blocks of the ﬁgure. The upper row of blocks corresponds to the design of spotting network, which we are mainly

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

1005

Fig. 6. System overview.

interested in this paper. The spotting network consists of three types of component models: keyword models, ﬁller models, and a white space model. They will be described in the following sections. The second row of blocks corresponds to the ordinary process of any OCR system; input image, preprocessing, and then target search. Given a document image obtained through a optical scanner, we ﬁrst project the pixels horizontally to detect text lines. Next we project the text line pixels vertically to extract word blocks. These word block images are the unit of DP search. Using the spotter network to be designed in this section, the spotting search is performed to determine whether the block is a keyword block or not. In the ensuing sections we will describe a detailed method creating a spotting network. 4.2. Filler model In keyword spotting tasks, ﬁller corresponds to something between interesting things or keywords. It is also called a non-key character. Then a ﬁller model is deﬁned as the model for all nonkey characters. For convenience sake, however, it does not discriminate keys and non-keys but models all kinds of character patterns statistically. The desired characteristic of the ﬁller model kF is f ðxK jkF Þ < f ðxK jkK Þ

f ðxF jkF Þ > f ðxF jkK Þ where f ð Þ is a probability density function, kK is a key model for the key pattern xK , and xF is a nonkey pattern. However, in general, the character patterns are not completely random and there is a certain degree of similarity between some characters. In addition it is not easy to design a single good model for numerous patterns of all characters. According to the work of Lee and Kim (1999), the ﬁller model can behave as a threshold. For better thresholding in Korean Hangul characters, we deﬁned six ﬁllers, one for each of the six character composition types as in Fig. 1. Fig. 7 shows the ﬁller images before conversion to P2DHMMs. They are simple arithmetic averages over a large set of character samples. Unlike the key character models, the ﬁller models are not synthesized in real-time. Rather they are prepared once and for all from the image templates. Compared to the key model construction, ﬁller model creation is very simple. There are a number of ways to reﬁne the types of ﬁllers. Each of the six types of characters can be further subdivided according to their shape features. It may be especially the case with the second type where upward-horizontal vowels (e.g., the left character in Fig. 10) and downward-horizontal vowels (e.g., the right character in Fig. 10) are sometimes considered quite diﬀerent. There are many more such similarly distinctive features.

1006

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

Fig. 7. Bitmap images for ﬁller models.

However, we believe such an elaboration is not needed unless there is a severe performance problem. Furthermore, what we are modeling is ﬁllers not individual characters. And it is noted that the ﬁller models should not be too strong compared with key character models. 4.3. White space model The region excluding the text is white space. The white space will be limited to the white frames between characters. It is modeled with a small number of nodes. In practice the state merge step reduced the nodes to one or two most of the time. 4.4. Spotting network For character spotting task we have designed a network-based transcription model (Fig. 8). It is a circular digraph with a backward link via the space model so that it can model arbitrary long sequence of non-key as well as key patterns. Given such a network, an input sequence will be aligned to every possible path circulating the network. One circulation is called a level. An l level path hypothesis represents a string of l characters (Meyers and

Rabiner (1981). The result is retrieved from the best hypothesis. 4.5. Search method The spotter network models a small set of key patterns and is used to locate them while ignoring the rest of the words of no interest. One eﬃcient search is the one stage DP. For the continuous spotting with forward scanning, we applied a modiﬁed form of two-level one-stage DP; this performs a single forward pass consisting of alternating partial forward search and output. The time requirement for the DP is OðN 2 ST Þ where N is the total number of states or nodes, and S and T are the frame length and the number of vertical frames in a line (Sakoe, 1979) respectively. In the proposed method, this is reduced to OðNST Þ.

5. Experiments In this section three sets of test results are present. They are described in the order of increasing rigor. First, character spotting with a small set of keywords over ten points print docu-

Fig. 8. Circular network of P2DHMMs for spotting keywords.

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

ments. Then it is followed by keyword spotting with a whole set keyword models. Finally, a keyword spotting experiment in a practical setting is performed. In the second and the third set of experiments, we prepared hand-segmented letter images from a set of 29,500 characters, and the six ﬁller images from the set. For word spotting task we tested about 540 frequently used keywords over 480 journal abstract images. 5.1. Experiment I: Hangul character spotting One signiﬁcant characteristic of Korean text is that there are no natural italic fonts. This observation justiﬁes the use of simple image-based modeling scheme. In the initial experiment a limited test has been performed using 10 point (Myongjo font) character images scanned in 200 dpi resolution. The letter models were created from the hand-segmented letter images. In this test we prepared only a single image for each letter and blurred it by a Gaussian ﬁlter to the eﬀect of averaging multiple images. The most frequently used 97 character classes were used in character (not word) spotting task. The character set constitutes approximately the half of the test text. The test result has been analyzed in terms of correct spotting (H ), false positives (P ) and false negatives (N ). The overall spotting performance

1007

was 79.7 percent as shown in Table 1. In order to better understand the performance and weakness, we detailed the result into character type hits and failures in the same table. The character Types I– VI correspond to the six diﬀerent arrangements (see Fig. 1) of Hangul vowels and consonants. Here the type hit means that the type of the character is correct regardless of the correctness of the label. According to the table, the hit ratios of Types III and V are relatively low, for which false acceptance and rejection are high. We noted this fact to reﬁne the models and tune merge parameters for the next set of experiments. Fig. 9 shows a sample result containing spotted characters with enclosing boxes and labels. Although not marked in the ﬁgure, all the characters are correctly transcribed into either key labels or appropriate ﬁller types. According to the test results, we have found one interesting case of failure that cannot be easily resolved using the current P2DHMM architecture. The case is illustrated in Fig. 10 where the two images are potentially very similar when analyzed in vertical frame basis. This shows the eﬀect of false Markovian assumption that resulted in an exponential state duration distribution. The problem can be removed simply by introducing state or model duration parameters at the cost of extra computation, which we avoided.

Table 1 The spotting result of Korean Hangul character. (H ¼ h=ðh þ p þ nÞ, P ¼ p=ðh þ p þ nÞ, and N ¼ n=ðh þ p þ nÞ where h ¼ the number of correct hits, p ¼ the number of false positives, and n ¼ the number of false negatives) all ﬁgures are in % Overall Type I Type II Type III Type IV Type V Type VI

H

P

N

# Classes

Remarks

79.7 90.9 91.7 81.3 88.9 80.0 87.5

10.9 0.0 8.3 12.5 11.1 0.0 0.0

9.4 9.1 0.0 6.3 0.0 20.0 12.5

97 20 22 17 18 19 11

Character spotting Type spotting

Fig. 9. A snapshot of the character spotting system.

1008

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

Fig. 10. Two characters which are similar in the context of P2DHMM and vertical frames.

5.2. Experiment II: word spotting

80 70

%

A word is a linear left-to-right concatenation of characters in Hangul system. For word spotting task we tested a mixture of 14 keyword models on a set of one hundred journal papers’ abstract images. In this test we ﬁxed the ﬁller models optimized previously since they need not be created dynamically at run time. For an optimal

choice we carried a series of tests by varying the size of ﬁller models. Unlike our expectation, it proved that the model size was not a determining factor. Hence we chose the ﬁller models as small as possible, all with six super-states and seven substates for training images of 27 · 32. Fig. 11 shows the performance change by varying the state merging thresholds. In the graph the highest hit (H ) reaches 66.7% with the statemerging threshold of 0.03. In this case the recall is very high but the precision is sacriﬁced a lot; the best precision score is obtained at threshold 0.01. Compare the performance ﬁgures of the proposed method with those of spotting with Baum– Welch-trained P2DHMMs which are optimal among the set of models tested with diﬀerent size (see Fig. 12). The latter models record 77.3% (H * for the hit rate), 22.6% (P * for the false positives)

H(hits) P(false+) N(false-) H*(77.3%) P*(22.6%) N*(0.01%)

60 50 40 30 20 10 0 0.007

0.008

0.01

0.015

0.02

0.03

0.04

B/W

Threshold

%

Fig. 11. Word spotting result.

100.0 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0

H*(hits) P*(false+) N*(false-)

4x5

5x6

6x7 7x8 model size

10x12

14x16

Fig. 12. The performance of Baum–Welch-trained models with increasing sizes or the number of super-states · the number of substates.

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

and 0.1% (N * for the false negatives) as are separately marked at the right end. According to the table the greatest source or error comes from false acceptances, which, we suspect, is primarily due to their strong optimization compared to the overly general ﬁller models. In general, the Baum–Welch modeling method for the P2DHMM, although more optimal and superior in performance, cannot be used for large vocabulary keyword spotting tasks that require training tens of thousand P2DHMMs and preparing a huge number of character samples. This implies that the proposed method of dynamic synthesis of key character P2DHMMs has a deﬁnite advantage over the traditional Baum–Welch modeling. Furthermore, if a higher precision is desired, we can pass the spotted word images to a highperformance recognizer for a more accurate spotting. This method will be far faster than the full recognition of the whole documents. Fig. 13 gives a sample result, a part of screen shot, showing correct classiﬁcation and ﬁller type classiﬁcation. Note the small gaps between ﬁllers. They denote white spaces between characters. 5.3. Experiment III: keyword set spotting In the ﬁnal set of experiment we compared the hit ratio by varying the number of keywords sought at a time. Table 2 summarizes the result. When the number of keywords N ¼ 1, the hit ratio

1009

reached peak, above from character spotting performance. When N increases, the confusion among words also increases thus gradually degrading the performance. The last column corresponds to the highest hit ratio in the preceding experiment. However when N is moderately large, the word spotting task is more successful than the individual character spotting task where we used about four character models at a time, about two Korean Hangul syllable characters in a word. As far as the authors’ knowledge goes, there has not been a research dealing with Korean keyword spotting in document images using HMMs. However referring to a research result reported by Yang and Oh (2001), their method is based on whole character matching and two stage wavelet coeﬃcient matching; coarse matching and ﬁne matching. Although a fair comparison is not possible per se, let us refer to the case of single candidate. Their system recorded 87.48% recall and 87.25% precision, which can very roughly be compared to our result of single word case in the above table. One problem with their system is the requirement of preparing the entire set of character images for each font type. One feature of the Yang and Oh’s method is the speed which is possible thanks to two stage processing. If a similar ﬁrst-stage coarse matching is employed in our method, a similar speedup is expected in our system.

Fig. 13. Sample result; a part of screen shot, showing correct spotting and ﬁller type classiﬁcation.

Table 2 Character and word spotting performance with increasing number of keyword models spotted at a time

Hit ratio (H ) (%)

Word spotting (N ¼ the number of keywords)

Character spotting (N ¼ 2)

N ¼1

N ¼ 2:5

N ¼ 14

75.0

86.3

83.8

66.7

1010

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011

5.4. Discussion The three sets of experiment show the strength and weakness of the proposed method. The performance of the proposed method is relatively low. This is largely due to the low discriminative power of the P2DHMM as well as the use of raw feature of pixel intensities. Note, however, that in the target task the issue is often not a greater accuracy but eﬃciency or speed of accessing desired documents among a huge set of document image database. In this case the conventional performance measure is not the simple recognition or hit rate but the recall rate and the precision. The use of ﬁller models is the very factor that facilitates the speed of spotting systems. We limited the number of ﬁller models to six, the number of Korean character types described in Sections 3.2 and 4.2. There is of course a compromise between time and performance. We have chosen the HMM framework for our spotting task since it is highly suited to sequential signal analysis tasks. And the P2DHMM is a useful option for sequential processing of 2D text image retrieval tasks. In this context the greatest contribution of the proposed method lies in the real-time synthesis of HMMs which is impossible with traditional whole word/character-based methods. Unlike conventional methods we do not need to prepare hundreds or thousands of character samples for each of the 2350 characters classes, and train and store all the character models. The modeling is very easy and eﬃcient. Moreover the proposed method should not be limited to Korean character. It can be applied to any script system involving similar synthesis of alphabet or components. The HMM is well known for the capability of modeling variability found in many pattern recognition tasks. Since the average grapheme images have been prepared from a large collection of images, with and without noise, the resulting models can accept even noisy images to a certain extent. Moreover, it is not too demanding to assume a normal scanning condition. One problem with the method is the heuristic parameter of thresholds given in the state-merging step. For automating and optimizing the process we

need a more sophisticated or theoretically founded method. One such an option will be the use of distance metric between the output distributions (Kullback, 1968). This constitutes the future direction of our research. In fact we believe that kind of solution is the only one in urgent demand. Finally there remains a problem of insuﬃcient discrimination power of the HMM. In fact this has been the greatest persistent problem to HMMbased method. A more robust form of HMM with increased discrimination power has been an important topic for a number of researchers. We will employ their result in the future for a higher performance retrieval system. 6. Conclusion Using a set of letter image templates, we proposed a very eﬀective method for real-time synthesis of key word P2DHMMs; given a set of keyword labels, we can construct corresponding keyword models on the spot. The proposed method utilizes the principle of composing Hangul syllable characters. The composition itself is very eﬃcient and its conversion to a P2DHMM is highly intuitive considering that we are dealing with machine-printed character images. With experimental results from the application to key word spotting tasks, we consider that the proposed method is highly feasible and performs adequately in a realistic retrieval condition, thus meeting our ultimate demand for the application to contentbased document image indexing and retrieval.

Acknowledgements Beom-Joon Cho is on the faculty of Computer Engineering at Chosun University, Kwangju, Korea. This study was supported by research funds from Chosun University, 2000. References Agazzi, O.E., Kuo, S., 1993. Hidden Markov model based optical character recognition in the presence of deterministic transformations. Pattern Recognition 26 (12), 1813–1826.

B.-J. Cho, J.H. Kim / Pattern Recognition Letters 25 (2004) 999–1011 Agazzi, O.E., Kuo, S., Levin, E., Pieraccini, R., 1993. Connected and degraded text recognition using planar hidden Markov models. Proc. ICASSP, Minneapolis, 27– 30. Anigbogu, J.C., Belaid, A., 1995. Hidden Markov models in text recognition. Internat. J. Pattern Recognition Artiﬁcial Intelligence 9 (6), 925–958. Chellapa, R., Chatterjee, S., 1985. Classiﬁcation of textures using Gaussian Markov random ﬁelds. IEEE Trans. ASSP 33 (4), 959–963. Chen, F.R., Wilcox, L.D., Bloomberg, D.S., 1993. Detecting and locating partially speciﬁed keywords in scanned images using hidden Markov models. Proc. Second Internat. Conf. Document Analysis and Recognition, 133–138. Kullback, S., 1968. Information Theory and Statistics. Dover Publications, New York. Kuo, S., Agazzi, O.E., 1993. Keyword spotting in poorly printed texts using pseudo 2D hidden Markov models. Proc. IEEE Conf. CVPR, New York, 15–17. Lang, K., Waibel, A., Hinton, G., 1990. A time delay neural network architecture for isolated word recognition. Neural Networks 3, 23–44.

1011

Lee, H.K., Kim, J.H., 1999. An HMM-based threshold model approach for gesture recognition. IEEE Trans. PAMI 21 (10), 961–973. Levin, E., Pieraccini, R., 1992. Dynamic planar warping for optical character recognition. Proc. ICASSP 3, San Fransisco, CA, 149–152. Meyers, C.S., Rabiner, L.R., 1981. A level building dynamic time warping algorithm for connected word recognition. IEEE Trans. Acoust. Speech Signal Process, ASSP 29 (2), 284–297. Rabiner, L.R., 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77 (2), 257–286. Sakoe, H., 1979. Two-level DP-matching––a dynamic programming-based pattern matching algorithm for connected word recognition. IEEE Trans. Acoust. Speech Signal Process., ASSP 27 (6), 588–595. Xu, Y., Nagy, G., 1999. Prototype extraction and adaptive OCR. IEEE Trans. PAMI 21 (12), 1280–1296. Yang J.-H., Oh I.-S., 2001. Fast Retrieval of Korean Words based on Two-Pass Processing. Korea Information Science Society, Workshop on Computer Vision and Pattern Recognition, pp. 105–106.

Print keyword spotting with dynamically synthesized pseudo 2D HMMs

Print keyword spotting with dynamically synthesized pseudo 2D HMMs

Recommend Documents