Pattern Recognition 47 (2014) 1039–1050
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/pr
Statistical script independent word spotting in offline handwritten documents Safwan Wshah n, Gaurav Kumar, Venu Govindaraju Department of Computer Science and Engineering, University at Buffalo, 113 Davis Hall, Amherst, NY 14260-2500, United States
art ic l e i nf o
a b s t r a c t
Available online 10 October 2013
We propose a statistical script independent line based word spotting framework for offline handwritten documents based on Hidden Markov Models. We propose and compare an exhaustive study of filler models and background models for better representation of background or non-keyword text. The candidate keywords are pruned in a two stage spotting framework using the character based and lexicon based background models. The system deals with large vocabulary without the need for word or character segmentation. The script independent word spotting system is evaluated on a mixed corpus of public dataset from several scripts such as IAM for English, AMA for Arabic and LAW for Devanagari. & 2013 Elsevier Ltd. All rights reserved.
Keywords: Script independent Keyword spotting Hidden Markov models
1. Introduction The recognition of unconstrained offline handwritten documents has been a major area of research during last decades. Due to large variability in writing styles and huge vocabulary, the problem is still far from being completely solved [22,34]. As a result, word spotting has been proposed as an alternative of full transcription to retrieve keywords from document images [27]. The inputs to a word-spotting system are document sets or databases and an element denoted as query, the output is a set of images or sub-images from the database that are relevant to the query, making it similar to the classical information retrieval system. Word spotting finds its application in many areas such as information retrieval and indexing of handwritten document that are made available for searching and browsing [9]. An extensive number of multilingual handwritten documents and forms are sent every day to companies for processing [6]. An efficient retrieval system for these documents has the advantage of saving these companies time and money. As another example, recently, majority of libraries around the world have digitized their valuable handwritten books transcribed in many scripts ranging from old ancient to modern ages. An application of word spotting is to make these books searchable. The optimum trend in word spotting systems is to propose methods that show high accuracy, high speed and work on any language with minimum preprocessing steps such as preparing the query format or word segmentation. Our goal in this work is to develop approaches to improve the word spotting performance in
n
Corresponding author. Tel.: þ 1 716 587 1594. E-mail addresses:
[email protected],
[email protected] (S. Wshah).
0031-3203/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2013.09.019
handwritten documents by simulating any keyword, even those unseen in the training corpus, and effectively deal with large background vocabulary without the need for word or character segmentation, and to make it scalable over many languages such as English, Arabic, and Devanagari. We will also evaluate the performance of the system and compare it to current approaches. In the proposed method, we assume minimal preprocessing during training and validation. Our method does not require segmentation or lines into words because it works on the line as one unit. The required keyword and non-keyword models are generated at run time to look for certain key word with minimal preprocessing step. In this work we elaborate on our work [36] where we have introduced the script independent word-segmentation free keyword spotting framework based on Hidden Markov Models (HMMs). This framework is scalable across multiple scripts. We learn HMMs of trained characters and combine them to simulate any keyword, even those unseen in the training corpus. We use filler models for better representation of non-keyword image regions avoiding the limitations of line-based keyword spotting technique, which largely relies on lexicon free score normalization and white space separation. Our system is capable of dealing with large background vocabulary without the need for word or character segmentation and is scalable over many languages such as English, Arabic and Devanagari. The main characteristic of the proposed approach is utilizing script independent methods for feature extraction, training and recognition. This work is a detailed illustration in terms of framework setup and detailed evaluation of the proposed technique. Some of the key attributes such as feature extraction and different filler and background models proposed in this work are evaluated extensively along with an analysis of system complexity. In the
1040
S. Wshah et al. / Pattern Recognition 47 (2014) 1039–1050
experimental evaluation section, the system has been evaluated on individual and mixed public datasets of English, Arabic and Devanagari manuscripts using different sizes of the keyword list. The proper modeling of the filler and background models has been investigated and the system results compared with the model presented by Fischer et al. [13] on English, Arabic and Devanagari showing better performance over all the languages. The rest of the paper is structured as follows. We present related work and categorize existing approaches in Section 2. In Section 3 we propose our approach including image preprocessing, feature extraction and spotting framework. We discuss the complexity of our system in Section 4 and experimental evaluation in Section 5.
Fig. 1. Feature extraction using a sliding window.
2. Related work One of the first word spotting approaches for document images was proposed by [19]. Since then, many word spotting approaches have been proposed. Mainly, word spotting approaches are divided into two types based on the query input: query-by-example and query-by-string [25]. The query-by-example or template based approach requires images that are hard to prepare and may not exist in the training set. In the template based approach, input image is matched to a set of template keyword images and the outputs are the images most similar to the query image. The image is represented as a sequence of features and usually compared with dynamic time warping (DTW) technique [24,19,32]. The main advantage of this approach is that there is minimum learning involved. However, there are limitations of dealing with wide variety of unknown writers [13]. There are certain segmentation free techniques such as Rusinol et al. [26], Leydier et al. [18] that work at document level detecting interest points using gradient or scale-invariant transform features. The query by example is not focus of our work. The query-by-string refers to the word-spotting techniques where the input is the string that needs to be located [7,5]. The query-by-string is more complicated than query-by-example. In the query-by-string the keyword models need to be created even though no samples of that keyword exist in the training set. We further categorize the query by string techniques based on processing level as below. 2.1. Word recognition based spotting In word based spotting such as Rodrguez-Serrano and Perronnin [25], Saabni and El-Sana [27], the HMM model for each keyword is trained separately. The score of each HMM is normalized with respect to the score of the same topology HMM trained for all non-keywords. This approach relies heavily on perfect word segmentation and requires several samples for each keyword in the training set. In a similar way [8,33], use word segmentation free technique to train character models to build the keywords as well as non-keywords. The drawback of their approach is the confidence measure with respect to a general non-keyword model that represents everything but keywords as well as the dependence on the white space model to segment the words. 2.2. Line recognition based spotting In the line based approach, the word or character segmentation step is done during the spotting process. Chan et al. [4], Edwards et al. [7] train character HMMs from manually segmented templates assuming small variation in data. Fischer et al. [13] proposed a line level approach using HMM character models under the assumption that no more than one keyword can be spotted in a
Fig. 2. Main model, (a) keywords and filler models, (b) score normalization using background models.
given line. Their approach outperformed the template based methods for single writer with few training samples and multiwriters with many training samples. A major drawback in their approach is the dependency on the white space to separate keywords from rest of the text. This not only has a large influence on the spotting results but also prevents the system from being scalable over other languages such as Arabic in which the space could be within or between the words revealing little information about the word boundaries [4]. Besides, the lexicon free approach to model the non-keyword has large negative effect on their system performance as well. Frinken et al. [15] proposed a neural network-based spotting system. It parses the line to recognize the sequence of the characters and maps each character's position and its probability. It then takes the sequence of the character probabilities, a dictionary, and a language model and computes a likely sequence of words. The drawback of this approach is the dependency on the recognition system. In addition, increasing the number of the keywords increases the accuracy due to the use of an efficient language model based on a big dictionary, making it more like a
S. Wshah et al. / Pattern Recognition 47 (2014) 1039–1050
1041
Table 1 Filler models. Model
Description
Characters Bigram context dependent character models Bigram context independent sub-words Trigram context dependent Character models Trigram context independent sub-words
Group of character models used as fillers All bigram context dependent character not found in the keyword list used as filler models All bigram characters sequence not found in the keyword list is modeled and used as filler models All trigram context dependent character not found in the keyword list used as filler models All trigram characters sequence not found in the keyword list is modeled and used as filler models.
recognizer rather than a spotting system. In general, the neural network-based keyword spotting depends crucially on the amount of the training data [14], and it is hard to segment and train for other languages.
3. Proposed approach We propose a novel script independent line based spotting framework. Given a line image, the goal is to locate and recognize one or more occurrences of a keyword. The algorithm starts by detecting the candidate keywords using a recognizer that searches for the keywords in a line. Keyword models consist of all keywords built by concatenating their HMM character models. The HMM based recognizer uses the Viterbi beam search decoder [23] to parse the entire line finding the maximum probability path between the keywords and filler models as shown in Fig. 2. Each candidate keyword is processed by extracting it from the line using the start and end positions. Then the score of each candidate keyword is normalized with respect to the score of word background models. Our approach utilizes both the filler and background models. The models are explained in Sections 3.3 and 3.4.
Fig. 3. Lexicon based word background model.
3.1. Preprocessing and feature extraction The quality of input image is enhanced by removing the noise through smoothing and morphological operations. Common shapes such as dots, lines and hole punches are also removed using some additional preprocessing steps. Skew correction algorithm such as one proposed by Yan [37] is used for skew correction at both the document and line levels. For line segmentation a robust algorithm presented by Shi et al. [31] is used. The height of each segmented line image is resized to a fixed value maintaining the aspect ratio. In analytical recognition systems, representing the image as a sequence of features allows a powerful use of the hidden Markov models (HMMs), which is mainly used for sequence modeling. Thus, instead of extracting features from the whole image, a sliding window is moved over the image from left to right. At each position n, a feature vector fn is computed from only the pixels inside the sliding window, as shown in Fig. 1. The sliding window preserves the left-to-right writing nature of the document as well as the variable word length property. The most popular sliding window features were presented by Favata and Srikantan [10] and Vinciarelli and Luettin [35]. In Vinciarelli and Luettin [35], the sliding window is split into 4 2 cells known as bins, and the pixel counts in each bin are considered to be a feature, resulting in a 16-dimensional feature vector. We denote these as intensity features. Advanced gradient, structural, and concavity (GSC) features presented by Favata and Srikantan [10] showed the state of the art for Arabic handwritten word recognition [28]. The GSC features proposed by Favata and Srikantan [10] are multi-resolution features that combine three different attributes of the character shape, the gradient (representing the local orientation of strokes), the structural features (which extend the gradient to longer distances and provide information about stroke trajectories) and the concavity features (which capture stroke relationships at
Fig. 4. Word background model using character filler models.
long distances). The best performance was found with the combination of gradient features from GSC and intensity features. A 20 pixels wide sliding window was considered with 85% overlap. These numbers were empirically determined and the evaluation is determined based on validation dataset. For each window gradient and intensity features were extracted, the sliding window was divided into two vertical bins according to the center of mass. For each bin, the gradient direction of each pixel was calculated, and the gradient direction at any pixel in image I(x, y) was defined as
ϕ ¼ tan 1
Gy Gx
ð1Þ
where Gx ¼ Iðx þ 1; yÞ Iðx 1; yÞ Gy ¼ Iðx; y þ 1Þ Iðx; y 1Þ
ð2Þ
For each pixel, the gradient was calculated and the angle was uniformly quantized into eight directions. Each orientation accumulated into a histogram, and after processing all the pixels in the bin, the gradient histogram was normalized with respect to the maximum value in the histogram. Each value in the normalized histogram was considered as an independent feature. Thus, the dimension of the gradient features was 8 (directions) 2 (bins) ¼ 16. For the intensity features, the adjusted sliding window was divided horizontally into four regions based on center of mass. For each region, the black-to-white pixel ratio was calculated and
1042
S. Wshah et al. / Pattern Recognition 47 (2014) 1039–1050
considered as an independent feature, and the total number of the intensity features was 8. Thus, the total number of features per sliding window was 24. 3.2. HMM character models HMM is a powerful statistical tool for modeling generative sequences that can be characterized by an underlying process generating an observable sequence. HMMs are widely used in the field of machine learning and pattern recognition for sequence modeling such as speech and handwritten and gesture recognition. State-of-the-art systems for speech recognition [12], online [30] and offline [34,3] handwriting recognition, behavior analysis in videos [2], and others are all HMM based. In addition, when training the HMM character models, there is no need to provide exact character positions along with the training transcription, as the character positions are estimated by the Baum–Welch algorithm. This makes it possible to reduce the time required to prepare the transcription of the training data. Feature extraction procedure converts the text line into a set of features F as shown in Fig. 1. The generated sequence considered as observations O, where the i-th observation value Oi corresponds to i-th 24-dimensional feature vector fi. For each character, a 14state linear topology HMM is implemented. The number of states for each character was identified empirically. For each state Si, the observation probability distribution is modeled by a continuous probability density function of Gaussian mixture given as M
bi ðoÞ ¼ ∑ cm Nðo; μm ; Σ m Þ
ð3Þ
m¼1
where o is the observation vector, M the number of mixture components, cm the weight of the m-th component such that C 1 þ C 2 þ ⋯ þ C M ¼ 1, N the 24-dimensional Gaussian PDF with mean vector
μ and diagonal covariance matrix Σ .
1 1 Nðo; μm ; Σ m Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :eðo μÞΣ ðo μÞ 2π 2 jΣ j
ð4Þ
The character models are concatenated to form word and sequence of words. There is no need to provide exact character positions along with the training transcription. In the training of the HMM, the character positions are estimated by Baum–Welch algorithm. This property makes it possible to reduce the time required to prepare the transcription of the training data. 3.3. Filler models
In case of context-dependent bigram and trigram models, all character models are trained based on their position in the context. All context-dependent characters not appearing in keyword models are used as fillers. This technique requires an exceptionally large training data for the purpose of training a large number of context dependent character models. In the case of context independent filler model all the non-keyword character sequences not appearing in the keywords are used as filler models. Since the number of nonkeywords sequences is huge, it adds more complexity to the system. Character filler models (CFMs) can significantly reduce the computational complexity making it more attractive for real applications due to fewer models and high efficiency. Each CFM is an HMM that has exactly same implementation of the character models but trained on different classes. It is expected that the number of CFMs will affect the performance, and thus different numbers of CFMs are evaluated for each language. The clustering of these CFMs is implemented as described in Algorithm 1. The candidate keywords from the filler models are pruned using the word background models to efficiently reduce the false positive rate. Algorithm 1. INPUT :HMM character models, validation dataset, number of the required filler models. OUTPUT: Character filler models. Initialization: INPUT←HMM character models. OUTPUT←” Step 1: for all character model in INPUT do for other character models in INPUT do Merge the characters pair models. PAIRS[Accuracy, characters pair]←Evaluate the validation set accuracy after merging. end for MaxPair←Pick maximum accuracy from PAIRS. Merge the corresponding pair (MaxPair)and store it in OUTPUT array. Delete pair from INPUT. end for Step 2: Label the validation dataset according to the new models. Step 3: if OUTPUT size ¼ ¼ Number of filler models then end else INPUT←OUTPUT, OUTPUT←”, go to step 1. end if
Filler models are used to model the non-keywords without explicitly defining them. They allow separation of keyword text from non-keyword background. While the proper modeling of non-keywords reduces the rate of false positives, the proper modeling of the keywords increases the true positive rate. We investigate several filler models such as sub-words including characters or sequence of characters. Table 1 contains the summary of filler models that we propose and evaluate, including characters, bigram-subwords or trigram-subwords. All these models are compared based on the accuracy, computation complexity and simplicity of the implementation.
Fig. 5. Dataset samples, (a) Arabic, (b) English, (c) Devanagari.
Table 2 Number of text lines used for training, validation and testing and the number of unique characters of each script. Dataset
Training lines
Testing lines
Validation lines
# Writers
# Characters
# Average char/model
# Unique characters
IAM Data (English) AMA Data (Arabic) LAW Data (Devanagari)
3000 5700 3000
1000 1000 1000
1000 1000 1000
650 35 1000
16,8013 27,2180 119,980
2625 1756 1714
64 155 70
S. Wshah et al. / Pattern Recognition 47 (2014) 1039–1050
1043
Table 3 MAP results for combinations of sets of features. Features
Features description
MAP (English)
MAP (Arabic)
MAP (Devanagari)
Profile features GSC [11] Intensity features (I) IG IS IC IGSC IGS ISC IGC
The nine profile features presented by Marti and Bunke [20]. Gradient, structural, and concavity as described in Favata and Srikantan [11] by Favata 4 2 cells known as bins as described by Vinciarelli Combining intensity and gradient Combining intensity and structural Combining intensity and concavity Combining intensity, gradient, structural , and concavity Combining intensity, gradient, and structural Combining intensity, structural, and concavity Combining intensity, gradient, and concavity
47.2 48.6 41.1 54.2 49.6 51.7 50.9 47.6 49.5 52.2
46.1 50.2 40.1 53.7 48.5 50.9 50.7 48.6 49.7 51.1
45.8 51.4 42.6 53.8 50.3 51.3 50.6 49.6 50.5 51.3
Fig. 6. Filler types performance for 100 keywords .
3.4. Background model Score normalization is an effective method to enhance accuracy. It is applied as a rejection strategy to decrease the false positive rate [25]. In this paper we present two novel methods for score normalization. The first method is based on score normalization between the candidate keyword and non-keyword scores as shown in Fig. 3. We refer to it as Lexicon Based Background Model. The other method is based on the character filler models as shown in Fig. 4 referred as character based background model.
3.4.1. Lexicon based background model In this technique, background model is represented by all or a subset of non-keywords. A reduced lexicon is used to overcome the high computational complexity which results from using all non-keywords. The candidate keyword recognized in filler models stage is either correct or similar to the keyword due to the fact that filler models represent the non-keyword regions. The reduction in size of the Background model is implemented based on the Levenshtein distance between all non-keywords in the dictionary and the candidate keyword text. The non-keyword is added to the reduced lexicon if its edit distance compared to the candidate keyword is less than a certain threshold, different reduction ratios have been studied in the Experiments section. Reduced lexicon can be computed once for each keyword without adding more computation cost to the system. In general, for lexicon based background models the likelihood ratio R between keyword candidate score (Sscore) and the WBM scores
Table 4 Mean precision of the filler types according to the different numbers of keywords in the list. Filler type
Character filler models Context-independent bigram Context-independent trigram Context-independent bigram Context-dependent trigram
Keyword list size 30
100
500
63.2 21.44 17.14 28.58 35.18
58.98 22.99 17.36 26.37 32.64
52.69 24.36 18.38 24.27 30.33
(SLexicon_score ) is given by R¼
Sscore SLexicon_score
ð5Þ
If R is larger than 1, this means the candidate is most likely a keyword. The likelihood ratio R is normalized by the width of the keyword width. Positive match is considered if the normalized likelihood score is greater than a certain threshold T: R 4T W
ð6Þ
3.4.2. Character based background model The second background model is based on the character filler models as shown in Fig. 4. The candidate keyword is evaluated over the background models as the best path between candidate
1044
S. Wshah et al. / Pattern Recognition 47 (2014) 1039–1050
Fig. 7. Number of character filler models vs. the mean average precision (MAP) for English.
Fig. 8. Number of character filler models vs. the MAP for Arabic.
Fig. 9. Number of character filler models vs. the MAP for Devanagari.
keyword characters and their corresponding character filler models. Thus, obtaining the separation amount between the keyword and the background. The complexity of this technique is very low
compared to the lexicon free and reduced lexicon techniques. The normalized likelihood score is the ratio R between keyword candidate score (Sscore) and the sum of background character
S. Wshah et al. / Pattern Recognition 47 (2014) 1039–1050
scores (Sbkscore): Sscore R¼ ∑Sbkscore ðiÞ
ð7Þ
i
If R is closer to 1, it is most likely to be a keyword. The likelihood ratio R is normalized with the width of the keyword (W) and a positive match is considered if the normalized likelihood score is within certain thresholds, the values of T1 and T2 are experimentally evaluated on a validation dataset to maximize the mean average precision: T1 4
R 4T2 W
1045
In general, if the size of the keyword list (k) is increased, the complexity also increases. The lexicon size (r) has a large effect on the complexity. If the lexicon is reduced to a high degree, then the complexity is also significantly reduced. The experimental results showed that a 93% reduced lexicon approximately works as efficiently as the full lexicon. In the structured character-based background model, the complexity is lower than in the lexiconbased approach because c is much smaller than r, where the average number of the keyword characters does not exceed 15 characters for English, Arabic, or Devanagari, resulting in a huge reduction in the complexity.
ð8Þ
5. Experimental setup
4. System complexity The word-spotting framework contains two main parts: the keyword, and the filler and the background models. We found that CFMs are the best filler models due to their small number and their high efficiency. In this section, we calculate the complexity of the full system for the lexicon-based and the character-based background models described in Section 3. For the word spotting system, the complexity of the Viterbi algorithm recognizing a line of length L is measured by Oðm2 LÞ, where m is the number of models. The complexity of the system using the lexicon-based background models, complexityLb, is given by complexityLb ¼ keyword models’ complexity þ background models’ complexity 2
¼ Oðk LÞ þ Oðr 2 Ln Þ
ð9Þ
where k is the number of the keywords, L is the length of the text line, Ln is the length of the candidate keywords detected in line L, and r is the reduced lexicon size. The complexity of the system using character-based background models is given by complexitycb ¼ keyword models’ complexity þ background models’ complexity 2
¼ Oðk LÞ þ Oðc2 Ln Þ
ð10Þ
where c is the average number of the keyword characters. Note that, since the number of the character filler models is constant and usually small, for each script, we ignore them in evaluating the system complexity.
We evaluate our system on three publicly available datatsets, the public IAM dataset [21] for English, the public AMA [1] dataset for Arabic, and LAW [16] dataset for Devanagari. IAM English dataset: A Modern English handwritten dataset consists of 1539 pages text from the LancasterOslo = Bergen corpus [17]. The dataset has been written by 657 writers. For more details about this dataset refer to Marti and Bunke [21]. AMA Arabic dataset: A Modern Arabic handwritten dataset of 200 unique documents consisting of 5000 documents transcribed by 25 writers using various writing utensils. For more details about this dataset refer to AMA [1]. LAW Devanagari dataset: The Devanagari lines are synthetically formed by randomly concatenating up to 5 words from a dataset containing 26,720 handwritten words written in Hindi and Marathi languages (Devanagari script) separated by a random space size. For more details about this dataset refer to Jayadevan et al. [16]. Table 2 summarizes the statistics of the datasets used for training and validation and the sample line images are shown in Fig. 5. Table 5 MAP results for the different types of background models. BKM type
MAP
0% reduction 87% reduction 93% reduction 96% reduction Char-based model
58.98 54.63 55.36 54.02 40.12
Fig. 10. Performance of the background models.
1046
S. Wshah et al. / Pattern Recognition 47 (2014) 1039–1050
Fig. 11. Performance with varying numbers of keywords.
Table 6 Performance with varying numbers of keywords. Keyword list size
MAP
30 100 500
62.02 57.7 49.32
5.1. Performance evaluation We evaluate the proposed word spotting system with filler and word background models on IAM data because of its complexity and large variation. The results are measured using recall, precision and mean average precision (MAP): Recall ¼
TP ðTP þFNÞ
Precision ¼
TP ðTP þ FPÞ
ð11Þ ð12Þ
where TP is true positive, FN false negative and FP false positive. Mean Average Precision (MAP) is given by the area under curve of recall vs. precision graph. 5.2. Performance with different features We implemented the three main methods that extract the features from the sliding window as discussed in Section 3.1. We implemented the profile features by Marti and Bunke [20], the window features by Vinciarelli and Luettin [35] and Favata and Srikantan [10]. The results of these features and their combinations on the word-spotting system are reported. Table 3 shows the results of the performance of the word-spotting system with different types of features. We randomly selected 100 keywords and used the 93% reduced lexicon based BKM. In the profile features presented by [20], a sliding window of one pixel width was used, consisting of nine geometrical features, as shown in Section 3.1. In the features presented by Vinciarelli et al. [35] and Favata and Srikantan [11], the width of the sliding window was more than one pixel. We refer to these features as window features. Vinciarelli et al. [35] split the sliding window into 4 2 cells known as bins. The pixel count in each of the bins is considered as a feature results in 16-dimensional feature vector. We refer to these features as intensity features (I). Advanced gradient, structural, and concavity (GSC) features presented by Favata and Srikantan [11] showed the state of the art for
Arabic handwritten word recognition [29]. GSC features are a set of features that measure the image characteristics at local, intermediate, and large scales, as discussed in Section 3.1. Profile features and window features cannot be combined due to the locality property of the profile features working on one pixel width window. The combinations of the intensity (I) and GSC features are shown in Table 3. For each feature, we choose the optimum parameters that achieve the highest performance. In the combined features, we used two vertical bins, as described in Section 3.1. The results showed that the best features are the intensity and the gradient features, as described in detail in Section 3.1. As the features are extracted from the sliding window area only, the sliding window technique is considered as a local measurement unit and the overlap between the sliding windows represents the sequence property between the extracted windows. The intensity and the gradient features showed high accuracy because they represent the local description of the image pixels. The structural and the concavity features did not yield promising results because they represent the intermediate and the large-scale features which are not usually represented by the sliding window. 5.3. Evaluation of filler models We evaluate the five filler models discussed in Section 3.3 on the IAM dataset which consists of samples taken from large number of writers and with about 9000 unique words. The background model was held fixed and consisted of all nonkeywords in the corpus. Three different experiments were carried out for all filler types with different numbers of keyword sizes: 30, 100 and 500. Fig. 6 shows the results for 100 random keywords and the mean precision of the filler types for different numbers of keywords in the list is shown in Table 4. The character filler models outperform all the other types. The superior performance and low complexity associated with the small number of models make it the most attractive model. The system performance was affected by the size of the keyword list because the rate of false positives increases as the size of the list increases, thereby decreasing the precision of the performance. 5.4. Evaluation of the number of character fillers As the character filler models outperformed all the other filler types, we evaluated our system with different numbers of the character filler models. The models excellent performance and low
S. Wshah et al. / Pattern Recognition 47 (2014) 1039–1050
1047
Fig. 12. System performance with English, Arabic, and Devanagari.
Fig. 13. Performance comparison with Fischer's algorithm on different keyword sizes.
Table 7 MAP results of our proposed system and Fishers algorithm for different numbers of keywords. System
Number of keywords
MAP
Fischer
100 500 100 500
9.39 11.54 57.70 49.32
Our proposed system
complexity make it the most attractive model. Each language has a different number of character models and, thus, a different optimum number of filler models. The optimum number of character filler models was experimentally evaluated. Figs. 7, 8, and 9 show the performance of the different numbers of character fillers for English, Arabic, and Devanagari respectively. The number of best CFMs models found for English, Arabic, and Devanagari was 4, 11, and 7, respectively. 5.5. Word background Both the lexicon based background model and character based word background models are evaluated on English IAM dataset
with 100 randomly selected keywords. The number of character filler models was fixed to 4. The main reason of using the lexicon based background model is to apply lexicon reduction to reduce the complexity without affecting the performance. The lexicon reduction based on Levenshtein distance shows an effective method for huge reduction of the lexicon with slight effect on the system performance due to the similarity between the detected candidate and the keyword text. Many experiments have been evaluated for full and reduced lexicon with different reduction ratios as shown in Fig. 10 and Table 5. The results of the character based background models are also shown in Fig. 10 showing the effectiveness of this method to spot the keyword without the need to know all non-keyword list. The main reason for using the lexicon reduction method is to reduce the computational complexity without affecting the performance. The reduction based on Levenshtein distance is an effective method for reducing large lexicons; it has a slight effect on the systems performance due to the similarity between the detected candidate and the keyword text. As a result, a huge reduction in the computational complexity is achieved. Fig. 10 and Table 5 show the performance of different background models, including the background character-based method and the lexicon-based method with different reduction ratios. The experiments were performed on an
1048
S. Wshah et al. / Pattern Recognition 47 (2014) 1039–1050
Fig. 14. Performance comparison on English, Arabic, and Devanagari compared with Fischer's line-based approach.
English IAM dataset using 100 randomly selected keywords and the four optimum character filler models. The IAM dataset was used due to the large number of writers and the large lexicon size (up to 9000 unique words). The lexicon-based background models showed a high performance for high lexicon reduction ratios. For a lexicon reduction of 93%, the performance only decreased by 6.5% due to the similarity between the detected candidate and the keyword text. The low complexity character-based models showed promising results relative to their level of complexity.
Table 8 MAP results of our proposed system and Fischer's algorithm for different numbers of keywords. System
Script
MAP
Fischer
English Arabic Devanagari English Arabic Devanagari
9.39 4.18 8.72 57.70 55.1 55.3
Our proposed system
5.6. Performance with different numbers of keywords The size of the keyword list is also a major criteria for evaluating a word-spotting system. The performance of a wordspotting system gradually decreases with the increase in the number of keywords but the drop should not be too high. We investigated the effect of the number of keywords on IAM dataset. The best performing filler models, CFMs were used with a reduced background model lexicon with a 93% reduction ratio. The results are shown in Fig. 11 and Table 6. As the number of keywords increased, more false positives were detected and, thus, the performance of the system decreased. The system showed a high performance for a reasonable number of keywords (100). 5.7. Performance on Arabic, and Devanagari The proposed system is independent of the script used. We evaluated the proposed approach on other scripts such as Arabic and Devanagari. We used the IAM dataset for English, the AMA dataset for Arabic, and the LAW dataset for Devanagari. We randomly selected 100 keywords for Arabic and English and 30 keywords for Devanagari. For the fillers, we used the CFMs and two BKM model types: 93% reduced lexicon background models and char based background models. The data used for training and validation is summarized in Table 2. The result showed a high performance for English, Arabic, and Devanagari. To the best of our knowledge, we are the first to apply a line-based approach on Arabic and Devanagari. The results are shown in Fig. 12.
their method and we used same features described in Section 3.1 to compare the strength of the models only. The systems were compared on keyword lists of sizes 100 and 500 respectively as shown in Fig. 13 and Table 7. The systems were also compared on other scripts such as Arabic and Devanagari as shown in Fig. 14 and Table 8. We used 100 keywords for Arabic and English and 30 keywords for Devanagari. We also compared the speed between [13] and our proposed system. We built the algorithm using the same features and the HMM models used in our approach. Both systems were compared in exactly the same environment using the same machine with 4 GB RAM and an Intel Core 2 Duo 3.17 GHz CPU. We used the same training and testing data for the two systems. Tables 7 and 8 show that our proposed system outperforms Fischer's algorithm on English, Arabic and Devanagari using different keyword list sizes. The best results of Fischer's algorithm can be obtained when each line contains only one keyword. Table 9 shows the results of the speed comparison of Fischer et al. [13] with those of our proposed system for the two proposed background models (lexicon based and character based). The results in Table 9 show that the lexicon-free approach presented by Fischer was faster than the proposed lexicon-based approach in case of few keywords. For larger numbers of keywords (100), both systems had approximately the same speed. On the other hand, the proposed character-based system was much faster than [13], irrespective of the number of keywords.
6. Conclusion 5.8. System comparison We compared the proposed system with the method described in [13]. In their work no specific information was reported about their results such as the keywords used. Thus we implemented
We proposed a statistical script independent word spotting system based on hidden Markov models (HMMs). The system uses a line based learning technique to learn individual character models. We propose and experiment with different types of filler
S. Wshah et al. / Pattern Recognition 47 (2014) 1039–1050
1049
Table 9 Speed of our proposed system and Fischer's algorithm [13] on IAM and AMA datasets with different numbers of keywords. Dataset
Number of keywords
Proposed system (Char. based) speed/line (ms)
Proposed system (Lexicon based) speed/line (ms)
Fischer's system speed/line (ms)
IAM for English
50 100 500 50 100 500
69 142 472 70 136 485
230 260 623 236 273 676
180 255 758 223 295 1,023
AMA for Arabic
and background models on three different scripts: English [21], Arabic [1] and Devanagari LAW [16]. Filler models are used for better representation of non-keyword image regions. Efficient score normalization techniques using the lexicon based background models and character based background models techniques are implemented to reduce the false positive rates. Apart from being script independent, the system has an added advantage of avoiding any word segmentation for spotting which is crucial for scripts such as Arabic where the word segmentation accuracy is quite low. We outperform the state of the art line based learning approach as well. In the future, language modeling of the keywords and non-keywords will be investigated to increase system performance. The language model assigns a probability to a sequence of words, for example language models between keywords and non-keywords can be evaluated and a separate language model for keywords and non-keywords can be evaluated as well. Also, working on top-n recognition results instead of top-1 can be investigated for better performance. In addition, extension on the research for regular expressions would be interesting.
Conflict of interest statement None declared. References [1] 2007. Applied media analysis, arabic-handwritten-1.0. URL 〈http://appliedme diaanalysis.com/Datasets.htm〉. [2] M. Brand, V. Kettnaker, Discovery and segmentation of activities in video, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8) (2000) 844–851. [3] H. Bunke, Recognition of cursive roman handwriting—past, present and future, in: ICDAR, 2003, p. 448. [4] J. Chan, C. Ziftci, D. Forsyth, Searching off-line arabic documents, in: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition—Volume 2. CVPR '06, IEEE Computer Society, Washington, DC, USA, 2006, pp. 1455–1462. URL http://dx.doi.org/10.1109/CVPR.2006.269. [5] C. Choisy, Dynamic handwritten keyword spotting based on the nshp-hmm, in: Ninth International Conference on Document Analysis and Recognition, 2007. ICDAR 2007, vol. 1, September 2007, pp. 242–246. [6] D.S. Doermann, The indexing and retrieval of document images: a survey, Computer Vision and Image Understanding 70 (3) (1998) 287–298. [7] J. Edwards, Y. Whye, T. David, F. Roger, B.M. Maire, G. Vesom, Making Latin manuscripts searchable using ghmms, in: NIPS, vol. 17, 2005, pp. 385–392. [8] M. El-Yacoubi, M. Gilloux, J.-M. Bertille, A statistical approach for phrase location and recognition within a text line: an application to street name recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (February (2)) (2002) 172–188. [9] R. Farrahi Moghaddam, M. Cheriet, M.M. Adankon, K. Filonenko, R. Wisnovsky, Ibn sina: a database for research on processing and understanding of arabic manuscripts images, in: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems. DAS '10. ACM, New York, NY, USA, 2010, pp. 11–18. URL http://dx.doi.acm.org/10.1145/1815330.1815332. [10] J.T. Favata, G. Srikantan, A multiple feature/resolution approach to handprinted digit and character recognition, International Journal of Imaging Systems and Technology 7 (4) (1996) 304–311, URL http://dx.doi.org/ 10.1002/(SICI)1098-1098(199624)7:4 o 304::AID-IMA5 43.0.CO;2-C. [11] J.T. Favata, G. Srikantan, A multiple feature/resolution approach to handprinted digit and character recognition, International Journal of Imaging Systems and Technology 7 (4) (1996) 304–311, URL http://dx.doi.org/10.1002/(SICI)10981098(199624)7:4 o 304::AID-IMA5 43.0.CO;2-C.
[12] G.A. Fink, Markov Models for Pattern Recognition, From Theory to Applications, springer, 2008. [13] A. Fischer, A. Keller, V. Frinken, H. Bunke, Hmm-based word spotting in handwritten documents using subword models, in: Proceedings of the 2010 20th International Conference on Pattern Recognition. ICPR '10, IEEE Computer Society, Washington, DC, USA, 2010, pp. 3416–3419. URL http://dx.doi.org/ 10.1109/ICPR.2010.834. [14] V. Frinken, A. Fischer, H. Bunke, R. Manmatha, Adapting blstm neural network based keyword spotting trained on modern data to historical documents, in: Proceedings of the 2010 12th International Conference on Frontiers in Handwriting Recognition. ICFHR '10, IEEE Computer Society, Washington, DC, USA, 2010, pp. 352–357. URL http://dx.doi.org/10.1109/ ICFHR.2010.61. [15] V. Frinken, A. Fischer, R. Manmatha, H. Bunke, A novel word spotting method based on recurrent neural networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (2012) 211–224. [16] R. Jayadevan, S.R. Kolhe, P.M. Patil, U. Pal, Database development and recognition of handwritten Devanagari legal amount words, in: International Conference on Document Analysis and Recognition, 2011, pp. 304–308. [17] S. Johansson, G. Leech, H. Goodluck, Manual of Information to Accompany the Lancaster-Oslo/Bergen Corpus of British English, for Use with Digital Computers, 1978. [18] Y. Leydier, A. Ouji, F. LeBourgeois, H. Emptoz, Towards an omnilingual word retrieval system for ancient manuscripts, Pattern Recognition 42 (September (9)) (2009) 2089–2105, URL http://dx.doi.org/10.1016/j.patcog.2009.01.026. [19] R. Manmatha, C. Han, E.M. Riseman, Word spotting: a new approach to indexing handwriting, in: Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96). CVPR '96, IEEE Computer Society, Washington, DC, USA, 1996, p. 631. URL 〈http://dl.acm.org/citation.cfm? id=794190.794536〉. [20] U.-V. Marti, H. Bunke, Using a statistical language model to improve the performance of an hmm-based cursive handwriting recognition system, International Journal of Pattern Recognition and Artificial Intelligence 15 (1) (2001) 65–90. [21] U.-V. Marti, H. Bunke, The iam-database: an English sentence database for offline handwriting recognition, International Journal on Document Analysis and Recognition 5 (2002) 39–46, http://dx.doi.org/10.1007/s10032 0200071. [22] R. Plamondon, S.N. Srihari, On-line and off-line handwriting recognition: A comprehensive survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (January (1)) (2000) 63–84, URL http://dx.doi.org/10.1109/34. 824821. [23] L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, in: Proceedings of the IEEE, 1989, pp. 257–286. [24] T.M. Rath, R. Manmatha, Word spotting for historical documents, International Journal on Document Analysis and Recognition (2007) 139–152. [25] J.A. Rodrguez-Serrano, F. Perronnin, Handwritten word-spotting using hidden Markov models and universal vocabularies, Pattern Recognition 42 (9) (2009) 2106–2116, URL 〈http://www.sciencedirect.com/science/article/pii/S003132 0309000673〉. [26] M. Rusinol, D. Aldavert, R. Toledo, J. Llados, Browsing heterogeneous document collections by a segmentation-free word spotting method, in: 2011 International Conference on Document Analysis and Recognition, 2011, pp. 63–67. URL 〈http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm? arnumber=6065277〉. [27] R. Saabni, J. El-Sana, Keyword searching for arabic handwritten documents, in: 11th International Conference on Frontiers in Handwriting recognition (ICFHR2008), Montreal. ICFHR '08, 2008, p. 716–722. [28] S. Saleem, H. Cao, K. Subramanian, M. Kamali, R. Prasad, P. Natarajan, Improvements in bbn's hmm-based offline arabic handwriting recognition system, In: Proceedings of the 2009 10th International Conference on Document Analysis and Recognition. ICDAR '09, IEEE Computer Society, Washington, DC, USA, 2009, pp. 773–777. URL http://dx.doi.org/10.1109/ICDAR.2009. 282. [29] S. Saleem, H. Cao, K. Subramanian, M. Kamali, R. Prasad, P. Natarajan, Improvements in bbn's hmm-based offline arabic handwriting recognition system, in: Proceedings of the 2009 10th International Conference on Document Analysis and Recognition. ICDAR '09, IEEE Computer Society, Washington, DC, USA, 2009, pp. 773–777. URL http://dx.doi.org/10.1109/ICDAR.2009. 282.
1050
S. Wshah et al. / Pattern Recognition 47 (2014) 1039–1050
[30] M. Schenkel, I. Guyon, D. Henderson, On-line cursive script recognition using time-delay neural networks and hidden Markov models, Machine Vision and Applications 8 (4) (1995) 215–223. [31] Z. Shi, S. Setlur, V. Govindaraju, A steerable directional local profile technique for extraction of handwritten arabic text lines, in: ICDAR, 2009, pp. 176–180. [32] S.N. Srihari, H. Srinivasan, C. Huang, S. Shetty, Spotting words in Latin, Devanagari and Arabic scripts. Vivek: A Quarterly Indian Journal of Artificial Intelligence (2006). [33] S. Thomas, C. Chatelain, L. Heutte, T. Paquet, An information extraction model for unconstrained handwritten documents, in: Proceedings of the 2010 20th International Conference on Pattern Recognition. ICPR '10, IEEE Computer Society, Washington, DC, USA, 2010, pp. 3412–3415. URL http://dx.doi.org/10. 1109/ICPR.2010.833.
[34] A. Vinciarelli, A survey on off-line cursive word recognition. Idiap-RR IdiapRR-43-2000, IDIAP, 2000. [35] A. Vinciarelli, J. Luettin, Off-line cursive script recognition based on continuous density HMM. Idiap-RR Idiap-RR-25-1999, IDIAP, 0 1999a. [36] S. Wshah, G. Kumar, V. Govindaraju, Script independent word spotting in offline handwritten documents based on hidden markov models, in: Proceedings of the 2012 13th International Conference on Frontiers in Handwriting Recognition, 2012. [37] H. Yan, Skew correction of document images using interline cross-correlation, CVGIP: Graphical Model and Image Processing 55 (6) (1993) 538–543.
Safwan Wshah completed his PhD degree in Computer Science and Engineering from the University at Buffalo, State University of New York, USA, in 2012 and he is working now as research scientist at XEROX. His areas of interest include document image processing, natural language processing, pattern recognition and biometrics.
Gaurav Kumar received his MS degree in Computer Science from the University at Buffalo, State University of New York, USA, in 2011. Since then he is a PhD student at the University at Buffalo under Dr. Venu Govindaraju. He has worked as Research and Graduate Assistant at the University at Buffalo. His research interests include document analysis, keyword spotting, graphical models and computer vision.
Venu Govindaraju is a SUNY Distinguished Professor of Computer Science and Engineering at the University at Buffalo, State University of New York. He received his B-Tech (Honors) from the Indian Institute of Technology (IIT), Kharagpur, and his PhD degree from UB. He is the founding director of the Center for Unified Biometrics and Sensors. He has authored more than 350 scientific papers and graduated 25 doctoral students. He has been a lead investigator on projects funded by government and industry for about 60 million dollars. Dr. Govindaraju is a recipient of the IEEE Technical Achievement award, and is a fellow of the AAAS, the ACM, the IAPR, the IEEE, and SPIE.