Bayesian background models for keyword spotting in handwritten documents

Bayesian background models for keyword spotting in handwritten documents

Author’s Accepted Manuscript Bayesian Background Models for Keyword Spotting in Handwritten Documents Gaurav Kumar, Venu Govindaraju www.elsevier.com...

1MB Sizes 0 Downloads 74 Views

Author’s Accepted Manuscript Bayesian Background Models for Keyword Spotting in Handwritten Documents Gaurav Kumar, Venu Govindaraju

www.elsevier.com/locate/pr

PII: DOI: Reference:

S0031-3203(16)30147-9 http://dx.doi.org/10.1016/j.patcog.2016.06.030 PR5784

To appear in: Pattern Recognition Received date: 3 February 2015 Revised date: 24 May 2016 Accepted date: 29 June 2016 Cite this article as: Gaurav Kumar and Venu Govindaraju, Bayesian Background Models for Keyword Spotting in Handwritten Documents, Pattern Recognition, http://dx.doi.org/10.1016/j.patcog.2016.06.030 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Bayesian Background Models for Keyword Spotting in Handwritten Documents Gaurav Kumar, Venu Govindaraju Department of Computer Science and Engineering, University at Buffalo a 113

Davis Hall, Amherst, NY, USA, 14260-2500

Abstract Background in a handwritten document can be anything other than the words we are interested in. The characteristics of the background are typically captured by a background model to achieve spotting in handwritten documents. We propose two such bayesian background models for keyword spotting in handwritten documents. Firstly, we present a background model using the bayesian generalized linear model called (VDBM) and secondly propose a bayesian generalized kernel background model called BGKBM. Given a set of handwritten documents and a bunch of keyword and non-keyword scores, the models learn an efficient bayesian rejection criteria to output the most confident keyword regions in the handwritten document. For the variational dynamic background model (VDBM) the inference of parameters is done using variational methods and for the bayesian generalized kernel background model (BGKBM), the inference is done using a proposed Markov Chain Monte Carlo (MCMC) approach. The models are built on top of the scores returned by a handwritten recognizer for keywords and non-keywords. The approach is recognition based and works at line level. The methods have been validated on publicly available IAM dataset and compared with other state of the art line level keyword spotting approaches. Keywords: Handwriting Recognition, Keyword Spotting, Bayesian Generalized Linear Models, Bayesian Generalized Kernel Models ∗ Corresponding

author Email address: [email protected] (Gaurav Kumar, Venu Govindaraju)

Preprint submitted to Journal of LATEX Templates

July 14, 2016

Figure 1: Sample Image from IAM dataset. For purpose of keyword spotting, the foreground depicts all occurences of keywords (Labour, resolution) we are interested in. All other regions are considered background.

1. Introduction Handwriting recognition and word spotting have been looked at with keen interest by the researchers over the past two decades. Transcribing handwritten documents is a separate area of research which is not yet achieved in an uncon5

strained environment. Keyword spotting thus becomes a viable alternative in cases where the recognition of each word in the document might be too tough a task for any recognizer. The highest accuracy that a recognizer can achieve for a word lexicon of about 10000 words for a Latin script like English is close to 60% . It is largely due to the fact that there are a lot of variations in hand-

10

writing samples taken from multiple persons even for same character. Keyword spotting is thus a good alternative where we reduce the size of lexicon only to words that we are interested in and let the recognizer output each region in the document with keyword it could possibly belong to, along with the confidence score.

2

15

1.1. Keyword Spotting In keyword spotting, we are only interested in few keywords rather than transcribing the entire handwritten document as shown in figure 1. It has been approached in two different and distinct ways in the literature. In the first approach, the query word is an image. Given a bunch of handwritten documents

20

the goal is to find regions in the documents that are close to the query image in the feature space. The other approach is where the query word is submitted as text. While former approach is often termed as Search by Example, the latter approach is termed as Search by String [1]. Work on Search by Example started with a strong focus on historical documents. They are often referred to

25

as template based approach as well. In template based approaches the query word is mapped to trained template of words in the lexicon and regions of the document closest to the templates in feature space are returned as output. They fail to capture variations in writing styles to a large extent. Manmatha et al. [2] in his seminal work proposed the first search by example method for indexing

30

historical documents. Rath et al. [3] applied dynamic time warping (DTW) on the features extracted from the query word image and segmented regions of the handwritten document to identify a match. However, such approaches are not the focus of this work. Our focus is on Search by String strategy. The search by string strategy often relies on a recognizer that takes the keyword list as lexicon

35

and outputs a matching score for each region of the document belonging to a particular word from the lexicon. We will do a review of the approaches in section 2. However, the approaches can be broadly categorized as word based and line based recognition approaches. Word based approaches were introduced by [1], where individual word Hidden Markov Models were trained for each word in

40

the lexicon. The documents were segmented into lines and lines were segmented into words. The word images were passed to individual word HMM models and a confidence score was obtained. The approach though outperform the template based approaches, they have a drawback of relying on word segmentation which introduces another level of error, specially for scripts like arabic where it is dif-

45

ficult to find proper word gaps. In addition, for any change in the keyword list 3

the word models were retrained which is not a scalable approach. To overcome these drawbacks, line level approaches were first proposed by Fischer et al. [4] where documents were segmented at line level and each line was passed to Hidden Markov Model based line model which allowed the occurrence of a keyword 50

succeeded or preceded by any number of filler models separated by space. While the approach was very novel, it allowed only a single occurrence of a keyword in a line and heavily relied on finding the space in between the keyword and filler models. Wshah et al. [5] extended this idea and proposed another line level word spotting system that allowed any number of occurrence of one or more

55

keyword in a line and used filler models and background models to do further pruning. This approach though outperformed Fischer’s work, a huge amount of training is involved in learning the filler and background model.

In our prior work Kumar et al. [6], we proposed a novel line level strategy 60

that avoided any use of the background or filler models with similar results. The work used local character level score and global word level scores for the candidate keyword regions obtained from the recognizer and learned a variational dynamic background model to build a bayesian rejection strategy to prune the candidate keyword regions obtained from the recognizer. Built on top of the

65

bayesian logistic regression classifier, the model learned an optimum decision boundary using the variational methods between the keyword and non-keyword regions. In this work we represent them as special form of Bayesian Generalized Linear models and in addition propose a Bayesian Generalized Kernel based Background Model (BGKBM) that provides similar and slightly improved re-

70

sults. The BGKBM applies the bayesian generalized kernel model and tries to find a function belonging to the Reproducible Hilbert Kernel Space that minimizes the loss in prediction of the output label of the candidate keyword region. The rest of the paper is organized as follows. Section 2 covers some of the related and prior work. The variational dynamic background model and its in-

75

ference is covered in section 3.1 followed by the bayesian generalized kernel based background model in section 3.2. We cover a brief overview of our recognizer in 4

Figure 2: Sample from IAM dataset showing variations in handwriting style

Figure 3: Feature Extraction for Recognition

section 4 and the experimental evaluation is done in section 5.

2. Background and Related Work 2.1. State of art Word spotting framework 80

As discussed in section 1, there are two distinct ways of approaching the word spotting problem, namely query by example and query by string [1]. A number of approaches have been proposed in both areas and both have some advantages and disadvantages. Query by example can be considered as template based approach where the similarity between a set of features extracted from

85

the input image and the standard templates of the keywords is calculated based on some distance metric. Query by string approach, on the other hand, can be considered as recognition based approach where the candidate keywords need to be recognized as certain string to reject keyword regions from non-keywords. Some of the existing state of the art approaches are discussed below.

90

2.1.1. Template Based Approach Over the years, a number of approaches have been proposed in this category [2, 3, 7]. Manmatha et al. [2] in the seminal work proposed a way of indexing 5

and retrieving historical handwritten documents by applying a word spotting strategy using a template matching approach. The approach created equivalence 95

classes of words where each equivalence class consisted of different samples of same word from different writers. Such equivalence classes were then compared in euclidean and affine space to find the best matching candidate keywords. Madhavnath et al. [7] proposed use of holistic features extracted from word images considered as a single entity and applied dynamic time warping (DTW)

100

to find the nearest matches. Rath et al. [3] build over these ideas and used DTW on clusters of words created from sample data. An important drawback of such approaches is that they require at least single sample of the keyword in training and often fail to capture the variation in writing styles. 2.1.2. Recognition Based Approach

105

The recognition based approaches have evolved over the years and they overcome the above drawbacks of the template based approaches, especially the Hidden Markov Model (HMM) based systems which do not require the keyword samples to be present in the training set. Also, with enough training samples the variation in styles is also taken into account. The recognition based keyword

110

spotting approaches are further divided into word based and line based. In the word based spotting approaches [8, 9], same topology HMM models are trained for each keyword and non-keyword. A major drawback of this system is its non robustness to the keyword list as it requires retraining of the system for a given keyword and also relies heavily on word segmentation. To overcome these

115

drawbacks line level word spotting systems were proposed by Fischer et al. [4] where words were spotted at line level with the input image being a line instead of a word. They proposed a keyword model for line images where the keyword could be present at the beginning, in the middle or at the end. The non-keyword region were modeled by the filler models which were separated from the keyword

120

by spaces. A major drawback of their system is that the model assumed single occurrence of the keyword in a line and relied on spaces to spot the keywords. Recently, Wshah et al. [5] proposed another line model that could incorporate

6

any number of occurrences of keywords in the line. Though, it outperformed Fischer et al. [4], a good amount of training is required to train the background 125

and filler models. In our prior work [6], we built a similar line level keyword spotting framework which learned a bayesian model for pruning the candidate keyword regions returned by the recognizer. Taking a line image as input and a set of keywords, the recognizer outputs a bunch of candidate keyword regions along with the confidence at word and character level. The confidence extracted

130

from a bunch of labeled keyword and non-keyword sample is passed to a bayesian logistic regression classifier that learned a bayesian model to provide a binary output of region being a keyword or non-keyword. The classifier is learned using variational methods and hence we term it as Variational Dynamic Background Model.

135

2.2. Bayesian Kernel Methods We are proposing here another bayesian model using the bayesian formulation of Reproducible Hilbert Kernel space and hence it is worthy of highlighting some of the related areas where kernel methods have been applied in a bayesian 140

environment. Kernel methods and their bayesian counterparts have been applied quite frequently in various applications. Mallick et al. [10] applied the bayesian interpretation of Reproducing Kernel Hilbert Space for classification of tumors in gene expression data. Zhang et al. [11] used a bayesian formulation of the kernel methods for multi-class support vector machines. Bobb et

145

al. [12] applied another variation which they termed Bayesian Kernel Machine Regression (BKMR) to estimate the health effects of the multi pollutants. They tried to identify an association between the pollutant to health symptoms using the BKMR. Menor et al. [13] applied it in bio-informatics. He predicted sites showing post-translational modification of proteins using probabilistic kernel

150

methods that outperformed the common kernel methods like Support Vector Machines (SVM) and Relevance Vector Machines(RVM). Gonen et al. [14] proposed an efficient bayesian formulation of combining multiple kernels and showed 7

its performance in diverse areas such as bio-informatics and image processing.

3. Proposed Keyword Spotting Framework 155

A document consists of anything around tens to hundreds of words. Often we are interested in just a few keywords which may or may not occur in the document. If the recognizer is not good enough to capture all variations in writing styles, the amount of false positives would be high. A good rejection strategy is required to prune the output from the recognizer. In a recognition

160

based keyword spotting framework often a filler model or background model [5] is applied to separate keyword regions from non-keywords. The filler and background models are learned in different ways either using the statistical tools like Hidden Markov Model [5] or by using recurrent neural networks [15]. The filler and backfround models together or alone serve as rejection criterion

165

to filter our the non-keyword regions from keyword regions. We approach the problem of building the rejection criterion in a novel way. In a recognition based keyword spotting framework, the vocabulary size of the recognizer is reduced. The proposed model is based on following hypothesis. Given an input sequence of frames, the recognizer will always output one of the words in the

170

vocabulary with some confidence score. At any given time, there exists following possibilities: • Possibility 1: The input image is a genuine keyword image and the output text is also the same keyword. • Possibility 2: The input image is a genuine keyword image but the output

175

text is a different keyword. • Possibility 3: The input image does not belong to any keyword and the output is a keyword. We start with following assumption:

8

• Assumption 1: Given a genuine keyword image, the recognizer will be 180

highly confident and corresponding global word level score and local character level score will also be high. • Assumption 2: Conversely, if the frames do not represent any keyword, the confidence of the recognizer both at character level and word level will be low.

185

While Assumption 1 answers the first possibility, latter answers possibilities 2 and 3. It’s on these assumptions that we proceed and build our rejection criterion which we term as Dynamic Background Model (DBM). In this work we suggest two different background models learned from the character level and word level scores obtained from the recognizer. We proposed Dynamic Background

190

Models learned from the score features and clubbed them with segmentation free recognizer [16] and segmentation based recognizers [17]. The models were learned on scores obtained from two different recognizers. We further extended the idea and added a bayesian perspective to the models [16] to accommodate for variations in writing styles using the bayesian logistic regression classifier

195

learned in a variational framework. It is noteworthy that Logistic Regression is a specialized form of Generalized Linear Models where the link function is logistic. Here, we represent the variational dynamic background model (VDBM) as bayesian generalized linear model. 3.1. Variational Dynamic Background Model

200

We first formulate the task of keyword spotting as Bayesian Generalized Linear Model (BGLM) and then extend it with Bayesian Generalized Kernel Models. Let X  Rm be the features extracted from L labeled samples of keywords and non-keywords and let Y  [0; 1] be the corresponding labels. Our goal is to predict a binary output yt [0, 1] given an input xt . The GLM model

205

has the form given by:

yt = σ(W T xt ) + t

9

(1)

Figure 4: Score feature extraction for given candidate image

where σ(ε) =

1 1 + e−ε

(2)

In case of the GLMs, the predicted values are Y {0, 1} of the region being a keyword or non-keyword. The model output is predicted using a conditional Bernoulli distribution. The predictor functions are the linear transformation of input data, that are the score features, onto parameter space by multiplying them with a regression vector W. The link function is the logit function defined by

1 . 1+exp−f (x)

The predictor function f (x) is given by WT X, given the input

score feature X. The score features are obtained by obtaining local character level scores and global word level scores of regions provided by the recognizer as shown in figure 4. The input to the recognizer is an input line image and a grammar that allows one or more occurrences of the keywords and set of stopwords separated by spaces. The output from the recognizer is a set of candidate keyword regions that the recognizer considers as one of the keywords as shown in figure 4. Along with the candidate keyword regions, the recognizer 10

Figure 5: Variational Dynamic Background Model. The shaded circle represent the observed variables, feature vectors xt and corresponding labels yt . The hidden weights W and variational distribution which has a functional form of Gaussian N are represented as blank circle and arrows show dependency. The approximate distribution is parameterized by the variational parameter λ.

also returns top N matching keywords and their scores. They form the first set of features for the dynamic background model. The candidate keyword regions are again passed to the recognizer to obtain the best possible segmentation points of the candidate keyword image given a grammar that allows only occurrence of characters in a same sequence as that of output keyword. This step is required to obtain proper segmentation of the candidate keyword image based on the keyword it corresponds to. The underlying assumption is that for an impostor image, the segmented regions will often not be proper. The segments obtained from the candidate regions are then passed to obtain top K character level scores corresponding to each segment. The values N and K are fixed empirically. We denote these features as score features, Sf . Since the length of the keywords vary, we fix the length of the feature vector to be K ∗ max word length + N and set all unknown values to zero. Thus the length of the resulting feature vector, Sf is given by: |Sf (x)| = K ∗ max word length + N

(3)

The maximum likelihood estimate for the Bernoulli case of our GLM is given

11

by: p(Y |W, X) =

QR

=

QR

t=1

p(yt = 1|xt )

yt

+ (1 − p(yt = 1|xt ))1−yt ) (4)

t=1

σ(W T xt )

yt

+ (1 − σ(W T xt ))1−yt )

Thus the log likelihood is given by:

L=

R X

lnp(yt |xt ) =

t=1

R X

yt lnσt + (1 − yt ) ln(1 − σt )

(5)

t=1

Where σt = σ(W T xt ). The posterior of W is intractable due to non-normal nature of the logit function. This problem can be approached in two possible ways. One is using the Laplace approximation where the width of the curvature 210

of the distribution is approximated at the maximum posterior solution using Hessian at the MAP solution and a Gaussian is fitted with the same curve [18]. However, such methods often fail to capture the true width of the distribution as they may miss areas of high probability mass since they focus on regions where probability was higher. The other possible approach is the Variational approxi-

215

mation in which instead of using the MAP solution, a best fit distribution closest to the actual posterior is estimated. The distance between the distributions is measured using the Kullback-Leibler divergence or the KL divergence. The KL divergence is a non-symmetric measure of the difference between two probability distributions P and Q. Essentially, the KL divergence denoted by DKL (P ||Q),

220

is the amount of information lost when the distribution Q is used as an estimate of P. 3.1.1. Variational Inference The graphical representation of our generalized linear model framework is shown in figure 5. The observed variables are shaded circles representing the feature vectors x and corresponding labels y. The plate notation denote R samples of such data. The arrows indicate the dependencies. The hidden or unobserved variable in the framework is represented by blank circles. λ is the variational parameter. The posterior distribution of the hidden weights given

12

Figure 6: EM Algorithm for Variational Inference

Require: ξold 1:

{E-Step}

2:

λ(ξr old ) =

1 (σ(ξr old ) − 21 ) 2λ(ξr old ) PR −1 3: ΣR = Σ0 −1 + 2 r=1 λ(ξr old )xr xr T PR −1 1 4: µR ← ΣR (Σ0 µ0 + r=1 (yr − 2 )xr )

5:

return q(W ) ← N (W |µR , ΣR )

1:

{M-Step}

2:

(ξr new )2 = xr T (ΣR + µR µR T )xr

3:

return ξr

the target label and the data is represented as: p(W |Y, X) ∝ p(Y |X, W ) ∗ p(W )

(6)

where p(Y |X, W ) is the sigmoid likelihood and p(W ) is the Gaussian prior on the hidden weights. Consider a variational lower bound for the sigmoid output of the target label [19] as: p(Y |X, W ) > h(W, ξ)

(7)

where, h(W |ξ) is given by: h(W, ξt ) = ezt ∗yt σ(ξt )exp{−

(zt + ξt ) − λ(ξt )(zt 2 − ξt 2 )} 2

(8)

where zt is given by W T x∗ . The variational posterior distribution q(W ) is given as: p(Y |X, W )p(W ) > h(W, ξ)p(W ) = q(W ) = N (W |µn , Σn )

(9)

Thus q(W ) is the variational approximation of P (W |Y, X) which has a functional form of Gaussian. The approximate inference is achieved by Expectation 225

Maximization, EM step as shown in figure 6. In the E-step, the variational parameter is kept fixed and expected complete data log likelihood is calculated. The M-step re-estimates the parameter by maximizing over the lower bound. 13

The variational dynamic background model, VDBM sits on top of any recognition based word spotting framework. 230

3.2. Bayesian Generalized Kernel Background Model In the variational dynamic background model, we applied prior on individual weight parameters for each score feature and approximate posterior using variational inference. The variational approximation in essence tries to model a Gaussian distribution of weights over keywords and non-keywords that best fits the data. We next explore and transform our data to infinite dimension using kernels and kernel trick. The goal is to model a non-parametric bayesian distribution for given data. In this technique we will try to find the distribution of weights over individual samples so that the margin between keywords and non-keywords is maximized. Inspired by Zhang et al. [20], we first represent our dynamic background model as a GLM and represent φ(·) by a feature function that maps X to the Hilbert functional Space H (X). Given Y {0, 1} binary valued data of keywords and non-keywords and score features X → RM obtained from section 3.1 for the input data, goal is to find f  HK that minimizes the loss function: min f  HK

(

n

λ 1X 2 L(yi , f (xi )) + ||h||HK n i=1 2

) (10)

where HK is the Reproducing Kernel Hilbert Space, λ is the regularization constant and the norm ||h||2HK is given by: n X

βi βj K(xi , xj )

(11)

i,j=1

where K(·, ·) is the kernel function and βr are the weight vectors associated with each input. By the Reproducer Theorem [21] the solution to equation 10 is expressed as: f (x) = ω +

n X i=1

14

βi K(x, xi )

(12)

where ω is some constant. Equation 10 can be represented in terms of weight vectors β as: min ω, β

(

n

0 1X λ 0 L(yi , α + ki (βi )) + β Kβ n i=1 2

) (13)

where β is the n × 1 weight vector, K is the n × n kernel matrix [k1 , k2 , ..kn ] 0

and ki = (K(xi , x1 ), K(xi , x2 )..K(xi , xn )) . As in Zhang et al. [20], we represent equation 11 in terms of feature function ψ(x). Given a Mercer Kernel K : X × X → R, there exists feature functions ψ(x) that maps the input space X to the feature space F Rp . Thus the feature vector of x can be represented as ψ(x) = (ψ(x1 ), ψ(x2 ).., ψ(xr )) such that K(xi , xj )

=

0

ψ(xi )ψ(xj ) . The feature function ψ(x) can be expressed in

terms of orthogonal eigen functions in the square integrable Hilbert Functional n√ or Space L2 (X) and corresponding eigen values as ψj (x = λj φ(x) with j=1

λ1 ≥ λ2 ≥ .. > 0. Equation 11 can be expressed in terms of the feature function as: f (x) = ω +

p X

0

qj ψj (x) = ω + ψ(x) q

(14)

j=1 0

where q = [q1 , q2 ..qp ] . Since p can be infinite, we approximate f (·) by making all qi = 0 for all i > n where n is the number of samples. It is easy to show 0

that q = βΨ and β = K−1 Ψq where K = ΨΨ . As in Zhang et al. [20], let qk be all independent random variables sampled from a gaussian distribution 235

with mean E(qk ) = 0 and variance E(qk2 ) = g −1 . Hence prior on β can be represented as β ∼ N (0, g −1 K −1 ) which was termed as Silverman g-prior by Zhang.

3.2.1. Inference We assume that g is sampled from a Gamma distribution Gamma(k, θ) with shape parameter k and scale parameter θ and ω is sampled from a Gaussian distribution N (0, σω ). The joint probability of our framework can thus be represented as: p(y, z, β, ω, g) = p(y|z)p(z|β, ω)p(β|g)p(g); 15

(15)

We try to formulate Markov Chain Monte Carlo approach to solve for the parameters. Since β is sampled from Normal Distribution β ∼ N (0, g −1 K −1 ). Its conjugate prior is a Normal distribution: p(β|z) ∼ N (µpo , σpo ) (16)

µpo = σpo Kz 0

σpo = (g −1 K −1 + KK )−1 β and z, however are strongly correlated, hence as suggested by Holmes et al. [22], we jointly update β and z such that: p(β, z|y) = p(z|y)p(β|z);

(17)

where p(β|z) is obtained from equation 16. In order to obtain p(z|y) we apply a Gibbs Sampling approach as suggested by Holmes et al.[22]. p(zi |z−i , yi ) ∝

N (µi , σi )I(zi > 0)

yi = 1

(18)

N (µi , σi )I(zi ≤ 0) otherwise

µi = xi ∗ µpo − wi ∗ (zi − xi ∗ µpo ) where

(19)

σi = 1 + wi 0

0

wi = Hii where H = Kσpo K /(1 − Kσpo K ) and z−i represent all elements of z except ith element zi . The posterior mean 240

old µpo is recalculated after every update of zi as µpo = µold po + Si (zi − zi ) where

Si is the ith column of S = σpo K. Our MCMC algorithm can thus be summarized as: • Sample g from Gamma Distribution • Sample β from the Multivariate Gaussian Distribution. 245

• Compute each zi from p(zi,j |β, ω). • Calculate misclassification error for each zi . • Update gamma with the misclassification error and iterate.

16

4. Segmentation Free Recognizer The recognizer used in this work is a line level recognizer. It is trained on 250

line images taken from publicly available IAM dataset [23]. The documents are segmented into lines by algorithm proposed by Shi et al. [24]. The height of each line image is fixed to 100 pixels and line image is resized retaining its aspect ratio. The height normalized line image is skew corrected by method proposed by Yan et al. [25]. Features are extracted using a sliding window to

255

learn character level Hidden Markov Models (HMM) for each character. 4.1. Feature Extraction The input line image is divided into overlapping frames of fixed width 20 with 85% overlap. Each frame is divided into two vertical bins so that the number of foreground pixels are almost same in both bins. From each bin

260

two kinds of features are extracted, the Gradient, Structural and Concavity (GSC) features proposed by Favata et al.[26] and intensity features proposed by Vinciarelli et al.[27]. In the GSC features, the gradient features capture the local orientation of strokes, the structural features provide information about stroke trajectories and the concavity features capture the stroke relationship.

265

Wshah et al. [5] showed a comparison of all possible combination of these features and concluded that the combination of gradient features from GSC and Intensity features from Vinciarelli showed best results. We use a similar combination of gradient and intensity features for each frame. The gradient features are obtained by dividing each frame into two equal halves based on the

270

center of mass. In each half, gradient is calculated for each pixel based on its neighboring pixel and quantized into angles. The angles are then binned into 8-directional bins and a normalized histogram of gradient is calculated for each half. They thus constitute 8 × 2 direction vector for each frame. To calculate the intensity features, each frame is divided into 4 ∗ 2 cells and for each cell the

275

number of black and white pixel ratio is calculated and stored as 8 dimensional vector. Thus the overall dimension of feature vector extracted from each frame is 16 + 8 = 24. 17

4.2. Character Models Features extracted above from each frame of a line along with the transcription of the line is used to learn individual character models for each unique character in the dataset. The number of states in each character model is empirically fixed to 14. The features extracted from each frame form the 24dimensional observation vector. For each state Si the observations are modeled to be sampled from a Gaussian Mixture Model with 24 mixing components. Oi =

24 X

wj ∗ N (µj , σj )

(20)

P24

wj = 1 and N is standard

j=1

where Oi is the observation at the ith frame and 280

j=1

normal distribution.

5. Experimental Setup 5.1. Corpus We evaluate the DBM, VDBM and BGKBM models on the public IAM dataset for English [23]. The IAM dataset consists of 1539 pages of handwrit285

ten text from the LancasterOslo/Bergen corpus[28]. The dataset comprises of documents written by 657 writers including 115,320 handwritten words and close to 13,353 lines of text. The underlying lexicon includes more than 12,000 unique words. The individual handwritten text lines of the scanned forms have been extracted and thus are also available separately. The dataset contains

290

forms of unconstrained handwritten text, which were scanned at a resolution of 300dpi and saved as PNG images with 256 gray levels. A sample document image is shown in Figure 7. 5.1.1. VDBM Evaluation The Variational Dynamic Background Model(VDBM) was learned on la-

295

beled samples of 1000 genuine keywords and non-keyword. A gaussian prior on the weights was initialized with zero mean and unit covariance. The score features obtained from the genuine and impostor samples were passed to the 18

Figure 7: Sample Scanned Document from the IAM Dataset

EM algorithm of the variational inference that ran until convergence. To evaluate the model, we chose top 60 and top 200 most frequent occurring terms in 300

the IAM dataset. Two separate experiments with top 60 and top 200 terms as keywords were conducted. The average length of the keywords for lexicon of size 60 and 200 was approximately 7 characters. In both cases the system was evaluated in terms of average precision and average recall. The results are shown in Table 1. The variational dynamic background outperforms the normal

305

dynamic background model and the work by fischer. We apply different threshold on sigmoid output from bayesian logistic regression classifier to generate the precision recall curve shown in figure 8. With increase in the size of the lexicon the precision and recall decreases. This is expected. The proposed approach outperforms fischer’s algorithm for both lexicon sizes of 60 and 200 in terms of

310

precision and recall and Wshah et al. [5] in terms of recall.

19

Figure 8: Precision Recall curve for Scanned Images from IAM Dataset Using Variational Dynamic Background Model and Bayesian Generalized Kernel Background Model on Segmentation Free Recognizer and comparison with state of the art.

5.1.2. BGKBM Evaluation The Bayesian Generalized Kernel Background Model was learned using 2000 samples of keyword and non-keyword images. The Markov Chain Monte Carlo 315

(MCMC) was run for 1000 iteration with the 100 iteration considered as burnin. The shape and scale parameters for the gamma distribution were empirically fixed to 1.1 and 2.0. About 400 labeled samples of keyword and non-keyword were held out for validation. As earlier the model was evaluated on a similar setup with 60 and 200 keywords chosen from the IAM dataset. The results are

320

shown in table 1. The BGKBM model outperforms VDBM and other line model proposed by Wshah et al.[5] and Fischer et al.[4] in terms of recall showing that a more complex function better models the variation in writing styles and provides a better separation between keywords and non-keywords. Since the BGKBM model uses more complex kernels, it is able to model the patterns associated with

325

the genuine and impostor keyword scores better. Both the proposed methods outperform other state of the art approaches in terms of F-measure which is given by

2∗P recision∗Recall P recision+Recall .

20

Table 1: Spotting Accuracy on IAM Dataset Using Segmentation Free Recognizer

Lexicon Size 60

Lexicon Size 200

Model

Avg. Prec

Avg. Recall

F- measure

Avg. Prec

Avg. Recall

F- measure

Fischer et al. [4]

0.2152

0.2664

0.2380

0.2871

0.1792

0.2207

Wshah et al. [5]

0.4998

0.3272

0.3955

0.4951

0.3066

0.3787

DBM

0.3611

0.4790

0.4118

0.2966

0.4037

0.3420

VDBM

0.4451

0.4992

0.4706

0.3467

0.4216

0.3805

BGKBM

0.4062

0.5968

0.4834

0.3607

0.5584

0.4383

5.1.3. Processing Time We also evaluated the average processing time for about 1000 line images 330

on the IAM dataset and compared it with other state of the art approaches. The numbers are compared in table 2. Each processing of line image had two major stages. Firstly, the score feature extraction where the score features were extracted at the character and word level and secondly the predicton stage. The score feature extraction was common for both the VDBM and BGKBM

335

approaches. The prediction time for both VDBM and BGKBM was comparable as a result the overall processing time for VDBM and BGKBM approaches were similar. While the overall processing time was better than that of Fischer, it was marginally more than the character filler models approach proposed by Wshah et al.[5]. With the increase in number of keywords, the score feature

340

extraction takes a little more time as the recognizer takes more time to return the word level scores and corresponding character level scores. The prediction time remains constant with increase in the size of the lexicon for both the proposed frameworks.

6. Conclusion 345

We proposed two recognition based bayesian background models for keyword spotting. Being bayesian, the models avoid any over fitting of data and are robust to unseen data. The variation in handwriting styles is well captured as 21

Dataset

IAM

#keywords Wshah

Fischer

Proposed

Proposed

VDBM

BGKBM

for

60

83 ms

195 ms

103 ms

109 ms

for

200

320 ms

406 ms

343 ms

347 ms

English IAM English Table 2: Processing Speed of proposed frameworks compared with other state of the art on IAM datasets with different numbers of keywords on a single line

well. The bayesian formulation of the generalized linear models and generalized kernel models are presented in this work. The spotting framework learns regres350

sion coefficients in bayesian framework for local character level scores and global word level scores returned by the recognizer. The accuracy of both models can be improved by including more labeled samples of keyword and non-keywords. Labeled data being expensive, active learning approaches can be applied to improve these models and that will be our focus in future.

355

7. Acknowledgment The authors would like to acknowledge Safwan Wshah for valuable discussions and providing code for his keyword spotting framework.

References [1] J. A. Rodr´ıguez-Serrano, F. Perronnin, Handwritten word-spotting using 360

hidden markov models and universal vocabularies, Pattern Recognition 42 (9) (2009) 2106–2116. [2] R. Manmatha, C. Han, E. Riseman, Word spotting: a new approach to indexing handwriting, in: Computer Vision and Pattern Recognition, 1996. Proceedings CVPR ’96, 1996 IEEE Computer Society Conference on, 1996,

365

pp. 631 –637. doi:10.1109/CVPR.1996.517139. 22

[3] T. M. Rath, R. Manmatha, Word spotting for historical documents, INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION (2007) 139–152. [4] A. Fischer, A. Keller, V. Frinken, H. Bunke, Lexicon-free handwritten word 370

spotting using character hmms, Pattern Recogn. Lett. 33 (7) (2012) 934– 942. doi:10.1016/j.patrec.2011.09.009. URL http://dx.doi.org/10.1016/j.patrec.2011.09.009 [5] S. Wshah, G. Kumar, V. Govindaraju, Statistical script independent word spotting in offline handwritten documents, Pattern Recognition 47 (3)

375

(2014) 1039 – 1050, handwriting Recognition and other {PR} Applications. doi:http://dx.doi.org/10.1016/j.patcog.2013.09.019. URL

http://www.sciencedirect.com/science/article/pii/

S0031320313003919 [6] G. Kumar, S. Wshah, V. Govindaraju, Variational dynamic background 380

model for keyword spotting in handwritten documents (2013). doi:10. 1117/12.2041244. URL http://dx.doi.org/10.1117/12.2041244 [7] S. Madhvanath, E. Kleinberg, V. Govindaraju, Holistic verification of handwritten phrases, Pattern Analysis and Machine Intelligence, IEEE Trans-

385

actions on 21 (12) (1999) 1344 –1356. doi:10.1109/34.817412. [8] J. A. Rodriguez, F. Perronnin, Local gradient histogram features for word spotting in unconstrained handwritten documents, in: Proc. ICFHR2008, ICFHR ’08, 2008. [9] R. Saabni, J. El-Sana, Keyword searching for arabic handwritten doc-

390

uments, in: 11th International Conference on Frontiers in Handwriting recognition (ICFHR2008), Montreal, ICFHR ’08, 2008, p. 716722. [10] B. K. Mallick, D. Ghosh, M. Ghosh, Bayesian classification of tumours by using gene expression data, Journal of the Royal Statistical Society: 23

Series B (Statistical Methodology) 67 (2) (2005) 219–234. doi:10.1111/ 395

j.1467-9868.2005.00498.x. URL http://dx.doi.org/10.1111/j.1467-9868.2005.00498.x [11] Z. Zhang, M. I. Jordan, Bayesian Multicategory Support Vector Machines, ArXiv e-printsarXiv:1206.6863. [12] J. F. Bobb, L. Valeri, B. Claus Henn, D. C. Christiani, R. O. Wright,

400

M. Mazumdar, J. J. Godleski, B. A. Coull, Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures, BiostatisticsarXiv:http://biostatistics.oxfordjournals.org/ content/early/2014/12/21/biostatistics.kxu058.full.pdf+html, doi:10.1093/biostatistics/kxu058.

405

URL

http://biostatistics.oxfordjournals.org/content/early/

2014/12/21/biostatistics.kxu058.abstract [13] M. Menor, K. Baek, G. Poisson, Probabilistic prediction of protein phosphorylation sites using classification relevance units machines, SIGAPP Appl. Comput. Rev. 12 (4) (2012) 8–20. doi:10.1145/2432546.2432547. 410

URL http://doi.acm.org/10.1145/2432546.2432547 [14] M. G¨ onen, Bayesian efficient multiple kernel learning, in: Proceedings of the 29th International Conference on Machine Learning, 2012. [15] V. Frinken, A. Fischer, R. Manmatha, H. Bunke, A novel word spotting method based on recurrent neural networks, IEEE Transactions on Pattern

415

Analysis and Machine Intelligence 34 (2) (2012) 211–224. doi:http://doi. ieeecomputersociety.org/10.1109/TPAMI.2011.113. [16] G. Kumar, S. Wshah, V. Govindaraju, R. Sitaram, Segmentation-free keyword spotting framework using dynamic background model, in: DRR, 2013. [17] G. Kumar, Z. Shi, S. Setlur, V. Govindaraju, S. Ramachandrula, Keyword

420

spotting framework using dynamic background model, The 13th International Conference on Frontiers in Handwriting Recognition, (ICFHR 2012),. 24

[18] A. Azevedo-Filho, R. D. Shachter, Laplace’s method approximations for probabilistic inference in belief networks with continuous variables, in: UAI, 1994, pp. 28–36. 425

[19] M. Jordan, T. Jaakkola, A Variational Approach to Bayesian Logistic Regression Models and Their Extensions, in: Workshop on Artificial Intelligence and Statistics, 1996. [20] Z. Zhang, G. Dai, M. I. Jordan, Bayesian generalized kernel mixed models, J. Mach. Learn. Res. 12 (2011) 111–139.

430

URL http://dl.acm.org/citation.cfm?id=1953048.1953053 [21] B. Sch¨ olkopf, R. Herbrich, A. J. Smola, A generalized representer theorem, in: Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory, COLT ’01/EuroCOLT ’01, Springer-Verlag, London, UK, UK, 2001,

435

pp. 416–426. URL http://dl.acm.org/citation.cfm?id=648300.755324 [22] C. C. Holmes, L. Held, Bayesian auxiliary variable models for binary and multinomial regression, Bayesian Analysis 1 (1) (2006) 145–168. doi:10. 1214/06-BA105.

440

URL http://dx.doi.org/10.1214/06-BA105 [23] U.-V. Marti, H. Bunke, The iam-database: an english sentence database for offline handwriting recognition, International Journal on Document Analysis and Recognition 5 (2002) 39–46, 10.1007/s100320200071. URL http://dx.doi.org/10.1007/s100320200071

445

[24] Z. Shi, S. Setlur, V. Govindaraju, A steerable directional local profile technique for extraction of handwritten arabic text lines., in: ICDAR, IEEE Computer Society, 2009, pp. 176–180. URL

http://dblp.uni-trier.de/db/conf/icdar/icdar2009.html#

ShiSG09

25

450

[25] H. Yan, Skew correction of document images using interline crosscorrelation, CVGIP: Graphical Models and Image Processing 55 (6) (1993) 538 – 543. doi:http://dx.doi.org/10.1006/cgip.1993.1041. URL

http://www.sciencedirect.com/science/article/pii/

S1049965283710412 455

[26] J. T. Favata, G. Srikantan, A multiple feature/resolution approach to handprinted digit and character recognition, International Journal of Imaging Systems and Technology 7 (4) (1996) 304–311.

doi:

10.1002/(SICI)1098-1098(199624)7:4<304::AID-IMA5>3.0.CO;2-C. URL 460

http://dx.doi.org/10.1002/(SICI)1098-1098(199624)7:

4<304::AID-IMA5>3.0.CO;2-C [27] A. Vinciarelli, J. Luettin, Off-line cursive script recognition based on continuous density HMM, Idiap-RR Idiap-RR-25-1999, IDIAP (0 1999). [28] L.-G. Johansson, S., H. Goodluck, Manual of information to accompany the lancaster-oslo/bergen corpus of british english, for use with digital comput-

465

ers (1978).

26