Character confidence based on N-best list for keyword spotting in online Chinese handwritten documents

Character confidence based on N-best list for keyword spotting in online Chinese handwritten documents

Pattern Recognition 47 (2014) 1880–1890 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr ...

3MB Sizes 3 Downloads 74 Views

Pattern Recognition 47 (2014) 1880–1890

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Character confidence based on N-best list for keyword spotting in online Chinese handwritten documents Heng Zhang, Da-Han Wang, Cheng-Lin Liu n National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, No. 95 Zhongguancun East Road, Beijing 100190, PR China

art ic l e i nf o

a b s t r a c t

Article history: Received 22 January 2013 Received in revised form 11 November 2013 Accepted 2 December 2013 Available online 11 December 2013

In keyword spotting from handwritten documents by text query, the word similarity is usually computed by combining character similarities, which are desired to approximate the logarithm of the character probabilities. In this paper, we propose to directly estimate the posterior probability (also called confidence) of candidate characters based on the N-best paths from the candidate segmentationrecognition lattice. On evaluating the candidate segmentation-recognition paths by combining multiple contexts, the scores of the N-best paths are transformed to posterior probabilities using soft-max. The parameter of soft-max (confidence parameter) is estimated from the character confusion network, which is constructed by aligning different paths using a string matching algorithm. The posterior probability of a candidate character is the summation of the probabilities of the paths that pass through the candidate character. We compare the proposed posterior probability estimation method with some reference methods including the word confidence measure and the text line recognition method. Experimental results of keyword spotting on a large database CASIA-OLHWDB of unconstrained online Chinese handwriting demonstrate the effectiveness of the proposed method. & 2013 Elsevier Ltd. All rights reserved.

Keywords: Online Chinese handwritten documents Keyword spotting Posterior probability N-best list Confidence measure Confusion network

1. Introduction With the increasing use of digitizing tablets, tablet PCs, digital pens (such as the Anoto Pen) and smart phones, online handwritten documents are produced constantly. This entails efficient retrieval techniques to exploit the semantic information in the documents. The automatic recognition of handwritten texts, such as postal mail addresses [1], manuscripts or entire books [2], has been a focus of intensive research for several decades. However, the recognition of unconstrained handwriting is still a challenge due to the divergent writing styles. The insufficient handwriting recognition performance leads to low performance of text search. Keyword spotting from handwritten documents is of interest because keywords partially satisfy the need of information search, and the located keywords form the basis of vector space representation for document clustering and classification. Major applications of keyword spotting include handwritten notes, digital libraries [3] and historical document retrieval [4], which involve large volumes of documents and entail efficient techniques of document information retrieval. Keyword spotting is usually accomplished by computing a similarity measure between the query word and a segmented candidate in the document. According to the word similarity scoring technique, keyword spotting methods can be grouped into image matching-

n

Corresponding author. Tel.: þ 86 10 82544797; fax: þ 86 10 82544594. E-mail address: [email protected] (C.-L. Liu).

0031-3203/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2013.12.001

based ones [5] and word/character model-based ones [6]. Both can be applied to text (keyboard) query and handwriting query. The imageto-image matching technique does not involve training because the distance measure is directly computed by flexible matching between the query (input image or an image synthesized from the text query) and the candidate word images. However, it suffers from low matching accuracy and high computational cost. In contrast, the wordmodel-based method can be used for retrieving multi-writer documents, by computing similarity scores using a trained character/word model. The advantage of model-based retrieval over image-to-image matching has been verified by previous works, e.g., the experiments of Cao et al. [7]. Recently, some keyword spotting methods have been proposed to searching words from text lines, without the need of segmenting lines into individual words. A line-oriented approach uses dynamic time warping (DTW) to automatically spot keyword candidates from handwritten text lines [8]. Some keyword spotting methods are based on text line recognition using, e.g., Hidden Markov Models (HMM) [9] and recurrent neural networks (RNN) [10]. Similar methods have been proposed for keyword spotting in speech [11] as well. To better account for the geometric context, Cao et al. [7] proposed to incorporate the probability of candidate word segmentation into the word similarity to enhance the spotting performance. Some keyword spotting methods in speech [12] or handwriting [13] compute the word probability by summing up the probabilities of the paths passing through the candidate word. While in the speech/handwriting recognition

H. Zhang et al. / Pattern Recognition 47 (2014) 1880–1890

system, the path probability is computed by combining the word recognition scores and the language model. Unlike the HMM, many Chinese text line recognition methods, such as those in [14] and [15], use non-probabilistic character/word models, and thus, the estimation of path probability is not straightforward. In such scenario, the estimation of path and character/word probabilities has not been investigated adequately. The proposed method is based on N-best paths and computes the posterior probabilities of candidate characters by summing up the confidence measures of the paths. The N-best paths are searched for from the candidate segmentation-recognition lattice generated after over-segmentation, with the paths evaluated by combining multiple contexts (character recognition score, language model and geometric model) [14]. In Chinese handwriting recognition, such methods based on over-segmentation have shown superiority [14,15] compared to HMM-based recognition [16]. On weighted combination of multiple contexts, the path score is a distance or similarity measure between the candidate segmentation path and a hypothesized string class. This is different from the HMM-based speech and handwriting recognition systems where the path score is already a probability. The path scores can be transformed to posterior probabilities using the soft-max [17]. We propose to estimate the parameters of confidence transformation by stochastic gradient descent to optimize the margin of competing candidates in the same bin of the character confusion network (CCN), which is generated from the N-best paths of a training dataset of text lines. Character probabilities are then computed from the path probabilities and combined into word similarities. Based on path evaluation and N-best path search, the proposed keyword spotting model plentifully exploits the multiple contexts used in text line recognition. We evaluated our proposed method on the online Chinese handwriting database CASIA-OLHWDB [18], and demonstrated the effectiveness of the proposed method compared with some reference methods. This work is an extension of our previous confidence-based method [19], which estimates the confidence parameter using the cross-entropy (CE) loss. The extension mainly lies in the CCN construction and decoding for confidence parameter estimation, and the extensive experimental results and discussions. The major contributions of this work are as follows. (1) We propose a character-confidence-based keyword spotting method for Chinese handwriting, which involves a large number of character classes. (2) We propose a character confidence estimation method based on CCN, which plentifully exploits the multiple contexts in candidate segmentation-recognition path evaluation. (3) We evaluate the related methods for both character confidence and word confidence estimation, and demonstrate the superiority of character confidence over direct word confidence estimation. (4) Through elaborate estimation of character confidence, the character model-based keyword spotting method yields superior retrieval performance compared to text line recognition (transcription)-based text search. The rest of this paper is organized as follows. In Section 2, we give a brief review of the related works on character/word confidence estimation. Section 3 gives the overview of our keyword spotting system. Section 4 describes our method of character confidence estimation. Section 5 presents the character confusion network decoding strategy for confidence parameter estimation. Section 6 presents the experimental results and Section 7 offers concluding remarks.

2. Related works The estimation of character/word confidence in speech or handwriting recognition is needed because the recognition model

1881

often does not output probabilities. Though some models are derived from the Bayesian decision theory, they approximate probabilities very roughly. According to the Bayesian decision rule, the text line , pattern X is classified to the optimal string class (word sequence) Wopt according to the maximum posterior probability: ,

W opt ¼ arg max PðWjX Þ W ,

PðX jWÞPðWÞ ¼ arg max ,

W

PðX Þ ,

¼ arg max PðX jWÞPðWÞ:

ð1Þ

W ,

The computation of probability PðX Þ is complicated and is normally omitted since it is independent of the word sequence. This facilitates the comparison of competing sequences of words, but makes the estimation of posterior probability an issue. The posterior probabilities of word sequences have been effectively estimated based on word graphs and N-best lists. The posterior probability of an individual word can then be evaluated by summing up the posterior probabilities of the paths passing through it. The posterior probabilities on the word graph (compressed lattice) [20] or word lattice [21] are usually computed using a forward–backward algorithm. Since word graph is a compact and fairly accurate representation of all the alternative competing hypotheses of word sequences, the posterior probability calculated on it is a fairly accurate estimate. However, the generation of word graph and computation on it is rather complicated, especially in large vocabulary recognition systems. The N-best list [22] is a much simpler approximation to the lattice than the word graph, and can be used for estimating the posterior probabilities of words efficiently. For optimizing the parameters of confidence estimation from word graphs or N-best lists, word confusion networks [23] are generated from the training data. When constructing the confusion network, to find the optimal alignment of multiple strings is a problem for which no efficient solution is known [24]. There have been some heuristic methods for approximate solution. In machine translation systems [25], the confusion networks are usually built around a “skeleton” hypothesis (usually the top rank) from the N-best list. All the hypotheses are aligned against the skeleton independently and the confusion networks are created from the union of these alignments. In speech [26] and handwriting [13] recognition, the confusion network can be generated using a two-step word clustering procedure, including an intraword clustering step and an inter-word clustering step. In the field of general pattern classification, there have been many methods for transforming classifier output measures to posterior probabilities or generalized confidence measures. The commonly used methods include the soft-max [27,28], logistic regression (sigmoidal) [29], and evidence combination of sigmoidal [30,31]. These methods can be applied to the estimation of path probabilities in character string (text line) recognition, by either considering string classes or combining character probabilities into path probabilities. In character string recognition, the candidate segmentation-recognition lattice includes both legal characters and non-characters. Considering non-character samples in character confidence estimation has been shown to improve the string recognition performance [14]. Our proposed method in this work estimates the posterior probabilities of candidate segmentation-recognition paths by optimizing an objective for discriminating confusing characters, and then estimates the probabilities of characters from the path probabilities. This enables the plentiful exploitation of contexts in

1882

H. Zhang et al. / Pattern Recognition 47 (2014) 1880–1890

text line recognition, and benefits the character-confidence-based keyword spotting.

3. System overview For possible applications of retrieval from large database of documents, our system of online Chinese handwritten document retrieval consists of two stages: indexing and retrieval (keyword spotting), as shown in Fig. 1. The indexing stage generates candidate text line recognition results and confidence measures, while the retrieval stage spots keywords in response to the query. In the indexing stage, the input document is first segmented into text lines (we use the algorithm of [32]). Each text line is oversegmented [33] into primitive segments (stroke blocks) by stroke grouping according to the off-stroke distance. Candidate character patterns are generated by combining consecutive segments and are assigned candidate classes by the character classifier. The candidate patterns and classes are represented in a candidate segmentation-recognition lattice, which contains many paths each corresponding to a hypothesis of recognition result. An example of text line segmentation and over-segmentation is shown in Fig. 2 (a), and an example of candidate segmentation lattice (only candidate patterns) is shown in Fig. 2(b). On evaluating the candidate segmentation-recognition paths by combining character classification scores, linguistic context and geometric context [14], the N-best paths are obtained using the beam search algorithm [34], wherein the partial paths ending at each candidate segmentation point are sorted and at most N partial paths with maximum scores are retained for extension. The scores of N-best paths are converted to posterior probabilities using soft-max, and the probabilities of character classes are computed from the path probabilities. For saving storage space in indexing, the candidate characters in the N-best list are rearranged into a compact lattice by merging the same candidate character in different paths. For each line, we store the number of primitive segments and the bound of each primitive segment. At each segmentation point, we store the number of candidate characters ending at this point, and for each candidate character, its primitive segment number, candidate class labels and confidence values. Keyword spotting is performed by dynamic search [6] from the compact lattice constructed from the N-best list, wherein the query word is compared with sequences of candidate characters (partial paths in the candidate lattice) with every primitive segment as the start. The word matching score is obtained by combining the character similarity scores. When the word similarity is greater than a pre-specified threshold, a word instance is located in the document. Fig. 3(a) shows an illustrative example of Retrieval

Storage

Handwritten Document

Fig. 2. (a) Text line segmentation of a document and over-segmentation of a line. (b) Candidate segmentation lattice.

Document Database

Spotting Results

Document Segmentation Candidate Lattice

Fig. 3. (a) Dynamic search for spotting a query word. (b) Spotted instances in a document.

Beam-Search N-best List

Compact Lattice

Confidence Measure

Index File

Fig. 1. Block diagram of the keyword spotting system.

Text Query

dynamic search for spotting a word from a sequence of primitive segments as shown in Fig. 3(b). The parameters for confidence transformation of N-best paths are estimated using confusion network decoding on a set of training documents, as will be elaborated in Section 5.

H. Zhang et al. / Pattern Recognition 47 (2014) 1880–1890

1883

4. Character probability estimation

4.2. Probabilities of individual candidate characters

In keyword spotting, the similarity between the query word W ¼ c1 …cn and a sequence of candidate characters X ¼ x1 …xn is obtained by combining the similarity of candidate characters [6]:

The probability of each candidate character can be computed on the N-best list obtained using beam search [34], wherein the partial paths ending at each candidate segmentation point are sorted and at most N partial paths with maximum scores are retained for extension. Let b and e denote the start and end segments of a candidate character of class c, respectively. The i-th path can be represented as a sequence eiM of characters ðC; X S Þi ¼ ðci1 ; xebi1i1 Þ; …; ðciMi ; xbiMi Þ, where Mi is the ,numi ber of candidate characters on the path. Given a text line X , the posterior probability of a candidate character ðc; xeb Þ can be computed by summing up the posterior probabilities of the paths which contain this candidate character [12]:

SimðW; XÞ ¼

1 n ∑ simðci ; xi Þ; ni¼1

ð2Þ

where simðci ; xi Þ is the similarity between candidate pattern xi and character class ci. The character similarity is desired to approximate the logarithm of the posterior probability Pðci jxi Þ. We evaluate the character probability from the posterior probabilities of the N-best paths of text line recognition, while the path probabilities are transformed from the path evaluation scores.

On the candidate lattice, the segmentation ,of the path can be denoted as a sequence of candidate, patterns X ¼ x1 …xn and the label C ¼ c1 …cn . Then each path ðC; X Þ is evaluated by the scoring function combining multiple contexts [14]:

PððC; X S Þi jX Þ:

ð4Þ

According to Eq. (1), the posterior probability PððC; X S Þi jX Þ can be computed as ,

,

PððC; X S Þi jX Þ ¼

PððC; X S Þi ; X Þ :

ð5Þ

,

PðX Þ Reasonably assuming Gaussian distribution with equal variance for the path score for different classes, the numerator is proportional to the exponential of the path score:

n

dðC; X Þ ¼  ∑ fki log Pðci jxi Þ þ λ1 log Pðci jci  1 Þ i¼1

p ui þ λ2 log Pðci jg uc i Þ þ λ3 log Pðzi ¼ 1jg i Þ g bi þ λ4 log Pðci  1 ; ci jg bc i Þ þ λ5 log Pðzi ¼ 1jg i Þg;

,



ðC;X S Þi :ðc;xeb Þ A ðC;X S Þi

,

4.1. Path evaluation

,

,

Pððc; xeb ÞjX Þ ¼

,

ð3Þ

where ki is the number of primitive segments composing the candidate character xi , the probabilities are the character classification score Pðci jxi Þ, bi-gram linguistic score Pðci jci  1 Þ, unary classdependent and class-independent geometric scores Pðci jg uc i Þ and Pðzpi ¼ 1jg ui Þ, binary class-dependent and class-independent geoi g bi metric scores Pðci  1 ; ci jg bc i Þ and Pðzi ¼ 1jg i Þ, respectively. Except the linguistic score, all these constituent probabilities are approximated from the outputs of classifiers on character features or geometric features. For the large category set problem of Chinese character classification, we use a state-of-the-art classifier, the modified quadratic discriminant function (MQDF) [35]. The MQDF is a modified version of quadratic discriminant function (QDF) rooting from the Bayesian classifier with multivariate Gaussian density assumption. The minor eigenvalues of each class in QDF are replaced by a constant resulting in MQDF, such that only the principal eigenvectors (50 in our experiments) are used in the discriminant function. This helps reduce the computation complexity and meanwhile benefits the generalization performance. To build the geometric models, we extract features for unary and binary geometry from the bounding boxes of a candidate character pattern and from two adjacent character patterns, respectively [15]. Due to the large number of Chinese characters and the fact that many different characters have similar geometric features, we cluster the character classes into six super-classes using the EM algorithm. The combining weights in path evaluation are learned on text line samples by Minimum Classification Error (MCE) training [36], which has been popularly used in speech recognition and handwriting recognition. According to the previous results of Chinese handwriting recognition [14], the string recognition performance of character classifier only is inferior and the geometric models can increase the recognition accuracy. The statistical language model is much more effective than the geometric models and the combination of all the context models yields the best performance, justifying that geometric context and linguistic context are complementary.

,

PððC; X S Þi ; X Þ p exp ½  α  dððC; X S Þi ; X Þ;

ð6Þ

,

where dððC; X S Þi ; X Þ is the path score as in Eq. (3) and the parameter α is estimated to optimize the probability fitting or a margin criterion. α is related to the variance of path score for different path classes and controls the hardness of confidence transformation by soft-max. The probabilities of the remaining paths beyond the N-best list can be viewed as zero because they have low scores, then the , probability PðX Þ is computed by ,

N

,

N

,

PðX Þ ¼ ∑ PððC; X S Þj ; X Þ p ∑ exp ½  α  dððC; X S Þj ; X Þ: j¼1

ð7Þ

j¼1

The posterior probabilities of N-best paths are then computed by normalization of Eq. (6), resulting in the soft-max: ,

,

PððC; X S Þi jX Þ ¼

exp ½  α  dððC; X S Þi ; X Þ ,

:

ð8Þ

∑N j ¼ 1 exp ½  α  dððC; X S Þj ; X Þ Care should taken in computing the soft-max on digital computers, as the value of exponential can be out of the range of double float variables. To overcome this,, we subtract the path score of top , rank from each rank: dððC; X S Þi ; X Þ  dððC; X S Þ1 ; X Þ. Using this difference of path score instead of the original path score does not alter the probability value of Eq. (8). By doing this, since the biggest item of exponentials is scaled to one, the normalized probabilities are guaranteed to be accurate. 4.3. Accumulated probability A candidate character can be correctly recognized even if it contains only a part of primitive segments of the true character. In this situation, its posterior probability mass may be split among different paths. To avoid this undesirable effect which produces an underestimation of character probability, the split posterior probabilities should be re-joined by summing up the probabilities of the intersected candidate characters with identical class label. So, given a character which intersects a specific segment s, its accumulated posterior probability at segment s, Aððc; xeb Þ; sÞ, is computed by

1884

H. Zhang et al. / Pattern Recognition 47 (2014) 1880–1890

Initialization: n ¼ M 1 and Bk ¼ ðc1k ; xeb1k Þ, k¼ 1,…,n. 1k

Dynamic Alignment: for i¼ 2,…,N, CN ¼ DPAlign½CN; ðC; X S Þi . End

Algorithm 2. Dynamic programming alignment. Fig. 4. Traditional lattice (upper) and the derived confusion network (lower).

summing the posterior probabilities over all the candidate characters which have the same class label and intersect with segment s: Aððc; xeb Þ; sÞ ¼



,

Pððc; xe′ b′ ÞjX Þ:

ð9Þ

ðc;xe′ Þ:s A ðc;xe′ Þ \ ðc;xeb Þ b′ b′

In order to compute the posterior probability of a character which is composed of successive segments (b,e), different methods based on the accumulated posterior probabilities have been proposed in [12]. According to the results therein, the best performing method is based on the maximum (best-case) Aððc; xeb Þ; sÞ over the primitive segments in the character: P max ððc; xeb ÞÞ ¼ max Aððc; xeb Þ; sÞ: s A ½b;e

ð10Þ

Input: character confusion network B1 …Bn and the character string c1 …cm on a path. Output: similarity matrix Simði; jÞ and the updated character confusion network CN. 1: Dð0; 0Þ ¼ 0; 2: for i¼ 1 to n do Dði; 0Þ ¼ 0; 3: for j¼ 1 to m do Dð0; jÞ ¼ 0; 4: for i¼ 1 to n do 5: for j¼1 to m do 6: s1 ¼ Dði 1; j  1Þ þ simðBi ; cj Þ; 7: s2 ¼ Dði 1; jÞ þsimðnull; cj Þ; 8: s3 ¼ Dði; j 1Þ þsimðBi ; nullÞ; 9: Dði; jÞ ¼ maxðs1; s2; s3Þ; 10: if Dði; jÞ ¼ s1, point ði; jÞ-ði  1; j 1Þ 11: else if Dði; jÞ ¼ s2, point ði; jÞ-ði  1; jÞ 12: else point ði; jÞ-ði; j  1Þ End

5. Confusion network decoding The confusion network is a simplified lattice with candidate alternatives (including nulls) for each bin, and it forces the competing candidates in different paths to be in the same group by aligning the candidate characters with higher similarity [23]. Fig. 4 shows the typical structures of the traditional lattice and the confusion network, where we can see that the confusion network is more compact in size and structure. During the CCN decoding, the confidence parameter α can be estimated by maximizing the probability of the correctly recognized character and minimize the other ones in each bin. 5.1. Confusion network construction In speech [26] and handwriting [13] recognition, the confusion network is usually generated by a two-step word clustering procedure. But to cluster the Chinese handwriting is too difficult because of the writing style variation and large category set. Considering that a candidate path in our N-best list is a sequence of candidate characters, it can be a string matching procedure to align the candidate characters on different paths [37]. A character confusion network (CCN) is represented as a sequence of bins B1 …Bn , where Bk, k ¼ 1; …; n, denotes a set of alternative candidate characters. From the N-best paths of a text line, the CCN construction process is briefly described in Algorithm 1 and the dynamic programming (DP) alignment DPAlign½ is described in Algorithm 2. The 1-best path usually has the higher accuracy compared to the other ranks, so we initialize the CCN using the candidates in the first (top-rank) path. This is similar to [25] which chooses the best hypothesis as the skeleton and performs better than the other methods in the experiments. The remaining paths are sequentially aligned with the existing confusion network using Algorithm 2.

The DP alignment algorithm is modified from the string matching algorithm in [37], which allows two string to be matched to become identical length. Given a CCN sequence B1 …Bn and a character string on a path c1 …cm , the alignment is to maximize the similarity between them. We define the k-th bin Bk ¼ fckj ; j ¼ 1; …; jBk jg, where jBk j is the number of candidate characters in Bk. The similarity between Bk and a character ci is defined as the similarity between ck1 (the candidate with maximal probability in the bin Bk) and ci, which is calculated as the horizontal overlapping width of two characters normalized by the average width of them. It is 0 if either ck1 or ci is null (ε):  simðBk ; ci Þ ¼

simðck1 ; ci Þ; 0

ck1 a ε and ci a ε; otherwise:

During DP alignment, a similarity matrix Dðn; mÞ is maintained. At the end of alignment, the assignment of ci, i ¼ 1; …; m, to the bins in the confusion network can be obtained by tracing back the pointers from Dðn; mÞ to Dð0; 0Þ, and the confusion network is updated accordingly. Fig. 5 shows an example of N-best list; Fig. 6 (a) shows the demonstration of the dynamic alignment in Algorithm 1; and Fig. 6(b) shows the resulting confusion network (the output of Algorithm 1).

Ground Truth R1 R2 R3

Algorithm 1. Confusion network construction. Input: N-best list

eiM {ðC; X S Þi ¼ ðci1 ; xebi1i1 Þ; …; ðciMi ; xbiMi Þ, i

Output: character confusion network CN ¼ B1 …Bn .

R4

i ¼ 1; …; N}.

ð11Þ

R5 Fig. 5. Example of N-best list (N ¼5).

H. Zhang et al. / Pattern Recognition 47 (2014) 1880–1890

" # ∑dbi e  αðtÞdbi ∑dbj e  αðtÞdbj ¼ αðtÞ  ηðtÞ   ; ∑e  αðtÞdbi ∑e  αðtÞdbj

5.2. Confidence parameter estimation The probabilities of the candidate characters in each bin of CCN sum up to 1: jBk j

,

∑ Pðckj jX Þ ¼ 1;

j¼1

k ¼ 1; …; n;

ð12Þ

where ckj is the j-th candidate in the k-th bin, and the probability of “null” is the complement of the probabilities of the candidates in the k-th bin: ,

,

PðεjX Þ ¼ 1  ∑ Pðckj jX Þ:

1885

ð15Þ

where the summation is over the bins of one confusion network, η is the learning step, which shrinks gradually and is empirically initialized as 0.1. In contrast to the above CCN-based confidence parameter estimation, our previous method estimated the confidence parameter by cross-entropy (CE) loss minimization [19], which treat all the candidate characters on the N-best paths equally without emphasizing the confusion of overlapping candidate characters.

ð13Þ

ckj a ε

6. Experimental results For confidence parameter estimation on confusion networks constructed from training text lines, we view the correctly recognized candidate characters in each bin as positive samples and the other candidates in the same bin as negative samples. Then inspired from the log-likelihood of margin (LOGM) criterion in prototype classifier learning [38], the parameter α is estimated by optimizing a margin of log-likelihood loss on the bins of confusion networks: Nb

,

,

minL0 ¼  ∑ ½log Pðct b jX Þ  log Pðcr b jX Þ α

b¼1 Nb

¼  ∑ ½log ∑e  αdbi log ∑e  αdbj ;

ð14Þ

b¼1

where Nb is the total number of bins from training text lines, ctb is the correctly recognized candidate character in a bin and crb is the error candidate with the highest probability in the same bin. The second line of the above equation follows Eq. (4), dbi and dbj are the path scores for the path containing ctb and the path containing crb, respectively. The objective of (14) is minimized by stochastic gradient descent to iteratively update the parameter α of soft-max. At each iteration, update the parameter on the bins of confusion network of a training text line:

αðt þ 1Þ ¼ αðtÞ  ηðtÞ 

∂L0 ∂αðtÞ

We evaluated the performance of the proposed keyword spotting method on a database of online Chinese handwriting: CASIA-OLHWDB [18]. This database is divided into six datasets, three for isolated characters DB1.0–1.2 (called DB1 in brief) and three for handwritten texts DB2.0–2.2 (DB2 in brief). There are 3 912 017 isolated character samples (7356 classes) and 5092 handwritten pages (composed of 52 220 text lines and in turn 1 348 904 character samples) in total. Both the isolated data and handwritten text data have been divided into standard training set (816 writers) and test set (204 writers). Fig. 7(a) shows some samples of isolated characters, and Fig. 7(b) shows a handwritten text page with multiple lines of characters. 6.1. Experimental setting For candidate segmentation-recognition path evaluation of text lines, the character classifier extracts features from candidate character patterns and assigns class labels to them. For character feature extraction, we use the popularly used stroke direction histogram feature, implemented by the normalization cooperated feature extraction (NCFE) method of [39] with bi-moment normalization. The resulting 512-dimensional (8-direction decomposition and 8  8 sampling) NCFE feature vector is projected onto a 160D subspace learned by Fisher linear discriminant analysis (FLDA) [40].

1. First step of dynamic alignment in Algorithm 2

B1

B2

B3

2. Last step of dynamic alignment in Algorithm 2

B4

B5

B6

B7

B8

Fig. 6. The dynamic alignment and the resulting CCN of the N-best list in Fig. 5. (a) The demonstration of dynamic alignment algorithm. (b) The resulting character confusion network.

1886

H. Zhang et al. / Pattern Recognition 47 (2014) 1880–1890

that the TR increases with N and gets nearly saturated when N 4 40. Now that the computing time does not increase substantially with N, for completeness of true characters in confidence parameter estimation of path probability, we choose N ¼50. In the following experiments, the default settings are word similarity (Eq. (2)) from accumulated character probability (Eq. (10)), and confidence parameter estimation by CCN decoding (Section 5.2). For justification, they are compared with individual probability (Eq. (4)), and our previous confidence parameter estimation method by cross-entropy (CE) loss minimization [19]. The performance was evaluated on the test pages of DB2 using all the words (2-character, 3-character, 4-character) as query.

The parameters of MQDF classifier were learned using 4/5 of the training set containing the training isolated characters and the characters segmented from the training text pages (4 207 801 samples in total). The remaining 1/5 training samples were used for parameter estimation of classifier confidence transformation (logistic regression followed by Dempster–Shafer evidence combination [14]). The four geometric models were trained using the training set of DB2 [15] and the language model, character bi-gram, is the one used in [14]. The training set of DB2.0 was used for estimating the combining weights of path evaluation in text line recognition, and the training set of DB2.0–2.2 was used to decode CCN for estimating the confidence parameter α in path probability transformation. In over-segmentation of text lines, each candidate character pattern is assumed to have at most six primitive segments. The character classifier assigns to each character pattern 20 candidate classes of highest scores such that the true class is included with high probability [15]. For evaluating the retrieval performance on the test set of DB2.0–2.2, we use the high-frequency words in the lexicon of the Sogou labs [41] as query words. The top 60,000 frequently used words, including 39 057 two-character words, 9975 threecharacter words and 9451 four-character words, were tested in our experiments. The keyword spotting performance is measured using three metrics: recall (R, percentage of correctly detected words among the true words), precision (P, percentage of correct words among the detected ones) and the F-measure (harmonic mean). Our experiments were implemented on a PC with Intel(R) Core (TM)2 Duo CPU E8400, 3.00 GHz processor and 2GB RAM, and the algorithms were programmed using Cþ þ.

6.2.2. CCN decoding versus CE loss To justify path confidence parameter estimation by CCN decoding, we compare with our previous method using the cross-entropy loss for parameter estimation [19]. The CE loss is calculated directly on the candidate characters (divided into positive and negative samples) on the N-best paths, and is similarly minimized by stochastic gradient descent. The recall–precision curves of keyword spotting using confidence parameter estimation by the proposed CCN coding and by CE are shown in Fig. 10. We can see that parameter estimation by CCN decoding performs slightly better Table 1 Total recall rate (TR) for 2-character words and average recognition time (AT) with variable N. N

10

20

30

40

50

60

TR (%) AT (s)

92.40 3.01

93.29 3.05

93.54 3.35

93.67 3.40

93.75 3.55

93.85 3.67

99 indiv prob accum prob

98 Precision

Fig. 7. (a) Example of isolated character samples. (b) A handwritten text page. Each character sample is attached with class label.

6.2.1. Accumulated character probability To justify the accumulation of character probability over paths, we first compare the keyword spotting performance of word similarity from individual character probability (Eq. (10)) and from accumulated character probability (Eq. (4)). The recall–precision curves of the spotting results are plotted in Fig. 8. It is evident that the accumulated character probability over paths yields higher performance of keyword spotting than the individual probability. This is because some characters in text lines may be segmented into different candidate patterns in different paths as shown in Fig. 9, and the accumulated character probability is able to re-join the probabilities of such split characters.

97

96

6.2. Spotting results For choosing a number of paths for N-best path generation, we investigated the tradeoff between the total recall rate (TR, the percentage of query words on the N-best paths) for 2-character words and the average time (AT) of N-best list generation for one text line. Table 1 shows the TR and AT of different N. We can see

95 88

90

92

94

Recall Fig. 8. Recall–precision curves of keyword spotting by word similarity from individual character probability (indiv-prob) and accumulated probability (accum-prob).

H. Zhang et al. / Pattern Recognition 47 (2014) 1880–1890

1887

100

Precision

95

90 al–0.1 al–1

85

proposed

Fig. 9. The first character is split in the recognition results of different paths.

al–100

99

80 91

CE proposed

94

Fig. 11. Recall–precision curves of keyword spotting using different confidence parameter values.

97

100

96

95

90

92

Precision

Precision

93 Recall

98

95 88

92

94

Recall Fig. 10. Recall–precision curves of keyword spotting using confidence parameter estimation by CCN decoding and by CE.

than parameter estimation by CE. This is because CCN decoding better exploits the confusion of characters in different paths and the margin criterion (14) considers the confusion between the true character and the most confusing character, while the CE loss does not exploit such confusion information explicitly. The parameter α is to scale the score of each candidate path during confidence estimation. The optimal values obtained by CCN decoding and CE are 10.5 and 6.5, respectively. To justify the influence of α on the final spotting results, we set the parameter value as 0.1, 1, 10.5 and 100 for comparison. The recall–precision curves of these values are shown in Fig. 11. We can see that the optimal confidence parameter given by CCN decoding indeed yields the best performance. 6.3. Comparing with reference methods We compare the keyword spotting performance of the proposed method with the reference methods of word confidence based on N-best list and text search based on text line recognition (transcription). Both the reference methods exploit plentiful contexts of text lines to yield high performance. 6.3.1. Comparing with word confidence Instead of fusing character probabilities into word similarity as in the proposed method, the word probability can be directly accumulated from the path probabilities. In [42], Pan et al. compared the posterior probabilities of words and that of subword units, and showed that the sub-word probability can well overcome the out-of-vocabulary (OOV) problem and significantly outperform the word probability for in-vocabulary query.

90

85

80 80

word proba 2c proposed 2c word proba 3c proposed 3c word proba 4c proposed 4c

85

90

95

Recall Fig. 12. Recall–precision curves of keyword spotting by the proposed method and word probability (word-prob).

Table 2 Spotting results using the proposed character probability and the word probability. Word length

2 3 4

Proposed

Word probability

R (%)

P (%)

F (%)

R (%)

P (%)

F (%)

91.19 91.70 89.35

96.59 97.88 99.00

93.81 94.69 93.93

91.35 91.58 89.41

89.99 97.43 98.94

90.66 94.41 93.93

Similarly, we compare the keyword spotting performance of the proposed character-probability-based method and the word probability method. For fair comparison, the posterior probability of query word is also summed over the paths that contain the word, similar to Eq. (4), and accumulated over all the candidate words which have the same label and intersect with a segment, similar to Eqs. (9) and (10). The recall–precision curves of keyword spotting using character probability and word probability are plotted in Fig. 12, and the precision and recall rates corresponding to the maximum F-measure are listed in Table 2. It is evident that for query words of 2-character and 3-character, the spotting performance of character-probability-based method is superior to that of word

1888

H. Zhang et al. / Pattern Recognition 47 (2014) 1880–1890

probability. For 4-character query words, the two methods give comparable performance. This is because with increased word length, both methods achieve a very high precision with a slightly decreased recall rate.

100

6.3.2. Comparing with transcription-based search Because of the limited accuracy in handwriting recognition, most previous works on keyword spotting avoid using a text recognition system to transcribe the handwritten text and search on the output text. However, some experimental studies (e.g., the one in [10]) show that transcription-based spotting can perform competitively. We attribute the superiority of transcription-based spotting to the plentiful exploitation of contexts in text line recognition. To evaluate the performance of the proposed character-confidence-based spotting method, we compare with transcription-based search using a state-of-the-art text line recognition method [14], which is also underlying the proposed character-confidence-based method for candidate segmentationrecognition path evaluation (as described in Section 4.1). To investigate the dependency of spotting performance on the 1-best recognition (i.e., transcription) accuracy, we divide the test set into three subsets of different degrees of recognition hardness: Test1 (718 pages) with the character-level correct rate above 90%, Test2 (224 pages) with correct rate between 80% and 90%, and Test3 (78 pages) with correct rate below 80%. The 1-best recognition and keyword spotting experiments are respectively performed on the three subsets and the whole test set. Since text line recognition gives unique text output, transcription-based word search gives a unique point of recall-precision rates. The results are compared with those of the proposed method (corresponding to the maximum F-measure) in Table 3. It is shown that on 2-character words, both the precision and recall rate of transcription-based search are fairly high, but with increased word length, the precision increases but the recall rate decreases rapidly. By searching from the N-best paths, the proposed method can better balance between precision and recall and achieves a higher F-measure. The comparison of recall–precision curves on the whole test set in Fig. 13 also manifests the superiority of the proposed method. In addition to higher F-measure, the proposed method provides flexible options of tradeoff between the precision and recall rate. This property is important particularly when we want to spot keywords with higher recall rate while tolerate false positives. Table 4 shows the index data size and the average searching time (with threshold  1 so as to spot all the query words present in the N-best list) for one word by the proposed method and by transcription-based search. The data size of the test pages in stroke Table 3 Keyword spotting results of the proposed method and transcription-based search. Dataset

Accuracy (%)

Test1

Length

Proposed

Transcription

R (%)

P (%)

F (%)

R (%)

P (%)

F (%)

95.04

2 3 4

95.43 95.27 94.06

97.85 98.46 99.11

96.63 96.84 96.52

95.29 93.71 92.32

97.90 99.45 99.58

96.57 96.49 95.82

87.67

2 3 4

87.12 86.14 78.93

94.18 97.91 98.77

90.51 91.65 87.74

86.89 83.87 73.88

94.03 98.84 99.43

90.32 90.75 84.77

73.29

2 3 4

68.28 66.53 63.60

86.55 92.74 99.32

76.34 77.48 77.54

68.90 60.12 58.33

84.96 95.54 99.25

76.09 73.80 73.48

91.74

2 3 4

91.19 91.70 89.35

96.59 97.88 99.00

93.81 94.69 93.93

91.45 89.57 86.78

96.03 99.16 99.54

93.68 94.12 92.72

Test2

Test3

Total

Precision

95

90

85

80 80

transcription−2c proposed−2c transcription−3c proposed−3c transcription−4c proposed−4c 85

90

95

Recall Fig. 13. Recall–precision curves of keyword spotting by the proposed method and transcription-based search.

Table 4 The index size (Mb) and the average keyword spotting time (ms) for one word. Method

Proposed

Transcription

Index size Spotting time

5.32 8.06

2.68 2.84

trajectory (4 bytes for a sampled point) is 47.71 Mb. We can see that though the proposed method consumes larger index storage and search time than transcription, the increased data size of index file is much smaller than the original handwriting data.

6.4. Error analysis The proposed character-confidence-based keyword spotting method performs well overall. The remaining errors of false negatives or positives can be attributed to two main reasons: text line recognition error (character segmentation or recognition errors in generating the N-best paths) and the low confidence of candidate characters in the N-best list. If a character/word is not present in the N-best paths, it cannot be spotted, thus resulting in a false negative. On the other hand, correct character/word candidates in the N-best paths with low confidence and ranked behind other wrong character/word candidates will result in false positives. The text line recognition error is caused by over-segmentation failure (under-segmentation) or character classification error. Under-segmentation results when the strokes of different characters are merged into a primitive segment. Such mis-merged characters cannot be separated at later processing. Character classification error refers to the case that the true class of candidate character is not included in the top ranks in path evaluation or its confidence is low, such that the true class will not appear in the top ranked path or N-best paths. Fig. 14 shows an example of text line recognition where both under-segmentation and mis-classification occurs. Fig. 15 shows an example that the true characters appear only in a path of low rank. In this case, the character or word confidence will be low even after probability summation over the paths. The true word will be either rejected (yielding low recall) or accepted with some wrong words of higher confidence (yielding low precision).

H. Zhang et al. / Pattern Recognition 47 (2014) 1880–1890

1889

Fig. 14. Example of text line recognition error. Under-segmentation occurs for the first two ground-truth characters, mis-classification occurs for many characters. After string recognition by best path search, more characters are mis-merged.

References

Fig. 15. Example of text line recognition error that true characters are in lowrank path.

7. Conclusion We proposed a character-confidence-based keyword spotting method for online Chinese handwritten documents. The character confidence (posterior probability) is estimated from the N-best paths of candidate segmentation-recognition generated by a text line recognition method combining multiple contexts. The character probabilities are accumulated over paths and combined into word similarity for spotting. The confidence parameter is estimated on character confusion network (CCN) by minimizing a margin-based character discrimination objective. Our experiments on a large online handwriting database show the promise of the proposed method, and justify the benefits of character probability accumulation and CCNbased confidence parameter estimation. The superiority of the proposed method against word-confidence-based spotting is demonstrated. Compared to text line recognition (transcription)-based text search, the proposed method better balances between precision and recall and achieves higher F-measure. The spotting performance can be further improved by reducing the over-segmentation and character classification errors underlying the text line recognition.

Conflict of interest None declared.

Acknowledgments This work has been supported by the National Natural Science Foundation of China (NSFC) Grants 60825301 and 60933010. The authors thank Dr. Xiang-Dong Zhou for helpful discussions.

[1] C.-L. Liu, M. Koga, H. Fujisawa, Lexicon-driven segmentation and recognition of handwritten character strings for Japanese address reading, IEEE Trans. Pattern Anal. Mach. Intell. 24 (11) (2002) 1425–1437. [2] P.-P. Xiu, H.-S. Baird, Whole-book recognition, IEEE Trans. Pattern Anal. Mach. Intell. 34 (12) (2012) 2467–2480. [3] D. Damm, C. Fremerey, V. Thomas, M. Clausen, F. Kurth, M. Müller, A digital library framework for heterogeneous music collections: from document acquisition to cross-modal interaction, Int. J. Digit. Libr. 12 (2–3) (2012) 53–71. [4] T.-M. Rath, R. Manmatha, Word spotting for historical documents, Int. J. Doc. Anal. Recognit. 9 (2) (2007) 139–152. [5] R. Manmatha, C. Han, E.M. Riseman, Word spotting: a new approach to indexing handwriting, in: IEEE CVPR 1996, San Francisco, CA, USA, 1996, pp. 631–637. [6] H. Zhang, D.-H. Wang, C.-L. Liu, H. Bunke, Keyword spotting from online Chinese handwritten documents using one-versus-all character classification model, Int. J. Pattern Recognit. Artif. Intell. 27 (3) (2013). [7] H. Cao, A. Bhardwaj, V. Govindaraju, A probabilistic method for keyword retrieval in handwritten document images, Pattern Recognit. 42 (12) (2009) 3374–3382. [8] A. Kołcz, J. Alspector, M.-F. Augusteijn, R. Carlson, G.-V. Popescu, A lineoriented approach to word spotting in handwritten documents, Pattern Anal. Appl. 3 (2) (2000) 153–168. [9] A. Fischer, A. Keller, V. Frinken, H. Bunke, HMM-based word spotting in handwritten documents using subword models, in: Proceedings of the 20th ICPR, Istanbul, Turkey, 2010, pp. 3416–3419. [10] V. Frinken, A. Fischer, R. Manmatha, H. Bunke, A novel word spotting method based on recurrent neural networks, IEEE Trans. Pattern Anal. Mach. Intell. 34 (2) (2012) 211–224. [11] M. Wollmer, F. Eyben, J. Keshet, A. Graves, B. Schuller, G. Rigoll, Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks, in: IEEE ICASSP 2009, Taipei, Taiwan, 2009, pp. 3949–3952. [12] F. Wessel, R. Schluter, K. Macherey, H. Ney, Confidence measures for large vocabulary continuous speech recognition, IEEE Trans. Speech Audio Process. 9 (3) (2001) 288–298. [13] S. Quiniou, E. Anquetil, Use of a confusion network to detect and correct errors in an on-line handwritten sentence recognition system, in: Proceedings of the 9th ICDAR, Parana, Brazil, 2007, pp. 382–386. [14] Q.-F. Wang, F. Yin, C.-L. Liu, Handwritten Chinese text recognition by integrating multiple contexts, IEEE Trans. Pattern Anal. Mach. Intell. 34 (8) (2012) 1469–1481. [15] D.-H. Wang, C.-L. Liu, X.-D. Zhou, An approach for real-time recognition of online Chinese handwritten sentences, Pattern Recognit. 45 (10) (2012) 3661–3675. [16] T.-H. Su, T.-W. Zhang, D.-J. Guan, H.-J. Huang, Off-line recognition of realistic Chinese handwriting using segmentation-free strategy, Pattern Recognit. 42 (1) (2009) 167–182. [17] J.S. Bridle, Probabilistic interpretation of feedforward classification network outputs with relationships to statistical pattern recognition, in: F. FogelmanSoulie, J. Herault (Eds.), Neurocomputing: Algorithms, Architectures and Applications, Springer, 1990, pp. 227–236. [18] C.-L. Liu, F. Yin, D.-H. Wang, Q.-F. Wang, CASIA online and offline Chinese handwriting databases, in: Proceedings of the 11th ICDAR, Beijing, China, 2011, pp. 37–41. [19] H. Zhang, D.-H. Wang, C.-L. Liu, A confidence-based method for keyword spotting in online Chinese handwritten documents, in: Proceedings of the 21st ICPR, Tsukuba, Japan, 2012, pp. 525–528. [20] S. Ortmanns, H. Ney, X. Aubert, A word graph algorithm for large vocabulary continuous speech recognition, Comput. Speech Lang. 11 (1) (1997) 43–72.

1890

H. Zhang et al. / Pattern Recognition 47 (2014) 1880–1890

[21] T. Kemp, T. Schaaf, Estimating confidence using word lattices, in: Proceedings of the 5th ECSCT, Rhodes, Greece, 1997, pp. 827–830. [22] B. Rueber, Obtaining confidence measures from sentence probabilities, in: Proceedings of the 5th ECSCT, Rhodes, Greece, 1997, pp. 739–742. [23] L. Mangu, E. Brill, A. Stolcke, Finding consensus in speech recognition: word error minimization and other applications of confusion networks, Comput. Speech Lang. 14 (4) (2000) 373–400. [24] D. Gusfield, Efficient methods for multiple sequence alignment with guaranteed error bounds, Bull. Math. Biol. 55 (1993) 141–154. [25] N.-F. Ayan, J. Zheng, W.-Y. Wang, Improving alignments for better confusion networks for combining machine translation systems, in: Proceedings of the 22nd COLING, Manchester, UK, 2008, pp. 33–40. [26] Y.-S. Fu, Y.-C. Pan, L.-S. Lee, Improved large vocabulary continuous Chinese speech recognition by character-based consensus networks, in: Proceedings of the 5th ISCSLP, Kent-Ridge, Singapore, 2006, pp. 422–434. [27] J.-S. Denker, Y. Le Cun, Transforming neural-net output levels to probability distributions, in: Advances in Neural Information Processing Systems, Morgan Kaufmann, Los Altos, CA, 1991, pp. 853–859. [28] C.-L. Liu, M. Nakagawa, Precise candidate selection for large character set recognition by confidence evaluation, IEEE Trans. Pattern Anal. Mach. Intell. 22 (6) (2000) 636–642. [29] J. Platt, Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, in: A.J. Smola, P. Bartlett, D. Scholkopf, D. Schuurmanns (Eds.), Advances in Large Margin Classifiers, MIT Press, 1999, pp. 61–74. [30] J.-A. Barnett, Computational methods for a mathematical theory of evidence, in: Proceedings of the 7th IJCAI, B.C., Canada, 1981, pp. 868–875. [31] C.-L. Liu, Classifier combination based on confidence transformation, Pattern Recognit. 38 (1) (2005) 11–28.

[32] X.-D. Zhou, D.-H. Wang, C.-L. Liu, A robust approach to text line grouping in online handwritten Japanese documents, Pattern Recognit. 42 (9) (2009) 2077–2088. [33] X.-D. Zhou, J.-L. Yu, C.-L. Liu, T. Nagasaki, K. Marukawa, Online handwritten Japanese character string recognition incorporating geometric context, in: Proceedings of the 9th ICDAR, Curitiba, Brazil, 2007, pp. 23–26. [34] C.-L. Liu, H. Sako, H. Fujisawa, Effects of classifier structures and training regimes on integrated segmentation and recognition of handwritten numeral strings, IEEE Trans. Pattern Anal. Mach. Intell. 26 (11) (2004) 1395–1407. [35] F. Kimura, K. Takashina, S. Tsuruoka, Y. Miyake, Modified quadratic discriminant functions and the application to Chinese character recognition, IEEE Trans. Pattern Anal. Mach. Intell. 9 (1) (1987) 149–153. [36] B.-H. Juang, W. Chou, C.-H. Lee, Minimum classification error rate methods for speech recognition, IEEE Trans. Speech Audio Process. 5 (3) (1997) 257–265. [37] H. Bunke, A. Sanfeliu (Eds.), Syntactic and Structural Pattern Recognition— Theory and Applications (Chapter 5), World Scientific, Singapore, New Jersey, 1994. [38] X. Jin, C.-L. Liu, X. Hou, Regularized margin-based conditional log-likelihood loss for prototype learning, Pattern Recognit. 43 (7) (2010) 2428–2438. [39] C.-L. Liu, X.-D. Zhou, Online Japanese character recognition using trajectorybased normalization and direction feature extraction, in: Proceedings of the 10th IWFHR, La Baule, France, 2006, pp. 217–222. [40] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed., Academic Press, New York, 1990. [41] SogouLab: 〈http://www.sogou.com/labs/resources.html〉. [42] Y.-C. Pan, L.-S. Lee, Performance analysis for lattice-based speech indexing approaches using words and subword units, Pattern Recognit. 18 (6) (2010) 1562–1574.

Heng Zhang received the B.S. degree in Electronic and Information Engineering from University of Science and Technology of China, Hefei, China, in 2007, and got the Ph.D. degree in pattern recognition and intelligent systems at the Institute of Automation of Chinese Academy of Sciences (CASIA), Beijing, China, in 2013. Currently, he is an assistant professor at the CASIA. His research interests include handwriting recognition, document analysis and information retrieval.

Da-Han Wang received the B.S. degree in Automation Science and Electrical Engineering from Beihang University, Beijing, China, in 2006, and got his Ph.D. degree in pattern recognition and intelligent systems at the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2012. He is now a postdoctoral research fellow at the Department of Information Science and Engineering, Xiamen University, Fujian, China. His research interests include text detection and recognition, character string recognition, object tracking, computer vision, and pattern recognition.

Cheng-Lin Liu is a Professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences, Beijing, China, and is now the deputy director of the laboratory. He received the B.S. degree in electronic engineering from Wuhan University, Wuhan, China, the M.E. degree in electronic engineering from Beijing Polytechnic University, Beijing, China, the Ph.D. degree in pattern recognition and intelligent control from the Chinese Academy of Sciences, Beijing, China, in 1989, 1992 and 1995, respectively. He was a postdoctoral fellow at Korea Advanced Institute of Science and Technology (KAIST) and later at Tokyo University of Agriculture and Technology from March 1996 to March 1999. From 1999 to 2004, he was a research staff member and later a senior researcher at the Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan. His research interests include pattern recognition, image processing, neural networks, machine learning, and especially the applications to character recognition and document analysis. He has published over 180 technical papers at prestigious international journals and conferences. He is on the editorial board of journals Pattern Recognition, Image and Vision Computing, and International Journal on Document Analysis and Recognition. He is a Fellow of the IAPR, and a senior member of the IEEE.