Pattern Recognition 45 (2012) 3661–3675
Contents lists available at SciVerse ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/pr
An approach for real-time recognition of online Chinese handwritten sentences Da-Han Wang a, Cheng-Lin Liu a,n, Xiang-Dong Zhou b a b
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 Zhongguan East Road, Beijing 100190, PR China Intelligence Engineering Lab & Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, P.O. Box 8718, Beijing 100190, PR China
a r t i c l e i n f o
abstract
Article history: Received 10 January 2012 Received in revised form 24 March 2012 Accepted 18 April 2012 Available online 30 April 2012
With the advances of handwriting capturing devices and computing power of mobile computers, penbased Chinese text input is moving from character-based input to sentence-based input. This paper proposes a real-time recognition approach for sentence-based input of Chinese handwriting. The main feature of the approach is a dynamically maintained segmentation–recognition candidate lattice that integrates multiple contexts including character classification, linguistic context and geometric context. Whenever a new stroke is produced, dynamic text line segmentation and character over-segmentation are performed to locate the position of the stroke in text lines and update the primitive segment sequence of the page. Candidate characters are then generated and recognized to assign candidate classes, and linguistic context and geometric context involving the newly generated candidate characters are computed. The candidate lattice is updated while the writing process continues. When the pen lift time exceeds a threshold, the system searches the candidate lattice for the result of sentence recognition. Since the computation of multiple contexts consumes the majority of computing and is performed during writing process, the recognition result is obtained immediately after the writing of a sentence is finished. Experiments on a large database CASIA-OLHWDB of unconstrained online Chinese handwriting demonstrate the robustness and effectiveness of the proposed approach. & 2012 Elsevier Ltd. All rights reserved.
Keywords: Online Chinese handwritten sentence recognition Real-time recognition Dynamic text line segmentation Dynamic over-segmentation Dynamic candidate lattice Path search
1. Introduction With the proliferation of pen-based and touch-based mobile computers, online handwriting recognition has many potential applications [1–4], including text input, handwritten notes and diagrams recording, signature verification, and mathematical expressions recognition [5]. Character recognition-based Chinese text input has been widely applied in Chinese market. However, as the handwriting capturing devices and computing power of mobile computers advances, sentence-based text input becomes possible. Compared to character-based input, sentence-based input is more natural and enables faster and more accurate input via handwritten sentence recognition incorporating contexts. Handwritten sentence (character string) recognition is a difficult contextual classification problem involving character segmentation and recognition [2,3]. There have been many efforts towards the improvement of handwritten character string recognition [6–11]. Most methods adopt the integrated segmentation–recognition strategy to overcome the ambiguity of character segmentation. In the segmentation–recognition framework, handwritten text is first n
Corresponding author. Tel.: þ86 10 62558820. E-mail addresses:
[email protected],
[email protected] (D.-H. Wang),
[email protected] (C.-L. Liu),
[email protected] (X.-D. Zhou). 0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2012.04.020
over-segmented into primitive segments which can be a character or a part of a character. Then candidate character patterns are generated by concatenating consecutive segments, and are recognized by a character classifier to assign candidate classes. The candidate character sequence and assigned candidate classes are represented in a segmentation–recognition candidate lattice, which contain many segmentation–recognition paths each corresponding to one recognition result. The optimal path of segmentation– recognition is searched from the candidate lattice via path evaluation combining character classification scores and contexts. Fig. 1 shows a typical handwritten text recognition system (Fig. 1(a)), and an illustrative example of over-segmentation and the segmentation– recognition candidate lattice (Fig. 1(b)). The above methods, though show promise, perform character segmentation and recognition after the sentence writing is finished. To achieve real-time recognition, character segmentation and recognition should be performed during the writing process, such that the result can be obtained immediately after the completion of writing. In recent years, some real-time handwriting input (dynamic recognition during writing) products have been developed, but we have not seen an academic study addressing this problem theoretically or experimentally. Besides character string recognition, real-time recognition of handwritten sentences also involves text line segmentation, since sentences are often written in multiple lines due to the limited
3662
D.-H. Wang et al. / Pattern Recognition 45 (2012) 3661–3675
Fig. 1. (a) A typical handwritten text recognition system. (b) An illustrative example of over-segmentation and the segmentation–recognition candidate lattice. Each box contains the candidate character (upper) and its candidate classes (lower). The optimal path is denoted by thick line with red characters (left one in each box) being the correct result. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
space of writing area. Text line segmentation in real-time recognition is difficult because the lines are short, the strokes are dynamically produced, and there are often delayed strokes, which are inserted into previous characters or even previous lines. Unlike previous text line segmentation methods that mostly group strokes into lines after all strokes are produced, the dynamic segmentation during writing can only utilize the information of part of strokes. In this paper, we propose an approach to real-time recognition of Chinese handwritten sentences using a dynamically maintained segmentation–recognition candidate lattice. Whenever a new stroke is produced, dynamic line segmentation and character over-segmentation are performed on the stroke to update the primitive segment sequence and locate the position of the segment in text lines. Then candidate characters are generated on the new stroke, and are recognized to assign candidate classes. Meanwhile, multiple contexts including linguistic context and geometric context involving the newly generated candidate characters are computed using language model and geometric models. The candidate lattice is updated constantly while the writing process continues. When the pen lift time exceeds a threshold, the system searches the candidate lattice for the result of sentence recognition by the path search algorithm as in conventional character string recognition. Since the updating of the candidate lattice consumes the majority of computing and is performed during writing process, sentence recognition is obtained immediately after a long pen lift. Based on automatic recognition, we can develop some editing functions to manually correct segmentation and recognition errors to facilitate user applications. For dynamic text line segmentation in real-time recognition, we propose to adopt a statistical classifier to model the geometric relationship between the ongoing stroke and the existing text lines. By classification based on extracted features of a line–stroke pair, the classifier judges whether to assign the stroke to a previous line or it starts a new line. The method can deal with delayed strokes by grouping them into previous lines, and therefore, it makes the real-time recognition system more robust. For dynamic character over-segmentation, we also use a statistical classifier to model the geometric relationship between the ongoing stroke and existing primitive segments that belong to the same line of the stroke. We transform the output of the classifier on extracted features of a segment–stroke pair into posterior probability by confidence transformation [12], which indicates the probability of the stroke belonging to the segment.
The stroke is considered to belong to the segment if the probability is greater than a threshold. By testing each segment–stroke pair, the stroke is assigned to one existing segment or starts a new segment. The position of the segment in the sequence of segments is located according to their left boundaries. Similar to dynamic text line segmentation, the over-segmentation can also deal with delayed strokes. For path search after candidate lattice construction, we propose a real-time beam search algorithm for real-time recognition. The beam search algorithm is an accelerated version of the dynamic programming (DP) algorithm by pruning the partial paths at intermediate nodes. Via retaining partial optimal paths ending at each segment, we perform search from the updated segment other than from the start segment. We evaluated the performance of the proposed approach in respect of the recognition accuracy and recognition speed on a large database CASIA-OLHWDB [13] of unconstrained online Chinese handwritten characters and texts, and the results demonstrate the robustness and effectiveness of the proposed approach. The rest of this paper is organized as follows. Section 2 reviews the related works. Section 3 describes the baseline character string recognition method that we customize for real-time recognition. An overview of the real-time recognition system is provided in Section 4. Section 5 presents the methods for dynamic text line segmentation, dynamic character over-segmentation, and candidate lattice updating. The real-time path search algorithm is described in Section 6. Section 7 presents the experimental results, and Section 8 offers concluding remarks. This paper is an extension to our previous conference paper [14] by elaborating the procedures of dynamic line segmentation, dynamic character over-segmentation, and candidate character generation, incorporating geometric context into the path evaluation criterion, optimizing the combining weights, and evaluating the system quantitatively on a large database of online handwriting.
2. Related works Chinese handwritten character string recognition is a challenging problem due to the large character set, the diversity of writing styles, the character segmentation difficulty, and the unconstrained language domain. Particularly, due to the variability of character size and position, character touching and overlapping, the characters cannot be reliably segmented prior to character recognition. To overcome the large number of character classes and the infinite
D.-H. Wang et al. / Pattern Recognition 45 (2012) 3661–3675
sentence classes of Chinese texts, over-segmentation-based character string recognition approaches are commonly used [1]. Under the integrated segmentation–recognition framework, a lot of efforts have been devoted to the key techniques in Chinese/ Japanese handwritten character string recognition. In the framework, the criterion for evaluating candidate segmentation–recognition paths usually integrates multiple contexts including the character classification, linguistic context and geometric context. Among previous works, some integrated incomplete contexts [15–17], and some combined the contexts heuristically without optimizing the combining weights [8,9,18,19]. Zhou et al. optimize the combining weights using the conditional random field (CRF) model [10], which is hard to incorporate language models of higher order than the bi-gram, while Zhu et al. adopt the genetic algorithm (GA) [11] to optimize the combining weights, which is computationally expensive and is sensitive to some artificial parameters. Recently, Wang et al. proposed to integrate the character classification scores and linguistic context by transforming the output of character classifier into posterior probability via confidence transformation [20], which benefits the recognition performance. Furthermore, they investigated into the parameter optimization for path evaluation and efficient path search, and achieved significant improvements on unconstrained handwritten Chinese texts [21]. They reported character-level correct rate of 91.39% on an offline Chinese handwriting database CASIA-HWDB [13]. On another offline Chinese handwriting dataset HIT-MW, they achieved character-level correct rate of 92.72%, which is much higher than previously reported results in [15,22]. For online character string recognition, many works experimented on Japanese handwritten text databases have reported higher accuracies [8–11], which results from the fact that online handwriting recognition has the advantage that the sequences of strokes are available for better segmenting and discriminating characters. For online Chinese character string recognition, however, there have few works reported except that in ICDAR 2011 competition [23], the Vision Object achieved correct rate of 94.33% on a competition dataset. Real-time recognition of handwritten sentences is closely connected with online handwritten character string recognition, which takes similar techniques of path evaluation and search with offline character string recognition. Our system of real-time recognition is customized from a high performance online handwritten character string recognition system by developing robust and efficient techniques for dynamic text line segmentation, character over-segmentation, updating of the candidate lattice, and real-time path search. Among the previous methods for text line segmentation in online handwritten documents, some segment text lines using heuristics or simple features like horizontal projection [24,25] and off-stroke distances [8]. The methods based on optimizing line-fitting objectives [26–28] yield more reliable line partitioning. They usually take a hypothesis-and-test strategy to generate candidate line partitioning and seek for the optimal partitioning by heuristic search. To generate text line hypotheses, however, these methods require that all the strokes have been written. On the other hand, for real-time recognition, line segmentation is performed on each stroke rather than on the whole page. Character over-segmentation in online handwritten character string recognition is often performed using off-stroke (pen lift) distances, and delayed strokes are re-arranged according to some heuristic rules [9]. For over-segmentation in real-time recognition, the rules should be designed more carefully because only part of strokes are available at dynamic segmentation. Recognition speed is another important factor in real-time recognition of handwritten sentences, where character
3663
recognition is a crucial part and consumes the majority of computing. With over 5000 classes of frequently used characters, Chinese character recognition is a difficult classification problem. The most popularly used classifiers are the modified quadratic discriminant function (MQDF) [29] and the nearest prototype classifier (NPC) [30]. The MQDF provides higher accuracy than the NPC but suffers from high expenses of storage and computation. In this paper, we will evaluate the performance of both MQDF classifier and NPC, investigating the tradeoff between recognition accuracy and speed.
3. Online handwritten character string recognition We customize a high performance online handwritten character string recognition system to real-time recognition. Before describing the real-time recognition approach, we describe the online handwritten character string recognition approach below. For the character string recognition system, we apply the integrated segmentation–recognition strategy, using the same framework as illustrated in Fig. 1. In the system, the input string sample (sequence of strokes) is over-segmented and composed to be sequences of candidate characters, each denoted by X ¼ x1 . . . xn . Each candidate character is assigned candidate classes (denoted as ci) by a character classifier, and then the result of character string recognition is a character string C ¼ c1 . . . cn . In the candidate segmentation–recognition lattice, each path (X,C) is evaluated by the path evaluation criterion. In our system, we adopt the path evaluation criterion presented in [21], which is formulated from Bayesian decision view in [21], integrates multiple contexts including character classification, linguistic context, and geometric context, and shows fairly good performance. In this paper, we do not present the derivation process but give the criterion directly for saving space, and more details can be found in [21]. Denote the score of classifying character x into class c given by the character classifier as Pðc9xÞ. The linguistic context is given by a bi-gram language model, which gives the 2-gram probability, denoted as Pðci 9ci1 ), from character class ci1 to ci. The unary class-dependent (uc for short) geometric score, unary classindependent (ui) geometric score, binary class-dependent (bc) geometric score and binary class-independent (bi) geometric score are denoted as Pðc9guc Þ, Pðzp ¼ 19gui Þ, Pðci1 ,ci 9gbc Þ, and Pðzg ¼ 19gbi Þ, respectively, where g denotes corresponding geometric feature and output scores are given by geometric models classifying on features extracted. For the ui geometric model, Pðzp ¼ 19gui Þ indicates the probability of the character being a valid character. For the bi geometric model, Pðzg ¼ 19gbi Þ indicates the probability of the gap between two successive candidate characters being a between-character gap. The path evaluation is the combination of multiple contexts: f ðX,CÞ ¼
n X
fki log Pðci 9xi Þ þ l1 log Pðci 9ci1 Þ þ l2 log Pðci 9guc i Þ
i¼1 g bc bi þ l3 log Pðzpi ¼ 19gui i Þ þ l4 log Pðc i1 ,c i 9gi Þ þ l5 log Pðzi ¼ 19gi Þg,
ð1Þ where fl1 , l2 , l3 , l4 , l5 g are five combining weights that balance the different contributions of different models, and ki is the number of primitive segments composing the candidate character. The idea of weighting character classification score with multiplier ki follows the variable length HMM of [31]. This is to make the sum of classification scores insensitive to the path length (number of candidate characters), and enables optimal path search by DP.
3664
D.-H. Wang et al. / Pattern Recognition 45 (2012) 3661–3675
In [21], Wang et al. propose to convert the outputs of models of character classifier and geometric context into posterior probabilities by confidence transformation [12,20]. In this paper, we apply confidence transformation to multiple contexts integration. Specifically, for character classification, we use the Dempster– Shafer (D–S) theory of evidence [32] to combine the sigmoidal two-class probabilities into multi-class probabilities, which considers the outlier class and hence is suitable for character string recognition [20]. For geometric context models which have a small number of classes, we use the sigmoidal confidence transformation. The confidence parameters are estimated by minimizing the cross entropy (CE) loss function, which is commonly used in logistic regression and neural network training, on a validation dataset (preferably different from the dataset for training classifiers) [12]. In the following, we briefly introduce the character classifier, geometric context modeling, and combining parameters estimation. 3.1. Character classifier Though a large number of classifiers are available in the pattern recognition, only a few of them are effective for the large category set problem of Chinese character recognition [33]. We use the MQDF and NPC because they are among the most popularly used and effective ones, and the main aim of this paper is to propose and demonstrate a real-time handwritten sentence recognition approach instead of comprehensive comparison of classifiers. The MQDF classifier is a modified version of quadratic discriminant function (QDF), which rooted from the Bayesian classifier by assuming that the probability distribution of each class is multivariate Gaussian [29]. In the MQDF, the minor eigenvalues of each class are replaced by a constant, such that only the principal eigenvectors are used in the discriminant function. This helps reduce the computation complexity and meanwhile benefits the generalization performance. For NPC classifiers, we test two variations depending on the prototype learning algorithm: one is trained by the LOG-likelihood of Margin criterion (NPC-LOGM) [34], and one trained by One-VsAll criterion (NPC-OVA) [35]. The training objective of NPC-LOGM is the negative Conditional Log-likelihood Loss (CLL), where the posterior probability is approximated by the logistic (sigmoidal) function of hypothesis margin. For NPC-OVA classifier, the training objective is the multi-class cross-entropy (CE) loss, where the binary posterior probability is approximated by the sigmoidal function as well. More details of the MQDF classifier, NPC-LOGM, and NPC-OVA can be found in [29,34,35], respectively.
(SVM) [37] trained with character and non-character samples for the unary class-independent model, and similarly, a linear SVM for the binary class-independent model. In path evaluation, we convert both QDF and SVM outputs to posterior probabilities via sigmoidal confidence transformation. 3.3. Combining parameter estimation The combining weights are learned by Minimum Classification Error (MCE) training [38,39], which has been popularly used in speech recognition and handwriting recognition [40–42]. The objective of learning the combining weights by MCE is to optimize the string recognition accuracy. In string-level MCE training, the weights are estimated on a dataset containing R string samples Dx ¼ fðX n ,C nt Þ9n ¼ 1, . . . ,Rg, where Cnt is the ground-truth transcript of the string sample Xn. Following Juang et al. [38], the misclassification measure on a string sample is approximated by dðX, LÞ ¼ gðX,C t , LÞ þ gðX,C r , LÞ,
ð2Þ
where L is the parameter set, gðX,C t , LÞ is the discriminant function for the truth class, and gðX,C r , LÞ is the discriminant function of the closest rival class: gðX,C r , LÞ ¼ max C k a C t gðX,C k , LÞ. The misclassification measure is transformed to loss by the sigmoidal function: lðX, LÞ ¼
1 , 1 þ exdðX, LÞ
ð3Þ
where x is a parameter to control the hardness of sigmoidal nonlinearity. The parameters in MCE training are learned by stochastic gradient descent [43] on each input sample by
Lðt þ1Þ ¼ LðtÞeðtÞU rlðX, LÞ9L ¼ LðtÞ ,
ð4Þ
where eðtÞ is the learning step, and U is related to the inverse of Hessian matrix and is usually approximated to be diagonal. In MCE training for handwritten character string recognition, the discriminant function is the path evaluation criterion as (1), and the rival segmentation–recognition path, which is the most confusable one with the correct one, is obtained by beam search. Substituting the discriminant functions f t and fr of the correct and rival path into (4), the parameters are updated iteratively as @lðX, LÞ @dðX, LÞ Lðt þ1Þ ¼ LðtÞeðtÞ ¼ LðtÞeðtÞxlð1lÞ @L L ¼ LðtÞ @L L ¼ LðtÞ ¼ LðtÞeðtÞxlð1lÞðf r f c Þ:
ð5Þ
4. Real-time recognition system 3.2. Geometric context modeling Geometric context has been proven effective in character string recognition [9,10] and transcript mapping of handwritten documents [36]. Similar to geometric modeling in [36], we design four geometric models: unary and binary class-dependent models, unary and binary class-independent models. To build geometric models, we extract features for unary and binary geometry from the bounding boxes of a candidate character pattern, and from two adjacent character patterns, respectively [36]. Due to the large number of Chinese characters and the fact that many different characters have similar geometric features, we cluster the character classes into six super-classes using the EM algorithm. After clustering, we use a 6-class quadratic discriminant function (QDF) for the unary class-dependent model, and a 36-class QDF for the binary class-dependent model. For class-independent geometric models, which in essence is a twoclass classification model, we use a linear support vector machine
The proposed real-time recognition system consists of four main modules (Fig. 2(a)): real-time segmentation–recognition module, sentence recognition module, sentence edition module and language association module. While the modules of real-time segmentation–recognition and sentence recognition are the core of the automatic recognition system, the other two modules are provided to facilitate user application. The real-time segmentation–recognition module (Fig. 2(b)) acts whenever an ongoing stroke is produced. In line segmentation, the system judges which text line the new stroke belongs to. If the stroke belongs to one previous line, then the line is updated and character over-segmentation is performed on the line. If no previous line is found to contain the stroke, the stroke is considered to start a new line and composes the first primitive segment (a stroke block) of the line. In character over-segmentation of a text line, if the stroke belongs to one previous segment of the line, the system updates the segment, otherwise creates a new segment using the stroke and
D.-H. Wang et al. / Pattern Recognition 45 (2012) 3661–3675
3665
Fig. 2. (a) Flow chart of the real-time recognition system. (b) Flow chart of the real-time segmentation–recognition module.
Fig. 3. (a) A candidate lattice and (b) the updated one due to a new stroke. The partial lattice with red lines are added into the previous lattice. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
finds the position of the new segment in the sequence of segments according to the left boundaries. After assigning the new stroke, the updated primitive segment or newly created segment is merged with its preceding segments to generate candidate characters, which are recognized by a character classifier to assign candidate classes. The new candidate characters and their assigned classes, as well as the linguistic context and geometric context scores associated with the new candidate characters, are added into the candidate segmentation– recognition lattice. Fig. 3 shows an intermediate candidate lattice and its updated form due to a new stroke. After real-time segmentation–recognition on a new stroke, if the pen lift time exceeds a threshold (adjustable by the user, e.g., 0.5 s), the result of sentence recognition is obtained by path search in the updated candidate lattice, performed by the sentence recognition module. The sentence recognition result may have errors of character segmentation or recognition. A sentence edition module is thus designed to correct such errors. Character split error can be corrected by drawing a circle embracing the split parts. Character merge error can be corrected by drawing a vertical line to separate the merged characters. After manual merge or split, the merged or
split parts are re-combined into candidate characters and reassigned candidate classes, and the updated candidate lattice are re-searched for sentence recognition result. For character recognition error, candidate classes will be displayed when clicking on the character area, and the user can select the correct class. If the correct class is not in the top ranks, the user can erase the character and rewrite to activate real-time recognition. In the following, we elaborate the techniques in the modules of real-time segmentation–recognition and sentence recognition.
5. Real-time segmentation–recognition module On a new stroke, the real-time segmentation–recognition module performs dynamic text line segmentation, character over-segmentation, and updating of the segmentation–recognition candidate lattice. Algorithm 1 illustrates the real-time process of an ongoing stroke, where the Part 1 performs dynamic text line segmentation, and the Part 2 performs dynamic character over-segmentation. Afterwards, candidate characters are generated from the updated primitive segment, and assigned candidate classes to update the candidate lattice.
3666
D.-H. Wang et al. / Pattern Recognition 45 (2012) 3661–3675
Algorithm 1. Real-time process of an ongoing stroke. Input: Existing lines and segment sequence: lines, segments Line number: m A new stroke: strk Initialization: set lineIdx ¼ 1 // Part 1: dynamic line segmentation For i¼m to 1 feature¼ LineStrokeFeature(strk,linei), Classifier(feature), if strk belongs to linei lineIdx¼ i; break; else continue; End for. // Part 2: dynamic character over-segmentation If (lineIdx 40) Merge strk into the lineIdx-th line, OverSegmentation(updated lineIdx-th line) Else Create a new line using strk, Create the first segment of the line using strk. m ¼ m þ 1. End if. Update the segment sequence, // Part 3 Generate candidate characters, Update the candidate lattice. End.
continues until one line containing the stroke is found or all the previous lines have been considered. If there is no line containing the stroke, the stroke forms a new line. We adopt a statistical classifier to model the geometric relationship of a line–stroke pair, and to judge whether the stroke belongs to the line or not. To collect training samples for the two-class classifier, we extract samples from a stroke and its temporally previous lines. If the stroke belongs to the line, the sample is considered to be a positive one, otherwise a negative one. Samples can be extracted from ground-truthed online documents containing multiple text lines. Each positive or negative sample (a line–stroke pair) is extracted geometric features for training classifier. For extracting geometric features from a line–stroke pair, we do not rely on temporal features such as the off-stroke distance so as to cope with delayed strokes. Before feature extraction, the line line and the stroke strk are tentatively merged and fitted by linear regression. Denote the merged line as linet. The line height is estimated by computing the average height of strokes. We extract 22 features from the line–stroke pair, as listed in Table 1. The features can be divided into four categories: (1) five features related to the line line (No. 1–5 in Table 1); (2) two features related to the stroke strk (No. 6 and 7); (3) four scalar features related to the line linet (No. 8–11); (4) 11 scalar features related to the geometric relationship between the stroke strk and the line line as well as the line linet (No. 12–22 in Table 1). The estimated line height of a character string is important in extracting line–stroke geometric feature and segment–stroke feature. To estimate the line height (denoted as lineHei in this paper) robustly, all the strokes in the line are first sorted in ascending order of height, and the half of strokes with larger heights are used to estimate the line height (average of the heights of selected strokes). While writing proceeds, the estimate is updated incorporating the new stroke. The estimate becomes more accurate when the number of strokes increases.
5.1. Dynamic line segmentation This step is to assign a new stroke (denoted as strk) into one of m previous lines (denoted as lines) or start a new line. In the algorithm, lineIdx is the index of the text line that the new stroke belongs to, and lineIdx ¼ 1 indicates that the stroke starts a new line. The function LineStrokeFeature(strk, linei) extracts geometric features characterizing the relationship between the stroke and the i-th line. Based on the features, if the classifier judges that strk belongs to linei, then update linei and perform over-segmentation on the updated linei. Otherwise, the process
Fig. 4. Segment sequence of multiple lines.
Table 1 Line–stroke geometric features (the last column denotes whether normalized w.r.t. line height or not). No.
Feature
Norm
1–2 3 4
Height and width of line The number of strokes in line Average regression error of line: s21 Horizontal direction of the line line Height of strk Aspect ratio of strk Height and width of linet Average regression error of linet: s22 Horizontal direction of the line linet Growth of line height Change of horizontal direction Change of average regression error Distance between line and strk, as the minimum distance between strk and the strokes in line Common area of line and strk Distances of upper/lower bound of strk to vertical center of line along the norm direction of line Distances between the upper bounds, lower bounds, upper-lower bounds, and lower-upper bounds of line and strk
Y N Y
5 6 7 8-9 10 11 12 13 14 15 16 17–18 19-22
N Y N Y Y N Y N Y Y Y Y Y
D.-H. Wang et al. / Pattern Recognition 45 (2012) 3661–3675
5.2. Dynamic character over-segmentation After text line segmentation, dynamic character over-segmentation is then performed to update the sequence of primitive segments. In our system, the segments of multiple lines are ordered in one sequence, as depicted in Fig. 4. In the case that the ongoing stroke forms a new line, the stroke composes the first segment of the line which is considered as the last segment in the sequence. If the stroke is judged to belong to a previous line, dynamic character over-segmentation is performed on the updated line, by the function OverSegmentation(updated lineIdx-th line) in Algorithm 1, as detailed in Algorithm 2. The Algorithm 2 aims to locate the segment the stroke belongs to in the text line L, similar to the line segmentation algorithm as the Part 1 of Algorithm 1. Suppose there are n previous segments in L. In Algorithm 2, segIdx denotes the index of the segment that the new stroke belongs to, and segIdx ¼ 1 indicates that the stroke starts a new segment. The function SegStrokeFeature(strk, si) extracts geometric features characterizing the relationship between the stroke and the i-th segment. Based on the features, the output of a classifier is transformed to a confidence measure which indicates the probability of the stroke belonging to the segment. If the confidence is greater than a threshold g, strk is considered to belongs to si and is merged into si. Otherwise, the process continues until one segment containing the stroke is found. If there is no segment containing the stroke, the stroke will be considered to start a new segment. The threshold g should be large enough to avoid merge errors in over-segmentation (g is safely set as 0.85 empirically in our system). After assignment of the stroke, the updated segment sequence of the line is sorted according to the left boundaries, performed by the function SortSegments(s1 s2 sn sn þ 1 ). Algorithm 2. Dynamic character over-segmentation. Input: updated line : L segment number: n previous segments: s1 s2 sn A new stroke: strk Initialization: set segIdx ¼ 1 For i¼n to 1 feature¼SegStrokeFeature(strk, si ), confidence¼Classifier(feature), if (confidence 4 g) // strk belongs to si segIdx¼i;
3667
break; else continue; End for. If (segIdx 4 0) Merge strk into the segIdx-th segment, Else Create a new segment using strk, SortSegments(s1 s2 sn sn þ 1 ), n ¼nþ1. End if. End. Fig. 5 shows the two main cases of dynamic character oversegmentation on a new stroke, where the segment with red frame indicates a newly created or an updated segment. Case A shows normally writing strokes where the stroke is written in the end of the line, while Case B shows delayed strokes inserted to previous parts. Delayed strokes also happen when a character is deleted in user edition and a new character is re-written in the same position. In Case B, if the stroke starts a new segment, the position of the newly created segment is located according to the left boundaries. For robust over-segmentation, we also adopt a statistical classifier to model the geometric relationship of a segment–stroke pair. To collect training samples of positive and negative segment–stroke pairs, we first segment the text lines of training data into primitive segments according to the off-stroke distance and then re-arrange delayed strokes using spatial information. Each stroke is paired with its temporally preceding segments to form segment–pair samples, which are positive or negative samples depending on the stroke really belongs to the segment or not. Similar to feature extraction for line segmentation, we do not rely on temporal features so as to cope with delayed strokes. The geometric features of a segment–stroke pair include 12 features in total: three features related to the stroke strk (No. 1–3 in Table 2), two features related to the segment si (No. 4 and 5), two features related to the temporally merged segment sti (No. 6 and 7), and five scalar features related to the relationship between strk and si (No. 8–12). The horizontal overlap between strk and si, which is important for character over-segmentation, is characterized by horizontal relationship between them as the features No. 9–12. 5.3. Candidate characters generation
Fig. 5. Examples of segment sequence with a new stroke inserted. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
On dynamic line segmentation and character over-segmentation, generation of candidate characters is straightforward. Fig. 6 shows examples of new candidate characters. The segment with red box indicates the one updated or formed by a new stroke, and the blue frame embraces the candidate characters that start from or end at the red segment. In this paper, the maximum number of segments composing a candidate character is denoted as SN.
Table 2 Segment–stroke geometric features (the last column denotes feature value normalization w.r.t. the line height). No.
Feature
Norm
1–2 3 4–5 6–7 8 9 10–12
Height and width of strk Aspect ratio of strk Height and width of si Height and width of the temporally merged segment sti Common area of bounding boxes of strk and si Horizontal gap between bounds of strk and si Distances between the left bounds, right bounds, horizontal centers of strk and si
Y N Y Y Y Y Y
3668
D.-H. Wang et al. / Pattern Recognition 45 (2012) 3661–3675
Fig. 6. Examples of new candidate characters. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
The generation of candidate characters is subject to some heuristic rules for reducing the number of candidate characters while guaranteeing including true characters: (a) the number of segments in a character does not exceed the maximum number SN; (b) segments in different lines are not combined to candidate character; (c) candidate characters with width larger than a threshold (safely set as 3 lineHei in our system) are pruned; (d) two successive segments with horizontal distance larger than a threshold (safely set as 2 lineHei in our case) are not allowed to be merged.
computation than the binary class-dependent geometric context because they have less geometric classes. Since the candidate character classification and geometric context scoring cost majority of computation, it is beneficial to make them computed in real time during the writing process. After the writing of a sentence is finished, the sentence recognition only has to search the candidate lattice.
5.4. Candidate lattice updating
Sentence recognition is to search the candidate lattice for the optimal segmentation–recognition path. Due to the summation nature of the path evaluation criterion of Eq. (1), the dynamic programming (DP) algorithm can be adopted for optimal path search. We further apply the beam search strategy to accelerate DP search by pruning the partial paths at intermediate nodes. The search algorithm is suitable for real-time recognition because the retained multiple partial paths can be extended for further path search when ongoing strokes are continually produced. The adopted beam search algorithm is similar to that used in [21], but we implement it in a different way for efficient updating of candidate lattice in real-time recognition. The DP search algorithm is similar to the forward procedure in the Viterbi decoding algorithm [44]. After character over-segmentation, sentences of multiple lines are represented as a sequence of primitive segments o x1 x2 , . . . ,xT 4 where T is the total number of segments in the sequence. A candidate pattern consisting of s segments and ending at t-th segment is denoted as xts þ 1,t (1r s r SN). If we assign the c-th candidate class (1r c rCN) to the candidate pattern, we get one single path from the (ts þ 1)-th segment to the t-th segment, denoted as (t,s,c). Denote candidate paths ending at t-th segment as Pt, in which one single path is denoted as pt. Then the forward variable can be defined as
After over-segmentation and candidate characters generation, the candidate character classes and their scores, the linguistic context and geometric context involving the newly generated candidate characters are obtained using the character classifier, language model and geometric model, respectively, and are added into and updated in the candidate lattice. To roughly estimate the computation cost on a new stroke, we consider the costs of feature extraction and classification in character recognition and geometric context scoring, as well as getting the linguistic scores. Denote the number of candidate classes for each candidate character maintained in the candidate lattice as CN and the number of newly generated candidate character patterns as PN (Pattern Number). In normal writing order as in the Case A of Fig. 6, the maximum number of candidate characters is PN ¼SN (some candidate characters may violate the conditions (b)–(d) and will be pruned). When a delayed stroke is written as in the Case B of Fig. 6, where the delayed stroke is in the pos-th segment, the number of candidate characters composed of k segments containing the pos-th segment is k (start segment from ðposk þ1Þ-th to pos-th). Consider candidate characters composed of 1; 2, . . . ,SN segments, the maximum number of candidate characters associated with the pos-th segment is PN ¼ 1þ 2 þ þ SN ¼ SN ðSN þ 1Þ=2. The cost of character classification (including character feature extraction and classification) is proportional to the number of candidate character patterns PN. For linguistic context given by the character bi-gram, there is no feature extraction but retrieving the value for pairs of successive characters from the lexicon. For each candidate class of a candidate character, there are at most SN CN preceding classes and SN CN succeeding classes in the candidate lattice. Remember that the maximum number of candidate characters associating the current segment is PN and the maximum number of candidate classes is PN CN, the cost of retrieving language model is proportional to f2 PN CN ðSN CNÞg (here PN ¼ SNðSN þ1Þ=2) for Case B and fPN CN ðSN CNÞg (here PN¼ SN) for Case A (which has predecessors only). Updating the binary class-dependent geometric context is similar to that for linguistic context except that retrieving bi-gram is replaced by geometric feature classification. The binary class-independent geometric context and the unary geometric contexts cost less
6. Sentence recognition (path search)
f t,s,c ¼
max
pts: pts A Pts
f ðpts ,ðt,s,cÞÞ,
ð6Þ
i.e., f t,s,c is the best score (highest probability) along a single path ending at the (ts)-th segment, extended with a candidate character ending at the t-th segment and associated with class c. The beam search strategy accelerates DP by pruning candidate paths: among the candidate paths ending at the (ts)-th segment, we retain the BW (Band Width) top ranked paths and prune the others. Then we can search the optimal path ending at t-th segment inductively as follows: Algorithm 3. Beam Search in frame-synchronous fashion. (1) Initialization ( f 1,s,c , f 1,s,c ¼ 0,
s ¼ 1; 1r c r CN s Z2; 1 r c rCN
D.-H. Wang et al. / Pattern Recognition 45 (2012) 3661–3675
3669
If CN 4 BW, retain BW top ranked paths ending at the first segment; Otherwise, retain the CN paths. (2) Induction f t,s,c ¼
max
ff ts,s0 ,c0 þ k logðc9xts þ 1,t Þ þ l1 log Pðc9c0 Þ
top BWðs0 ,c0 Þ
þ l2 log Pðc9guc Þ þ l3 log Pðzp ¼ 19gui Þ þ l4 log Pðc0 ,c9gbc Þ þ l5 log Pðzg ¼ 19gbi Þg: Retain BW top ranked paths ending at the t-th segments. (3) Termination f T ¼ maxf T,s,c : ðs,cÞ
(4) Backtracking. Step (1) initializes the candidate paths containing the first segment as the candidate character pattern, using the character classification scores, linguistic context score, unary class-dependent and unary class-independent geometric context scores. The induction step, which is the heart of the algorithm, is to search the optimal path for each triplet ðt,s,cÞ based on previous optimal partial paths ending at the (ts)-th segment, using multiple contexts that have been computed when updating the candidate lattice during writing. In the termination step, the optimal complete path is chosen from paths ending at the last segment, and the character segmentation and recognition results are obtained in the backtracking step. In the induction step, the maximum number of candidate paths ending at the (ts)-th segment is SN CN, among which the optimal one is chosen as the preceding path of ðt,s,cÞ. When BW equals to SN CN, the search process is the same as the DP algorithm. When BW oSN CN, the search process is accelerated. From the algorithm, we can see that the path search is framesynchronous (also called as time-synchronous in speech recognition, partial paths are updated segment by segment), and the DP algorithm guarantees finding the optimal path for context models up to order 2. Now that the beam search algorithm updates the optimal partial paths ending at a segment from the retained partial paths ending at the previous segments, it enables path extension when the candidate lattice is updated on new strokes. Suppose in realtime recognition, the position of the updated or newly created segment on a new stroke is pos, the system performs beam search from the candidate paths ending at the preceding segments of the pos-th segment, and extend to the succeeding segments if the new stroke is a delayed stroke.
7. Experiments We evaluated the performance of the proposed real-time recognition approach on a database of online Chinese handwriting: CASIA-OLHWDB [13]. This database is divided into six datasets, three for isolated characters (DB1.0-1.2) and three for handwritten texts (DB2.0-2.2). There are 3,912,017 isolated character samples and 52,221 handwritten pages (consisting of 1,348,904 character samples) in total. Both the isolated data and handwritten text data have been divided into standard training and test subsets. Though the handwritten text data has been produced ahead of time, we can utilize the temporal stroke order to simulate the real-time writing process for evaluating real-time recognition performance. In sentence-based input, due to the limited space of writing area of mobile computers, users tend to write multiple text lines and each line contains only a few (mostly o10) characters. To simulate this situation, we used the datasets DB2.0-2.2 (called
Fig. 7. (a) A handwritten text page; (b) three short pages generated from the first three lines.
Table 3 Statistics of DB2 and generated short page dataset GDB2. Datasets DB2 Train Test GDB2 Train Test
#Page
#Line
#Line/page
#Characters
#Chars/line
4072 1020
41,710 10,510
10.24 10.30
1,082,220 269,674
25.95 25.66
41,710 10,510
155,368 38,870
3.72 3.70
1,082,220 269,674
6.97 6.94
DB2 for short) to generate short text pages each with three to six text lines, each line consisting six to eight characters. In DB2, a text line typically contains 20–30 characters because it was written on A4 paper using digital pen. We split each line into multiple lines as in a short page by making the width of each line not larger than five times of the average height of the original lines. Fig. 7 shows an example of data generation: (a) is a handwritten text page, and (b) shows three short pages derived from the first three lines in (a). Table 3 provides the details of dataset DB2 and the generated short page dataset (called GDB2 for short). The total number of strokes in the test set is 987,027. To evaluate the real-time recognition performance on short pages with delayed strokes, we produced delayed strokes in GDB2 by changing the writing order of a stroke in each page. Specifically, we randomly chose a stroke and place it randomly after its original position. The generated short page dataset with delayed strokes is called GDB2-D for short. 7.1. Experimental setup For dynamic line segmentation and character over-segmentation, we used a linear SVM classifier to model the geometric relationship of line–stroke pair and segment–stroke pair, respectively, and trained the classifier on features extracted from the training set of text lines of GDB2-D. We evaluated the recognition performance using the three character classifiers introduced in Section 3.1: MQDF, NPC-LOGM, and NPC-OVA. The classifier parameters were learned on 4/5 of training character samples (both the isolated characters in the training set of DB1 and the segmented characters in the training set of DB2, 4,207,801 samples in total), and the remaining 1/5 training samples were used for confidence parameter estimation. The training character samples fall in 7356 classes, including
3670
D.-H. Wang et al. / Pattern Recognition 45 (2012) 3661–3675
7185 Chinese characters and 171 alphanumeric characters and symbols. For character feature extraction, we use the local stroke direction histogram feature, which has been popularly used in both online and offline handwritten character recognition. Particularly, we adopt the implementation method of [45] for direction feature extraction using bi-moment normalization. After direction decomposition, 8 8 feature values are extracted from each of eight direction planes. To reduce the complexity of the classifier, the 512D feature vector is projected onto a 160D subspace learned by Fisher linear discriminant analysis (FLDA). The character bi-gram language model was trained on a text corpus containing about 50 million characters (about 32 million words) [16]. To estimate the parameters of the geometric models and train the combining weights of path evaluation criterion, we simulated the real-time character over-segmentation process on the training text lines of GDB2-D. In simulation, a text line is over-segmented into primitive segments using the dynamic over-segmentation algorithm. On the segment sequence, we extracted samples of geometric features for geometric context modeling. Using the character classifier, language model and geometric context models, we then constructed the candidate lattice on the segment sequence and trained the combining weights by MCE. Table 4 shows some statistics of character samples segmented from the test pages of DB2. The ‘‘rec’’ row gives the correct rate of segmented character recognition by character classifiers, and ‘‘rec10’’ and ‘‘rec20’’ are the cumulative accuracies of top 10 and 20 ranks, respectively. We can see that for all three classifiers, the correct rate of Chinese characters is the highest among four character types, and the MQDF classifier is the highest for Chinese characters among three classifiers. Comparing the overall correct rate, however, the NPC-LOGM classifier is the highest because it performs much better on symbols. The non-characters are abnormal samples and labeled as non-characters in the database, and outliers are the characters out of the defined 7356 classes. Our experiments were implemented on a PC with Intel(R) Core(TM) 2 Duo CPU E8400 3.00 GHz processor and 2 GB RAM, and were programmed using Microsoft Visual Cþþ. 7.2. Performance metrics We use some performance metrics for dynamic line segmentation and real-time sentence recognition, respectively. Many metrics have been defined for evaluating performance of line segmentation [10,46,47]. We adopt some of them and define a new metric for performance of real-time line segmentation. These metrics are based on the definitions of matches. A one-toone match is a match where a detected line and a ground-truthed line contain identical strokes. And g-one-to-many match occurs when the union of two or more detected lines equal to a ground-
truthed line. Similarly, a d-many-to-one match means the union of two or more ground-truthed lines equals a detected line. Among the performance metrics presented in [28], we chose the detection rate (DR), recognition accuracy (RA) and entity detection metric (EDM): DR ¼ w1
one2one g_one2many þ w2 , N N
RA ¼ w3
one2one d_many2one þ w4 , M M
EDM ¼
2 DR RA , DR þRA
where N is the number of ground-truthed lines, M is the number of detected lines, and w1 w4 are all set to 1. DR, RA and EDM are similar to the recall, precision and F-rate, respectively. Page recognition rate (PRR), defined as the percentage of pages with no segmentation error, is used to measure the page level performance. To evaluate the performance of real-time sentence recognition, we use two character-level metrics [15,21]: Correct Rate (CR) and Accurate Rate (AR): CR ¼ ðNt De Se Þ=Nt , AR ¼ ðN t De Se Ie Þ=Nt , where Nt is the total number of characters in the ground-truth transcript. The numbers of substitution errors (Se), deletion errors (De) and insertion errors (Ie) are calculated by aligning the recognition result string with the transcript by dynamic programming. The metric CR denotes the percentage of characters that are correctly recognized. Further, the metric AR considers the number of characters that are inserted due to over-segmentation. For real-time recognition of handwritten sentences, besides the CR and AR, the recognition speed is of crucial importance for practical applications. We evaluated the speed using the CPU times of each separate step as well as the whole process: line segmentation (denoted as Cl), character over-segmentation (denoted as Co), candidate lattice updating (computing multiple contexts in the candidate lattice, denoted as Cu), path search (denoted as Cs), and the whole recognition process (denoted as Cw). In Cu, we also count the times of character classification, geometric context scoring, and linguistic context scoring, respectively. The time cost is averaged over the total number of strokes, since in real-time recognition, each step performs once on each stroke. 7.3. Performance of dynamic line segmentation We evaluated the performance of real-time line segmentation on simulated short text pages. Given an online page, whenever a stroke is input, the system performs line segmentation and
Table 4 Statistics of character types and recognition rates. Classifier
All
Chinese
Symbol
Digit
Letter
Non-char
Outlier
Number
269,674
234,078
26,753
6,931
748
745
419
MQDF
Rec(%) Rec10 Rec20
85.02 97.41 98.05
89.90 98.00 98.39
48.07 93.93 96.67
69.33 97.10 97.79
73.53 95.72 96.93
0 0 0
0 0 0
NPC-LOGM
Rec(%) Rec10 Rec20
87.03 97.17 98.12
88.62 97.40 98.24
78.85 96.92 98.70
72.37 96.57 98.05
67.65 94.12 96.26
0 0 0
0 0 0
NPC-OVA
Rec(%) Rec10 Rec20
86.11 96.64 97.68
87.80 97.01 97.92
77.76 95.05 97.15
68.42 96.38 97.94
65.24 93.85 95.59
0 0 0
0 0 0
D.-H. Wang et al. / Pattern Recognition 45 (2012) 3661–3675
Table 5 Performance of dynamic line segmentation on GDB2 and GDB2-D.
Table 8 CPU times (ms) in updating the candidate lattice.
Dataset
DR
RA
EDM
PRR
GDB GDB-D
0.9964 0.9838
0.9960 0.9814
0.9962 0.9826
0.9509 0.9253
Dataset
Classifier
Cu:
Char
Geo
Lng
GDB2
MQDF NPC-LOGM NPC-OVA
23.08 10.95 11.02
17.825 5.835 5.881
2.484 2.446 2.452
2.768 2.699 2.683
GDB2-D
MQDF NPC-LOGM NPC-OVA
23.35 11.39 11.15
17.960 6.062 5.881
2.547 2.524 2.501
2.843 2.807 2.766
Table 6 Recognition accuracies with dynamic line segmentation. Dataset
Classifier
CR (%)
AR (%)
ch (%)
sb (%)
dg (%)
lt (%)
GDB2
MQDF NPC-LOGM NPC-OVA
92.13 89.48 87.56
90.12 87.20 84.83
94.09 91.22 89.46
80.24 80.69 78.18
79.01 69.25 61.23
88.65 81.75 77.39
GDB2-D
MQDF NPC-LOGM NPC-OVA
91.90 89.24 87.32
89.66 86.76 84.37
93.85 90.96 89.19
80.06 80.59 78.13
79.01 69.52 61.76
88.57 81.69 77.25
Table 7 CPU times (ms) of sentence recognition with dynamic line segmentation. Dataset
3671
Classifier
Cl
Co
Cu
Cs
Cw
GDB2
MQDF NPC-LOGM NPC-OVA
0.411 0.404 0.398
0.435 0.560 0.528
23.08 10.95 11.02
0.035 0.030 0.030
23.96 11.94 12.24
GDB2-D
MQDF NPC-LOGM NPC-OVA
0.412 0.410 0.392
0.698 0.630 0.628
23.35 11.39 11.15
0.036 0.036 0.030
24.50 12.47 12.20
updates text lines. After the last stroke is processed, the result of line segmentation is obtained. In our previous work [48], we used a linear SVM classifier to characterize the geometric relationship of a line–stroke pair, and evaluated the performance on GDB2.1-D (a subset of GDB2-D). The comparison with other related methods (including the ones based on off-stroke distance and overlap of the line–stroke pair) has shown the robustness and effectiveness of the proposed algorithm. Hence, in this paper, we only show the performance of the proposed algorithm on GDB and GDB2-D in Table 5, without comparing with the other methods. From the results, we can see that the algorithm performs well and can deal with delayed strokes. 7.4. Performance of real-time sentence recognition We evaluated the performance of real-time recognition on the generated short page datasets without (GDB2) and with delayed strokes (GDB2-D). On dynamic text line segmentation and candidate lattice updating on each stroke, the sentence recognition result is found using the real-time beam search algorithm with default parameter setting SN¼6, CN¼10 and BW¼10 (these parameter values were found to give good tradeoff between recognition accuracy and speed). 7.4.1. Performance with dynamic line segmentation In this case, the sentence recognition result is obtained based on dynamic text line segmentation, so the line segmentation error will cause sentence recognition error. The recognition accuracies (CR and AR) on the test data of GDB2 and GBD2-D are listed in Table 6, and the CPU times are given in Table 7. In Table 6, the performance is also specified to different types of characters: Chinese characters (ch), symbols (sb), digits (dg) and letters (lt). In Table 7, the time cost is specified to different steps: line segmentation (Cl), character over-
Table 9 Recognition accuracies with ground-truth line segmentation. Dataset
Classifier
CR (%)
AR (%)
ch (%)
sb (%)
dg (%)
lt (%)
GDB2
MQDF NPC-LOGM NPC-OVA
92.30 89.64 87.72
90.53 87.59 85.22
94.28 91.40 89.63
80.32 80.73 78.23
79.28 69.65 61.50
88.73 81.78 77.46
GDB2-D
MQDF NPC-LOGM NPC-OVA
92.23 89.56 87.62
90.37 87.44 85.05
94.20 91.30 89.51
80.28 80.72 78.25
79.41 70.05 61.36
88.75 81.86 77.36
segmentation (Co), candidate lattice updating (Cu), path search (Cs), and whole process (Cw). Further, the CPU time in candidate lattice updating is specified to character classification (Char), geometric context scoring (Geo), and linguistic context scoring (Lng), and is given in Table 8. From the results, we have some observations as follows: (a) The character correct rate of sentence-based input is significantly higher than that of isolated character recognition (in Table 4) though sentence-based recognition involves segmentation. This is due to the important effect of contexts. (b) For all the three character classifiers, the correct rates on Chinese characters are fairly high (94.09%, 91.22%, and 89.46% for MQDF, NPC-LOGM, and NPC-OVA, respectively), but the correct rates of symbols, digits and letters are much lower. This is because the shapes of symbols, digits and letters are more likely to be confused. (c) The correct rate on handwritten text data with delayed strokes is only slightly lower than that on data without delayed strokes. This demonstrates the robustness of the proposed approach against delayed strokes. (d) Among the three character classifiers, the MQDF classifier gives the highest overall correct rate and accurate rate. This is due to its advantage on Chinese characters, which takes advantage of the linguistic context to improve the recognition accuracy. (e) Tables 7 and 8 show that the major source of computation cost lies in the updating of candidate lattice, and further, character classification costs most of computation time in candidate lattice updating. Comparing the three classifiers, the MQDF classifier is most computationally intensive, and it makes the whole recognition process much slower than the NPC classifiers. (f) The proportion of time cost of path search (Cs) in the whole recognition process is very small. This favors real-time applications, because the majority of computation is performed during the writing process, and the sentence recognition results can be obtained immediately by path search after writing is finished. (g) Although the MQDF classifier is computationally intensive, it is acceptable for real-time applications because the majority of computation is done during writing. From the tradeoff
3672
D.-H. Wang et al. / Pattern Recognition 45 (2012) 3661–3675
between accuracy and speech, the NPC-LOGM classifier is preferable.
7.4.2. Performance with ground-truth line segmentation In this case, the ground truth of text line segmentation is taken such that sentence recognition is not disturbed by line segmentation error. This is to evaluate the performance of ‘‘pure’’ sentence recognition as if the sentence is always written in a single line. The recognition accuracies are shown in Table 9. Compared to Table 6 we can see that the recognition accuracies do not differ significantly between ground-truth line segmentation and dynamic line segmentation. This is because the dynamic line segmentation algorithm yields very few errors.
7.4.3. Effects of parameters in path search Both the recognition accuracy and speed depend on the parameters SN (segment number) and CN (candidate class number) in candidate lattice updating, and BW (band width) in beam search. In the above experiments, the parameters were set default values (SN¼6, CN¼10, BW¼10). The choice of SN is related to the concrete candidate character generation method. In our system, SN¼6 was chosen to guarantee that nearly all true characters can be generated by combining consecutive segments. Under this situation, we evaluate the effects of the other two parameters CN and BW. Experimental results with CN¼10 and various BW are shown in Fig. 8. When BW equals SN CN (60 in this case), the performance is the same as that of the DP algorithm. We do not show
Fig. 8. Experimental results with various BW. (a) Correct rate of MQDF and NPC-LOGM classifier. (b) Recognition speed of MQDF and NPC-LOGM classifier.
Fig. 9. Experimental results with various CN. (a) Correct rate of MQDF and NPC-LOGM classifier. (b) Recognition speed of MQDF and NPC-LOGM classifier.
Fig. 10. Examples of line segmentation error. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
D.-H. Wang et al. / Pattern Recognition 45 (2012) 3661–3675
the results of NPC-OVA classifier since the correct rate of which is lower than NPC-LOGM classifier while their recognition speed is the same. From the results, we can see that, when BW equals 10, the recognition accuracy is as high as when BW equals 60, and various BW makes little difference in recognition speed. Performances with BW ¼10 and various CN are shown in Fig. 9. We can see that the number of 10 candidate classes performs sufficiently well, and increasing CN improves correct rate slightly but brings substantial computation cost. 7.5. Examples of recognition errors The sources of real-time sentence recognition errors include the text line segmentation error, character over-segmentation failure (under-segmentation), character classification error, and path search failure. The error rate of dynamic text line segmentation is very low as shown in Table 5. Some examples of line segmentation error are shown in Fig. 10, where the stroke in red is the succeeding stroke written after the first line. This indicates that line segmentation error likes to happen in the beginning of writing because the line height is not precisely estimated due to the small number of strokes, as well as the red stroke is rather apart from the first line. But this line segmentation error does not affect sentence recognition significantly because the segment sequence is correctly ordered. As the writing continues, the temporary line segmentation error may be corrected after the line becomes longer. Over-segmentation failure
3673
happens when a segment contains strokes that belong to different characters or connected strokes are written. Character classification error means that the true class of the candidate character is not included in the top CN candidate classes. This makes the correct path not included in the candidate lattice. The path search failure occurs when the correct path, even though included in the candidate lattice, cannot be searched by the path search algorithm due to the imperfection of path evaluation criterion or search algorithm. We show three examples of recognition errors in Fig. 11: (a) shows a symbol that is misrecognized, (b) shows a Chinese character recognition error, and (c) shows a segmentation error. In real-time string recognition, the correct rates of symbols, letters and digits are quite low, as have been shown in Table 6. It is the goal to improve the accuracy on alphanumerics and symbols in the future. 7.6. Example of real-time recognition process We have developed a prototype system on Tablet PC to demonstrate the applicability of the proposed real-time recognition approach. User edition functions have also been developed for correcting segmentation and recognition errors, though they are not described in this paper. Fig. 12 shows some sampled steps of real-time recognition on a handwritten sentence, where the left subwindow is the writing area and the right sub-window shows the recognition result. Whenever the pen lift time exceeds a threshold, the sentence recognition result will be shown and the user has the
Fig. 11. Three examples of recognition errors. (a) A symbol is misrecognized; (b) A Chinese character is misrecognized; (c) A segmentation error. Upper, segments after over-segmentation; middle, segmentation–recognition result; bottom, ground-truth.
Fig. 12. Steps of real-time recognition on a short page.
3674
D.-H. Wang et al. / Pattern Recognition 45 (2012) 3661–3675
option to edit in this time. The recognition output text extends when writing continues. 8. Concluding remarks To fulfill the increasing needs of sentence-based input of handwritten characters, this paper proposes an approach to real-time recognition of Chinese handwritten sentences. The main feature of the approach is a dynamically maintained segmentation–recognition candidate lattice that integrates character classification, linguistic context and geometric context. For implementation, we proposed effective methods for dynamic text line segmentation, character oversegmentation, path evaluation and search. The generation of candidate characters and computation of multiple contexts, which consume the majority of computing, are performed during the writing process. Thus, the sentence recognition result can be obtained immediately after the completion of writing or at a long-time pen lift. Experiments on a large database of online Chinese handwriting demonstrated the robustness and effectiveness of the proposed approach. The techniques presented in this paper can be implemented for sentence-based input of Chinese/Japanese characters in penbased devices such as Tablet PC, electronic white board, PDA, and mobile phone. Based on the recognition performance in our experiments, the proposed approach and system are acceptable for practical applications. The remaining recognition errors can be corrected by user edition. Anyway, improved performance of automatic recognition is always beneficial and desirable. This can be achieved by optimizing the techniques in all the steps, including line segmentation, character over-segmentation, character recognition, context modeling, path evaluation and search. These are to be addressed in the future works by the community.
Acknowledgments This work was supported by the National Natural Science Foundation of China (NSFC) Grants 60825301 and 60933010. The authors like to thank Fei Yin and Qiu-Feng Wang for helpful discussions. References [1] R. Plamondon, S.N. Srihari, On-line and off-line handwriting recognition: a comprehensive survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (1) (2000) 63–84. [2] C.-L. Liu, S. Jaeger, M. Nakagawa, Online handwritten Chinese character recognition: the state of the art, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2) (2004) 198–213. [3] M. Cheriet, N. Kharma, C.-L. Liu, C.Y. Suen, Character Recognition Systems: A Guide for Students and Practitioners, John Wiley & Sons, 2007. [4] K. Hinckley, K. Yatani, M. Pahud, N. Coddington, J. Rodenhouse, A. Wilson, H. Benko, B. Buxton, Pen þ Touch¼New Tools. in: Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology (UIST), New York, USA, 2010, pp. 27–36. [5] U. Garain, B.B. Chaudhuri, Recognition of online handwritten mathematical expressions, IEEE Transactions on System Man Cybernetics Part B 34 (6) (2004) 2366–2376. [6] C.-L. Liu, H. Sako, H. Fujisawa, Effects of classifier structures and training regimes on integrated segmentation and recognition of handwritten numeral strings, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (November (11)) (2004) 1395–1407. [7] H. Murase, Online recognition of free-format Japanese handwritings, in: Proceedings of the 9th ICPR, vol. 2, 1988, pp. 1143–1147. [8] M. Nakagawa, B. Zhu, M. Onuma, A model of on-line handwritten Japanese text recognition free from line direction and writing format constraints, IEICE Transactions on Information and Systems E 88 (8) (2005) 1815–1822. [9] X.-D. Zhou, J.-L. Yu, C.-L. Liu, T. Nagasaki, K. Marukawa, Online handwritten Japanese character string recognition incorporating geometric context, in: Proceedings of the 10th ICDAR, Curitiba, Brazil, 2007, pp. 48–52. [10] X.-D. Zhou, C.-L. Liu, M. Nakagawa, Online handwritten Japanese character string recognition using conditional random fields, in: Proceedings of the 11th ICDAR, Barcelona, Spain, 2009, pp. 521–525.
[11] B. Zhu, X.-D. Zhou, C.-L. Liu, M. Nakagawa, A robust model for on-line handwritten Japanese text recognition, International Journal of Document Analysis and Recognition 13 (2) (2010) 121–131. [12] C.-L. Liu, Classifier combination based on confidence transformation, Pattern Recognition 38 (1) (2005) 11–28. [13] C.-L. Liu, F. Yin, D.-H. Wang, Q.-F. Wang, CASIA online and offline Chinese handwriting databases, in: Proceedings of the 11th ICDAR, Beijing, China, 2011, pp. 37–41. [14] D.-H. Wang, C.-L. Liu, An approach to real-time recognition of Chinese handwritten sentences, in: Proceedings of the 2nd Chine–Japan–Korea Joint Workshop on Pattern Recognition (CJKPR), Fukuoka, Japan, 2010, pp. 203–208. [15] T.-H. Su, T.-W. Zhang, D.-J. Guan, H.-J. Huang, Off-Line recognition of realistic Chinese handwriting using segmentation-free strategy, Pattern Recognition 42 (1) (2009) 167–182. [16] Q.-F. Wang, F. Yin, C.-L. Liu, Integrating language model in handwritten Chinese text recognition, in: Proceedings of the 10th ICDAR, Barcelona, Spain, 2009, pp. 1036–1040. [17] Y. Jiang, X. Ding, Q. Fu, Z. Ren, Context driven Chinese string segmentation and recognition, in: Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshops, Lecture Notes in Computer Science, vol. 4109, 2006, pp. 127–135. [18] X. Ding, H. Liu, Segmentation-driven offline handwritten Chinese and Arabic script recognition, in: Proceedings of the Summit on Arabic and Chinese Handwriting (SACH), 2006, pp. 61–73. [19] S. Senda, K. Yamada, A maximum-likelihood approach to segmentationbased recognition of unconstrained handwriting text, in: Proceedings of the 6th ICDAR, September 2001, pp. 184–188. [20] Q.-F. Wang, F. Yin, C.-L. Liu, Improving handwritten Chinese text recognition by confidence transformation, in: Proceedings of the 11th ICDAR, Beijing, China, 2011, pp. 518–522. [21] Q.-F. Wang, Fei Yin, C.-L. Liu, Handwritten Chinese text recognition by integrating multiple contexts, IEEE Transactions on Pattern Analysis and Machine Intelligence, http://dx.doi.org/10.1109/TPAMI.2011.264, in press. [22] N.-X. Li, L.-W. Jin, A Bayesian-based probabilistic model for unconstrained handwritten offline Chinese text line recognition, in: IEEE International Conference on Systems Man and Cybernetics (SMC), 2010, pp. 3664–3668. [23] C.-L. Liu, Fei Yin, Q.-F. Wang, D.-H. Wang, ICDAR 2011 Chinese handwriting recognition competition, in: Proceedings of the 11th ICDAR, Beijing, China, pp. 1464–1469. [24] A.K. Jain, A.M. Namboodiri, J. Subrahmonia, Structure in on-line documents, in: Proceedings of the 6th ICDAR, Seattle, WA, 2001, pp. 844–848. [25] E.H. Ratzlaff, Inter-line distance estimation and text line extraction for unconstrained online handwriting, in: Proceedings of the 7th IWFHR, Nijmegen, Netherlands, 2000, pp. 33–42. [26] M. Liwicki, E. Indermuhle, H. Bunke, On-line handwritten text line detection using dynamic programming, in: Proceedings of the 9th ICDAR, Curitiba, Brazil, 2007, pp. 447–451. [27] M. Ye, P. Viola, S. Raghupathy, H. Sutanto, C. Li, Learning to group text lines and regions in freeform handwritten notes, in: Proceedings of the 9th ICDAR, Curitiba, Brazil, 2007, pp. 28–32. [28] X.-D. Zhou, D.-H. Wang, C.-L. Liu, A robust approach to text line grouping in online handwritten Japanese documents, Pattern Recognition 42 (9) (2009) 2077–2088. [29] F. Kimura, K. Takashina, S. Tsuruoka, Y. Miyake, Modified quadratic discriminant functions and the application to Chinese character recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 9 (January (1)) (1987) 149–153. [30] C.-L. Liu, M. Nakagawa, Evaluation of prototype learning algorithms for nearest neighbor classifier in application to handwritten character recognition, Pattern Recognition 34 (March (3)) (2001) 601–615. [31] M.-Y. Chen, A. Kundu, S.N. Srihari, Variable duration hidden Markov model and morphological segmentation for handwritten word recognition, IEEE Transactions on Image Processing 4 (December (12)) (1995) 1675–1688. [32] J.A. Barnett, Computational methods for a mathematical theory of evidence, in: Proceedings of the 7th IJCAI, 1981, pp. 868–875. [33] C.-L. Liu, H. Fujisawa, Classification and learning in character recognition: advances and remaining problems, in: S. Marinai, H. Fujisawa (Eds.), Machine Learning in Document Analysis and Recognition, Springer, 2008, pp. 139–161. [34] X.-B. Jin, C.-L. Liu, X. Hou, Regularized margin-based conditional log-likelihood loss for prototype learning, Pattern Recognition 43 (7) (2010) 2428–2438. [35] C.-L. Liu, One-vs-all training of prototype classifier for pattern classification and retrieval, in: Proceedings of the 20th ICPR, 2010, pp. 3328–3331. [36] F. Yin, Q.-F Wang, C.-L. Liu, Integrating geometric context for text alignment of handwritten Chinese documents, in: Proceedings of the 12th ICFHR, Nov 2010, pp. 7–12. [37] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (3) (1995) 273–297. [38] B.-H. Juang, W. Chou, C.-H. Lee, Minimum classification error rate methods for speech recognition, IEEE Transactions on Speech and Audio Processing 5 (3) (1997) 257–265. [39] W. Chou, Discriminant-function-based minimum recognition error patternrecognition approach to speech recognition, Proceedings of the IEEE 88 (8) (2000) 1201–1223.
D.-H. Wang et al. / Pattern Recognition 45 (2012) 3661–3675
[40] W.-T. Chen, P. Gader, Word level discriminative training for handwritten word recognition, in: Proceedings of the 7th IWFHR, Amsterdam, The Netherlands, 2000, pp. 393–402. [41] C.-L. Liu, K. Marukawa, Handwritten numeral string recognition: characterlevel training vs. string-level training, in: Proceedings of the 17th ICPR, Cambridge, UK, 2004, pp. 405–408. [42] A. Biem, Minimum classification error training for online handwriting recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (7) (2006) 1041–1051. [43] H. Robbins, S. Monro, A stochastic approximation method, Annals of Mathematical Statistics 22 (1951) 400–407. [44] L.R. Rabiner, A tutorial on hidden Markov models and selective applications in speech recognition, Proceedings of the IEEE 77 (1989) 257–286.
3675
[45] C.-L. Liu, X.-D. Zhou, Online Japanese character recognition using trajectorybased normalization and direction feature extraction, in: Proceedings of the 10th IWFHR, 2006, pp. 217-222. [46] I. Phillips, A. Chhabra, Empirical performance evaluation of graphics recognition systems, IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (September (9)) (1999) 849–870. [47] A. Antonacopoulos, B. Gatos, D. Bridson, ICDAR2007 page segmentation competition, in: Proceedings of the 9th ICDAR, Curitiba, Brazil, 2007, pp. 1279–1283. [48] D.-H. Wang, C.-L. Liu, Dynamic text line segmentation for real-time recognition of Chinese handwritten sentences, in: Proceedings of the 11th ICDAR, Beijing, China, 2011, pp. 931–935.
Da-Han Wang received the B.S. degree in Automation Science and Electrical Engineering from Beihang University, Beijing, China, in 2006. He is currently pursuing a Ph.D. degree in pattern recognition and intelligent systems at the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include pattern recognition, handwriting recognition and retrieval, and probabilistic graphical models.
Cheng-Lin Liu is a Professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences, Beijing, China, and is now the deputy director of the laboratory. He received the B.S. degree in electronic engineering from Wuhan University, Wuhan, China, the M.E. degree in electronic engineering from Beijing Polytechnic University, Beijing, China, the Ph.D. degree in pattern recognition and intelligent control from the Chinese Academy of Sciences, Beijing, China, in 1989, 1992 and 1995, respectively. He was a postdoctoral fellow at Korea Advanced Institute of Science and Technology (KAIST) and later at Tokyo University of Agriculture and Technology from March 1996 to March 1999. From 1999 to 2004, he was a research staff member and later a senior researcher at the Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan. His research interests include pattern recognition, image processing, neural networks, machine learning, and especially the applications to character recognition and document analysis. He has contributed many effective methods to different aspects of handwritten document analysis. Some of his algorithms have been transferred to industrial applications including mail sorting, form processing and video text indexing. He has published over 130 technical papers at prestigious international journals and conferences. He is on the editorial board of journals Pattern Recognition, Image and Vision Computing, and International Journal on Document Analysis and Recognition. He is a senior member of the IEEE.
Xiang-Dong Zhou received the B.S. degree in Applied Mathematics and the M.S. degree in Management Science and Engineering both from National University of Defense Technology, Changsha, China, the Ph.D. degree in pattern recognition and artificial intelligence from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 1998, 2003 and 2009, respectively. He was a postdoctoral fellow at Tokyo University of Agriculture and Technology from April 2009 to March 2011. From 2011, he has been a research assistant at the Intelligence Engineering Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China. His research interests include character recognition and document analysis.