An image-based automatic Arabic translation system

An image-based automatic Arabic translation system

Pattern Recognition 42 (2009) 2127 -- 2134 Contents lists available at ScienceDirect Pattern Recognition journal homepage: w w w . e l s e v i e r ...

377KB Sizes 12 Downloads 227 Views

Pattern Recognition 42 (2009) 2127 -- 2134

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r

An image-based automatic Arabic translation system Yi Chang ∗ , Datong Chen, Ying Zhang, Jie Yang School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA

A R T I C L E

I N F O

Article history: Received 1 June 2008 Received in revised form 11 September 2008 Accepted 18 October 2008 Keywords: Text detection Image classification OCR Error correction

A B S T R A C T

In this paper, we present a system that automatically translates Arabic text embedded in images into English. The system consists of three components: text detection from images, character recognition, and machine translation. We formulate the text detection as a binary classification problem and apply gradient boosting tree (GBT), support vector machine (SVM), and location-based prior knowledge to improve the F1 score of text detection from 78.95% to 87.05%. The detected text images are processed by off-the-shelf optical character recognition (OCR) software. We employ an error correction model to post-process the noisy OCR output, and apply a bigram language model to reduce word segmentation errors. The translation module is tailored with compact data structure for hand-held devices. The experimental results show substantial improvements in both word recognition accuracy and translation quality. For instance, in the experiment of Arabic transparent font, the BLEU score increases from 18.70 to 33.47 with use of the error correction module. Published by Elsevier Ltd.

1. Introduction In an information society, we communicate with people and information systems through diverse media in increasingly varied environments. Advanced technologies have bridged many communication gaps. While the Internet technology provides a shortcut to overcome the distance barrier, the machine translation (MT) technology helps us to overcome language barrier to communicate with people who use different languages. Much information is in written form embedded in various environments, such as on a piece of paper, on a wall, on a bulletin board, etc. As digital cameras become popular, an image-based MT system is able to capture the information in a variety of environments and translate it from the source language to the target language for different applications. For example, we have developed a sign translation system to translate Chinese into English for tourist applications [1,2]. In this research, we develop an image-based system that translates Arabic on images into English, which is a different and challenging problem due to complex writing formations and character connectivity of the language. The system works as follows. After an image captured from a digital camera, the system pre-processes the image to account for fonts, skew, rotation, illumination, shadows, glare, reflection, and other sources of variability. Subsequently, it automatically detects text regions in the image, performs recognition using off-the-shelf optical character

∗ Corresponding author. Tel.: +1 4 129748849. E-mail address: [email protected] (Y. Chang). 0031-3203/$ - see front matter Published by Elsevier Ltd. doi:10.1016/j.patcog.2008.10.031

recognition (OCR) software on the text regions, and then translates the text strings into English using a state-of-the-art statistical MT system, PanDoRA, which is tailored with compact data structure for hand-held devices [3]. In the proposed system, the text region detection is addressed by a cascade approach. We first apply a set of heuristic rules on a multiresolution pyramid of an image for extracting candidate text regions. The result of the first step has a high recall but a low precision. To decrease the detection false alarms, we then model the verification of the candidate text regions as a binary classification problem and train classifiers using the true positive and false positive examples extracted in the first step. Two types of learning algorithms, support vector machines (SVMs) and gradient boosting tree (GBT), have been employed to address the classification problem. We use commercial OCR software to recognize Arabic text from verified image regions. Most commercial OCR systems have achieved recognition accuracy of 99% on high quality images of printed text. However, when the input images are from different domains than the training data for OCR, the accuracy of OCR is much lower than the ideal testing situation, and the translation quality is also affected. In this paper, we use the BLEU score to measure translation quality [4]. For example, in one of our experiments, 10.7% word recognition error can reduce the translation BLEU score severely dropping from 43.12 to 28.56. In this paper, we propose to enhance the image-based Arabic translation using an error correction model to correct the misrecognized word. We deploy a noisy channel model trained from synthetic data with different fonts and sizes to simulate real world situations. We further enhance the correction model with bigram to improve the

2128

Y. Chang et al. / Pattern Recognition 42 (2009) 2127 -- 2134

word segmentation error correction. We perform experiments to demonstrate the proposed methods. The experimental results show a significant improvement in translation quality. For instance, in the experiment of Arabic transparent font, the BLEU score increases from 18.70 to 33.47 with use of the noisy channel model. The paper presents a novel application in reading and translation Arabic to English speakers with an imaging device. Our main algorithmic contributions include the following points: (1) we successfully combine the location of the candidate text region with its visual appearance, and apply the GBT algorithm to enhance the text region detection. The experiments have shown the advantage of this new feature representation in improving text region detection and (2) we propose an error correction model to post-process the noisy OCR output. This algorithm employs a bigram language model for word segmentation. The substantial improvement in the translation result has been presented and discussed in the paper. The rest of this paper is organized as the following: in Section 2, we introduce some related work, and in Section 3, we describe the architecture of the image-based translation system. Then, in Section 4 we illustrate our endeavor on text region detection and classification. In the following Section 5, we explicitly explain our efforts to enhance translation with a noisy channel model. Finally, we conclude the work and discuss the future directions in the last section. 2. Related work Yang et al. [1] proposed a sign detection and translation system as one of the earliest works on automatically detection, recognizing, and translating signs. Chen et al. [5] attacked the sign detection problem in a hierarchical framework of multi-resolution edge detection, color analysis, and affine rectification. In a recent work, a conditional random field model was presented by Weinman et al. [6] to capture dependencies between neighboring image region in sign detection. Hong [7] corrected noisy OCR result through passage-level post-processing using visual constraints and linguistic constraints. Kolak and Resnik [8] modeled a noisy channel in OCR error correction with syntactic information. As two most commonly used Arabic OCR products, Sakhr and OmniPage are thoroughly compared by Kanungo et al. [9], and the absolute page accuracy rates of Sakhr and Omnipage are only 90.33% and 86.89%, respectively. Taghva et al. [10] built an expert system for automatically correcting OCR errors to post-process the OCR result text in preparation for subsequent retrieval system. The system claimed that 87% word errors could be corrected in circumstance of text retrieval. Doermann and Yao [11] also presented a system for modeling the OCR output errors. They used symbol and page models to simulate the degraded images during scanning, decomposing, and recognition. Sato et al. [12] implement the video OCR techniques to solve the low resolution characters and extremely complex backgrounds problems in digital video data. They post-processed OCR results by mapping the OCR results into a dictionary using a self-defined word similarity. OCR errors might also impair the accuracy of other applications that rely on text processing techniques. Croft et al. [13] showed that low quality OCR output can result in significant degradation on the accuracy of retrieval by examining the information retrieval performance on OCR output. Instead of correcting OCR errors, Harding et al. [14] used n-gram formulations with a probabilistic retrieval system. Retrieval performance can be improved over standard queries on the same data when a level of 10% degradation or worse was achieved. Similarly, Mittendorf et al. [15] showed that recognition errors could be ignored in retrieval if the number of documents and their lengths are sufficiently large.

3. System architecture and interface As shown in Fig. 1, the image-based translation system consists of three modules. The image capture module is hardware dependent which handles image input. The input image is then fed into the detection, recognition, and translation module for processing. This module is a key part of the system. It first detects and extracts text regions from an image, then further recognizes the text characters in the source language using image-based OCR techniques. The recognition results are later translated into a specified target language. The interaction module provides a user-friendly interface between a user and the system. Fig. 2 illustrates the prototype of the image-based Arabic translation system. The system can process input images from a file or a digital camera. The upper left window in the interface shows the input image and bottom left window is the detection results. The recognized text is at right upper window and the translation results are shown in right bottom window. The processing pipeline of an image-based MT system consists of text detection, character recognition, and MT. Errors in any component of an image-based translation system can affect the end-to-end performance of the entire system. Especially, errors in detection and recognition can be propagated through the system and even amplified during the propagation process, e.g., a misrecognized word or prophase can result in multiple translation errors. To address the problem of text detection in complex backgrounds, we have developed a hierarchical detection framework consisting of multi-resolution and multi-scale edge detection, adaptive searching, color analysis, and affine rectification [5]. Multi-resolution and multiscale edge detection techniques are combined together to effectively

Carema

Iegam Capturer Module

Image Commands Interactive Module Results User

Visual Output

Detection & Recognition & Translation Module

Commands Audio Output User Input

Fig. 1. The architecture of the image-based translation system.

Fig. 2. An illustrate of the prototype system.

Y. Chang et al. / Pattern Recognition 42 (2009) 2127 -- 2134

detect text in different sizes. We use affine rectification to deform the detected text regions captured by any non-frontal camera view angle. To refine the text detection results, we perform text detection again in rectified regions. In the system implementation, recognition module incorporates an off-the-shelf OCR software, Sakhr Automatic Reader version 8.0 (Platinum Edition), which is one of the most commonly used Arabic OCR products. In the translation module, we use the CMU PanDoRA system [3], a phrase-based statistical MT system developed for handheld devices. There are two decoding mode in PanDoRA: monotonic decoding and ITG-style reordering decoding. In the image-based automatic Arabic translation system, we use monotonic decoding as the testing sentences are usually very short. 4. Text region detection and classification 4.1. Multi-resolution text region detection Due to the lighting and contrast variations in different images, a robust detection algorithm is required to adopt different circumstances. Character resolution, which relies on image size, text font, camera view angle, and other possible factors, can cause severe detection errors. For example, large characters are mistakenly detected as separate segments instead of a whole character, while small characters are apt to be missed by a detection algorithm. To solve the problem, we perform detection with multi-resolution edge detector [16]. We use an edge detection algorithm to obtain the initial candidate of text regions under variant lighting condition, and apply a multi-resolution approach to compensate contrast and noises. The multi-resolution edge detector computes edge filtering and then threshold detection is conducted to control the sensitivity of the edge detector. After edge detection, edges are clustered to find the text cue regions in terms of color, position, and size. To improve the recall of region detection, we adopt a pyramid method, which resize the original image into different scales, and apply the text region detector on each scale, then combine text regions from different scales into an integration result. In the image pre-processing steps, we deployed affine rectification transformation to recover deformation of the text region caused by inappropriate camera view angle. Given the result of the multiresolution edge detector, we proposed a RGB-based Bayesian method to model the similarity between the detected region and their surrounding regions, and expand an original region with its neighboring regions within a predefined a threshold. We apply a few heuristics to filter out regions that are too large or too small and merge overlapped regions into to as the post-processing for text region detection. Efficiency is a key fact for the image-based translation task for a real world application. In order to have a fast text detector, we first employ the edge-based detector as proposed in Ref. [5] to coarsely detect candidate text regions. The edge-based detector assumes that (1) the text is highly contrasted to its background; (2) each word is composed of several connected regions; and (3) the characters in the same region have similar foreground and background pattern. Edges are first extracted from multiple scales and then grouped into separated sets by recursively optimal the intensity, mean, variance and some heuristic rules defined on the three assumptions above. We have enhanced the original edge-based detector in Ref. [5] by adding more criteria for edge candidate filtering and using a pyramid structure to handle various variations in our implementation. We apply only two scales in the pyramid method to keep the detection fast. The detector is tuned to keep the recall as high as possible. Later, we proposed a fine-grained text region classification

2129

module based on SVM and GBT algorithms, in Sections 4.3.1 and 4.3.2 to increase the precision. 4.2. Text region classification The existing text detection module can achieve a high recall rate, but its precision is rather low. How to improve precision of the text region detection module is a key point in the system. For example, Fig. 3(a) shows an image with false alarm detected region on it, while Fig. 3(b) illustrates the desired detection result on the same image. We formulate the false alarm region removal problem into a binary region classification problem. In this paper, we study several state-of-the-art machine learning methods, including SVM with different kernels, and GBT, for text region classification on images. We also incorporate prior knowledge and location information with binary classifiers to improve classification F1 score. 4.3. Learning methods 4.3.1. Support vector machine SVM seeks a hyperplane to separate a set of positive and negative training examples. The hyperplane is defined as wT x + b = 0, where w ∈ Rd is a vector orthogonal to the hyperplane, |b|/||w|| is the perpendicular distance from the hyperlane to the origin, and ||w|| is the Euclidean norm of w. The decision function is the hyperplan classifier: F(x) = sign(wT x + b) and the hyperplane is subject to: yi (wT xi + b)  1 − i ,

∀i = 1, . . . , N,

where xi ∈ Rd is a training example with d-dimension features, and yi ∈ {+1, −1} denotes the label of the feature vector xi . The margin is defined by the distance between two parallel hyperplanes wT x+b=1 and wT x + b = −1. Therefore, the SVM training process can be defined as an optimize problem as following: min(1/2)wT w + 1T ,

yi (wT xi + b)  1 − i , ∀i = 1, . . . , N,

where  is the regularization function to control the computational complexity. A linear SVM classifies linear separatable examples using the hyperplane. If it is beyond a linear classification problem, we can extend the basic SVM with nonlinear kernels, and predict on high dimensional hyperplane. In our experiment, we applied SVM-light package [17], and tried linear kernel function, polynomial kernel function, radial basis kernel function, and sigmoid kernel function. 4.3.2. Gradient boosting tree The basic idea of boosting is to iteratively improve the loss function with a weak learner. GBT chooses the decision tree as its weak learner [18], and it iteratively fits an additive model to minimize the loss function L(yi , fT (x + i)), as ft (x) = TRt (x; 0 ) + 

T 

t TRt (x; t ),

t=1

where TRt (x; t ) is a decision tree at iteration t which is weighted by parameter t . In this formula, t is the parameter in the decision trees, and  is the learning rate. At iteration t, tree TRt (x; t ) is generated to fit the negative gradient of least squares errors as

ˆ = arg min 

N  (−Git − t TRt (xi ); )2 , i

2130

Y. Chang et al. / Pattern Recognition 42 (2009) 2127 -- 2134

Fig. 3. An example of text detection with and without false alarm.

where Git is the gradient over current prediction function, as

Table 1 Evaluation results on the training set with five-dimension features

Git = [jL(yi , f (xi ))/ jf (xi )]f =ft−1 . The optimal weight of trees t is computed as

t = arg min 

N 

L(yi , ft−1 (xi ) + TR(xi , )).

Baseline SVM linear SVM polynomial GBT

Precision (%)

Recall (%)

F1 (%)

58.31 81.65 82.14 87.35

100 80.41 81.93 94.91

73.67 81.03 82.03 90.98

i

If we choose square error in as loss function, the gradient G(xi ) = −yi + f (xi ). Compare with nonlinear SVM, GBT is more robust for nonlinear classification tasks, since the “kernel trick” in SVM relies on data distribution. If the data do not fit those predefined SVM kernels on high dimension hyperplane well, GBT performs much better than nonlinear SVM. 4.4. Data and feature To attack the learning-based text region classification problem, we made some Arabic signs, put them in natural scenes, and captured 313 images under different surroundings. We then applied the basic heuristic-based text region detector on the simulated images, and obtained more than 650 text regions. After human labeling, we use these data as the training set. A total of 62 images were used as the testing data. They were provided by defense advanced research projects agency (DARPA) for the advanced soldier sensor information system and technology (ASSIST) program, and each image contains at most one text region. The time-consuming generation cycles prevent us building a large training set as most other learning tasks required. As a result, we need to avoid a high dimension feature space with very limited training examples. The intuition is the color distribution of the text region within a sign should be significantly different from non-text region. In our preliminary experiments, each candidate text regions are represented by a five-dimension feature vector, which is composed of highest value of red, green, blue, normalized red, and normalized green. Tables 1 and 2 illustrate the binary classification F1 scores on the training set and the testing set. The baseline is the result of the existing text region detector, which is introduced in Section 4.1. On the training set, GBT show much higher F1 scores than linear and nonlinear SVM. SVM also outperform baseline. On the testing set, all of the classifiers perform worse than the baseline due to overfitting. In all of our experiments, we also tried radial basis kernel function, and sigmoid kernel function in SVM. Their F1 scores are worse than using the SVM with the polynomial kernels.

Table 2 Evaluation results on the testing set with five-dimension features

Baseline SVM linear SVM polynomial GBT

Precision (%)

Recall (%)

F1 (%)

65.22 75.82 77.33 76.67

100 76.67 64.44 80.23

78.95 76.24 70.30 78.41

4.5. Incorporating prior knowledge with classification The Location of the candidate text region provides additional information to the classifier. Considering a user use an imaging device to capture the text in a picture, his/her focus is on the text part instead of its surround area. Therefore, we can make a reasonable assumption that text regions are likely in the centric part of an image. To incorporate the location information in the binary classification problem, we expand the feature vector from five-dimensions to 10-dimensions. These five augmented dimensions are subtraction between peak RGB values in its own region and peak RGB values in the most central regions. If the region itself is at the central of the image, the appended five-dimensions will be set as 0. We also directly incorporate prior knowledge into the prediction with a linear combination: F  (x) =  Pr(x) + (1 − )F(x), where F(x) is the prediction of a classifier, 0    1, and Pr(x) is the prior knowledge based on location information. We grouped all candidate regions into three types, the only region on the middle, the nearest middle region when no region on the middle, not middle regions, and tune different  values on different types of regions. In our experiments, we compute the prior knowledge Pr(x) based on the labels of training data, for each type of candidate regions, Pr(x) is computed as the fraction of the real text region to all candidate regions. Tables 3 and 4 represent the F1 scores on training and testing sets after incorporating prior knowledge. Comparing with Tables 1 and 2, we find F1 scores gain on every classifier after incorporating prior knowledge with the reasonable assumption that the text region

Y. Chang et al. / Pattern Recognition 42 (2009) 2127 -- 2134

2131

Table 3 Evaluation results on training set with prior knowledge

Baseline SVM linear SVM polynomial GBT

Precision (%)

Recall (%)

F1 (%)

58.31 89.11 83.59 93.61

100 81.17 98.47 96.95

73.67 84.95 90.42 95.25

Precision (%)

Recall (%)

F1 (%)

65.22 67.97 67.67 77.78

100 96.67 100 98.82

78.95 79.82 80.72 87.05

Fig. 4. An example of Arabic transparent font.

Table 4 Evaluation results on testing set with prior knowledge

Baseline SVM linear SVM polynomial GBT

Fig. 5. An example of simple Arabic font.

would be in the central of a picture. Overfiitting is relieved since the F1 scores on testing set are higher than baseline. GBT largely outperforms the SVM, which improves the F1 score from 78.95% to 87.05% on the testing set. 5. Enhance translation with a noisy channel model 5.1. Recognition problems in document translation from images OCR is one of most successful applications in the pattern recognition field. It is a common belief that OCR is a solved problem because so many papers and patents claimed that the recognition accuracy can be 99.99%. Although many commercial OCR systems work well on high quality scanned documents under controlled conditions, they fail in many tasks, such as video OCR, license plate OCR, and sign OCR. Current video OCR is limited to recognizing captions in video images for video indexing, or to identify license plates on vehicles for various applications. Even at 99% accuracy, OCR will generate about 30 errors on a typical printed page of 3000 characters. A study conducted by Rice et al. [19], catalogued errors produced by OCR systems and their causes. OCR errors have been organized into four major classes: imaging defects, similar symbols, punctuation, and typography. In the image-based Arabic translation system, we have found four problems related to recognition accuracy severely impair the endto-end system performance. 5.1.1. Font face variations Arabic has very complex printing forms. Arabic is written from right to left. Characters in a word might be connected to each other. Each character has four different written forms: isolated form, beginning form, middle form, and end form. Another variation comes from diverse font faces. We only address three most commonly used Arabic fonts in this paper: Arabic transparent, simplified Arabic, and traditional Arabic. Figs. 4–6 illustrate a sentence written in these fonts. We can find that characters in the traditional Arabic font are largely different from the characters in other fonts. In addition, the word segmentation is quite vague in these fonts. Without the knowledge of Arabic, it is quite difficult to precisely segment the word boundaries from a single image. How to handle word segmentation is also a tough problem for Arabic OCR. 5.1.2. The optimal input resolution of OCR Commercial OCR systems are developed for fine-print documents. These systems are usually trained by images with high resolutions. Characters in road or street signs may contain characters in much lower resolutions. For example, Sakhr Automatic Reader version 3.0

Fig. 6. An example of traditional Arabic font.

Table 5 The character recognition error rate

Arabic transparent Simplified Arabic Traditional Arabic

36 pixels (%)

60 pixels (%)

84 pixels (%)

108 pixels (%)

11.6 17.2 28.2

4.0 4.5 3.1

3.1 5.6 0.3

4.8 5.1 0.3

Table 6 The word recognition error rate

Arabic transparent Simplified Arabic Traditional Arabic

36 pixels (%)

60 pixels (%)

84 pixels (%)

108 pixels (%)

46.7 53.2 64.5

11.3 17.7 9.7

12.9 19.4 1.6

22.6 21.0 1.6

reports 99.9% character recognition accuracy images with a 300+ dpi resolution, but according to Kanungo et al. [9], it only obtained an absolute page accuracy rate of 90.33% with lower but still reasonably high quality images. In this application, due to size of the images captured by cameras, those small text regions in image hardly satisfies the resolution requirement of 300 dpi. The image pre-processes process could lead to further information loss. We deploy a set of algorithms, such as bilinear interpolation, image binarization and affination transformation, etc. to handle skew, rotation, illumination, shadows, glare, and reflection of images. To locate the best range of character resolutions of the Sakhr OCR, we manually generated a set of high resolution images (600 dpi) in different fonts and sizes without adding any additional noises. The evaluation results of the OCR outputs are summarized in Tables 5 and 6. Table 5 illustrates the character error rates of the testing images in various fonts and sizes. The character alignment is automatically performed by a dynamic programming algorithm with equal penalty for deletion and insertion. We can observe that the accuracy of OCR is very sensitive to both font and size changes, and the recognition accuracy can only reach 99% with some specific parameters. For example, OCR performs the best in the tradition Arabic font on images

2132

Y. Chang et al. / Pattern Recognition 42 (2009) 2127 -- 2134

English Text (E)

Translation Channel

Arabic Text (A)

Image Generation Channel

Image

Fig. 7. An illustration of the two-step noisy channel model.

with 108 pixels, and performs much worse with 36 pixels. But it performs differently for other two fonts. It is difficult to find one single set of parameters to get the most satisfactory recognition accuracy with all fonts. To achieve the best performance of the OCR, we used a profile to direct interpolate extracted text images into the optimal resolutions for each font. MT is based on the exact match of each word or phrase. Therefore, the word error rate of OCR is more important in this application. Table 6 shows the corresponding word error rates with the same parameters. We find that the word recognition error rate was about four times larger than the character recognition error rate. 5.1.3. The OCR error propagated in statistical MT In this paper, we use the BLEU score to measure translation quality. BLEU averages the precision for unigram, bigram, and up to 4-gram and applies a length penalty if the generated sentence is shorter than the best matching (in length) reference translation [4]. Due to the definition of the BLEU score, a recognized error will not only punish the unigram score, but also hurt the scores of bigram, trigram and 4-gram, which have larger contributions to the final BLEU score. The accumulated information loss in image pre-processing will degrade the quality of image direct to recognition module below the requirement of OCR software. As a result, OCR errors will be propagated in the current MT metrics. Our experiment shows that 10.7% error rate in word recognition will severely drop the BLEU score from 43.12 to 28.56. For some applications, such as scanned document retrieval, OCR errors are not so critical [13]. Applying probabilistic IR, we can retrieve the most relevant OCR generated documents using approximate matching techniques even without correcting OCR error. For translation tasks, OCR errors cause serious degradation in the endto-end translation quality because the OCR errors are propagated by the MT engine. We cannot ignore the OCR error in the image-based translation scenario, and an effective method for OCR error correction is essential to achieve high quality of MT for an image-based MT task.

^ Arabic Text (A)

Image

Decoder1 (OCR)

Decoder2 (SMT)

^ English Text (E)

Fig. 8. The decoding process of the noisy channel model in Fig. 7.

English Text (E) Translation Channel Text Transform Channel

Arabic Text (A)

Transformed Arabic Text (A’)

Image Image Generation Channel

Fig. 9. An illustration of the three-step noisy channel model.

Image

Decoder1 (OCR)

Transformed ^ Arabic Text (A’)

Decoder2 (Text Correction)

^ Arabic Text (A’)

^ Decoder3 English Text (E) (SMT)

Fig. 10. The decoding process of the noisy channel model in Fig. 9.

Fig. 10 represents a three-step decoding process with the text correction model. To model the three-step decoding process, we add a virtual noisy channel, called text transform channel, as shown in Fig. 9. 5.2.2. Text correction We use Bayesian rule to decode a text sequence from an OCR output: Aˆ = arg max P(A|Aˆ  ) = arg max P(A|A ) A

A

= arg max P(A) × P(A |A)/P(A ) A

= arg max P(A) × P(A |A), A

5.2. OCR correction with a noisy channel model 5.2.1. Noisy channel models in the image-based document translation system Noisy channel models are widely used in AI problems such as speech recognition and MT for correcting errors in noisy input data. A noisy channel model assumes that the source input sequence I is encoded by a noisy channel. The task is to estimate the source message using a decoder ˆI = arg maxI P(I|O) based on the output sequence O. In our application, there are two noisy channels: the translation channel and the image generation channel (Fig. 7). The translation model encodes the source language (English) to the target language (Arabic). And the image generation channel encodes the Arabic texts into images. The decoding process is shown in Fig. 8. Due to the limitations of the OCR system, output from the Decoder 1 is not perfect and OCR errors will be propagated in the Decoder 2. Instead of modeling OCR errors in SMT system, an optimal solution for implementation is to take the divide and conquer strategy, adding an error correction module to minimize OCR errors by post-processing, while keeping the existing SMT systems intact.

where P(A) is the Arabic language model trained from an large Arabic dataset. P(A |A) is the transformation model to convert a noisy text sequence to a denoised version. We assume that words are independent of each other. Therefore, the transformation model can be decomposed into  P(ak |A), P(A |A) = k

ak

where is the k-th word in A . We future assume that the number of words in A and A is the same and ak only depends on the k-th word in A (ak ): P(A |A) ≈



P(ak |ak ).

k

To learn the P(ak |ak ), we synthesize different images from a given word ak and use OCR to get the noisy texts. P(ak |ak ) can then be computed using the maximum likelihood estimation: P(ak |ak ) = count(ak , ak )/count(ak ).

Y. Chang et al. / Pattern Recognition 42 (2009) 2127 -- 2134

2133

Table 7 A comparison of word recognition error rate

OCR result Correction result Enhanced correction Perfect segmentation

Arabic transparent (%)

Simplified Arabich (%)

Traditional Arabic (%)

17.81 10.76 5.96 1.71

11.42 7.40 5.96 1.60

22.21 19.49 19.25 1.29

Fig. 11. An example of recognition and correction.

Table 8 A comparison of BLEU scores

OCR result Correction result Enhanced correction Perfect segmentation

Arabic transparent

Simplified Arabic

Traditional Arabic

18.70 25.19 33.47 42.10

25.90 31.13 34.01 42.25

18.73 21.61 21.80 42.65

A small probability is applied for unseen word pairs. We use the well-known Viterbi algorithm to decode the most probable correct word sequence given an OCR output. 5.3. Experiments and improvement 5.3.1. Data We evaluate the OCR correction module based on basic travel expression corpus (BTEC) Arabic–English corpus (Eck and Hori [20]). The training data contains 20,000 sentence pairs and the 131,711 Arabic words. The total vocabulary size is 26,116. To generate the training data, we synthesize the training data vocabulary into multiple images based on different fonts. The development data contains 506 Arabic sentences with a vocabulary of 2662 words. The testing data contains 500 Arabic sentences with a vocabulary 2566 words. The height of text in each image is 108 pixels, and the resolution of the image is 96 dpi. 5.3.2. Experimental results The top two rows in Table 7 illustrate a comparison of word recognition error rates of OCR outputs and noisy channel correction results based on the testing text in three fonts. After applying the error correction model, the word recognition error rate for the simplified Arabic testing set is reduced from 11.42% to 7.42% and for the Arabic transparent testing data, the error rate drop from 17.81% to 10.76%. In the following step, we translate all OCR and correction results into English using the CMU PanDoRA translation system. When translating the Arabic text directly, PanDoRA generates translations of BLEU score 43.12. We can regard it as the translation upper bound. Table 8 demonstrates the corresponding MT BLEU scores. According to Tables 7 and 8, 4.02% improvement word recognition in simplified Arabic font brings 6.23 BLEU score improvements. The highest BLEU score after correction can reach to 31.13. Interestingly, the BLEU score is not consistent with the word recognition error. For example, the word recognition rate of Arabic transparent based on OCR result is smaller than the recognition rate of traditional Arabic. However, the BLEU score of the latter is even slightly larger than the former. It is also noticeable that the improvement in traditional Arabic font with the enhanced correction model is marginal, and we will discuss it in Section 5.3.4. 5.3.3. Error analysis Although the correction model makes reasonable improvement in the BLEU score, it is still much lower than the translation upper

Fig. 12. Another recognition and correction example.

bound. There are mainly two types of errors from the OCR outputs, the character recognition error and the word segmentation error. The character recognition error happens when a character in an image is misrecognized into one/more characters, or even skipped (type I). The noisy channel model is aim to solve the type I error. The word segmentation error (type II) can be further divided into two sub-categories: OCR misrecognizes one single word on an image as multiple words, called over-segment error; OCR misrecognizes multiple adjacent words on an image as one single word, called overmerge error. Fig. 11 illustrates an example of the two types of OCR errors. The first line on Fig. 11 is the ground truth of the text, which means “I'd like to see it”. The second line in Fig. 11 is the OCR output, where the left most word on the first line has been incorrectly segmented into two words, and the right most word has been also misrecognized. The translation result is “unkIunkI”, where “unk” is the translation of out-of-vocabulary word. The third line on Fig. 11 is the result of text correction. As we can notice, the right most is successfully corrected with the correction model, and the translation result is “I'd like unkI”. The type II is not addressed by using the noisy channel model. In another example, as shown in Fig. 12, the first line means “how about a drink”. The OCR segments the left most word on the first line into two separate strings, and there is no character recognition error on them. Since these two separated strings are out of the vocabulary, the translation is “how about unk unk”. The correction model corrects the second left most string on the second line into a wrong word “hair”, and lead the final translation to be “how about hair unk”, which is misleading in the end-to-end system. These results show that the word segmentation error is dominated in our application. If we manually fix all word segmentation error in OCR result, (so-called perfect segmentation, as the fourth row in Table 7), the word error rate will drop to less than 2% Correspondingly, all BLEU scores rise to above 42.0, which are close to the upper bound of translation, as represented in the fourth row of Table 8. 5.3.4. Enhance correction with bigram To overcome the segmentation error, we enhance the noisy channel model for segmentation correction. Given a string of the OCR output, we explore all its bigrams, if there is a bigram proximate to a word in the vocabulary, we will replace the bigram with the new word. Iterate the bigram replacement and we will get a new string with less word split errors. In the decoding process, we use the Viterbi algorithm to decode the most possible word sequence given both the output of OCR and its variance of replacement. Fig. 13 explains the algorithm in details. In our experiment, we set n as 2.

2134

Y. Chang et al. / Pattern Recognition 42 (2009) 2127 -- 2134

BigramCorrection(OcrOutput Aˆ ' ){ A). For all segmentions in Aˆ ' , construct a set of bigram string using the adjacent segmentation. B). Iterative the replacement, and generate a new string Aˆ ′′ : a). If the edit distance between a bigram and a dictionary word is less than n , we called the bigram proximate the word. b). If a bigram proximate to a word in our dictionary, we will replace the bigram with the word. C). Decode both Aˆ ' and Aˆ ′′ , and choose the most probable sequence as the correction of Aˆ ' . } Fig. 13. An example of recognition and correction.

The third row in Tables 7 and 8 presented the result of the enhanced correction model. In Arabic transparent font, we improve the BLEU score from 18.70 to 33.47 with the enhanced correction model, while the most accurate translation result can reach 34.01. In our experiment, we also tried trigram combination in the enhanced noisy channel model, but have not got further improvements over the bigram model. According to our analysis, over-segment error is typical in image text written in Arabic transparent font and simplified Arabic font, while over-merge error is popular in traditional Arabic font. The enhanced correction model is designed to reduce over-segment error, and it is not very helpful to solve the over-merge problem. As a result, improvement of traditional Arabic font with the enhanced correction model is not as much as Arabic transparent font and simplified Arabic font. How to solve the over-merge problem is one of our future works. 6. Conclusions In this paper, we have demonstrated improvements on multiple modules to enhance the image-based Arabic translation system. In text region detection module, we apply different learning methods to attack the image region classification problem, and successfully incorporate location-based prior knowledge into classifiers. In recognition and translation module, we propose an approach to correct Arabic OCR errors in an image-based Arabic translation system. The correction model is trained with synthetic images with different fonts and sizes. We have further enhanced the correction model with bigram to improve the word segmentation correction. We have achieved substantial improvement in both word correction and the translation accuracy. The future work on the detection module will be data-driven, since the limited training data on the current stage prevent us from applying more complicated models for region detection. At the same time, the correction models we proposed are limited with the cases

of recognizing multiple adjacent words on image into one single word (over-merge error). Furthermore, those more complicated conditions with the mixture of character recognition errors and word segmentation errors are even challenging for us. In order to address these challenges, we will work on modeling the context information in the training data in the future work. References [1] J. Yang, J. Gao, Y. Zhang, A. Waibel, Toward automatic sign translation, in: Proceedings of the Human Language Technology Conference, 2001. [2] J. Yang, X. Chen, J. Zhang, Y. Zhang, A. Waibel, Automatic detection and translation of text from natural scenes, in: Proceedings of ICASSP 2002, vol. 2, 2002, pp. 2101–2104. [3] Y. Zhang, S. Vogel, PanDoRA: a large-scale two-way statistical machine translation system for hand-held devices, in: Proceedings of MT Summit XI, Copenhagen, Denmark, 2007. [4] K. Papineni, S. Roukos, T. Ward, W. Zhu, BLEU: a method for automatic evaluation of machine translation, in: Proceedings of the 20th Annual Meeting of the Association for Computational Linguistics, 2002. [5] X. Chen, J. Yang, J. Zhang, A. Waibel, Automatic detection and recognition of signs from natural scenes, IEEE Transactions on Image Processing 13 (1) (2004) 87–99. [6] J. Weinman, A. Hansen, A. McCallum, Sign detection in natural images with conditional random fields, in: IEEE International Workshop on Machine Learning for Signal Processing, 2004. [7] T. Hong, Degraded text recognition using visual and linguistic context, Ph.D. Thesis, Computer Science Department, SUNY Buffalo, 1995. [8] O. Kolak, P. Resnik, OCR error correction using a noisy channel model, in: Human Language Technology Conference (HLT 2002), San Diego, CA, 2002. [9] T. Kanungo, G. Marton, O. Bulbul, OmniPage vs. Sakhr: paired model evaluation of two Arabic OCR products, in: Proceedings of SPIE Conference on Document Recognition and Retrieval VI, San Jose, CA, 1999. [10] K. Taghva, J. Borsack, A. Condit, An expert system for automatically correcting OCR output, in: Proceedings of the SPIE—Document Recognition, 1994, pp. 270–278. [11] D. Doermann, S. Yao, Generating synthetic data for text analysis systems, in: Symposium on Document Analysis and Information Retrieval, 1995, pp. 449–467. [12] T. Sato, T. Kanade, E.K. Hughes, M.A. Smith, S. Satoh, Video OCR: indexing digital news libraries by recognition of superimposed captions, Multimedia Systems 7 (1999) 385–394. [13] W.B. Croft, S. Harding, K. Taghva, J. Borsack, An evaluation of information retrieval accuracy with simulated OCR output, Technical Report UM-CS-1993076, 1993. [14] S.M. Harding, W.B. Croft, C. Weir, Probabilistic retrieval of OCR degraded text using N-grams, in: Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, 1997. [15] E. Mittendorf, P. Schauble, P. Sheridan, Applying probabilistic term weighting to OCR text in the case of a large alphabetic library catalogue, in: Proceedings of the 18th ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, 1995, pp. 328–335. [16] J. Gao, J. Yang, An adaptive algorithm for text detection from natural scenes, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR01), Hawaii, 2001. [17] T. Joachims, Text categorization with support vector machines: learning with many relevant features, in: Proceedings of the European Conference on Machine Learning, Springer, Berlin, 1998. [18] J.H. Friedman, Stochastic Gradient Boosting, Computational Statistics and Data Analysis 38 (4) (2002) 367–378. [19] S. Rice, G. Nagy, T. Nartker, Optical Character Recognition: An Illustrated Guide to the Frontier, Kluwer, Boston, 1999. [20] M. Eck, C. Hori, Overview of the IWSLT 2005 evaluation campaign, in: Proceedings of International Workshop on Spoken Language Translation, 2005, pp. 11–17.

About the Author—YI CHANG is a Search Relevance Scientist/Engineer in Yahoo! Inc. He received his M.S. degree in Carnegie Mellon University in 2006, and M.S. degree from Chinese Academy of Sciences in 2004, and B.S. degree from Jilin University in 2001. His research interests include information retrieval, pattern recognition and machine learning. About the Author—DATONG CHEN is a Systems Scientist in Carnegie Mellon University. He got his Ph.D. degree from Swiss Federal Institute of Technology in 2003, and M.S. and B.E. degrees from Harbin Institute of Technology in 1997 and 1995. His research interests focus on assistive technology, multimedia mining, and statistical machine learning. About the Author—YING ZHANG got his Ph.D. degree from Carnegie Mellon University in 2008, and M.S. degree from Chinese Academy of Sciences in 1999. His research interests are statistical natural language processing, machine translation, machine learning, data mining, and information retrieval. About the Author—JIE YANG is a Senior Systems Scientist in Carnegie Mellon University. He obtained his Ph.D. degree from University of Akron, 1991. He has been leading research efforts to develop visual tracking and recognition system for multimodal human-computer interaction since 1994. His research interests are multimodal interfaces, computer vision, and pattern recognition.