Text detection in natural scene images based on color prior guided MSER

Text detection in natural scene images based on color prior guided MSER

Communicated by Dr XIANG Xiang Bai Accepted Manuscript Text detection in natural scene images based on color prior guided MSER Xiangnan Zhang, Xinbo...

4MB Sizes 0 Downloads 68 Views

Communicated by Dr XIANG Xiang Bai

Accepted Manuscript

Text detection in natural scene images based on color prior guided MSER Xiangnan Zhang, Xinbo Gao, Chunna Tian PII: DOI: Reference:

S0925-2312(18)30488-0 10.1016/j.neucom.2018.03.070 NEUCOM 19521

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

25 July 2017 4 February 2018 23 March 2018

Please cite this article as: Xiangnan Zhang, Xinbo Gao, Chunna Tian, Text detection in natural scene images based on color prior guided MSER, Neurocomputing (2018), doi: 10.1016/j.neucom.2018.03.070

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Text detection in natural scene images based on color prior guided MSER Xiangnan Zhanga , Xinbo Gaoa,∗, Chunna Tiana a School

of Electronic Engineering, Xidian University, Xian 710071, China

Abstract

CR IP T

In this paper, we focus on text detection in natural scene images which is conducive to content-based wild image analysis and understanding. This task is still an open problem and usually includes two key issues: text candidate extraction and verification. For text candidate extraction, we introduce a color prior to guide the character candidate extraction by Maximally Stable Extremal Region (MSER). The principle of color prior acquirement is to obtain stroke-like textures with modified Stroke Width Transform (SWT), which is based on segmented edges. For text verification, the ideology of deep learning is adopted to distinguish text/nontext candidates. To improve classification accuracy, the results of specific task CNNs are fused. The proposed framework is evaluated on the ICDAR 2013 Robust Reading Competition database. It achieves F-score at 85.87%, which are superior over several state-of-the-art text detection methods.

AN US

Keywords: Text detection, Text candidate extraction, Maximally stable extremal region, Stroke width transform, Text verification, Deep learning 1. Introduction

AC

CE

PT

ED

M

Text, as the crystallization of human wisdom, contains a wealth of semantic information. As a means to access and utilize textual information in images and videos, automatic text detection plays an important role in image scene understanding [1, 2, 3, 4]. With the advance in computer vision and pattern recognition technologies, the existing methods of Optical Character Recognition (OCR) achieve high recognition rates on scanned documents [5]. In contrast, localizing and recognizing texts in natural scenes are extremely difficult. Thus natural scene text detection is still an open problem. Compared with the uniform characters in document images, the fonts, scales and colors of texts in natural scene images often have strong variability. Moreover, camera-based natural scene images are susceptible to interference from noise, blur, distortion, low resolution, non-uniform illumination, partial occlusion, etc. In addition, the backgrounds of natural scene images often have complex textures, which are difficult to be distinguished from the real texts [6]. Some detection methods adopt sliding window to localize individual characters [7, 8] or whole words [9, 10]. These methods apply general object detection approaches to text detection problems successfully. Due to the employment of welldesigned features or deep networks, the local descriptions of images are more robust. However, as the scale of image increases, the computational complexity multiplies. Taking text scale variation into account, the determination of appropriate size of sliding window is another issue. ∗ Corresponding

author Email addresses: [email protected] (Xiangnan Zhang), [email protected] (Xinbo Gao), [email protected] (Chunna Tian) Preprint submitted to Neurocomputing

Connected component (CC) analysis for text characteristic extraction is popular in recent approaches [11, 12]. Unlike traditional object detection, CC-based methods economize the properties of characters such as color similarity and stroke width consistency. These properties are independent with character scales and orientations. Besides, the CC analysis provides pixel-level detection results which can be exploited for further character segmentation. The disadvantage of such approaches is that text texture feature affects are considered and the structural features are neglected during character candidate generation. Objects with similar textures like grasses, bricks, windows, etc. are treated as CCs, which result in false positives. With the development of deep learning, the design philosophy of text detection algorithm is profoundly affected. Benefited from excellent performances of deep learning, some researchers have used this technique to solve sub-problems in text detection, like text candidate extraction and text/non-text classification, which greatly enhanced the performance of text detection [13, 14]. Existing methods improve the performance by taking advantage of the powerful classification capability of deep learning. Those methods regard the text/non-text detection as a general object detection task, which neglects the unique properties of characters in texts. In this paper, we present a novel color prior guided Maximally Stable Extremal Region (MSER) method to extract character candidates. Then we propose a framework for scene text detection with hierarchical clustering based character grouping and CNN based text classification. The pipeline of our method is presented in Fig. 1, which includes two main steps: (1) Color prior estimation with local Stroke Width Transform (SWT) detector. In this step, we use local SWT to obtain stroke-like texture regions as coarse text candidates which carry color information. To convert traditional SWT to local SWT, we segment May 5, 2018

ACCEPTED MANUSCRIPT

CR IP T

the edges in advance. (2) Color prior guided text detection. we extract character proposals with color prior guided MSER and remove improper ones with two stage filtration including morphological rules and CNN. Single-link cluster is also utilized to group characters. The main contributions of this paper are summarized as follows: (1) We propose a color prior guided MSER, which is different with the traditional MSER. The traditional MSER obtains all regions with stable colors, But the regions are not character oriented. Since the characters in text usually have homogeneous color, we integrate the color prior of characters into MSER, which results in the character oriented MSER, to extract the color-flatten regions that might belong to text. This method eliminates most irrelevant homogeneous regions and improves the Precision evidently. (2) We propose a novel edge segmentation method to apply SWT to character-level stroke extraction. First we obtain the edge map in which the edges are one pixel wide and each pixel value of these edges is assigned by a structured edge detector. This remains more edge information than Canny does. Then, we use the way similar to chain code tracing method to obtain closed and isolated character boundaries. (3) We adopt local SWT to estimate color prior for MSER instead of using SWT for text proposal generation. The advantages of local SWT are summarized as follows: (i) Traditional global SWT does not verify whether two parallel edges belong to the same character during stroke extraction. Our local SWT runs on each character individually to avoid the interference from edges of adjacent characters. (ii) Traditional SWT runs twice on each image to cope with both light text on dark background and dark text on light background conditions. Since we only process one character at each time, we run SWT once on each image, which improves the efficiency. (4) We train dual CNNs to classify text/non-text regions. Connected component (CC) analysis could provide pixel-level labeled MSER maps and gray image regions cropped from original scene images. We train two CNNs to cope with these two types of data correspondingly and classify text/non-text comprehensively. By introducing more classification information the evaluation results are more convinced. The rest of this paper is organized as follows. Related work is reviewed in Section 2. Section 3 proposes our novel color prior based MSER for character candidates extraction. In Section 4, we adopt hierarchical clustering to group character candidates into word proposals and use CNN to eliminate non-text regions in Section 5. Section 6 verifies the performance of the proposed method and conclusions are drawn in Section 7.

Figure 1: Flowchart of the proposed system.

AC

CE

PT

ED

M

AN US

As previously mentioned, the two most important assumptions in text detection are similar color and consistent stroke width within each character. Based on these two assumptions SWT [11] and MSER [17] are two representative character extraction methods in the field of text detection.

2. Related Work The proposed approach belongs to the CC-based methods. Deep learning is also incorporated in our framework. So we focus on reviewing CC-based methods and the work applied deep learning to text detection. Other scene text detection methods can be referred to the related surveys [6, 15, 16]. 2

SWT transforms the pixel data from gray value to the most likely stroke width, which is first proposed by Boris Epshtein et al. [11] and becomes one of the most famous methods of stroke feature representation. SWT is based on the assumption that the edge of a stroke is parallel to its opposite one and the distances between these parallel lines, which are perceived as stroke widths, are consistent in the same character. Epshtein et al. merely treated SWT as a means of candidate extraction [11]. However, due to the complex background in natural scene images, the extracted edges are often too mussy to guide the distinguishing of strokes. This narrows down the applications of SWT in character candidates extraction. However, the SWT algorithm is easy to implement and valid to describe stroke properties, it is an important integrant in some sequential algorithms [12, 18, 19]. Yin et al. [12] adopted a simplified SWT algorithm in their framework which estimated stroke widths only in four directions [20]. They regarded stroke width as one of the important character features instead of the sole feature for character determining. Theoretically, this should be much more robust than locating text with SWT directly. Yao et al. [18] grouped neighboring pixels in the SWT image together to form CCs using a simple association rule. Compared with the original SWT in [11], there is no essential difference but more sophisticated subsequent processing. Some researchers adapt SWT to poor edge detection by adding constraints. Huang et al. [19] presented a new operator based on SWT, called Stroke Feature Transform (SFT). They introduced color consistency as an auxiliary to detect edge and determine the stroke range more accurately. In general, SWT is an effective method for the measurement of character stroke width but not a good idea to locate character in complicated conditions of natural scene images. So joining SWT with other edge-free character detector would be

ACCEPTED MANUSCRIPT

a better course. Matas et al. [21] proposed MSER to solve the wide-baseline stereo problem. It has been observed that text components usually tend to form homogenous color regions which conforms to the definition of Extremal Region (ER). Therefore MSER is utilized as the basis of candidates extraction in many text detection frameworks. Chen et al. [22] proposed an edgeenhanced MSER, which combined the complimentary properties of Canny [23] edges and MSER to cope with blurred images. Its drawback is the over dependent on Canny performance. Neumann et al. [24] used ER instead of MSER to generate text candidates. The unsuitable ERs were eliminated by a sequential classifier. This method is robust and in real-time. In particular, the MSER-based methods achieved the state-ofthe-art performance on ICDAR2013 and ICDAR2015 competitions [25, 26]. Yin et al. [12] designed a fast and effective algorithm to prune character candidates extracted by MSER using the strategy of minimizing regularized variations. In the next year, they proposed a multi-stage clustering algorithm for MSER components grouping to detect multi-oriented text [27]. Both these two works employed MSER as a way to prepare character candidates. Benefited from effective pruning strategy and hierarchical clustering, they achieved excellent performance. Huang et al. [28] introduced CNN to verify the regions obtained by MSER. This operation not only improved the accuracy of identifying true text candidates but also separated the merging of multiple characters, which is commonly existed in unsupervised MSER. In conclusion, MSER is robust enough for candidates generation task. For that reason, researchers usually obtain candidates as many as possible to promote Recall. In order to alleviate the pressure of classifier caused by excessive candidates, prior guided MSER is a significant research idea. In addition, with the emergence of deep neural networks, many approaches have been proposed to leverage CNN for scene text detection. Wang et al. [8] proposed an End-toEnd text recognition CNN with unsupervised pre-training. This framework consists of two independent parts: text detection and character recognition. They relied on the response map obtained by CNN based sliding window to locate text regions. In these regions, they recognized words using a beam search algorithm with lexicon [29]. Similar to [8], Jaderberg et al. [30] established a feature-sharing CNN architecture for both text detection and word recognition. In their further work [31], word candidates were represented by bounding boxes which generated with Edge Box and a trained aggregate channel feature detector [32]. They used a random forest classifier to prune improper proposals. Zhang et al. [13] used a Fully Convolutional Network (FCN) model to predict the salient map of text regions, then combine the holistic feature with local character features to extract candidates. In their work, text line hypotheses were estimated by combining the salient map and character components. Finally, another FCN classifier was used to predict the centroid of each character in order to remove the false hypotheses. He et al. [33] presented a new system for scene text detection by proposing a novel text-attentional convolutional neural network (Text-CNN) that particularly focused on extracting text-related regions and features from the image components. To improve

AN US

CR IP T

the detection accuracy, they used reconstructing character components as supervision when extracting text features. Gupta et al. [34] introduced the framework of FCRN and YOLO to detect text in natural images. But more importantly, their contributions on the generation of synthetic images with text in clutter made it possible to establish larger databases through artificial natural scene images included specific text. Liu et al. [35] proposed a new CNN based method to detect text with tighter quadrangle. They are the first to put forward prior quadrilateral sliding window and use shared Monte-Carlo computational method to compute the polygonal overlapping area fast and accurately. Zhou et al. [36] presented a FCN based multi-task network. The pipeline directly predicted words or text lines of arbitrary orientations and quadrilateral shapes in full images with a single neural network. Liao et al. [37] proposed TextBoxes inspired by SSD, which is a recent development in object detection. SSD aims to detect general objects in images but fails on words that have extreme aspect ratios. They proposed text-box layers in TextBoxes to solve this problem. Tian et al. [38] propose a weakly supervised scene text detection method. He et al. [39] propose an attention mechanism which roughly identifies text regions via FCN. No doubt CNN has been in a dominant position in object detection, but how to import eigen features of text to deep learning framework is significant for the discrimination of text and general object in natural scene image.

CE

PT

ED

M

3. Character Candidates Extraction with Color Prior Guided MSER 3.1. Extraction Algorithm Overview

For grayscale images, the MSER method is an excellent connectivity region extraction algorithm, which is adopted by many popular methods to generate character candidates. The basic principle of MSER is to obtain a series of binary image frames by changing threshold shown in Fig. 2 and regards the stable areas of all frames as ERs which are virtually unchanged over a range of thresholds [21]. We use Qm to represent the area of a region segmented by the threshold m. ∆ represents the variation of m. When q(m) in Eq. (1) is a localminima, Qm belongs to a MSER, which means the region is stable to the variation of threshold m. A lower ∆ results in more MSERs.

AC

q(m) =

Qm+∆ − Qm−∆ , m ∈ [0, 255] . Qm

(1)

Figure 2: The principle of MSER. With the increase of threshold m, the ERs are spreading and each growing area is marked with red pixels. When the area of a region is stable we regard this region as MSER.

3

ACCEPTED MANUSCRIPT

CR IP T

The traditional MSER-based method can only handle singlechannel images. Thus, we should convert colorful natural scene images to grayscale images primarily or apply the MSER in each color channel separately. In this pretreatment process, the dropout of character color information may cause a mixture of texts and backgrounds. So we propose a novel color prior guided MSER to extract character candidates which includes the following steps: 1) Coarse strokes extraction. Stroke candidates are extracted using the local SWT algorithm. Compared with the traditional SWT algorithm [11], the local SWT algorithm we proposed in this paper reduces the computational complexity by half and suppresses the false positive between characters. More details will be presented in Section 3.2 2) Strokes filtering. Stroke candidates are clustered into groups by the single-link clustering algorithm using the learned parameters. The isolated strokes will be eliminated. 3) Prior guided MSER extraction. The detected strokes are projected to the original image to obtain the stroke color information as prior information. Then, we incorporate these prior information to guide the threshold selection in the process of MSER.

AN US

Figure 3: The process of SWT. p and q determine the starting and ending of the ray r and the passage length of r is as the width of the stroke.

3.2. Local Stroke Extraction

PT

ED

M

From the experience we obtained such a hypothesis that each character possesses consistent stroke width. According to this hypothesis, SWT was used to detect text in natural scenes and its simplicity is conducive to handle different fonts and languages [11]. SWT transforms the image data from RGB values to the most likely stroke width for each pixel. First, it extracts image edges and sets the initial width value of each pixel to ∞. For each edge pixel p the gradient direction d p is calculated. Then, we develop the ray r = p + n · dp (2)

CE

into another edge pixel q as shown in Fig. 3, where n represents step size. If the gradient direction dq at q is roughly opposite to d p and the gradient direction is approximately perpendicular to the edge direction, each pixel of the ray will be assigned the

−−−−→

width p − q unless it already has a lower value. There are mainly two drawbacks in the traditional SWT method. In the actual situation, there may be bright text on dark background or vice-versa. In order to accommodate both circumstances, the authors applied the algorithm twice, once along d p and once along −d p [11]. Although this addresses the character color diversification, on the other hand, repeated counting takes twice as much time and brings double false positives. Another problem with traditional SWT is the over reliance on edge detection. The starting and ending of the ray r are determined only by these two edge pixels p and q, but whether these pixels belong to the same character is not clearly ascertained. The traditional SWT adopts the edge extraction method to get the whole image edge information without any segmentation. So there might be rays between adjacent characters that leads to the false stroke widths assigned to the pixels out of characters.

Through the two drawbacks we find that SWT is not suitable for global image processing because of the lack of judgment on whether the rays r are inside the character or not. Many character extraction approaches based on SWT have also encountered the problem that the transforms of stroke widths are not reliable when the edge detection results are messy. To tackle this problem we apply SWT to local image regions obtained by segmenting the edges in character level. When each local image region contains only one character, the direction d p of the rays r can be fixed and will not be disturbed by the adjacent edges. SWT adopt Canny the classical edge detection algorithm to detect the edges of required images without any fine tuning. Canny uses both high and low thresholds to filter the image gradient to get the one-pixel boundary of object. The final output is often a binary image [23]. In order to effectively segment the edges, we run a structured edge detector [40] on the input image at the same time with Canny detector to get accurate edge pixel value. Ec and E s represent the edge map of Canny and structured edge detector. Then, entrywise product is adopted to fuse these two matrices.

AC

Ee = Ec ◦ E s .

(3)

In Eq.(3), ◦ is a binary operation that takes two matrices of the same size, and produces another matrix Ee whose element is the product of corresponding elements of matrices Ec and E s . Structured edge detector can generate continuous edge values instead of the discrete binary edge labels as that obtained by Canny detector. Meanwhile, by the ”◦” operation the peculiarity of Canny’s one pixel wide object boundary is reserved, which is vital to the followed edge segmentation to reduce the complexity of edge tracing path. In edge map Ee , the edges are one pixel wide and each pixel value of these edges is assigned by structured edge detector. Then we use the way similar to chain code tracing method to obtain closed and isolated character boundaries. Algorithm 1 4

CR IP T

ACCEPTED MANUSCRIPT

Figure 4: Edge detection results. (a) Input grayscale image. (b) The result of Canny detector. (c) The result of structured edge detector. (d) The fusion of (b) and (c). In the Canny edge map, edge pixels are labeled with ”1” and non-edge pixels are labeled with ”0”. In the structured edge detector edge map, the edge pixels are assigned with consecutive values in the range of 0 to 1 but the spatial information of edge labels is inaccurate. By the fusion of these two edge maps, it can also suppress some of the messy edges as shown in (d).

in M, a set of isolated edges are extracted. For each single edge, first we assign the SWT direction d p . Unlike the varicolored texts in global SWT, our local SWT just cope with one character with certain color. According to the gradient direction at the edge, it is easy to restrict the ray r between parallel boundaries. Running SWT with specified direction on each edge, we can gain stroke-like textures which are treated as character proposals.

AN US

shows the basic steps of our edge segmentation work. In practice, we notice the existence of edge disconnection. To overcome this problem, we extract several pixels along the current tracing direction if Step 8 of Algorithm 1 returns false. We treat these imaginary pixels as real edge pixels and assign the value of the last real edge pixel to them. Then Algorithm 1 works on these pixels to verify the validity of edge completion. In another case, when there are a plurality of next point candidates in the eight direction neighbours of current point (xc , yc ), we will traverse them using recursion to ensure the integrity of edge segmentation.

M

3.3. Strokes filtering

CE

PT

ED

Algorithm 1 Edge Segmentation Algorithm Require: Edge map Ee ; Threshold th; Ensure: Label map M; 1: Set current edge group label l = 1; 2: Obtain initial point (xi , yi ) which makes max(E e ) = Ee (xi , yi ); 3: if max(E e == 0) then 4: Go to Step 17; 5: end if 6: Set current point (xc , yc ) = (xi , yi ); 7: Find the maximum value E e (xn , yn ) in the eight direction neighbours of (xc , yc ); 8: if abs(E e (xn , yn ) − E e (xc , yc ))/E e (xc , yc ) < th then 9: l = l + 1; 10: Go to Step 2; 11: else 12: Ee (xc , yc ) = 0; 13: M(xc , yc ) = l; 14: (xc , yc ) = (xn , yn ); 15: Go to Step 7; 16: end if 17: return M;

Characters usually exist in groups. Thereby if we cluster the stroke candidates, the isolated stroke candidates rarely belong to characters. From stroke to sentence, there are three levels corresponding character, word and phrase respectively. Intuitively, hierarchical clustering is particularly suitable for the stroke candidates construction task. One of the most popular hierarchical clustering algorithm is single-link clustering. The elongated clusters produced by single-link clustering is advance for stroke construction [41]. For more details, please refer to Section 4. After strokes filtering and clustering, we obtain several groups of strokes and each of them can be matched to a region of the colorful input image.

AC

3.4. Prior guided MSER extraction From the analysis in Section 3.1 we can see that the process of MSER is accompanied with the loss of color information. Neumann et al. [24] considered RGB and HSI color spaces and an additional intensity gradient channel to prevent the dropout of color information. This trick is commonly used in many approaches. It is really a pity that they simply increase the color channels artificially expecting to find a good color representation for each character without considering the color prior of the text. The mean color of grouped strokes can be easily calculated and used as a color prior for corresponding region. This means that there may be characters which have similar color to the color prior of text in this region.

Through Algorithm 1 we can obtain a label map M which indicates each segmented edge using continuous natural numbers. By reconstructing segmented edges via traversing label values 5

CR IP T

ACCEPTED MANUSCRIPT

Figure 5: The process of Local SWT extraction and clustering. (a) represents the input image and (b) shows the segmented edges with different color. (c) shows the Local SWT map, in which the regions clustered in same group are indicated with same color.

Figure 7: The dendrogram obtained using the single-link algorithm.

The strategy of color prior guidance is converting the color value of text to a certain gray value. Here we adopt Euclidean distance to calculate the similarity S between each pixel and color prior as q (4) S i, j = (Ri, j − R p )2 + (Gi, j − G p )2 + (Bi, j − B p )2 ,

AN US

algorithm. By calculating the similarities between clusters and merging the nearest pairs iteratively, the single-link clustering tree is constructed. In order to measure the similarity of character candidates, we adopt the weighted sum of features proposed by Yin et al. [12] as the distance function. The distance function d(u, v; w) was defined as

where R, G, B are the three color channels of the region to be processed and S is dimensionally identical as them. R p , G p , B p denote the channels of color prior. Suppose (i, j) represents the index of matrix and the similarities between each pixel and prior are calculated by ergodic matrix. In the similarity matrix the darker regions have a higher similarity with the text color prior. Thus, we can apply MSER on the similarity matrix to obtain the stable regions which have similar color with the text color prior. In this case, we can take thresholds under a constraint to make MSER more targeted. The improved color prior guided MSER is defined in Eq. 5

ED

M

d(u, v; w) = wT xu,v ,

CE

PT

Qm+∆ − Qm−∆ , m ∈ [0, T h c] , (5) Qm where the T h c is the maximum desirable threshold set by experience. q(m) =

4. Character Candidates Grouping The typical process of a single-link clustering algorithm is illustrated in Fig. 7. In the case of single-link clustering, the clusters who have a minimum distance are merged hierarchically. If we assume that A, B, C, D, E, F and G in Fig. 7 represent characters, in the first step they will be merged into words, then phrases and at last formed as a sentence. Because of the hierarchical structure of text, single-link clustering is appropriate to be employed for character candidates grouping. Single-link clustering is a variant of hierarchical clustering [41], which creates a hierarchical nested clustering tree based on the similarities between different data points. These similarities are obtained by calculating the distances between different clusters of data points. Especially the distance of clusters is expressed by the distance of their nearest data point in single-link

where u, v donate two given data points and vector w represents the weights learned by the proposed distance metric learning algorithm in [12]; xu,v , the feature vector is calculated in the feature space detailed as follows. Characters in the same sentence often have similar size, color and stroke width. Moreover, in terms of spatial distribution, the positional relationship between characters is also a considerable clustering basis. Upon these assumptions, Yin et al. [12] established a feature space to extract vector xu,v . With minimal outer rectangle method, it is convenient to measure u’s width wu and height hu . Data point v’s width wv and height hv are obtained in the same way. Let (xu , yu ) be the top left corner coordinate of u and su the stroke width and c1u , c2u , c3u the mean color values of v’s three color channels. The formulas are as follows: • Interval ( abs(xv − xu − wu )/ max(wu , wv ) abs(xu − xv − wv )/ max(wu , wv )

AC

(6)

if xu < xv , else.

(7)

• Width and height differences abs(wu −wv )/ max(wu , wv ), abs(hu −hv )/ max(hu , hv ). (8) • Top and bottom alignments

abs(yu − yv ) ), abs(xu + wu /2 − xv − wv /2) abs(yu + hu − yv − hv ) arctan( ). abs(xu + wu /2 − xv − wv /2) arctan(

6

(9)

CR IP T

ACCEPTED MANUSCRIPT

Figure 6: Comparison of the traditional MSER and the color prior guided MSER. (a) and (b) are gray image and MSER map obtained from it. (c) and (d) are color prior guided MSER. If the color information of text can be known in advance, the non text background area could be suppressed effectively.

• Color difference p (c1u − c1v )2 + (c2u − c2v )2 + (c3u − c3v )2 . 255

AN US

• Stroke width difference

(10)

abs(su − sv )/ max(su , sv ).

(11)

Utilizing Eq. 6 to cope with all N character proposals to obtain C N2 distances as single-link clustering similarities. Hierarchical grouping information is obtained from the clustering tree by the threshold setting method. We follow the text grouping method presented by Yin et al. [12]. Text detection methods based on CC analysis involve sub-problems like text extraction, grouping and verification in turn. In this paper, we focus on introducing color priors to improve the pertinence of text candidate extraction, and promoting the credibility of text verification using dual CNNs. Character clustering is not our advantage. In theory, it can be replaced by other character clustering methods.

PT

5. Text Candidates Classification

ED

M

Figure 8: The schematic of joint classifier

CE

Deep learning has a powerful ability to handle classification problems, and in general condition the inputs to be classified are regions cropped from original images. This might cause the target to mingle with the variable background. Unlike sliding window based text candidates extraction method, MSER could provide accurate pixel-level text/non-text labels. We can reconstitute binary character images by mapping these labels to zero matrixes respectively. Obviously, these artificially generated images exclude the interference from background, but are severely affected by the performance of MSER operator. To obtain robust classification results, we train two CNNs separately with collected gray-scale character dataset and generated binary character dataset. Finally, we remove the inappropriate text candidates by synthesizing both results of the two CNNs. Due to the proven good performance of VGG16[42], in our frame we adopt it as the classification network which is initialized by pre-trained weights. Different from standard VGG16 network, all input images are resized as 32 × 32, which is enough to retain the specific information of characters. Based

on the pre-trained VGG model the classification network is fine-tuned by a special dataset. The dimension of parameters in the full connection layers are different from the original network because of the modified input size. Therefore, during training the parameters of the full connection layers are completely retrained. In practice, two identical networks are adopted in the classification process and trained in the same way. Through preparing pertinent training data for MSER proposals and text/non-text regions, we obtain two CNN models as MSER classifier and character classifier.

AC

Training MSER classifier needs binary images each containing single character. Labeling text/non-text regions in real scene images at pixel level in large amounts is a heavy task and the negative samples manually annotated are not necessarily consistent with the actual situation. Trying to get negative samples close to reality, we run MSER on natural scene images and store non-character ERs generated during MSER. Fortunately, ICDAR provides a segmented training image set that well marks the location of characters on the image. To simulate the process of non-character ERs generation in actual operations, we run MSER on these images and compare every ER with the ground truth. By repeating this process in different scales and different color channels, we obtain a sufficient number of realistic negative samples. To prepare positive samples, we adopt automatic generated standard character images as training data. Classifying text/non-text regions requires areas containing texts coped from natural scene images as positive 7

ACCEPTED MANUSCRIPT

training data and random background regions taken as negative samples. In order to comprehensively consider the scores of these two classifiers, we define the final score as: 1 , α S1m + (1 − α) S1c

S =

(12)

M

AN US

CR IP T

where S m represents the score output by MSER classification network and S c represents the score output by character classification network. In Eq.(12) α balances the weights of these two classifiers. From the assumption that texts tend to appear in lines or columns, single-link clustering is employed to group character candidates at word-level or sentence-level. It is predictable that some candidates may not be clustered with others thus becoming isolated. According to the way human beings display texts, candidates in groups tend to have higher confidences on being characters than isolated candidates, so we set a lower threshold for the candidates belong to the group. Thus, we reduce the effect of occlusion or uneven illumination. Fig. 9 shows that incomplete ERs still remain in the group.

6.1. Implementation details The image resolutions in the ICDAR 2013 test set are quite various, which range from 355 × 200 to 3888 × 2592. This situation is mainly due to the differences in image compression quality. But for each image, the field of view is similar with others. To reduce the cost of computation, we zoom images with maximum side length of 1500. Alessandro et al. [7] adapted MSER on each images in several scales: 1, 0.5, 0.25 and 0.12. However, it is not the more the better. From the experience we chose two zoom ratios 1 and 0.125 to handle small and large text respectively. Increasing the color channel to be tested is another way to improve Recall performance. In the existing framework we obtain similarity maps with color priors and treat them as extra channels. For a further discussion, please refer to the Section 6.3. During edge segmentation we extend the search chains in 3 directions at most unless the attenuation of edge value in next point is greater than 20%. Then, the stroke texture is extracted by SWT with the standard parameters in [11]. The proposed candidate extraction approach is based on MSER which is affected by the following parameters: ∆ in Ep.1 controls how the variation is calculated and maximal variation v+ filters out the areas too unstable. In our experiment, we set ∆ = 1 and v+ = 0.25. Also, the pixel numbers of MSER regions are limited from 20 to 7000. The CNN classifiers are trained with Caffe [45]. For character classifier there are 20, 000/10, 000 positive/negative generated samples and 20, 000/30, 000 positive/negative samples collected from ICDAR training set. Notice that our training samples only contain English letters and digits. To train MSER classifier we generate 70, 000 binary characters and collect 200, 000 negative samples detailed in Section 5. The accuracies of these two classifiers on the validation set are both over 95%.

ED

Figure 9: Illustration of the classification of candidate groups.

6. Experimental Results

6.2. Text localization results

PT

We evaluate the proposed text detection approach on the ICDAR [43] database, which is widely used for benchmarking scene text detection algorithms. Profiting from its mature online evaluation system, we can easily obtain performance indicators with high credibility. The ICDAR competition includes four challenges containing scene text localization, character segmentation, etc. In this paper, the text localization task, as Challenge 2, is the focus of our research. There are 233 natural scene images containing horizontal text for testing in this challenge. In this experiment, DetEval [44] was selected to evaluate the performance of the algorithm. The reason why we choose this evaluation method is that it can cope with one-to-one, oneto-many (one box corresponding to many words) and many to one matches between ground-truth annotations and detection results. F-score represents the harmonic mean of Precison and Recall as Eq.13:

AC

CE

Mehtods Our method Shi et al. [46] Tian et al. [47] Zheng et al. [48] Gupta et al. [34] Zhang et al. [13] He et al. [33] Tian et al. [49] Zhang et al. [50] Lu et al. [51] Yin et al. [12] Epshtein et al. [11] Wu et al. [52] Baseline

f =

2× p×r , p+r

(13)

Recall(%) 82.89 83.00 83.98 77.92 75.50 78.00 73.00 75.89 74.00 69.58 69.28 73.24 70.00 35.07

Precison(%) 89.07 87.70 83.69 89.90 92.00 88.00 93.00 85.15 88.00 89.22 88.80 81.53 84.00 60.95

F-score(%) 85.87 85.30 83.84 83.48 83.00 83.00 82.00 80.25 80.00 78.19 77.83 77.16 76.00 44.52

Table 1: Experimental results based on DetEval rule on ICDAR 2013 dataset

where r and p denote Recall and Precision, and the formula result f denotes F-score as comprehensive index.

The evaluation of text localization results on ICDAR 2013 dataset with DetEval rule is presented in Table 1. Our approach 8

ACCEPTED MANUSCRIPT

is compared with some state-of-the-art methods: Shi et al. [46], Tian et al. [47], Zheng et al. [48], Gupta et al. [34], Zhang et al. [13], He et al. [33], Tian et al. [49], Zhang et al. [50], Lu et al. [51], Yin et al. [12], Epshtein et al. [11] and Wu et al. [52]. Our method achieves the highest F-score at 85.87%. He et al. [33] have the best Precison performance at 93% and Tian et al. [47] have the best Recall at 83.98% among the methods listed in Table 1. Benefited from the color prior guided MSER, text areas with low contrast are enhanced before extraction and the complex background are suppressed. Therefore, the Recall rate is increased and the false detection rate is decreased during character candidate extraction. With joint classifier proposed in Section 5, the improper candidates are validated and filtered further. All of these make the Recall and Precision of our method better. Fig. 10 presents some examples of our experimental results. The outputs of the proposed text detection system are bounding boxes denoted by their coordinates in the upper-left and lower-right corners. In the illustration, text areas are marked with blue rectangles. It is noticed that our method can handle scene images with text in various fonts, low contrast and complex backgrounds adequately. Compared with methods of He et al. and Gupta et al., Precision is our weakness. The error classification brings a certain false positives. Text-like textures, like tire, window, fence and so on, often interfere with the validation of classifier, as shown in Fig. 11(a). Due to the well trained classifiers, this is a rare situation. Occlusion, uneven illumination, or distortion in word region probably lead to incomplete word detection which causes most false positives in the experiment, as shown in Fig. 11(b).

13(c) which are hardly handled by our classifier. Therefore, our method tends to have low Recall in images containing wordarts.

AN US

CR IP T

Figure 12: Illustration of some missing detections. The missing detections are marked with red rectangles.

CE

PT

ED

M

Figure 13: Illustration of some missing detections. The missing detections are marked with red rectangles.

AC

Figure 11: Illustration of some detected false positives. The false positives are marked with red rectangles.

It is obviously that the incomplete word detection is also treated as missing detection in evaluation. This brings about a decrease in Recall. Relatively, if too many words are merged as one text region, shown in Fig. 12, the shorter words like ”Gt.” in the text region are considered as not to be detected. Besides the missing detection caused by error characters grouping, some factors of the text itself also make detection difficult. In Fig. 13(a) the contrast between text and background is too low to distinguish text region. (b) in Fig. 13 demonstrates the effect of character size during text detection. Recalling all small text regions is really a hard work. There are some flourishes in Fig.

ICDAR 2015 dataset features incidental scene text images taken by Google Glasses without taking care of positioning, image quality, and viewpoint. The dataset contains 1000 training images and 500 testing images. The evaluation of text localization results on ICDAR 2015 dataset is presented in Table 2. Like our method, the methods proposed by Yin et al. [12] and Zheng et al. [48] are CC analysis based ones, among which we achieve the best performance. Both the methods of Yin et al. [12] and Zheng et al. [48] adopt MSER to extract text candidates which are pruned with ER tree. Furthermore, Zheng et al. [48] employ CNN to verify text candidates. By incorporating color prior to MSER and dual CNNs in text classification, our method utilize the characteristics of text and classify text more confidently, which improve the detection performance. Compared with the FCN based method proposed by Zhang et al. [13] in CVPR 2016 , our Recall is about 1% higher but the Precision is lower. In ICDAR2013 database, our method achieves the best performance. However, benefitting from the bigger training data, the general object detection networks inspired end-to-end network for text detection adapts better to bigger and more complex dataset. Mehtods Our method Zhang et al. [13] Zheng et al. [48] Yin et al. [12]

Recall(%) 42.08 43.00 39.53 32.11

Precison(%) 55.74 71.00 61.68 49.59

F-score(%) 48.91 54.00 48.18 38.98

Table 2: Experimental results on ICDAR 2015 dataset

There are two types of methods cited in this paper, texture analysis based methods and end-to-end network based ones. 9

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 10: Illustration of some detected true positives.

are given the chance to be detected. As the extra channels are generated with the guidence of color prior, false positives are rarely introduced. Through the previous analysis, more character candidates make the combinations of words more complete, which slightly improved the Precision. Compared with the T1, T3 has lower Precision and similar Recall. It is because that in most cases the MSER classifier and character classifier both perform well and each of them has their own good coping situation. The joint classifier can prune false positive better.

6.3. Evaluation of different impact factors

6.4. Speed comparison Thanks for your valuable comment. Since we have adopted a step-by-step process, and all the codes expect CNN run on CPU, our time consuming is not superior to the end-to-end networks. Our method is conducted on an Ubuntus computer with 3.2 GHz Inter(R) Core(TM) i5-3470 CPU, 8GB RAM and NVIDIA GeForce GTX 1080 GPU. On average, images in ICDAR2013 cost 0.5s to estimate color priors and 0.7s for candidate extracting and grouping in ratio 1 and less than 0.1s in ratio 0.125. In this process the time consumed by CNN is extremely few which can be ignored. For the extremely complex or large images, the running time is still less than 5s. The overall speed comparison on ICDAR2013 is illustrated in Table 4. Zhou et al. [36], She et al. [46] and Liao et al. [37]

ED

M

In relatively complex cases like ICDAR 2015 and COCO-Text dataset, end-to-end network based methods usually achieve better performance than texture analysis based ones. Zhang et al. [13] and Gupta et al. [34] employ networks to cope with single task in their cascading systems. Zhou et al. [36] and Shi et al. [46] propose multi-task networks to detect text by classification and box regression. All these methods perform well on ICDAR 2015 and COCO-Text datasets. In general, networks in text detection approach are annotation guided, which means that the networks output predictions closed to the annotations we provided. When the network is migrated to other languages, it is necessary to prepare enough samples in these languages to retrain the whole network. Our method is based on texture analysis with generalized assumption priors, so the cost of training is less. Thus, the superiorities of our method include: (1) Using less training samples to handle testing under moderate complexity (2) Easy transferring to new languages; (3) Easy to implement with good performance.

PT

Experiments T1 T2 T3

Recall(%) 82.89 80.4 82.68

Precison(%) 89.07 88.61 83.02

F-score(%) 85.87 84.31 82.85

AC

CE

Table 3: Evaluation of different impact factors on the performance of our method.

Another comparison experiment we have done is used to validate the MSER’s improvement and joint classifier’s performance. Table 3 shows the results. What needs to be explained is that T1 is our method with color prior guided MSER, which implementation is to add color similarity map to commonly used four channels: gray image, R channel, G channel and B channel, as the fifth channel, even more. T2 is the same as T1 except that it does not employ the color similarity map. Compared with T1, T3 only uses the character classifier to validate grayscale candidate regions. Through the comparison between T1 and T2, the introduction of new channels not only improves Recall, but also does not decrease Precision. Due to the extra channels, more candidates 10

ACCEPTED MANUSCRIPT

9. References

design multi-task networks to handle text detection. All those networks only need to implement forward propagation once for each image with few preprocessing and post-processing. Benefited from the advantages of graphics cards on parallel computing, these end-to-end single shot text detection networks are superior in operational efficiency. Zhang et al. [13] also employ FCN in their system to predict the centroid of each character. Since they adopt a cascading structure, the post-processing steps is coded in MATLAB, which waste a lot of time. Yin et al. [12] propose a texture analysis based method which is one of the most classic methods in this area. Unfortunately their code is not released so we formulate their method in our computation environment, but failure to achieve the efficiency mentioned in their paper (FPS: 2.3). Our method belongs to the same category with Yins method in many procedures, but we improve the accuracy obviously with comparable efficiency.

M

Table 4: Overall speed comparison among different methods.

CR IP T

FPS 11.5 7.7 9.4 0.47 1.2 0.79

7. Conclusions and Future Work

[1] S. S. Tsai, H. Chen, D. M. Chen, G. Schroth, R. Grzeszczuk, B. Girod, Mobile visual search on printed documents using text and low bit-rate features, in: 18th IEEE International Conference on Image Processing, ICIP 2011, Brussels, Belgium, September 11-14, 2011, 2011, pp. 2601– 2604. doi:10.1109/ICIP.2011.6116198. URL http://dx.doi.org/10.1109/ICIP.2011.6116198 [2] D. B. Barber, J. D. Redding, T. W. McLain, R. W. Beard, C. N. Taylor, Vision-based target geo-location using a fixed-wing miniature air vehicle, Journal of Intelligent and Robotic Systems 47 (4) (2006) 361–382. doi: 10.1007/s10846-006-9088-7. URL http://dx.doi.org/10.1007/s10846-006-9088-7 [3] Y. Zhu, G. Xu, D. J. Kriegman, A real-time approach to the spotting, representation, and recognition of hand gestures for human-computer interaction, Computer Vision and Image Understanding 85 (3) (2002) 189– 208. doi:10.1006/cviu.2002.0967. URL http://dx.doi.org/10.1006/cviu.2002.0967 [4] G. N. DeSouza, A. C. Kak, Vision for mobile robot navigation: A survey, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2) (2002) 237–267. doi: 10.1109/34.982903. URL http://dx.doi.org/10.1109/34.982903 [5] R. Smith, An overview of the tesseract OCR engine, in: 9th International Conference on Document Analysis and Recognition (ICDAR 2007), 23-26 September, Curitiba, Paran´a, Brazil, 2007, pp. 629–633. doi:10.1109/ICDAR.2007.56. URL http://doi.ieeecomputersociety.org/10.1109/ICDAR. 2007.56 [6] Y. Zhu, C. Yao, X. Bai, Scene text detection and recognition: recent advances and future trends, Frontiers of Computer Science 10 (1) (2016) 19–36. doi:10.1007/s11704-015-4488-0. URL http://dx.doi.org/10.1007/s11704-015-4488-0 [7] A. Bissacco, M. Cummins, Y. Netzer, H. Neven, Photoocr: Reading text in uncontrolled conditions, in: IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, 2013, pp. 785–792. doi:10.1109/ICCV.2013.102. URL http://dx.doi.org/10.1109/ICCV.2013.102 [8] K. Wang, B. Babenko, S. J. Belongie, End-to-end scene text recognition, in: IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, 2011, pp. 1457–1464. doi:10.1109/ICCV.2011.6126402. URL http://dx.doi.org/10.1109/ICCV.2011.6126402 [9] J. Lee, P. Lee, S. Lee, A. L. Yuille, C. Koch, Adaboost for text detection in natural scene, in: 2011 International Conference on Document Analysis and Recognition, ICDAR 2011, Beijing, China, September 18-21, 2011, 2011, pp. 429–434. doi:10.1109/ICDAR.2011.93. URL http://dx.doi.org/10.1109/ICDAR.2011.93 [10] C. Yao, X. Bai, W. Liu, A unified framework for multioriented text detection and recognition, IEEE Trans. Image Processing 23 (11) (2014) 4737–4749. doi:10.1109/TIP.2014.2353813. URL https://doi.org/10.1109/TIP.2014.2353813 [11] B. Epshtein, E. Ofek, Y. Wexler, Detecting text in natural scenes with stroke width transform, in: The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, 2010, pp. 2963–2970. doi:10.1109/CVPR. 2010.5540041. URL http://dx.doi.org/10.1109/CVPR.2010.5540041 [12] X. Yin, X. Yin, K. Huang, H. Hao, Robust text detection in natural scene images, IEEE Trans. Pattern Anal. Mach. Intell. 36 (5) (2014) 970–983. doi:10.1109/TPAMI.2013.182. URL http://dx.doi.org/10.1109/TPAMI.2013.182 [13] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, X. Bai, Multi-oriented text detection with fully convolutional networks, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 4159–4167. doi:10.1109/CVPR.2016.451. URL http://doi.ieeecomputersociety.org/10.1109/CVPR. 2016.451 [14] S. Zhu, R. Zanibbi, A text detection system for natural scenes with convolutional feature learning and cascaded classification, in: 2016 IEEE

AN US

Methods Zhou et al. [36] Shi et al. [46] Liao et al. [37] Zhang et al. [13] Yin et al. [12] Our method

References

AC

CE

PT

ED

In this paper, we present a novel character candidate extraction approach by introducing color prior to the traditional MSER detector. To obtain appropriate color prior, we segment edges under the guidance of edge intensity and run SWT on local edge map to cope with single character. Depend on the position of stroke we can get the color information of the potential text. Utilizing color prior to standard MSER ensures extracting enough character candidates. High Precision is ensured by hierarchical clustering and fusion of specific task CNNs. On benchmark dataset, our system achieves the state-of-the-art performance. However, our work can only handle English detection. Then training CNNs for Chinese texts to extend our system to natural scene image Chinese text detection is our future work.

8. Acknowledgment This work is supported by the National Key Research and Development Program of China (No. 2016QY01W0200), Natural Science Foundation of China (No. 61571354), Natural Science Basis Research Plan in Shaanxi Province of China (No. 2017JM6001), National Natural Science Foundation of China (No. 61501349). 11

ACCEPTED MANUSCRIPT

[17]

[18]

[19]

[20]

[25]

[26]

[27]

PT

[24]

CE

[23]

AC

[22]

ED

M

[21]

CR IP T

[16]

[28] W. Huang, Y. Qiao, X. Tang, Robust scene text detection with convolution neural network induced MSER trees, in: Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 612, 2014, Proceedings, Part IV, 2014, pp. 497–511. doi:10.1007/ 978-3-319-10593-2_33. URL http://dx.doi.org/10.1007/978-3-319-10593-2_33 [29] C. Liu, M. Koga, H. Fujisawa, Lexicon-driven segmentation and recognition of handwritten character strings for japanese address reading, IEEE Trans. Pattern Anal. Mach. Intell. 24 (11) (2002) 1425–1437. doi: 10.1109/TPAMI.2002.1046151. URL http://dx.doi.org/10.1109/TPAMI.2002.1046151 [30] M. Jaderberg, A. Vedaldi, A. Zisserman, Deep features for text spotting, in: Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV, 2014, pp. 512– 528. doi:10.1007/978-3-319-10593-2_34. URL http://dx.doi.org/10.1007/978-3-319-10593-2_34 [31] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Reading text in the wild with convolutional neural networks, International Journal of Computer Vision 116 (1) (2016) 1–20. doi:10.1007/ s11263-015-0823-z. URL http://dx.doi.org/10.1007/s11263-015-0823-z [32] P. Doll´ar, R. Appel, S. J. Belongie, P. Perona, Fast feature pyramids for object detection, IEEE Trans. Pattern Anal. Mach. Intell. 36 (8) (2014) 1532–1545. doi:10.1109/TPAMI.2014.2300479. URL http://dx.doi.org/10.1109/TPAMI.2014.2300479 [33] T. He, W. Huang, Y. Qiao, J. Yao, Text-attentional convolutional neural network for scene text detection, IEEE Trans. Image Processing 25 (6) (2016) 2529–2541. doi:10.1109/TIP.2016.2547588. URL http://dx.doi.org/10.1109/TIP.2016.2547588 [34] A. Gupta, A. Vedaldi, A. Zisserman, Synthetic data for text localisation in natural images, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 2315–2324. doi:10.1109/CVPR.2016.254. URL http://dx.doi.org/10.1109/CVPR.2016.254 [35] Y. Liu, L. Jin, Deep matching prior network: Toward tighter multioriented text detection, CoRR abs/1703.01425. URL http://arxiv.org/abs/1703.01425 [36] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, J. Liang, EAST: an efficient and accurate scene text detector, CoRR abs/1704.03155. URL http://arxiv.org/abs/1704.03155 [37] M. Liao, B. Shi, X. Bai, X. Wang, W. Liu, Textboxes: A fast text detector with a single deep neural network, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., 2017, pp. 4161–4167. URL http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/ view/14202 [38] S. Tian, S. Lu, C. Li, Wetext: Scene text detection under weak supervision, in: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 1501–1509. doi: 10.1109/ICCV.2017.166. URL https://doi.org/10.1109/ICCV.2017.166 [39] P. He, W. Huang, T. He, Q. Zhu, Y. Qiao, X. Li, Single shot text detector with regional attention, in: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 3066– 3074. doi:10.1109/ICCV.2017.331. URL https://doi.org/10.1109/ICCV.2017.331 [40] P. Doll´ar, C. L. Zitnick, Fast edge detection using structured forests, IEEE Trans. Pattern Anal. Mach. Intell. 37 (8) (2015) 1558–1570. doi:10. 1109/TPAMI.2014.2377715. URL http://dx.doi.org/10.1109/TPAMI.2014.2377715 [41] A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: A review, ACM Comput. Surv. 31 (3) (1999) 264–323. doi:10.1145/331499.331504. URL http://doi.acm.org/10.1145/331499.331504 [42] K. Simonyan, A. Zisserman, Very deep convolutional networks for largescale image recognition, CoRR abs/1409.1556. URL http://arxiv.org/abs/1409.1556 [43] 13th International Conference on Document Analysis and Recognition, ICDAR 2015, Nancy, France, August 23-26, 2015, IEEE Computer Society, 2015. URL http://ieeexplore.ieee.org/xpl/mostRecentIssue. jsp?punumber=7321714

AN US

[15]

Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 625–632. doi: 10.1109/CVPR.2016.74. URL http://dx.doi.org/10.1109/CVPR.2016.74 Q. Ye, D. S. Doermann, Text detection and recognition in imagery: A survey, IEEE Trans. Pattern Anal. Mach. Intell. 37 (7) (2015) 1480–1500. doi:10.1109/TPAMI.2014.2366765. URL http://dx.doi.org/10.1109/TPAMI.2014.2366765 X. Yin, Z. Zuo, S. Tian, C. Liu, Text detection, tracking and recognition in video: A comprehensive survey, IEEE Trans. Image Processing 25 (6) (2016) 2752–2773. doi:10.1109/TIP.2016.2554321. URL http://dx.doi.org/10.1109/TIP.2016.2554321 L. Neumann, J. Matas, A method for text localization and recognition in real-world images, in: Computer Vision - ACCV 2010 - 10th Asian Conference on Computer Vision, Queenstown, New Zealand, November 8-12, 2010, Revised Selected Papers, Part III, 2010, pp. 770–783. doi: 10.1007/978-3-642-19318-7_60. URL http://dx.doi.org/10.1007/978-3-642-19318-7_60 C. Yao, X. Bai, W. Liu, Y. Ma, Z. Tu, Detecting texts of arbitrary orientations in natural images, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, 2012, pp. 1083–1090. doi:10.1109/CVPR.2012.6247787. URL http://dx.doi.org/10.1109/CVPR.2012.6247787 W. Huang, Z. Lin, J. Yang, J. Wang, Text localization in natural images using stroke feature transform and text covariance descriptors, in: IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, 2013, pp. 1241–1248. doi:10.1109/ICCV. 2013.157. URL http://dx.doi.org/10.1109/ICCV.2013.157 X. Yin, X. Yin, H. Hao, K. Iqbal, Effective text localization in natural scene images with mser, geometry-based grouping and adaboost, in: Proceedings of the 21st International Conference on Pattern Recognition, ICPR 2012, Tsukuba, Japan, November 11-15, 2012, 2012, pp. 725–728. URL http://ieeexplore.ieee.org/xpl/freeabs_all.jsp? arnumber=6460237 J. Matas, O. Chum, M. Urban, T. Pajdla, Robust wide baseline stereo from maximally stable extremal regions, in: Proceedings of the British Machine Vision Conference 2002, BMVC 2002, Cardiff, UK, 2-5 September 2002, 2002, pp. 1–10. doi:10.5244/C.16.36. URL http://dx.doi.org/10.5244/C.16.36 H. Chen, S. S. Tsai, G. Schroth, D. M. Chen, R. Grzeszczuk, B. Girod, Robust text detection in natural images with edge-enhanced maximally stable extremal regions, in: 18th IEEE International Conference on Image Processing, ICIP 2011, Brussels, Belgium, September 11-14, 2011, 2011, pp. 2609–2612. doi:10.1109/ICIP.2011.6116200. URL http://dx.doi.org/10.1109/ICIP.2011.6116200 J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. Intell. 8 (6) (1986) 679–698. doi:10.1109/TPAMI. 1986.4767851. URL http://dx.doi.org/10.1109/TPAMI.1986.4767851 L. Neumann, J. Matas, Real-time lexicon-free scene text localization and recognition, IEEE Trans. Pattern Anal. Mach. Intell. 38 (9) (2016) 1872– 1885. doi:10.1109/TPAMI.2015.2496234. URL http://dx.doi.org/10.1109/TPAMI.2015.2496234 D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. Almaz´an, L. de las Heras, ICDAR 2013 robust reading competition, in: 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, August 2528, 2013, 2013, pp. 1484–1493. doi:10.1109/ICDAR.2013.221. URL http://dx.doi.org/10.1109/ICDAR.2013.221 D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, E. Valveny, ICDAR 2015 competition on robust reading, in: 13th International Conference on Document Analysis and Recognition, ICDAR 2015, Nancy, France, August 23-26, 2015 [43], pp. 1156–1160. doi:10.1109/ICDAR.2015.7333942. URL http://dx.doi.org/10.1109/ICDAR.2015.7333942 X. Yin, W. Pei, J. Zhang, H. Hao, Multi-orientation scene text detection with adaptive clustering, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9) (2015) 1930–1937. doi:10.1109/TPAMI.2014.2388210. URL http://dx.doi.org/10.1109/TPAMI.2014.2388210

12

ACCEPTED MANUSCRIPT

and 1999, respectively. From 1997 to 1998, he was a Research Fellow at the Department of Computer Science, Shizuoka University, Shizuoka, Japan. From 2000 to 2001, he was a Postdoctoral Research Fellow at the Department of Information Engineering, the Chinese University of Hong Kong, Hong Kong. Since 2001, he has been at the School of Electronic Engineering, Xidian University. He is currently a Cheung Kong Professor of Ministry of Education, a Professor of Pattern Recognition and Intelligent System, and the Director of the State Key Laboratory of Integrated Services Networks, Xi’an, China. His current research interests include multimedia analysis, computer vision, pattern recognition, machine learning, and wireless communications. He has published six books and around 200 technical articles in refereed journals and proceedings. Prof. Gao is on the Editorial Boards of several journals, including Signal Processing (Elsevier) and Neurocomputing (Elsevier). He served as the General Chair/Co-Chair, Program Committee Chair/Co-Chair, or PC Member for around 30 major international conferences. He is a Fellow of the Institute of Engineering and Technology and a Fellow of the Chinese Institute of Electronics.

ED

M

AN US

CR IP T

[44] C. Wolf, J. Jolion, Object count/area graphs for the evaluation of object detection and segmentation algorithms, IJDAR 8 (4) (2006) 280–296. doi:10.1007/s10032-006-0014-0. URL https://doi.org/10.1007/s10032-006-0014-0 [45] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, USA, November 03 - 07, 2014, 2014, pp. 675–678. doi:10.1145/2647868.2654889. URL http://doi.acm.org/10.1145/2647868.2654889 [46] B. Shi, X. Bai, S. J. Belongie, Detecting oriented text in natural images by linking segments, CoRR abs/1703.06520. URL http://arxiv.org/abs/1703.06520 [47] C. Tian, Y. Xia, X. Zhang, X. Gao, Natural scene text detection with MC-MR candidate extraction and coarse-to-fine filtering, Neurocomputing 260 (2017) 112–122. doi:10.1016/j.neucom.2017.03.078. URL https://doi.org/10.1016/j.neucom.2017.03.078 [48] Y. Zheng, Q. Li, J. Liu, H. Liu, G. Li, S. Zhang, A cascaded method for text detection in natural scene images, Neurocomputing 238 (2017) 307– 315. doi:10.1016/j.neucom.2017.01.066. URL https://doi.org/10.1016/j.neucom.2017.01.066 [49] S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, C. L. Tan, Text flow: A unified text detection system in natural scene images, in: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 4651–4659. doi:10.1109/ICCV.2015.528. URL https://doi.org/10.1109/ICCV.2015.528 [50] Z. Zhang, W. Shen, C. Yao, X. Bai, Symmetry-based text line detection in natural scenes, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015, pp. 2558–2567. doi:10.1109/CVPR.2015.7298871. URL https://doi.org/10.1109/CVPR.2015.7298871 [51] S. Lu, T. Chen, S. Tian, J. Lim, C. L. Tan, Scene text extraction based on edges and support vector regression, IJDAR 18 (2) (2015) 125–135. doi:10.1007/s10032-015-0237-z. URL https://doi.org/10.1007/s10032-015-0237-z [52] H. Wu, B. Zou, Y. Zhao, Z. Chen, C. Zhu, J. Guo, Natural scene text detection by multi-scale adaptive color clustering and non-text filtering, Neurocomputing 214 (2016) 1011–1025. doi:10.1016/j.neucom. 2016.07.016. URL https://doi.org/10.1016/j.neucom.2016.07.016

Chunna Tian received the B. S., M. S. and Ph.D degrees from Xidian University, Xian, China, in 2002, 2005 and 2008, respectively. From2006 to 2007, she was with the Visual Computing and Image Processing Lab of Oklahoma State University (OSU) as a visiting student. She is currently an Associate Professor of School of Electronic and Engineering at Xidian University, Xian, China. Her research interests include image processing, machine learning and pattern recognition.

AC

CE

PT

Xiangnan Zhang received the B. S. and M. S. degrees from Xidian University, Xian, China, in 2013 and 2015, respectively. He is currently a Ph.D candidate of School of Electronic and Engineering at Xidian University, Xian, China. His research interests include computer vision and deep learning.

Xinbo Gao received the B.Eng., M.Sc., and Ph.D. degrees in signal and information processing from Xidian University, Xi’an, China, in 1994, 1997, 13