Accepted Manuscript
Detecting natural scenes text via auto image partition, two-stage grouping and two-layer classification Anna Zhu, Guoyou Wang, Yangbo Dong PII: DOI: Reference:
S0167-8655(15)00171-3 10.1016/j.patrec.2015.06.009 PATREC 6245
To appear in:
Pattern Recognition Letters
Received date: Accepted date:
28 October 2014 8 June 2015
Please cite this article as: Anna Zhu, Guoyou Wang, Yangbo Dong, Detecting natural scenes text via auto image partition, two-stage grouping and two-layer classification, Pattern Recognition Letters (2015), doi: 10.1016/j.patrec.2015.06.009
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT 1 Highlights • We present a new system to detect text in natural scene images. • Image is partitioned to unconstrained sub-images through statistical distribution of sampling points.
AC
CE
PT
ED
M
AN US
CR IP T
• A two-stage grouping method and a two-layer classification mechanism are designed to group and classify candidate text regions.
ACCEPTED MANUSCRIPT 2
Pattern Recognition Letters journal homepage: www.elsevier.com
Detecting natural scenes text via auto image partition, two-stage grouping and two-layer classification
CR IP T
Anna Zhu, Guoyou Wang∗∗, Yangbo Dong
State Key Lab for Multispectral Information Processing Technology, Huazhong University of Science and Technology, Wuhan 430074,China
ABSTRACT
ED
M
AN US
Text detection in natural scene images is important and challenging work for image analysis. In this paper, we present a robust system to detect natural scene text according to text region appearances. The framework includes three parts: auto image partition, two-stage grouping and two-layer classification. The first part partitions images into unconstrained sub-images through statistical distribution of sampling points. The designed two-stage grouping method performs grouping in each sub-image in first stage and connects different partitioned image regions in second stage to group connected components (CCs) to text regions. Then a two-layer classification mechanism is designed for classifying candidate text regions. The first layer is to compute the similarity score of region blocks and the second layer is a SVM classifier using HOG features. We add a normalization step to rectify perspective distortion before candidate text region classification which improves the accuracy and robustness of the final output result. The proposed system is evaluated on four types datasets including two ICDAR Robust Reading Competition datasets, a born-digital image dataset, a video image dataset and a perspective distortion image dataset. The experimental results demonstrate that our proposed framework outperforms state-of-the-art localization algorithms and is robust in dealing with multiple background outliers. c 2015 Elsevier Ltd. All rights reserved.
1. Introduction
AC
CE
PT
Nowadays, more and more people use handhold devices equipped with high-resolution cameras to record scenes in their daily life. Information analysis of these scene images attracted much attention in computer vision field. Among them, text information is intuitive and plays a vital role for various applications such as content-based image retrieval, scene understanding and vision based navigation. Due to its immense potential for commercial applications and ability to be applied in human computer interaction field, research about text extraction is being pursued both in academia and industry. Generally, text extraction paradigms are decomposed to three parts: text detection, text binarization and text recognition (OCR). They solve the problems: ”where is the text?”, ”what is the text?” and ”what is the text content?” respectively. In this paper, we focus on solving text detection problem. However, detecting text from natural scene images encounters many difficulties (Jung et al., 2004), for instance the clutter background, geometrical distortions, various text orientations and appearances. Much
∗∗ Corresponding
author: Tel.: +86-186-2770-8137; e-mail:
[email protected] (Guoyou Wang)
work in this area has been done, like yearly competitions (Shahab et al., 2011; Karatzas et al., 2013) and new algorithms (Jung et al., 2004; Zhang et al., 2013) to improve the accuracy and computational complexity issues. Text detection algorithms can be roughly classified into two categories. One is to find text component groups. The other is to find text blocks. The former uses distinct geometric and/or color features based on CC level to find the candidate text (or character) components. This category is considered as contentbased image partition grouping spatial pixels to connected components conditionally. Usually, geometrical analysis is applied to classify text component and non-text component. Remaining CCs with similar properties are gathered together to form text regions. This kind of methods may use local gradient features and uniform colors of text characters (Yi and Tian, 2011), maximally stable extremal region (MSER) algorithm (Koo and Kim, 2013), intensity histogram based filter and shape filter (Liu and Sarkar, 2008), stroke width transform (Epshtein et al., 2010), K-means clustering (Shivakumara et al., 2011) and mathematical morphology based method for multilingual text detection (Liu et al., 2008) et al.. The detected text components of this category can be used directly for text recognition. However, it will fail when text components are not homogeneous and the
ACCEPTED MANUSCRIPT 3
CR IP T
Fig. 2. The flowchart of our proposed method.
colors and invariant color and shape of attachment surface. The framework is a coarse-to-fine process starting from the whole to individual part then to integration. From the view of granulation (Yao, 2007), it involves the process of two directions: decomposition and construction. The decomposition involves the process of dividing a larger granule into smaller and lower level granules. It a top-down process. The construction involves the process of forming a larger and higher level granule with smaller and lower level sub-granules. It is a bottom-up process. Here text is considered as larger granule and CCs are lower level granules. We consider the text as an integral structure which is opposite to the MSER method. We then analysis the feature of each connected component in extracted sub-regions and group the candidate characters to the text string. Finally, a text candidate verification step is used to refine the detected text region result. The main contributions of our framework is listed as follows. (1) Besides analyzing individual character features, text string structure can also be used for image partition. The rectilinear characteristic of text string structure and parallel edge pair characteristic of characters are employed to form location and gray level distribution of representative points. Based on the distribution, an image is partitioned to several sub-images. Each sub-image contains certain horizontal position and color information of potential text region. (2) The proposed features for unary component classification are valid for both individual character components and jointed text string components. We also use two stage grouping method mixed with region-based analysis to locate candidate text regions which can detect multi-polarity text. (3) In most previous CC-based methods, the grouping step only groups individual characters but ignores the fact that multiple characters may be jointed into a single connected component. In this paper, we propose the three types of text region appearances and analyze them individually. (4) Characters in text strings are normally different and this particular characteristic could be utilized to compare the similarity of regions. The first layer of classification uses a similarity score based on this characteristic which can filter out most repeat backgrounds. The rest of this paper is organized as follows. Section 2 details the proposed method. Experiments and results are presented in Section 3 and conclusions are drawn in Section 4.
AC
CE
PT
ED
M
AN US
consumption time will increase with the complexity of the image since more CCs arise to be co-analyzed. The alternative category attempts to classify regions in a given patch belonging to text regions or non-text regions by texture analysis and then merges neighboring text regions to generate text blocks. Usually, this method uses multi-scale strategy and partitions images into patches with sliding windows. Then analysis the region feature like wavelet decomposition coefficients at different scales (Ye et al., 2005), gradient-based maps and histograms of block patterns(Li et al., 2008) or different gradient edge features (mean, standard deviation, energy, entropy, inertia and local homogeneity) of image blocks (Shivakumara et al., 2010) with different classification tools as SVM, AdaBoost or ANN. This kind of methods capture the interrelationship of text and treat the character string as a whole which can detect texts accurately even when noisy. While the operating speed is relatively slow for its non-content based methods that regions are generated on dense sampled pixels or fixed step with different scales and sizes. Also, the performance will lose its advantage for non-horizontal aligned text. To overcome the difficulties and take advantage of the above two categories, hybrid approaches are proposed. In Pan’s paper (Pan et al., 2011), a region-based method is firstly used to estimate the text existing confidence and scale information in image pyramid. Then, a conditional random field (CRF) model is performed to filter out non-text components. While Huang (Huang et al., 2014) takes advantages of both MSERs and sliding window based methods by using convolutional neural network (CNN) to robustly identify text components from text-like outliers after MSERs operator which can dramatically reduce the number of windows scanned and enhances detection of the low-quality texts. These kind of methods first roughly select candidate text regions with one category and then use other strategies to confirm the confidence of text regions and filter out non-text regions. In this paper, we use a different way to detect text depending on the appearance of detected candidate text regions. As shown in Fig. 1, natural scene texts present in different forms. They might be single character regions, connected character regions or individual character regions. Single character detection is more like character recognition so we do not consider this type. For the other two forms, we design a two layer filtering mechanism to classify text and non-text regions. Our proposed method belongs to CC-based method and uses the intrinsic characteristic of text and individual property of characters which can effectively detect text with different fonts, sizes,
Fig. 1. Examples of different text appearances in natural scene images.
ACCEPTED MANUSCRIPT 4
(a)
(b)
(c)
(d)
(e)
CR IP T
Fig. 3. Comparison of Canny edge detector in single channel and multi-channels. (a) Examples of original scene images. (b)(c)(d) show the edge maps detected by Canny in gray scale image, b channel and Y channel. (e) Edge maps based on our proposed multi-channel edge detection method.
M
AN US
modify the Canny edge detector (Canny, 1986) to get an improved edge map with multi-channels. We use five channel images to convolve with a Gaussian filter sequentially and while calculating the intensity gradient and direction of each pixel in these images separately. Then, by searching each pixel’s maximum intensity gradient among the five-channel images, we compose a single max-gradient map. Meanwhile, we record the corresponding direction in the orientation map. After that, non-maximum suppression is implemented on the max-gradient map in the gradient direction and two thresholds are set to find edges. In our system, we set the low threshold to 0.05 and the high threshold to 0.2. The results in Fig. 3 demonstrate that our method can keep the integrity of character edges after combining the five channels and the performance is better than onechannel edge detection.
(b)
(c)
ED
(a)
CE
2. The proposed method
PT
Fig. 4. (a) The processed original image in Y space and representative points are marked green. (b) Distribution of gray level and horizontal position. (c) Clusters for image partition.
AC
In our method, we choose Y-Cr-Cb space and ab channels of L-a-b color space to find the text region for these channels show the best performance in our experimental result as shown in Section 3. Our method contains three parts: auto image partition, a two-stage grouping and a two-layer classification. The flowchart of the proposed approach is depicted in Fig. 2. 2.1. Auto image partition In this stage, a natural scene image is partitioned to several sub-images without predefined parameters. We use a trained ANN classifier to classify CCs in each sub-image. 2.1.1. Combined edge map from multi-channel As parallel edge is one of the most distinct features of text, firstly we extract edges in image. Unlike the normal Canny detection performed on single channel or gray scale level, we
2.1.2. Cluster based image partition In this step, we partition the image to several layers through assigning the pixels with similar intensity and spatial positions to a certain layer. In each channel, instead of counting all the pixels in an image, we sample representative points for the further process. Here, we follow the steps below to find the representative points (RPs). (1) we scan each row and extract points in the middle of two neighbored edge points whose orientation belong to [π/3, π/2] or [-π/2, -π/3]. Those points are marked as Gi = {g1i , g2i ,· · ·, gni }, where gni is the intensity of extracted points n in ith row. (2) If the intensity difference of two neighbored points in Gi is less than T g , that is |g pi - g(p+1)i | < T g , we mark g(p+1)i as nonrepresentative point and delete these points after all the points in Gi are scanned. Here, we set T g to 10. (3) If a point in Gi has intensity difference with other points in Gi greater 5, it should be deleted. The rest of the points are RPs. Examples are shown in Fig. 4(a) and RPs are marked as green dots. We count the intensity distribution of the RPs in each row. The RPs are obtained from the ridge of vertical parallel edges. As text regions have more complex texture features than smooth invariant regions, most of the text ridge points are contained in the RPs. The statistical analysis of the dataset of training text images (Chen and Yuille, 2004) shows that the x derivatives
ACCEPTED MANUSCRIPT 5
CR IP T
new boundaries. If two clusters have a similar color range and are closely to each other vertically, then they are connected in vertical direction. If the height of a cluster is less than 15 or the width exceeds 100, it should not be partitioned. The final cluster number is number of partitioned sub-images. Besides, the partitioned sub-images are not the whole size of original image but the size of original image’s width × certain cluster’ height. The clusters of Fig. 4(a) images are shown in Fig. 4(c). It can be seen that both images are partitioned into six parts where text may exist. The partitioned regions are composed of many connected components constrained by different clusters. Fig. 5 shows the partition results of the images in Fig. 4(a).
(b) Six image partitions of bottom original image in Fig. 4(a)
PT
ED
M
tend to be larger in the central (i.e. text) region while smaller in the regions above and below the text. Based on this fact, we extract RPs between two vertical direction edges. The RPs in text regions have approximate intensity and approaching spatial positions so that they can form clusters. While in background regions without text, they may disperse to different intensity. Based on this characteristic, we search on the distribution map with an 11×11 window to find the location and intensity of potential text. If there is no other points in the window excluding the centered one, the center point is set to 0. Fig. 4(b) shows the RPs’ intensity and position distribution of images in Fig. 4(a). In order to compute the number of partitioned sub-images, we group the discrete RPs on the distribution map to clusters and generate a binary cluster image for display. In each row, if the difference between two RPs’ gray value is less than 5, points on their routine are marked as 1. The same operation is performed in each column. The left and right boundary values of each cluster compose the intensity range of potential text, while the cluster’s top and bottom boundary values may not be the real boundaries of potential text region. This arises from the situation that not all ridges of vertical strokes are extracted as RPs. For example, for a text consist of a uppercase letter ’I’ and lowercase letter ’s’, ridge points in top part of ’I’ are not considered as RPs so that the top and bottom boundary of its cluster are not real boundary of text ’Is’. Therefore, we rebuild the distribution of all ridge points of vertical parallel edges, then separately search neighbored points vertically from the top and bottom boundary of each original cluster to find the
CE
Table 1. Component Feature Definition
Feature Height
Fig. 5. Partitioned sub-images
AC
2.1.3. Connected components analysis In this stage, we analyze the CCs in each sub-image. The CCs can be classified to three classes: individual characters, connected text strings and non-text components. A 8-5-2 three layer artificial neural network (ANN) classifier is chosen for its good trade-off between effectiveness and efficiency (Bishop et al., 1995). To characterize each component’s geometric and textural properties, we use eight types of unary component features which are very easy and efficient to compute. The definitions of these features are summarized in Tab. 1.
AN US
(a) Six image partitions of top original image in Fig. 4(a)
Aspect ratio
Edge occupation ratio Contour compactness Edge pair score Occupation ratio Stroke width variance
Maximum stroke width ratio
Definition h(c) ≥ H/3 h(c) AR = w (c) EOR = EL(c) , L = min{w(c),h(c)} (c) CC = wC(c)h(c) (c) EPS = EP E(c) OR = P(c) P w(c)×h(c) SWV = (sw − SW)2 /EP(c), SW = {sw| max(hism(sw)), sw:stroke width set} MSWR = C(SW)/p(c), (SW ∈ [max0, 0.7SW, 1.3SW]
–Height. The height of component h(c) should not be much smaller than its sub-image’s height H. –Aspect ratio. The ratio of height h(c) and width w(c) of the component’s bounding box. –Edge occupation ratio. The ratio between the total numbers of edge pixels E(c) and length L of the component . –Contour compactness. The ratio between the bounding box area and component outer contour pixel number C(c). –Edge pair score. Utilizing the parallel edge characteristic in text components, we count the number of edge points meeting the following conditions: Searching in its gradient direction θ can meet the first edge point with gradient direction in [θ − π/12, θ + π/12] and the sum of component pixels on its path L sum should be larger than half of the path length path. An example is shown in Fig. 6(a) and Fig. 6(b). The green points
ACCEPTED MANUSCRIPT
(a)
(b)
(c)
CR IP T
6
(d)
Fig. 6. Component feature description. (a) Two layer components of the character ”E”. Blue points are edges, green lines and red lines are path of satisfied and not satisfied edge pairs respectively. The counted edge pairs are marked green and the dissatisfied edge pairs marked red in (b). (c) is the histogram to find stroke width. Based on the stroke width, maximum stroke width ratio is given in (d). See text for details.
alignment text strings, namely, the position adjacent text characters of satisfying Equ. 1 should be grouped. ( |Yi − Y j | < T h1 (1) |XiT − X jT | < T h2 , |XiB − X jB | < T h2
CE
PT
ED
M
AN US
are satisfied edge points but the red ones are not. EP(c) is the number of satisfied edge points and E(c) is the total number of edge points in a component. –Occupation ratio. P(c) represents the component pixel numbers and w(c)× h(c) is the component bounding box area. –Stroke width variance. This feature counts for standard deviation of the component’s stroke widths. When counting the number of edge pairs, we also do a width voting of these edge pairs to form a histogram hism(sw) (as shown in Fig. 6(c)). The maximum voting value in is the stroke width of this component. –Maximum stroke width ratio. We consider SW as the component’s stroke width range. The idea of stroke width transform (Epshtein et al., 2010) is used to count the sum of recovered pixels C(SW) having stroke width in SW in the component. It accounts for the great proportion of the component if it is a character. This feature presents the ratio between C(SW) and the size of component. (MSWR = blue area /(blue area+white area) in Fig. 6(d)) After classification, we group the character candidates and text string candidates if they have similar geometric properties, stroke width or colors. 2.2. Two-stage grouping
AC
A two-stage grouping strategy is used to group text components into text regions. The first stage grouping is implemented in each sub-image. Then, the grouping results in all sub-images of a certain channel are integrated and regrouped in the second stage. 2.2.1. First stage grouping In this step, candidate characters are joined into text string regions according to the criteria of geometric distance and spatial position relationship. These criteria consider the top, bottom, left and right boundary coordinate X xT , X xB ,X xL and X xR of the candidate character block, as well as the coordinates (X x , Y x ) of their centers. See Fig. 7. In this paper, we focus on detecting
where T h1 , T h2 are parameters of the grouping module. T h1 is a threshold for height difference of two CCs’ centers and T h2 is a threshold for the top and bottom boundary difference of the two CCs. We set T h1 = 0.3 × min{h(ci), h(cj)} and T h2 = 0.7 × min{h(ci), h(cj)}. These criteria are applied to all pairs of adjacent candidate characters. Using the transitivity of all grouped pairs, we define the horizontal spatial distance to group the pairs into text lines. Based on the fact that the interval distances of two neighbor characters in a text line are almost the same, two grouped pairs with one identical element satisfying Equ. 2 are grouped. |(XRL − XiR ) − (XiL − XLR )| < T l
(2)
XR denotes candidate character block on the right neighbor side and XL means the left one. T l is the allowable deviation of left and right distances difference between character and we set it to 0.5 × min{w(ci), w(left block of ci), w(right block of ci)}. The coordinates of the grouped regions’ bounding box are recorded, as well as the mean of neighboring character distances Md and stoke width M sk of the grouped regions. Since characters may merge together to be a text string connected component and exist as an isolated region, we keep these isolated regions.
Fig. 7. Geometric parameters used for character grouping.
ACCEPTED MANUSCRIPT 7
(a)
(b)
CR IP T
Fig. 8. (a) Three text region appearances. (b) Normalized results.
The vanishing point VH is obtained by finding text top and bottom base line and their intersection. We use a voting method to find top and bottom base line of a text block. According to the class of inside CCs, appearances of candidate text region are classified to three types: connected character regions, individual character regions and hybrid text regions, as shown in Fig. 8(a). For an individual character region, we count a set T P consisted of the top-most points of all character blocks. The slope of a line connecting any two points in T P is discredited into one of 30 levels in [-π/6,π/6]. Lines which own largest voting slope are kept. As we know, a base point and a slope value constitute a line. Hence a base point should be chosen in T P to obtain the top base line. We vote for the points in T P by counting endpoints of the remaining lines. The highest vote point is set as the base point. If more than one point has the highest vote, we randomly choose one of them. With the voting slope and voting base points, we draw the top base line of the text block. The same process is implemented to find bottommost points set BP and bottom base line. For connected character region and hybrid text region, it is difficult and complex to accurately segment characters (Casey and Lecolinet, 1996), therefore, we use sliding window to segment CCs and find T P , BP . Firstly, we calculate the absolute difference of top-most points and bottom-most points of each column, then sort the absolute difference values in ascending order. The value in 3/4 position is set as width and the half of the width as step size of sliding window. The height of the window is the same as the text block height. T P and BP are obtained in the window in each moving. The same voting method is applied to find the two base lines. Shearing angle is estimated through the method mentioned in (Neumann and Matas, 2011) by rotating the text block to make the bottom base line horizon and then shearing from range -45 to 45 degrees with 5 each time while measuring sum of squares of the count of pixels in each column. The shear with the highest value is taken as a result. The normalized result in the binary text block and the original image region is shown in Fig. 8(b).
PT
2.3. Text/non-text region verification
ED
M
AN US
2.2.2. Second stage grouping After the first stage grouping, regions can be classified to three kinds: grouped regions, isolated character regions and isolated text string regions (denoted as GR, ICR, and ITR respectively). GR is the region of grouped components in first stage. ICR represents individual characters and ITR is the joint characters classified by ANN. Considering for multi-polarity text detection (Li et al., 2008) that texts have multiple colors or intensities in the same line, we overlap all sub-images of a certain color space to one map. If two adjacent regions ri , r j appear in different sub-image originally and they satisfy Equ. 1, a further investigation is implemented. The possible cases of the two regions ri , r j are listed below with corresponding grouping condition (C). (1) ri ∈ GR, r j ∈ ICR; C: | M sk (ri ) - SW(r j ) | < T sk && | Rd (ri , r j ) - Md (ri ) | < T d ; (2) ri ∈ GR, r j ∈ GR; C: | M sk (ri ) − M sk (r j ) | < T sk && | Md (ri ) − Md (r j ) | < T d && | Rd (ri , r j ) − Md (ri )(andMd (r j )) | < Td ; (3) ri ∈ ICR, r j ∈ ICR; C: | SW(ri ) − SW(r j ) | < T sk ; (4) ri ∈ ITR, r j ∈ ITR; C: | SW(ri ) − SW(r j ) | < T sk && Rd (ri , r j ) < T d (here Md (ri ) and Md (r j ) are treated as 0); (5) ri ∈ ITR, r j ∈ ICR; C: The same with case (4); (6) ri ∈ ITR, r j ∈ GR; C: The same with case (1). Where Rd (ri , r j ) is the spacing distance, namely the nearest left and right boundary distance of text blocks. T sk and T d are the thresholds of the stroke width difference and the region distance difference for grouping respectively. We set T sk = 4 and T d = 10. For every region pair in different origin sub-images, we identify its case above. If they satisfy the corresponding condition, the two regions in the region pair are grouped to a new region in the second stage. Candidate characters that are not grouped and belong to ICR are discarded; this condition normally eliminates a large fraction of the false positives. The grouped regions are considered as candidate text regions.
AC
CE
In this stage, we address the text/non-text classification problem of candidate text regions. Text verification based on pattern classification has been shown to be a powerful method, because it tries to describe the intrinsic properties of texts. To accomplish this work, the candidate text regions are normalized firstly and then classified by a two-layer classification module which consists of a similar score filter and a SVM classifier. 2.3.1. Candidate text region normalization Due to the perspective distortion, not all CCs in a candidate text region are strictly arranged in horizontal alignment. For the reliable text/non-text classification in next step, perspective transformation is performed for geometric normalization. In general, the projection transformation is represented by a 3 × 3 homogeneous matrix H with eight degrees of freedom (dof) (Liang, 2006). Myers et al. (2005) point out that only three dof are required from OCR view. Those are shearing angle and two perspective foreshortenings. Alternatively, it’s a problem finding the location of two vanishing points in the image plane.
2.3.2. Two layer text/non-text classification First layer classification. The design of first layer is inspired by the similarity expert learning in character recognition problem (Smith et al., 2011). Instead of using SVM to train pair of characters SIFT feature difference, we only implement subtrac-
ACCEPTED MANUSCRIPT 8
S score = 0.364,
Ai j = 570, S i j = 1567 Fig. 10. Five blocks for extracting HOG features.
Ai j = 705, S i j = 910 (a)
l1 : S score = 0.412 l2 : S score = 0.385 (b) Fig. 9. (a) Computing individual character regions’ S score of text block and background. (b) Computing connected character regions S score .
CE
PT
ED
M
AN US
tion and addition operation in binary text regions to compute similarity score S score . For individual character regions, S score is obtained from pairs of character blocks. Firstly, we standardize all of the character blocks size by taking the maximum of the width and height of them, then resize them to be square images of that size. Next, we add and subtract any two standardized block to count the overlapping and eliminated pixels (denoted as Ai j and S i j ). Similarity score S score of pair of charA acter blocks is the ratio of Ai j and S i j , denoting as S score = S ii jj . An example is shown in Fig. 9(a). We accumulate similarity scores of each CCs pair and compute the mean value. However, accurately segmenting characters from connected character regions and hybrid text regions meets a great challenge. We decompose these regions to several levels as shown in Fig. 9(b). In each level, the upper level blocks break up into two equal size blocks. The decomposition stops when the width of block is less than its height. Computing integrated similarity score with the repeated work described above is done in each level. The highest one is considered as the text region’s integrated similarity score. If the integrated similarity score S score is larger than a threshold τ, this region is considered as background and is rejected. The discussion of determining this will be presented in section 3.
Then, we use a sliding 30 × 60 window with each window overlapping 30 pixels of the previous window to generate several samples. These samples are classified by a trained SVM classifier and all the output results are averaged to determine the final result. Feature extraction HOG descriptor is widely used in shape recognition like human recognition and other solid objects (Dalal and Triggs, 2005). In this paper, we use an improved HOG descriptor with conditioned gradient based feature. We perform distance transformation to obtain the distance map D. Based on the distance map, gradient is computed by Equ. 3. 1/2(I(i+1, j) - I(i-1, j),I(i, j+1) - I(i, j-1)), if D(i, j)≤3 (3) ∇I(i, j) = No operation, otherwise
CR IP T
S score = 0.775,
AC
Second layer classification. The second layer is a support vector machine (SVM) classifier and uses gradient-based HOG descriptor. We employ the ideology in (Jung et al., 2009) to compute SVM score, namely, using a sliding window for decomposing candidate region to several samples and fusing all SVM scores of these samples for classification. Firstly, the text block in original color space is converted to gray scale one since human visual system is only sensitive to brightness channel for character shapes recognition (Fairchild, 2013). Size normalization To solve text candidate region’s size variousness, we normalize the text candidate to a height of 30 pixels. Correspondingly its width is resized with equal proportion and then scaled to the proximate pixel number of 30 × n (n ≥ 2 ∈ N+). The gray scale text candidate region and the binaried one obtained from the grouping result are both normalized.
Large gradients arise close to edges, so we only compute gradient of pixels near the contours. The gradient orientation θ∇I (i, j) and magnitude ρ∇I (i, j) are obtained from valid ∇I(i, j). θ∇I (i, j) is defined in the range [0, 2π]. Minetto et al. (2013) point out gradient distribution is largely independent of horizontal position, so a single cell in horizon performs better. Hence we design five blocks and mainly use horizontal cuts (As shown in Fig. 10). The HOG in each cell have 12 bins and each bin has is π/6 radians wide, centered at orientations k × π/6 for k = 0, 1, , 11. For feature normalization, the final descriptor is divided by the sum of all features plus a constant η. The input of SVM is a 15 × 12 normalized HOG feature vector φ(x). SVM score fusion the SVM output score is the sum of weighted kernel distances between the test sample and the support vectors and it is represented in Equ. 4: f (x) =
N X
0
αi K(Φ(x), Φ(xi )) + b
(4)
i=1
Where K is the kernel (Picard and Gosselin, 2011) occurs for dot product of the input feature vectors and N fixed support vectors Φ(xi ). αi are real weights and b is the bias. The two parameters are obtained from the training step. These SVM output scores of the decomposed samples are averaged with Gaussian weights as used in (Jung et al., 2009). The fused SVM score is the final candidate region’s score. If it is larger than a certain threshold, then this region is regarded as a text region, otherwise it is regarded as a false alarm. Experimentally, we set the threshold to -2.8 to balance the false acceptance rate and false rejection rate. 2.3.3. Final detected result determination After classification, regions are amalgamated to one map for discriminating the candidate regions on different channel for
ACCEPTED MANUSCRIPT 9
3. Experimental results 3.1. Image collection
For text region images, there are 1697 text regions of training scene images in ICDAR 2011 and 2013 dataset. Firstly, these training scene images are inverted to gray scale images and moreover each of them is binarized by a clustering method (Mancas-Thillou and Gosselin, 2007). The binarized text regions can be used for training the ANN classifier. The ANN classifier has three kinds of input samples and they are collected as follows. We segment the binarized image patches into individual character images. 2160 individual character samples are selected. Although we consider connected character region in this paper, this situation rarely occurs in the training scene images. We tighten and rearrange these individual characters to form connected character components in each of the binarized text image manually. A total of 1450 connected character regions are collected. For non-text component samples, we apply our algorithm presented in the previous sections before using ANN classification to the 458 training scene images and manually select 3620 non-text CCs. The same process is used to obtain non-text regions for SVM. Some training samples for ANN classifier are shown in Fig. 11(a). After normalizing the binary text region images and gray scale text region images, we divide each of them to several 30 × 60 samples and make sure that each gray scale sample corresponds to its own binaried one. In total, we collect 3000 text strings (containing 3000 gray scale text region images and corresponding 3000 binary images) and 2500 non-text samples for training the SVM. Examples for SVM training samples are shown in Fig. 11(b).
ED
M
AN US
We use four kinds of image collections and they are described below. 1. ICDAR 2011 (Shahab et al., 2011) and 2013 text location datasets (Karatzas et al., 2013). 2. ICDAR 2013 Born-digital images dataset. 3. Static video images: 220 static video images are collected in this dataset which ranges over movie segments, TV news and other TV programs. All the images are normalized to the size 720 × 360 with text line height ranging from 12 to 100 pixels. 4. Perspective distort scene images: To evaluate the accuracy and robustness of our method in detecting perspective distort scene images, we collected 60 images containing scene text from shop names and signs taken at skewed angles ranging from -30o to 30o and with comprising several typefaces.
Fig. 12. Similarity score distribution for text class and non-text class.
CR IP T
text and non-text. Different channels may produce text regions with similar size and position. Hence, two text regions in different channels are considered as the same text region in the final result if over 90% of both region sizes overlap. The coordinates of this text region boundary is calculated by averaging the two regions. If the overlapping region is less than 30% or no overlapping of both region sizes, the two regions are considered as separate detected text regions. For other cases, we need to select one from the two overlapping regions as the detected text region. Motivated by Kim’s method (Kim et al., 2003) that transforms the output of SVM to probability scaled in an interval [0, 1] using a sigmoidal activation function, we determine that the region which has larger probability of transformed SVM output is considered a text region.
3.2. Training data
AC
CE
PT
In our system, there are two classifiers required to be trained before being tested. One is the ANN classifier for classifying single CC and the other is the SVM for text region discrimination. Features for the ANN classifier are obtained from binary images and the features for the SVM classifier are extracted from both normalized binary text region images and gray scale text region images. The text region images trained for the SVM classifier can be utilized by ANN classifier through segmentation.
positive samples for ANN
negative samples for ANN (a)
positive samples for SVM
negative samples for SVM (b)
Fig. 11. Training samples. (a) Samples for training ANN classifier. (b) Samples for training SVM classifier.
3.3. Determination of parameter τ As for first layer in the two-layer classification system, regions with integrated similarity score larger than the threshold τ are rejected. We test 3000 binarized text string images and 516 symmetrical non-text patterns which obtained from the 2500 non-text samples to set τ. Fig. 12 shows the testing similarity score’s distribution in interval [0, 1] with 100 bins. In the horizontal axis, Sscore value stands for the value range of the testing images’ integrated similarity score. In the vertical axis, frequency reflects the similarity score value distribution of text string samples and symmetrical non-text samples. The frequency is the number of the samples with similarity score values over the total number of the corresponding class samples. By setting the threshold τ to 0.70, 63% of non-text region samples are filtered out and only 2.7% of text samples are clas-
ACCEPTED MANUSCRIPT 10 Table 3. Comparison Results on ICDAR 2013 Dataset
Method name
Precision
Recall
F-score
US T BT exS tar
0.88
0.66
0.76
OUR METHOD
0.82
0.71
0.76
TextSpotter
0.87
0.65
0.75
CAS IANLPR
0.79
0.68
0.73
Baseline
0.61
0.35
0.44
Fig. 13. Performance evaluation of the four combination channels on the Robust Reading Dataset 2011 and 2013.
3.4. Color channel selection
PT
ED
M
AN US
Algorithms using single channel or gray scale information to detect text regions suffer from the loss of information when the character’ colors are not consistent. We exploit multichannel information to recall color inconsistent characters through twostage grouping and integrate text regions in different channels. To evaluate the performance of multichannel, we use the traditional formula (Karatzas et al., 2013) to measure recall, precision and f-measures. Four types of combinations of channels are evaluated on the two ICDAR datasets. The result is shown in Fig. 13. Single channel Y and gray scale are also tested but get bad results. For only single channel application, the two-level grouping loses its function in the second level group and no postprocessor is needed to select text block after classification. That speeds up the framework but meanwhile reduces the precision and recall. If more channels are added, the result changes subtle but time costs more. Since the combination of Y-Cr-Cb and ab channels shows the best performance on all aspects, they are selected for further comparison.
CR IP T
sified as background. When we test the misclassified text images in the second layer, 83% of them are still classified as nontext regions. Therefore, we set τ to 0.70 in the test step to filter out symmetrical non-text regions and also retain text region as much as possible.
a remarkable 4 percentage point advancement over the best published result. In dataset of ICDAR 2013, this framework gains recall 71%, precision 82% and f-measure 76%, which also shows a satisfying performance.
CE
3.5. Experimental results on two natural scene image datasets
AC
We use the combined channels, Y, Cr, Cb, a and b, to perform our proposed framework. Table II and Table III respectively show the comparison of our framework with high rank winners of ICDAR 2011 and ICDAR 2013 robust reading competition. The method achieves recall 68%, precision 83% and f-measure 75% in ICDAR 2011 text location dataset, which symbolizes
Table 2. Comparison Results on ICDAR 2011 Dataset
Method name
Precision
Recall
F-score
OUR METHOD
0.83
0.68
0.75
Kim’s Method
0.83
0.62
0.71
Yi’s Method
0.67
0.58
0.62
TH-TextLoc System
0.67
0.58
0.62
Fig. 14. Example of detected text in natural scene images
The time taken by our algorithm ranges from 212ms to 32s for images size ranging from 350 × 200 to 3888 × 2592 using a standard PC with 4 GHz Intel processor. The image size and edge texture complexity affect the processing time greatly because the algorithm finds representative points and partitions the image via edge detection. On the other hand, if the text string contains more characters, the detection becomes more accurate. Some examples of detected text regions in natural scene are illustrated in Fig. 14 with red bounding boxes. 3.6. Evaluation on born-digital images, static video images and perspective distort images. We further evaluate our framework on born-digital images, static video images and some perspective distort images. Borndigital text images look like real scene text images in that they both exist in complex color settings, but they are distinctly different. They may be inherently low-resolution because they are transmitted online and displayed on a screen. While they do not
ACCEPTED MANUSCRIPT 11
AN US
(b)
CR IP T
(a)
(c)
M
Fig. 15. Example of detected text in other dataset. (a) Born-digital image text. (b) Static video image text. (c) Perspective distort image text.
PT
ED
share the problems of illumination and geometrical with realscene images. Text characters or strings in them are more colorful. Our two-stage grouping can process those multi-polarity text. We evaluate our framework by using the same measures on ICDAR 2011 born-digital image dataset and get precision 0.77, recall 0.69 and f -measure 0.73. Fig. 15(a) presents some examples of detected text regions in born-digital images.
AC
CE
For video images, text can be classified into graphics text, which is artificially added to the video during the editing process, and scene text, which appears naturally in the scene captured by the camera. Our text detection approach is tested in static scenario where the locations of text and captions in images are fixed and have no temporal motions. The characters and strings in those images have features of color uniformity, stroke width consistency, and character alignment. Fig. 15(b) depicts some results of text detection in static video images. As one challenge of scene text detection is perspective distortion, we collect 60 distort images for experiment. HOG is a local feature describer which is sensitive to text orientation. With the normalization step, the precision is up to 0.86, which shows the effectiveness of geometric normalization for perspective distort scene text detection. The results shown in Fig. 15(c) are part of the dataset and detected regions.
4. Conclusion
In this paper, we proposed a novel and robust text detection algorithm. Text regions were classified to three types: connected character regions, individual character regions and hybrid text regions. All classification problems in this paper were based on the three different region forms. We proposed a method to automatically partition images without predefined parameters. The features used for CCA were valid for both individual character components and jointed text string components. After that, the two-stage grouping method grouped candidate characters in a certain sub-image firstly and then in a certain channel which decreased the false rejection rate. The verification step used a two-layer classifier. In the first layer, most bars, stripes and other shape repeated objects were filtered out. The second layer classifier extracted HOG features from both binary and gray scale images. The framework started from the sub regions to individual characters then to integral text regions and used both intrinsic characteristic of text and individual property of characters. The detection result showed our proposed method was highly effective on natural scene text detection. In future work, we will focus on different orientation text detection and model text structure to improve the performance of text detection in scene images.
ACCEPTED MANUSCRIPT 12
AC
CE
PT
ED
M
AN US
Bishop, C.M., et al., 1995. Neural networks for pattern recognition . Canny, J., 1986. A computational approach to edge detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on , 679–698. Casey, R.G., Lecolinet, E., 1996. A survey of methods and strategies in character segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on 18, 690–706. Chen, X., Yuille, A.L., 2004. Detecting and reading text in natural scenes, in: Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, IEEE. pp. II–366. Dalal, N., Triggs, B., 2005. Histograms of oriented gradients for human detection, in: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, IEEE. pp. 886–893. Epshtein, B., Ofek, E., Wexler, Y., 2010. Detecting text in natural scenes with stroke width transform, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE. pp. 2963–2970. Fairchild, M.D., 2013. Color appearance models. John Wiley & Sons. Huang, W., Qiao, Y., Tang, X., 2014. Robust scene text detection with convolution neural network induced mser trees, in: Computer Vision–ECCV 2014. Springer, pp. 497–511. Jung, C., Liu, Q., Kim, J., 2009. Accurate text localization in images based on svm output scores. Image and vision computing 27, 1295–1301. Jung, K., In Kim, K., K Jain, A., 2004. Text information extraction in images and video: a survey. Pattern recognition 37, 977–997. Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., de las Heras, L.P., et al., 2013. Icdar 2013 robust reading competition, in: Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, IEEE. pp. 1484–1493. Kim, K.I., Jung, K., Kim, J.H., 2003. Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm. Pattern Analysis and Machine Intelligence, IEEE Transactions on 25, 1631–1639. Koo, H.I., Kim, D.H., 2013. Scene text detection via connected component clustering and nontext filtering. Image Processing, IEEE Transactions on 22, 2296–2305. Li, J., Tian, Y., Huang, T., Gao, W., 2008. Multi-polarity text segmentation using graph theory, in: Image Processing, 2008. ICIP 2008. 15th IEEE International Conference on, IEEE. pp. 3008–3011. Liang, J., 2006. Processing camera-captured document images: Geometric rectification, mosaicing, and layout structure recognition . Liu, X., Fu, H., Jia, Y., 2008. Gaussian mixture modeling and learning of neighboring characters for multilingual text extraction in images. Pattern Recognition 41, 484–493. Liu, Z., Sarkar, S., 2008. Robust outdoor text detection using text intensity and shape features, in: Pattern Recognition, 2008. ICPR 2008. 19th International Conference on, IEEE. pp. 1–4. Mancas-Thillou, C., Gosselin, B., 2007. Color text extraction with selective metric-based clustering. Computer Vision and Image Understanding 107, 97–107. Minetto, R., Thome, N., Cord, M., Leite, N.J., Stolfi, J., 2013. T-hog: An effective gradient-based descriptor for single line text regions. Pattern recognition 46, 1078–1090. Myers, G.K., Bolles, R.C., Luong, Q.T., Herson, J.A., Aradhye, H.B., 2005. Rectification and recognition of text in 3-d scenes. International Journal of Document Analysis and Recognition (IJDAR) 7, 147–158. Neumann, L., Matas, J., 2011. A method for text localization and recognition in real-world images, in: Computer Vision–ACCV 2010. Springer, pp. 770– 783. Pan, Y.F., Hou, X., Liu, C.L., 2011. A hybrid approach to detect and localize texts in natural scene images. Image Processing, IEEE Transactions on 20, 800–813. Picard, D., Gosselin, P.H., 2011. Improving image similarity with vectors of locally aggregated tensors, in: Image Processing (ICIP), 2011 18th IEEE International Conference on, IEEE. pp. 669–672. Shahab, A., Shafait, F., Dengel, A., 2011. Icdar 2011 robust reading competition challenge 2: Reading text in scene images, in: Document Analysis and Recognition (ICDAR), 2011 International Conference on, IEEE. pp. 1491– 1496. Shivakumara, P., Huang, W., Quy Phan, T., Lim Tan, C., 2010. Accurate video text detection through classification of low and high contrast images. Pattern Recognition 43, 2165–2185. Shivakumara, P., Phan, T.Q., Tan, C.L., 2011. A laplacian approach to multi-
oriented text detection in video. Pattern Analysis and Machine Intelligence, IEEE Transactions on 33, 412–419. Smith, D.L., Field, J., Learned-Miller, E., 2011. Enforcing similarity constraints with integer programming for better scene text recognition, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE. pp. 73–80. Yao, J., 2007. A ten-year review of granular computing, in: Granular Computing, 2007. GRC 2007. IEEE International Conference on, IEEE. pp. 734– 734. Ye, Q., Huang, Q., Gao, W., Zhao, D., 2005. Fast and robust text detection in images and video frames. Image and Vision Computing 23, 565–576. Yi, C., Tian, Y., 2011. Text string detection from natural scenes by structurebased partition and grouping. Image Processing, IEEE Transactions on 20, 2594–2605. Zhang, H., Zhao, K., Song, Y.Z., Guo, J., 2013. Text extraction from natural scene image: A survey. Neurocomputing 122, 310–323.
CR IP T
References