Scene video text tracking based on hybrid deep text detection and layout constraint

Scene video text tracking based on hybrid deep text detection and layout constraint

Neurocomputing 363 (2019) 223–235 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Scene v...

3MB Sizes 0 Downloads 95 Views

Neurocomputing 363 (2019) 223–235

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Scene video text tracking based on hybrid deep text detection and layout constraint Xihan Wang, Xiaoyi Feng, Zhaoqiang Xia∗ School of Electronics and Information, Northwestern Polytechnical University, China

a r t i c l e

i n f o

Article history: Received 11 January 2019 Revised 16 May 2019 Accepted 27 May 2019 Available online 22 July 2019 Communicated by Dr. Kaizhu Huang Keywords: Scene video text Text detection and tracking Convolutional neural networks Hybrid architecture Layout constraint

a b s t r a c t Video text in real-world scenes often carries rich high-level semantic information and plays an everincreasingly important role in the content-based video analysis and retrieval. Therefore, the scene video text detection and tracking are important prerequisites of numerous multimedia applications. However, the performance of most existing tracking methods is not satisfactory due to frequent mis-detections, unexpected camera motion and similar appearances between text regions. To address these problems, we propose a new video text tracking approach based on hybrid deep text detection and layout constraint. Firstly, a deep text detection network that combines the advantages of object detection and semantic segmentation in a hybrid way is proposed to locate possible text candidates in individual frames. Then, text trajectories are derived from consecutive frames with a novel data association method, which effectively exploits the layout constraint of text regions in large camera motion. By utilizing the layout constraint, the ambiguities caused by similar text regions are effectively reduced. We conduct experiments on four benchmark datasets, i.e., ICDAR 2015, MSRA-TD 500, USTB-SV1K and Minetto, to evaluate the proposed method. The experimental results demonstrate the effectiveness and superiority of the proposed approach. © 2019 Elsevier B.V. All rights reserved.

1. Introduction The escalating popularity of digital camera devices and smart phones has led to the explosive growth of scene video data in daily life. Text in a video sequence can provide high-level semantic information and is closely related to video content. Meanwhile, one key characteristic of video text is temporal redundancy, which can eliminate the ambiguity and provide other relevant information. Therefore, text detection and tracking technology can serve as a basis of numerous multimedia applications [1], such as navigation guidance for impaired people [2], real-time translation for tourists [3], autonomous mobile robots [4] and driving assistance systems [5]. In recent years, the tracking-by-detection framework is becoming popular in video text tracking [6–9]. These methods track the text trajectories in consecutive frames based on the detection results of each individual frame. For instance, in [6], the Maximally Stable Extremal Regions (MSERs) based method was used to detect the text between adjacent frames, and the detected text targets were tracked by MSERs propagation. Zuo et al. [7] proposed a multi-strategy tracking based text detection algorithm, which ∗

Corresponding author. E-mail address: [email protected] (Z. Xia).

https://doi.org/10.1016/j.neucom.2019.05.101 0925-2312/© 2019 Elsevier B.V. All rights reserved.

can effectively use the fusion of detection and tracking results to improve the stability and quality of detections. Recently, some tracking methods use the deep learning techniques to detect text candidates and then perform the tracking by the min-cost flow algorithm [8] or graph matching algorithm [9]. However, the performance of text detection methods is still not satisfactory for scene videos. Frequent mis-detection will bring the ambiguity and increase the difficulty of text tracking. In text tracking procedure, most studies [6–8] mainly focus on the appearance cues between the tracking text and detection results, such as color histograms and aspect ratios. These appearance cues are not sufficient to discriminate the text with similar color, font and size, especially the text regions that are very close to each other. On the other hand, since camera motion is not always smooth and predictable, the motion trajectories of text in a video are often complicated and non-linear. This situation is more serious when the camera is moving intensely, greatly increasing the difficulty of text tracking. To summarize, scene video text tracking based on tracking-by-detection framework is still a challenging task. The challenging lies in two folds: (1) scene video text detection still needs to be promoted as the frequent mis-detection will increase tracking difficulty; (2) more effective and robust algorithm needs to be exploited for solving the problem of multiple-text tracking in large camera motion.

224

X. Wang, X. Feng and Z. Xia / Neurocomputing 363 (2019) 223–235

Fig. 1. Flowchart of our proposed method. Given a video, scene text detection is first utilized to predict text candidates in individual frames. Then, the tracking trajectories are linked by detection results using layout constraint based text tracking. Finally, the post-processing outputs the optimized detection and tracking results.

In this paper, we propose a novel text tracking approach for scene videos to address the above-mentioned issues. Based on tracking-by-detection framework, the proposed method contains two key procedures: text detection procedure and text tracking procedure. The flowchart of our text tracking method is shown in Fig. 1. Firstly, to improve the efficiency and effectiveness of text detection in individual video frames, we propose a new deep network, which combines the advantages of object detection and semantic segmentation in a hybrid way. The deep network consists of two branches: (1) the detection branch predicts quadrilateral bounding boxes of text regions in a two-step cascaded regression strategy; (2) the segmentation branch is utilized to augment those feature maps with strong semantic information and predict text segmentation maps. Secondly, to address the problem of multipletext tracking in large camera motion, we propose a novel data association method, which effectively uses the layout constraint of text regions to track text trajectories. The layout constraint is represented by the relationship between the text regions and used to constrain the position of text regions. Besides, we introduce a new association cost function, in which the layout similarity between adjacent frames is modeled for associating multiple text regions. In addition, to reduce the trajectories interruption caused by misdetection, the layout constraint is further utilized to re-track the missing ones from the tracked text. With a simple trajectory optimization post-processing, the final detection and tracking results are immediately obtained. The main contributions of this work can be summarized as follows: • We propose a new deep network for detecting scene text in videos. The network combines the advantages of object detection and semantic segmentation in a hybrid way. To ensure the effectiveness, semantic segmentation is utilized to augment the feature maps with strong semantic information and refine the text candidates. • We propose a novel data association approach which considers the text tracking in large camera motion. The layout constraint

of text regions is incorporated into a new cost function for associating multiple text regions. Using the layout constraint, the missing ones from the tracked text can be re-tracked, reducing the ambiguities caused by the frequent mis-detection. • Our proposed method can handle the arbitrary-oriented text in scene videos and achieve very competitive performance in both effectiveness and efficiency over four public-available datasets (i.e., ICDAR 2015, MSRA-TD 500, USTB-SV1K and Minetto). The rest of this paper is organized as follows. Related work is described in Section 2. Scene video text detection and tracking are described in Section 3 and Section 4, respectively. Comparative experiments are demonstrated in Section 5. Final conclusion is presented in Section 6. 2. Related work In this section, text detection methods in images are first discussed briefly. We then review the related methods focusing on scene text detection in videos. Tracking-by-detection based tracking methods are finally reviewed in details. 2.1. Scene text detection Towards image text detection, numerous effective approaches [10–13] have been investigated. Generally, these conventional approaches rely on manually designed features to capture text. To cite few, Zhang et al. [14] predicted text lines from natural images by the local symmetry property of character groups. He et al. [15] extracted text candidate components by developed contrastenhanced MSERs. Then, the multi-level supervised information, such as text region mask, character label and binary text/non-text information, were used to train a Text-CNN classifier for text component filtering. Tian et al. [16] used MSERs to generate text candidates. The false alarm were removed with a coarse-to-fine character classifier. Zhang et al. [17] proposed a color prior to guide the character candidate extraction by MSERs. Then they

X. Wang, X. Feng and Z. Xia / Neurocomputing 363 (2019) 223–235

used deep learning based classifier to distinguish text/non-text candidates. However, these hand-craft features are not robust in challenging scenarios, and the performance of these methods falls behind deep neural network based methods. Recently, deep neural network based methods [18–21] have gradually become the mainstream. These methods can be roughly classified into regression based methods and segmentation based methods. Regression based methods are mainly designed based on general object detection, such as Single Shot MultiBox Detector (SSD) [22] and Faster-RCNN [23]. Based on SSD, Liao et al. [24] predicted text bounding boxes directly. Shi et al. [25] extracted dense text segments and links in an SSD-style network, and then text boxes were merged with the segment and link information. In order to accurately detect arbitrary-oriented text, Liao et al. [26] extracted rotation-invariant and rotation-sensitive features for oriented-text box regression and classification. Ma et al. [27] used the architecture of Faster-RCNN to detect arbitrary-oriented scene text. Different from the text-box regression, Lyu et al. [28] localized the positions of corner points, and text regions were further generated by grouping the detected corners. He et al. [29] presented a pixel-wise classification between text and non-text regions, and each point was used to regress the text boxes directly. Zhou et al. [30] directly predicted words or text lines of arbitrary orientations and quadrilateral shapes,in a per-pixel manner. Segmentation based text detection methods are belonging to another direction of text detection. These methods treat text detection as a semantic segmentation problem. Thanks to the Fully Convolutional Network (FCN) [31], text region hypotheses are estimated by the pixel-wise heat map. Zhang et al. [32] first used FCN to predict a saliency map. Then they extracted characters in those text region hypotheses and grouped the characters to words or text lines in post processing. Tang et al. [33] used three convolution neural networks (text detection, segmentation and classification) to detect the text in a cascaded way. For curve text detection, Long et al. [34] presented a method named as Textsnake, which predicted the geometric properties of a text instance by a semantic segmentation network. Lyu et al. [35] proposed a unified text detection and recognition framework to identify arbitrary shapes of text. Towards video text detection, many approaches have been proposed in last decades. Shivakumara et al. [36] used gradient vector flow to identify the dominant edge pixels of text components. Then they introduced two grouping schemes to link text candidates into text lines. Gomez et al. [6] proposed MSERs to solve the real-time video text detection on low-resource mobile devices. Wu et al. [37] used stroke width transform (SWT) [38] based method to locate multi-orientation text in videos. Liang et al. [39] combined the MSERs and SWT to detect candidate text regions after text pixels enhancement. To address the problem of missed characters in videos due to skew distortion and low contrast, Yang et al. [40] proposed a multi-orientation text detection method with multi-channel and multi-scale information fusion, where MSERs were utilized in multiple channels to extract all possible character candidates. Then, an effective hybrid filter was designed to eliminate false alarm. However, these conventional video text detection methods are not robust in challenging tasks. The above-mentioned methods have achieved good performances on various benchmarks. However, most video text detectors are based on conventional methods, which are not robust in challenging video scenes. In addition, most deep model based methods are designed for the image text detection and usually sacrifice the efficiency to improve accuracy. In this paper, inspired by recent SSD-based general object detectors [41], we propose a novel deep network for video text detection. The deep network is a onestage framework and composed of text region detector and semantic segmentation. Text region detector uses two inter-connected

225

modules to predict text region with a two-stage regression strategy. Semantic segmentation is used to encode semantic information for detecting the small text and shares features with the text detectors. This leads to very competitive performance in both detection accuracy and speed.

2.2. Scene text tracking In recent years, a variety of text tracking methods have been investigated in the literature. These methods can be roughly classified into three typical tracking strategies, i.e., template matching for tracking, particle filtering and tracking-by-detection. Compared to other two tracking frameworks, tracking-by-detection can solve the re-initialization problem caused by the lost of text regions. It is gradually becoming mainstream due to its more robust performance. Our work is closely related to the tracking-bydetection paradigm and estimates the tracking trajectories using text detection results. In the following, we will review the related methods focusing on this tracking paradigm. The details of other two scene text tracking frameworks can be found in the literature [1,42]. Tracking-by-detection methods can be regarded as a data association problem. Liu et al. [43] used stroke-like edges and corners to locate captions in videos. Then, text tracking based on interframe information analysis was used to improve the accuracy of caption localization and segmentation. Gómez et al. [6] presented a hybrid algorithm for text detection and tracking, in which both detection and tracking modules are based on MSERs. Rong et al. [44] proposed a scene text recognition method by tracking text regions in videos. They located scene text character (STC) independently by an MSERs-based detector, and the STC is used to optimize the trajectory search. Zuo et al. [7] combined three tracking algorithms, namely, tracking by detection, spatial-temporal content learning (STCL) [45] and linear prediction into a multi-strategy tracking method. Yang et al. [8] proposed a motion-based tracking method, which translates text association issues into a costflow network and exports text trajectories from the network using a min-cost flow algorithm. However, Most of these methods focus on extracting appearance features for text tracking. These appearance cues are not sufficient to discriminate multiple text regions with similar color, font and size, especially when the text regions are very close. Pei et al. [9] introduced a graph matching method for text tracking, which uses the distance similarity between multiple objects to reduce the mismatch in text tracking. However, this method may not be robust when the text has multiple motion directions. Although many studies are devoted to video text tracking, few methods have been focusing on multiple-text tracking in large camera motion. In this paper, we propose a novel scene video text tracking method by introducing the layout constraint, which can discriminate the similar text regions and be more reliable in large camera motion.

3. Hybrid deep text detection We present the details of the proposed text detector in a hybrid framework. Inspired by RefineDet [41], we integrate the semantic segmentation module into the backbone of Refinenet, which is modified to be adaptive to the task of text detection. In this method, we employ a new loss combining the prediction of anchor, text region and text segmentation. Specially, our detector is composed of three parts: feature extraction, text region prediction and text-sensitive segmentation. A schematic view of our network architecture is depicted in Fig. 2 and its details are described in the following subsections.

226

X. Wang, X. Feng and Z. Xia / Neurocomputing 363 (2019) 223–235

Fig. 2. Network Architecture. Our network contains three parts: feature extraction, text region detector, and text-sensitive segmentation. The backbone is adapted from RefineDet [41]. The feature extraction is responsible for computing multi-scale convolutional features. The feature pyramid structure is utilized to enhance the convolutional features. Text region detector uses a two-stage regression strategy that includes anchor prediction and text prediction. The anchor prediction stage first adjusts the location and sizes of anchors, and then the text prediction stage further regresses the quadrilateral bounding boxes of text. Text-sensitive segmentation introduces a pixel-wise supervision of text, and shares features with the text detectors.

3.1. Network architecture 3.1.1. Feature extraction The pre-trained VGG-16 architecture [46] is adopted as the stem network. As shown in Fig. 2, we convert the last two fullyconnected layers (FC6 and FC7) of VGG-16 into convolutional layers (conv6 and conv7). Then three successive layers with size of 3 × 3 × 256 are stacked above conv7 to enlarge the receptive fields. The output of these convolutional layers is denoted as F7. To detect different-size text, three transfer-connection-blocks (TCB) proposed in RefineDet [41] are used in a top-down pathway. In TCB, particularly, a deconvolution operation is first utilized to enlarge the highlevel feature maps with 256 channels. Then, they are merged with low-level feature maps (e.g., conv4, conv5 and conv6) by two 3 × 3 convolutional layers in an element-wise way. The output features of TCB are generated through another 3 × 3 convolutional layer. This process is iterated for three times to construct three TCBs. For convenience, these output features of each TCB are denoted as F6, F5 and F4 successively. Finally, these features extracted from convolutional layers can be used for region prediction and text segmentation. More specially, the outputs of conv4 to conv7 are used to adjust the location and sizes of anchors. The outputs of F4 to F7 are used for predicting text regions, and the outputs of F4 are used for enriching the supervision semantics. 3.1.2. Text region prediction The text region prediction uses a two-stage regression strategy, including anchor prediction and bounding box prediction. The anchor prediction is first utilized to coarsely adjust the location and

sizes of anchors for obtaining refined anchors. Then, the bounding box prediction is used to regress accurate text regions based on the refined anchors. These two-step strategies of cascaded regression output the prediction results, and then undergo an efficient non-maximum suppression (NMS) process. Please refer to [41] for more details of two-stage regression strategy. In training phase, the matching strategy of default boxes and ground-truth ones in SSD [22] is adopted. To detect different-size scene text, we select multiple feature layers with different-scale anchors for bounding box prediction. Different from general objects, text tends to have larger aspect ratios. Therefore, “long” default boxes having larger aspect ratios are increased. Specifically, for the word-based dataset (Latin language such as English), the aspect ratios of default boxes are set to 1, 2, 3, 5, 1/2, 1/3, 1/5. For line-base dataset (long text lines such as Chinese and Japanese), we set the aspect ratios of default boxes to 1, 2, 3, 5, 7, 11, 1/2, 1/3, 1/5, 1/7, 1/11. 3.1.3. Text-sensitive segmentation Semantic information has been proved that it can be leveraged to help text detection [28,30]. To exploit the semantic information in our proposed method, we utilize the segmentation map to promote the learning of deep model in two aspects. (1) In training phase, we enrich the semantic information at the low-level detection feature maps for helping to learn amounts of model parameters. This would facilitate to detect small text in scene videos. (2) In test phase, the generated text segmentation maps by textsensitive segmentation are also used to determine whether the bounding box is a text region or not. The text rectangles having

X. Wang, X. Feng and Z. Xia / Neurocomputing 363 (2019) 223–235

227

Fig. 3. Label generation for text box detection and text sensitive segmentation. (a) Text regions are represented by horizontal rectangle (green) and vertices of the rotated rectangle (yellow). (b) Corresponding ground-truth mask of (a) for text sensitive segmentation.

intersection with corresponding segmentation map are used to filter out non-text regions. Specifically, this segmentation part takes the merged detection feature maps from F4 as the input. Then, a continuous Conv1 × 1 − Deconv − Crop block with three layers is used to enlarge the resolution of feature maps. The final text-sensitive segmentation map is generated and has the same dimension as input image. In training phase, the low-level feature maps can contain more semantic information by comparing the generated segmentation map with ground-truth supervision, which will be discussed in Section 3.2.1. The segmentation loss can be used to directly optimize the entire network with the detection loss.

3.2. Model learning 3.2.1. Label generation Except the given coordinates of bounding boxes, the loss function in the proposed method needs to exploit another labels for segmentation maps. To obtain this type of ground truth, we generate a mask from the given labels, which is shown in Fig. 3(a). Before generating a mask, the rotated quadrilateral label needs to be obtained. Firstly, the bounding rectangle of textual pixels is determined by the text coordinates. Then, we use four vertices to represent the rotated quadrilateral as labels. The text region label can be denoted as G = (Gb , Gv ). Gb = (x0 , y0 , w, h ) is a horizontal rectangle. (x0 , y0 ) is the center of Gb . w and h are the width and the height of Gb . The rotated quadrilateral is denoted as Gv = {vi |i ∈ 1, 2, 3, 4}. Gv represents the four vertices of the rotated quadrilateral in top-left, top-right, bottom-right and bottom-left order. vi = (xi , yi ) denotes the vertex coordinates. After that, the text region mask can be generated by rotated quadrilateral labels. As depicted in Fig. 3(b), we first generate a zero-initialized mask and draw the quadrilateral region with the value “1”. All pixels within the text region are taken as positives and marked as “1”. Oppositely, the background regions are marked as “0”.

3.2.2. Loss function The loss function for learning model parameters consists of three parts: anchor prediction loss La , text prediction loss Lt and text segmentation loss Ls . Both the anchor prediction and text prediction loss contain the confidence loss Lc and regression loss Lr with different inputs. The detailed loss function is defined as:

1 1 λ1 La + Lt + Ls Na Nt Ns 1 1 = (Lc (yc , pac ) + Lr (yar , par )) + (Lc (yc , ptc ) + Lr (ytr , ptr )) Na Nt

L=

+

λ1 Ns

Ls (R, Rˆ )

(1)

where yc is the class label of text region, yar is the anchor label for Gb , and ytr is the text-region label for G = (Gb , Gv ). pac and par are the prediction score and regression value in anchor prediction. ptc and ptr are the prediction score and regression value in text region prediction. In our method, Na and Nt are the numbers of positive default boxes in the anchor prediction and text prediction. Ns is the number of pixels in text region mask. For balancing the three losses and quick convergence, we set the λ1 to 1. Following SSD [22], we adopt a two-class softmax loss as the confidence loss Lc , and use the smooth L1 loss [47] as the regression loss Lr . We choose the balanced cross-entropy loss [48] as the text-sensitive segmentation loss Ls . Assume that the binary mask for segmentation map is denoted as R and the predicted map Rˆ. The loss function Ls (R, Rˆ ) is formulated as

Ls (R, Rˆ ) = −β

|R+ | 

Ri log P(Rˆi = 1 )

i=1

− (1 − β )

|R− | 

(1 − Ri ) log P(Rˆi = 0 )

(2)

i=1

where β = |R− |/(|R+ | + |R− | ) denotes the class-balancing weight. |R+ | and |R− | denote the number of text pixels and non-text pixels in the binary mask. 3.2.3. Hard negative mining and data augmentation In training stage, the matching strategy will produce a large number of negative default boxes, making samples extremely imbalanced. To address this problem, we follow the mining strategy for online hard negatives [49] to balance training samples. More precisely, the ratio between the negatives and positives is set to 3:1. We also use the data augmentation strategy proposed in [22] to make the model more robust to various sizes and shapes of input text. 4. Layout constraint based text tracking Most tracking approaches mainly focus on the appearance features between the tracked text and detected candidates. However,

228

X. Wang, X. Feng and Z. Xia / Neurocomputing 363 (2019) 223–235

Fig. 4. Text tracking based on layout constraint. The tracked text and their detections are represented by red boxes and blue boxes, respectively. The green lines connecting text regions denote the layout constraint. Light blue boxes represent false alarm. In the assignment process, we first match a pair of tracked text and detection, which is called anchor. As shown in this figure. The anchor assignment “a11 ” assumes that the tracked text “t1” moves to the location of detection text “d1”. According to the layout constraint, we can predict the possible location of other tracked text. We compute the data association cost based on these anchor assignments. From the different anchor assignments, the dissimilarity cost matrix of all anchor assignments is constructed. Finally, the best assignment event is estimated by minimizing total assignment cost.

in multiple-text tracking, these features are not descriptive enough to distinguish the text with similar color, font and size. Moreover, text detection in adjacent frames still has a room to be improved. Frequent mis-detection will increase the difficulty of text tracking. Inspired by the multi-object tracking (MOT) method [50], in this paper, we introduce the relationship of text regions into our text tracking method. Usually, the relative position between different text regions on the same object keeps unchanged even if the camera is moving. Based on this observation, the layout constraint is used to model the relative position by a new data association cost function, in which the position of text regions is constrained relatively to each other. The details are described in the following subsections and the proposed method is illustrated in Fig. 4.

The set of relationships for each text region i is represented by γti = {rti, j |∀ j ∈ Nt , j = i} and all relationships at frame t as Rt (γti ∈ Rt ), which are considered as the layout constraint. The tracking task can be considered as a data association problem, which estimates the tracking trajectories using text detecp tion results. We denote a detection result p at frame t as dt = p p p p [xd,t , yd,t , wd,t , hd,t ] and the set of all detected text regions at frame

4.1. Model formulation

Aˆ = argminAC (S, R, D ),   s.t. ai,p ≤ 1

The trajectory of a text region is represented by a sequence of states with respect to time. The state of a tracked text region i at frame t is denoted as si = [xti , yti , υti , νti , wti , hti , ati ]. (xti , yti ) is the center coordinate, (υti , νti ) is the velocity, and (wti , hti ) is the shape. ati is the appearance feature represented by a normalized histogram in RGB space. Each channel contains 16-dimensional histogram and totally 48-dimensional for three channels. All text regions at frame t are denoted as St (sti ∈ St ), i ∈ Nt . Nt denotes the number of text regions at frame t. Each text relationship is described by the location and velocity differences between two text regions as



rti, j = xti, j , yti, j , υti, j , νti, j





= xti − xtj , yti − ytj , υti − υtj , νti − νtj



(3)

p

t as Dt (dt ∈ Dt ), p ∈ Mt . For simplicity, in the following content, we use S, R, D to replace St , Rt , Dt . The assignment state between the tracked text (previous) and detection (current) is defined as a binary value ai,p = {0, 1}. If the detection p is assigned to the trajectory i, ai,p = 1, otherwise, ai,p = 0. For data association, we propose a data association cost to describe the dissimilarity between trajectories and detections. The data association by a cost function is described by

i∈N,p=0

 p∈M



{0}

ai,p = 1



ai,0 ≤ N

(4)

i∈N

where A = {ai,p |i ∈ N, p ∈ M}.  i ∈ N,p = 0 ai,p ≤ 1 ensures that each  i,p =  trajectory is assigned with at most one detection. p∈M {0} a 1 indicates that a trajectory is either detected or missed, and ai,0 = 1 means one mis-detected text.  i ∈ N ai,0 ≤ N means that some text regions would be mis-detected. C(S, R, D ) is a set of all possible assignments with the layout constraint, and the best data association is then estimated with the minimum cost. To increase the stability of tracking, the layout similarity between adjacent frames is modeled for associating multiple text regions. If the detection belongs to a trajectory, based on the layout constraint, the regions associated with the detection should have the same appearance characteristics as the remaining tracked text

X. Wang, X. Feng and Z. Xia / Neurocomputing 363 (2019) 223–235

regions. We first set an anchor assignment between the trajectory i and detection p. Anchor assignment ai,p supposes the trajectory i is transported to the location of detection p. Based on the layout constraint γ i , we can predict the location of residual trajectories in the current frame. The data association cost function is further formulated by

C(S, R, D ) = ai,p (Fs (si , d ) + Fo (si , d )  p + Fa (s j , d , ri, j )), i, j ∈ N, p ∈ M p

p

Input: text trajectories S in previous frame, layout constraints of tracked text R, detections D in current frame Output: updated text trajectories 1: 2: 4: 5:

j∈N, j=i

p

Algorithm 1 Layout Constraint Based Text Tracking.

3:

(5)

p

where Fs (si , d ) and Fo (si , d ) denote the size and overlap costs of anchor assignment. Here we compute the size and overlap costs as

229

6: 7: 8:

for video frame t do Step 1: Data association compute all association costs. for p=1: D do remove negligible association by using Eq. (8) between detections D and tracked text S. C(S, R, D ) = ai,p (Fs (si , d p ) + Fo (si , d p )+  j p i, j j ∈N, j =i Fa (s , d , r )), i, j ∈ N, p ∈ M

(7)

21:

 = argminC(S, R, D ) Step 2: Update association detections update the text trajectories with Kalman filter U pdate(S ) = {KF (si , d p )|ai,p = 1, i ∈ N, p ∈ M} Step 3: Update unassociation text trajectories update the text trajectory which is not associated with any detections by the new state of other trajectories and their layout constraints. Step 4: Initialize new trajectory initialize a new trajectory and add into old trajectories if a text block in the current frame does not match any existing trajectories. Step 5: Update update layout constraints, simplify layout constraints by Eq. (8) and Eq. (9).

where Hb (s) denotes the normalized histogram in RGB space, b is the bin index, and B is the number of bins. The area associated with current target is predicted by the position of detection p and the layout constraint ri,j . To solve Eq. (4), all costs of data association are formulated in a matrix C, We then apply the Hungarian algorithm [52] to obtain the best association having the minimum cost.

22:

return result



|hi − h p | |wi − w p | Fs (s , d ) = −ln 1 − − , 2 ( hi + h p ) 2 ( w i + w p ) 

area(B(si ) B(d p ))) p  Fo (si , d ) = −ln area(B(si ) B(d p ))) i

(wi ,

hi )

(wp ,

11: 12:

(6)

Fa (s , d , r ) = −ln p

i, j

B 

Hb (s p, j )Hb (s j )

In our tracking approach, text detection in an individual frame is used to initialize the text trajectory and layout constraint. In order to ensure the stability of the overall layout of text regions, we simplify the relationship of text regions when two regions satisfy two conditions as follows





16: 17: 18:

20:

4.2. Algorithm details

D ( si , s j ) <

15:

19:

b=1

s p, j = [xdp , ydp , 0, 0] + [xi, j , yi, j , w j , h j ]

13: 14:

hp )

where and denote the width and height of trajectory i and detection p, respectively. In addition, the overlap cost is measured by using the overlap ratio [51] of two bounding boxes. From the anchor position, the location of residual trajectories can be predicted. The appearance cost with the layout constraint is described by j

9: 10:

p

( w i )2 + ( hi )2

(8)

(υ i, j )2 + (ν i, j )2 < τ

(9)

where D(si , sj ) represents the position distance of two text regions. To consider text regions moving with different directions, we use Eq. (9) to ensure that there is no grater relative-motion between two text regions. (υ i , ν i ) represents the velocity difference between the text region i and j. We empirically set τ = 10. The whole process of our text tracking method is summarized in Algorithm 1. After tracking, with the observation dp , we update the state of trajectories by the standard Kalman filter [53] for smoothing. In step 3, we update the state of text trajectory, which is not associated with any detection. Based on the linear motion model and layout constraint, we predict the mis-detection location in current frame. By doing this, we can re-track the trajectory when the text region is re-detected in the next frame. This can keep the trajectory completeness under the camera motion and occlusions. On the other hand, we terminate the trajectory if it is not associated with any detection for three frames.

5. Experiments In this section, we introduce the implementation details and three public-available datasets. Then we evaluate the performance of the proposed detection method for multi-orientation text detection over two image datasets. The effectiveness of the proposed tracking method is validated by comparing to other state-of-theart methods on a video text dataset. Specially, over this dataset, the detection results promoted by tracking are reported firstly and then the tracking method is used to compare with existing tracking approaches. 5.1. Implementation details In training phase of text detection, the hybrid detection network is trained with a multi-scale training strategy, which consists of three stages. More specifically, the Synth Text dataset is firstly used to address the problem of limited data and pre-train the network. Then, the process is continued on the corresponding dataset with 384 × 384 input size. Finally, a larger input image with size of 768 × 768 is used to achieve better detections for multi-scale text. In the aforementioned process, we experimentally set three stages with different learning rates (the first two stages equal to 10−4 and the last stage equals to 10−5 ). In test phase of text detection, the redundant text regions are removed by NMS. In our work, the threshold of NMS is set to 0.2. We simply use the proportion of segmentation pixels in the bounding box to determine whether it is a text region. The boxes with a low threshold will be filtered out, and the threshold is set to 0.1. In text tracking procedure, we use the temporal redundancy characteristic of video text to reduce false alarm and improve the detection accuracy. The trajectory is only valid if the length of

230

X. Wang, X. Feng and Z. Xia / Neurocomputing 363 (2019) 223–235 Table 1 Performances of different text detection methods evaluated on ICDAR2015.

Table 2 Performances of different text detection methods evaluated on MSRA-TD500.

Method

Precision

Recall

F-measure

FPS

Method

Precision

Recall

F-measure

FPS

Zhang et al. [32] Tian et al. [18] Yao et al. [56] Liu et al. [19] SegLink [25] EAST [30] SSTD [20] Lyu et al. [28] Our method

0.708 0.742 0.723 0.732 0.731 0.805 0.80 0.941 0.836

0.43 0.516 0.587 0.682 0.768 0.728 0.73 0.707 0.769

0.536 0.609 0.648 0.706 0.75 0.764 0.77 0.807 0.801

0.48 7.1 1.61 – 8.9 6.52 7.7 3.6 9.1

Yao et al. [10] Yin et al. [57] Yin et al. [11] Zhang et al. [32] He et al. [29] EAST† [30] SegLink [20] Lyu et al.† [28] Our method

0.63 0.71 0.81 0.83 0.70 0.827 0.86 0.876 0.836

0.63 0.61 0.63 0.67 0.77 0.616 0.70 0.762 0.70

0.60 0.65 0.74 0.74 0.74 0.702 0.77 0.815 0.762

0.14 1.25 0.71 0.48 1.1 6.52 8.9 5.7 12.2

“-” means the result is not published.

text trajectories is larger than a given threshold. Otherwise, the short trajectory is invalid and discarded as noise. In this paper, the threshold is set to 8. 5.2. Datasets Synth Text: The Synth Text dataset [54] is a synthetically generated dataset which contains of 800k synthesized text images. This dataset is created by blending rendered words with natural images. The word-level labels in this dataset are used to pre-train our model. ICDAR 2015: The ICDAR 2015 dataset is proposed in the Challenge 4 of ICDAR 2015 Robust Reading Competition [18] for incidental scene text detection. The dataset is composed of 10 0 0 training images and 500 testing images, which are captured by Google glasses with relatively low resolutions. In this dataset, the images contain multi-oriented text with annotations labeled as word-level quadrangles. MSRA-TD 500: The MSRA-TD 500 is published in [10]. The dataset has 300 training images and 200 test images. All images are high-resolution natural scene images. This dataset is available in two languages, i.e., Chinese and English. The text in this dataset contains different orientations and fonts, making this dataset highly challenging. USTB-SV1K: The USTB-SV1K dataset [11] is obtained from Google Street View of six USA cities, which contains 500 training images and 500 testing images. Texts in this dataset contain multiple orientations, views, fonts and blur. Minetto: The Minetto dataset is published in [55]. It is annotated with bounding boxes of each word. It contains 5 videos with frame size of 640 × 480 pixels for testing. 5.3. Comparisons for text detection Since the text detection procedure is vitally important for subsequent tracking, we evaluate the proposed hybrid deep detection method with some state-of-the-art approaches on text detection datasets. Two public-available multi-orientation scene text datasets, ICDAR2015 [18] and USTB-SV1K [11], are chosen to perform the comparison experiments. We first evaluate our model on ICDAR2015. We fine tune our model on ICDAR2015. In testing, we set the threshold of detection score to 0.2 and resize the input images to 768 × 768. Note that our method focuses on video text detection. So here we only show single-scale test results of the state-of-the-art methods for fair comparison. Besides, most methods we compared are based on the VGG-16 model. Quantitative results are presented in Table 1 under the standard evaluation protocol Precision (P), Recall (R), and Fmeasure (F). The proposed method achieves competitive results, especially for the Recall. Besides, our method achieves the F-measure of 0.801 and the best Recall (0.769). Since the mis-detections can

”†” means that the model is trained with additional information. Table 3 Performances of different text detection methods evaluated on USTB-SV1K. Method

Year

Precision

Recall

F-measure

Yin et al. [57] Yao et al. [58] Yin et al. [11] Tian et al. [16] Our method

2013 2014 2015 2017 –

0.45 0.44 0.45 0.54 0.73

0.45 0.46 0.50 0.49 0.67

0.45 0.45 0.48 0.51 0.70

be removed by our tracking procedure, the final performance can be promoted, which is shown in Table 3. So we pursue the higher recall in detection procedure. The runtime of our proposed method is compared with the state-of-the-art methods on ICDAR2015. All experiments are conducted on a PC equipped with a single Geforce GTX 1080Ti GPU (RAM: 11GB). As shown in Table 1, the proposed method achieves 9.1 FPS (frame per second), which is faster than other methods. The method proposed in [25] performs at 8.9 FPS, which has the similar runtime. However, the F-measure is slightly lower than the proposed method. After a general comparison with these methods, our method has a better trade-off between the runtime and performance. The MSRA-TD500 is a multi-orientation dataset and contains Chinese and English. We fine tune our model on MSRA-TD500 and evaluate the performance of our proposed method. In testing, all images are resized to 768 × 768. The detection results are shown in Table 2. Our method achieves competitive results comparing to other approaches. However, the method proposed by Lyu et al. [28] achieved better results as it uses additional information (e.g., MSRA-TD400) to train its model. Our method achieves the F-measure score of 0.762, which is slightly lower than the second best result while the proposed method achieves the detection speed of 12.2 FPS (frame per second), which is obviously faster than other methods. It may be due to the following reasons: (1) The training images in MSRA-TD500 are very limited. Although we increased the amount of data through data augmentation, it is still not enough to learn a deep model. (2) Our method is developed to handle the task of video text detection, so it makes a trade-off between the accuracy and efficiency. To evaluate the generalization ability of our model, we have also evaluated our method on USTB-SV1K by the model finetuned on ICDAR2015. The USTB-SV1K is a challenging dataset with low-resolution scene images. As some images in USTB-SV1K does not have the ground-truth supervision, we cannot fine tune the deep model on USTB-SV1K. The experimental results on USTBSV1K are given in Table 3. Without fine-tuning, the proposed method achieves the best performance compared to the stateof-the-art methods. More specifically, our method achieves the 0.7 F-measure, which is approximately 19% (from 0.51 to 0.70) higher than the conventional method (second best method). The

X. Wang, X. Feng and Z. Xia / Neurocomputing 363 (2019) 223–235

231

Table 4 Comparative results for text detection on Minetto dataset.

Video v1 v2 v3 v4 v5 average

Minetto et al. [55]

Zuo et al. [7]

Proposed Strategy I

Proposed Strategy II

P

R

F

P

R

F

P

Yang et al.[40] R

F

P

Yang et al. [8] R

F

P

R

F

P

R

F

0.55 0.57 0.60 0.73 0.60 0.61

0.80 0.74 0.53 0.70 0.70 0.69

0.63 0.64 0.56 0.71 0.63 0.63

0.82 0.90 0.75 0.83 0.88 0.84

0.62 0.80 0.60 0.77 0.62 0.68

0.71 0.85 0.67 0.80 0.72 0.75

0.82 0.92 0.73 0.88 0.89 0.85

0.70 0.81 0.64 0.82 0.87 0.77

0.76 0.86 0.68 0.85 0.88 0.81

0.92 0.89 0.84 0.96 0.84 0.89

0.83 0.93 0.84 0.82 0.76 0.84

0.88 0.91 0.84 0.89 0.80 0.86

0.96 0.97 0.82 0.98 0.83 0.91

0.93 0.99 0.86 0.87 0.87 0.90

0.94 0.98 0.84 0.92 0.85 0.90

0.98 0.97 0.83 0.99 0.89 0.93

0.93 0.99 0.86 0.87 0.87 0.90

0.95 0.98 0.84 0.93 0.88 0.91

Table 5 Comparative results for text tracking on Minetto dataset. Zuo et al. [7] Video v1 v2 v3 v4 v5 Avg.

Pei et al. [9]

Proposed

MOTP

MOTA

IDS

MOTP

MOTA

IDS

MOTP

MOTA

IDS

0.76 0.82 0.70 0.65 0.72 0.73

0.74 0.97 0.38 0.36 0.36 0.56

15 5 81 0 0 20.2

0.76 0.82 0.70 0.65 0.72 0.73

0.75 0.97 0.43 0.38 0.36 0.59

14 3 46 0 0 12.6

0.77 0.85 0.76 0.75 0.83 0.79

0.86 0.98 0.63 0.69 0.63 0.76

4 3 26 2 0 7

result shows that our method can detect the text more accurately. Moreover, our proposed method can better detect watermark text, which has not been labelled in the ground-truth. Fig. 6(a) can illustrate this situation. 5.4. Comparisons of text tracking In this context, we evaluate our tracking method in two cases. Firstly, the detection results promoted by tracking are reported to observe the impact of tracking to detection. Secondly, the performance of tracking is individually evaluated to compare with existing tracking approaches. For evaluation, a public-available video dataset (i.e., Minetto’s dataset) [55] is employed in two cases. Texts in this dataset have been affected by distortions, occlusions and illumination. Table 4 shows the performance comparison of proposed method with other methods. Two strategies are employed in our proposed method. (1) Strategy I: text detection without text tracking (using single static frames). Strategy II: text detection and tracking using multiple frames. The results show that our text detection method without considering the tracking procedure outperforms other pervious methods on four of the five videos. The average performance of Strategy I is 4% higher than the method in [8]. Then, by combining the text tracking, false alarms are reduced and the accuracy of text detection is further improved. In this case, our proposed method achieves the best performance over all five videos. As seen from the Table 4, the average Precision in Strategy II is increased by 2%. Compared to the other methods, our proposed text tracking method achieves state-of-the-art performances both in precision, recall and f-measure (93%, 90% and 91%). Moreover, we also perform the experiments to evaluate our text tracking method. The evaluation metrics, i.e., Multi-Object Tracking Precision (MOTP) and Multi-Object Tracking Accuracy (MOTA) in [59], are widely adopted in text tracking. Specifically, MOTP measures the average overlap between tracking objects and ground truth, and MOTA measures the tracking trajectories to the real trajectories. This means that MOTA combines false negatives, false positives and mismatch rate. In addition, we also consider the number of times that a tracked trajectory changes its matched ground-truth identity (IDS). Table 5 shows the performance of the proposed method compared to other methods. From the results, the proposed method shows much better performance in terms of MOTP, MOTA and IDS score. We believe this may be due to the fol-

Table 6 Evaluation of text-sensitive segmentation module on ICDAR2015 dataset. Method

Precision

Recall

F-measure

Proposed method without segmentation module Proposed method with segmentation module

0.825

0.754

0.788

0.836

0.769

0.801

lowing reasons: (1) Our text detector can better locate the text regions in each video frames, which leads to less mis-detection and make more stable tracking trajectories; (2) The layout constraint based text tracking has a better ability to discriminate different text blocks in unpredictable camera motion, even though the text blocks are close to each other and share similar appearance features; (3) Using the layout constraint of text between frames, we can infer the missed text and re-track it from the tracked trajectories, which makes the IDS of the proposed method significantly fewer than the state-of-the-art methods. Fig. 6(a) is used to illustrate this advantage, in which more than 15 text blocks are involved for tracking in one scene. 5.5. Effect of different components To validate the effectiveness of text-sensitive segmentation module in our proposed text detection network, we conduct an experiment to evaluate different variants of our method. The component comparisons are summarized as follows: (1) Proposed method without segmentation module. The architecture with feature extraction, text classification and bounding box regression; (2) Proposed method with segmentation module. The proposed architecture with text-sensitive segmentation module. The segmentation module is used to enrich semantic information and potentially improve the ability to detect small text. Thus, we conduct ablation study on ICDAR2015 dataset, a typical dataset which mainly consists of small and oriented text. The quantitative comparison is shown in Table 6. The results show that the architecture with text-sensitive segmentation improves both recall and precision (F-measurement increases by approximately 1.3%). Therefore, it is proven that the component of text-sensitive segmentation is useful for text detection. The reason is that the text-sensitive segmentation module can provide rich semantic information at the

232

X. Wang, X. Feng and Z. Xia / Neurocomputing 363 (2019) 223–235

Fig. 5. Text tracking performance with respect to different detection loss rates. The loss rate of the detections is set to 0%, 5%, 10%, 15% and 20%. The proposed method with layout constraints shows better overall performance.

Fig. 6. Successful detection results of the proposed method and failure cases. (a) Some successful detection results on ICDAR2015 (first row), MSRA-TD500 (second row) and USTB-SV1K (third row). (b) Failure cases of our method. Green solid boxes: correct detections; Red solid boxes: false detections; Green dashed boxes: missed ground truths.

low-level detection feature maps. This can potentially improve the network’s ability to detect the small text. On the other hand, the segmentation map is used to filter out non-text regions, further improving detection accuracy. In order to verify the effectiveness of layout constraints, we conduct experiments by comparing the tracking method without

layout constraints. We choose the third video in Minetto’s dataset because the third video involves more than 15 text blocks in one scene and is accompanied by a fast motion of the camera. In addition, in order to verify the stability of the tracking algorithm when text regions are frequently missed, we randomly reduce the number of detected text regions in each frame by fixed ratios.

X. Wang, X. Feng and Z. Xia / Neurocomputing 363 (2019) 223–235

233

Fig. 7. Qualitative results of the proposed text tracking method and failure case. (a) Text tracking on Minetto video dataset (first row) and ICDAR 2015 video dataset (second and third row). (b) Failure case found in the tracking process. The dotted lines indicate that the target has been changed, which means that the text region in next frame do not match the correct trajectories.

The frequency of text region loss is represented by the detection loss ratio, which is set to 0%, 5%, 10%, 15% and 20%, respectively. To measure the accuracy of text tracking, the MOTA is used in our experiments. The performance of two approaches is shown in Fig. 5. It can be seen that in the case of high recall (the detection loss ratio is 0%), the method with layout constraints slightly increases the MOTA by 1.3% (from 0.618 to 0.631). This is because the text detection algorithm provides accurate detection in continuous video frames, ensuring the continuity of text trajectories and reducing the tracking difficulty as pointed out in [60]. However, when the detection loss ratio increases gradually, the improvement caused by the layout constraint method is also becoming apparent. As shown in Fig. The method with layout constraints obviously increases the MOTA by 11% (detection loss ratio is set to 20%). It is proven that, when the inspection target is frequently lost, adding layout constraints can effectively reduce the trajectory interruption and mismatch. 5.6. Qualitative analysis Some qualitative comparisons are illustrated in Fig. 6(a). Some successful detection results on ICDAR2015 are shown in the first row. The results indicate that our method can effectively detect the text in scene images. Specifically, the proposed method is also robust to some challenging cases, such as text with different font size, complex backgrounds and non-uniform illumination. Furthermore, we illustrate some successful detected results on MSRATD500 in the second row. The results show that our method has the ability to detect long text with larger variance in orientation. In

USTB-SV1K, the low resolution is a great challenge for small text. The third row shows our method can handle such text. It can be seen that the proposed method has good generalization ability and can detect text more accurately. Although the proposed method has achieved good performance in text detection. However, it still fails to handle some difficult conditions. Some failure samples are shown in Fig. 6(b). For instance, when the character spacing is too large, the detection process is unable to locate the text line. In contrast, when the spacing between words is too small, the proposed method cannot well distinguish these adjacent words. Our method also fails to detect some single characters due to the lack of data. In addition, Different from some segmentation-based methods, our method is hard to detect curved text. The reason is that curve text cannot be described by the quadrilateral. This limits all methods based on text box regression. Some qualitative results of the proposed text tracking method are shown in Fig. 7(a). The first row shows the tracking results on the Minetto dataset, which involves more than 15 text blocks for tracking in one scene. From the results, we can conclude that our method can deal with the multiple-text in the large camera motion. Furthermore, we also shows some qualitative results on ICDAR2015 video dataset in the second and third row. These videos are obtained from ICDAR2015 video scene text dataset (Robust Reading Competition Challenge 3), which contains a training set of 25 videos (13450 frames in total) and a test set of 24 videos (14373 frames in total). Unfortunately, the official evaluation interface is closed, which leads to the lack of quantitative experimental results. As shown in Fig. 7(a), the proposed method performs

234

X. Wang, X. Feng and Z. Xia / Neurocomputing 363 (2019) 223–235

better for text tracking in arbitrary camera motion due to the layout constraint. The failure situation found in the tracking process is shown in Fig. 7(b). When the text detection fails, the tracked text region in the next frame is split into two small regions. Especially when the two small regions are extremely close, the proposed tracking method might fail. The tracking target might be changed, which means that the text region in next frame do not match the correct trajectories. The reason is that when the tracking target separates into two parts, the existing layout constraints can only match one target, which will lead to mismatch of the trajectory. 6. Conclusion In this work, we proposed a novel video text tracking approach leveraging the hybrid deep text detection and layout constraint. For locating text in individual frames, we proposed a deep text detection network that combines the advantages of object detection and semantic segmentation in a hybrid way. Then, text trajectories were derived from consecutive frames with a new data association method, which effectively exploits the layout constraint of text regions in large camera motion. By utilizing the layout constraint, the robustness of text tracking was further improved as the misdetections and false positives were effectively reduced. Moreover, the experimental results demonstrated the effectiveness of the proposed method for scene video text detection and tracking. In the future, we are interested in constructing an on-line system based on the proposed method. Another future work will integrate the on-line system into mobile devices. Declarations of interest None. Acknowledgment This paper is partly supported by the National Natural Science Foundation of China (No. 61702419), and the Natural Science Basic Research Plan in Shaanxi Province of China (No. 2018JQ6090). References [1] X.C. Yin, Z.Y. Zuo, S. Tian, C.L. Liu, Text detection, tracking and recognition in video: a comprehensive survey, IEEE Trans. Image Process. 25 (6) (2016) 2752–2773. [2] H. Goto, M. Tanaka, Text-tracking wearable camera system for the blind, in: Proceedings of the IEEE International Conference on Document Analysis and Recognition, 2009, pp. 141–145. [3] C. Yi, Y. Tian, Scene text recognition in mobile applications by character descriptor and structure configuration, IEEE Trans. Image Process. 23 (7) (2014) 2972–2982. [4] D. Letourneau, F. Michaud, J.M. Valin, C. Proulx, Textual message read by a mobile robot, in: Proceedings of the IEEE International Conference on Intelligent Robots and Systems, vol. 3, 2003, pp. 2724–2729. [5] W. Wu, X. Chen, J. Yang, Incremental detection of text on road signs from video with application to a driving assistant system, in: Proceedings of the ACM International Conference on Multimedia, 2004, pp. 852–859. [6] L. Gómez, D. Karatzas, Mser-based real-time text detection and tracking, in: Proceedings of the IEEE International Conference on Pattern Recognition, 2014, pp. 3110–3115. [7] Z.Y. Zuo, S. Tian, W. Pei, X.C. Yin, Multi-strategy tracking based text detection in scene videos, in: Proceedings of the IEEE International Conference on Document Analysis and Recognition, 2015, pp. 66–70. [8] X.H. Yang, W. He, F. Yin, C.L. Liu, A unified video text detection method with network flow, in: Proceedings of the International Conference on Document Analysis and Recognition, 2017, pp. 331–336. [9] W. Pei, C. Yang, L. Meng, J. Hou, S. Tian, X. Yin, Scene video text tracking with graph matching, IEEE Access 6 (2018) 19419–19426. [10] Y. Cong, B. Xiang, W. Liu, M. Yi, Z. Tu, Detecting texts of arbitrary orientations in natural images, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 1083–1090. [11] X.C. Yin, W.Y. Pei, J. Zhang, H.W. Hao, Multi-orientation scene text detection with adaptive clustering, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9) (2015) 1930.

[12] Y. Zheng, Q. Li, J. Liu, H. Liu, G. Li, S. Zhang, A cascaded method for text detection in natural scene images, Neurocomputing 238 (2017) 307–315. [13] S. Dey, P. Shivakumara, K. Raghunandan, U. Pal, T. Lu, G.H. Kumar, C.S. Chan, Script independent approach for multi-oriented text detection in scene image, Neurocomputing 242 (2017) 96–112. [14] Z. Zhang, W. Shen, C. Yao, X. Bai, Symmetry-based text line detection in natural scenes, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2558–2567. [15] T. He, W. Huang, Y. Qiao, J. Yao, Text-attentional convolutional neural network for scene text detection, IEEE Trans. Image Process. 2016 (2016) 2529–2541. [16] C. Tian, Y. Xia, X. Zhang, X. Gao, Natural scene text detection with mc–mr candidate extraction and coarse-to-fine filtering, Neurocomputing 260 (2017) 112–122. [17] X. Zhang, X. Gao, C. Tian, Text detection in natural scene images based on color prior guided mser, Neurocomputing 307 (2018) 61–71. [18] Z. Tian, W. Huang, T. He, P. He, Y. Qiao, Detecting text in natural image with connectionist text proposal network, in: Proceedings of the European Conference on Computer Vision, 2016, pp. 56–72. [19] Y. Liu, L. Jin, Deep matching prior network: Toward tighter multi-oriented text detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3454–3461. [20] P. He, W. Huang, T. He, Q. Zhu, Y. Qiao, X. Li, Single shot text detector with regional attention, in: Proceedings of the IEEE International Conference on Computer Vision, 2017. [21] X. Wang, Z. Xia, J. Peng, X. Feng, Multiorientation scene text detection via coarse-to-fine supervision-based convolutional networks, J. Electron. Imaging 27 (3) (2018) 1. [22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, A.C. Berg, SSD: single shot multibox detector, in: Proceedings of the European Conference on Computer Vision, 2016, pp. 21–37. [23] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. 39 (6) (2015) 1137–1149. [24] M. Liao, B. Shi, X. Bai, Textboxes++: a single-shot oriented scene text detector, IEEE Trans. Image Process. 27 (8) (2018) 3676–3690. [25] B. Shi, X. Bai, S. Belongie, Detecting oriented text in natural images by linking segments, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3482–3490. [26] M. Liao, Z. Zhu, B. Shi, G.S. Xia, X. Bai, Rotation-sensitive regression for oriented scene text detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5909–5918. [27] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, X. Xue, Arbitrary-oriented scene text detection via rotation proposals, IEEE Trans. Multimed. 20 (11) (2018) 3111–3122. [28] P. Lyu, C. Yao, W. Wu, S. Yan, X. Bai, Multi-oriented scene text detection via corner localization and region segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. [29] W. He, X. Zhang, F. Yin, C. Liu, Deep direct regression for multi-oriented scene text detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017. [30] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, J. Liang, East: an efficient and accurate scene text detector, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2642–2651. [31] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 79 (10) (2015) 1337– 1342. [32] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, X. Bai, Multi-oriented text detection with fully convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4159–4167. [33] Y. Tang, X. Wu, Scene text detection and segmentation based on cascaded convolution neural networks, IEEE Trans. Image Process. 26 (3) (2017) 1509– 1520. [34] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, C. Yao, Textsnake: a flexible representation for detecting text of arbitrary shapes, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 20–36. [35] P. Lyu, M. Liao, C. Yao, W. Wu, X. Bai, Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 67–83. [36] P. Shivakumara, T.Q. Phan, S. Lu, C.L. Tan, Gradient vector flow and grouping-based method for arbitrarily oriented scene text detection in video images, IEEE Trans. Circuits Syst. Video Technol. 23 (10) (2013) 1729–17391. [37] L. Wu, P. Shivakumara, T. Lu, C.L. Tan, A new technique for multi-oriented scene text line detection and tracking in video, IEEE Trans. Multimed. 17 (8) (2015) 1137–1152. [38] B. Epshtein, E. Ofek, Y. Wexler, Detecting text in natural scenes with stroke width transform, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 2963–2970. [39] G. Liang, P. Shivakumara, T. Lu, C.L. Tan, Multi-spectral fusion based approach for arbitrarily oriented scene text detection in video images, IEEE Trans. Image Process. 24 (11) (2015) 4488–4501. [40] C. Yang, X.C. Yin, W.Y. Pei, S. Tian, Z.Y. Zuo, C. Zhu, J. Yan, Tracking based multi-orientation scene text detection: a unified framework with dynamic programming, IEEE Trans. Image Process. 26 (7) (2017) 3235–3248. [41] S. Zhang, L. Wen, X. Bian, Z. Lei, S.Z. Li, Single-shot refinement neural network for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

X. Wang, X. Feng and Z. Xia / Neurocomputing 363 (2019) 223–235 [42] S. Tian, X.-C. Yin, Y. Su, H.-W. Hao, A unified framework for tracking based text detection and recognition from web videos, IEEE Trans. Pattern Anal. Mach. Intell. 40 (3) (2018) 542–554. [43] X. Liu, W. Wang, Robustly extracting captions in videos based on stroke– like edges and spatio-temporal analysis, IEEE Trans. Multimed. 14 (2) (2012) 482–489. [44] X. Rong, C. Yi, X. Yang, Y. Tian, Scene text recognition in multiple frames based on text tracking, in: Proceedings of the IEEE International Conference on Multimedia and Expo, 2014, pp. 1–6. [45] K. Zhang, L. Zhang, M.H. Yang, D. Zhang, Fast tracking via spatio-temporal context learning (2013). arXiv:1311.1939. [46] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556. [47] R. Girshick, Fast R-CNN, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448. [48] S. Xie, Z. Tu, Holistically-nested edge detection, Int. J. Comput. Vis. 125 (1–3) (2015) 3–18. [49] A. Shrivastava, A. Gupta, R. Girshick, Training region-based object detectors with online hard example mining, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 761–769. [50] H.Y. Ju, C.R. Lee, M.H. Yang, K.J. Yoon, Online multi-object tracking via structural constraint event aggregation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1392–1400. [51] M. Everingham, L.V. Gool, C.K.I. Williams, J. Winn, A. Zisserman, The pascal visual object classes (VOC) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338. [52] H.W. Kuhn, The hungarian method for the assignment problem, Naval Res. Logist. 52 (1) (2010) 7–21. [53] G. Welch, G. Bishop, An introduction to the kalman filter, Technical report 8 (7) (1995) 127–132. [54] A. Gupta, A. Vedaldi, A. Zisserman, Synthetic data for text localisation in natural images, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2315–2324. [55] R. Minetto, N. Thome, M. Cord, N.J. Leite, J. Stolfi, Snoopertrack: text detection and tracking for outdoor videos, in: Proceedings of the IEEE International Conference on Image Processing, 2011, pp. 505–508. [56] C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, Z. Cao, Scene text detection via holistic, multi-channel prediction (2016). arXiv:1606.09002. [57] X.C. Yin, X. Yin, K. Huang, H.W. Hao, Robust text detection in natural scene images, IEEE Trans. Pattern Anal. Mach. Intell. 36 (5) (2013) 970–983. [58] C. Yao, X. Bai, W. Liu, A unified framework for multioriented text detection and recognition, IEEE Trans. Image Process. 23 (2014) 4737–4749. [59] K. Bernardin, R. Stiefelhagen, Evaluating multiple object tracking performance: the clear MOT metrics, EURASIP J. Image Video Process. 2008 (1) (2008) 246309.

235

[60] E. Bochinski, V. Eiselein, T. Sikora, High-speed tracking-by-detection without using image information, in: Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, 2017, pp. 1–6. Xihan Wang is currently pursuing the Ph.D. degree with the School of Electronics and Information, Northwestern Polytechnical University, China. His research interests include video text detection, text tracking, and pattern recognition.

Xiaoyi Feng received the M.S. degree from the Northwest University, Xi’an, China, in 1994. She received her Ph.D. degree from the Northwestern Polytechnical University, Xi’an, China, in 2001. She is currently a professor with the School of Electronics and Information, Northwestern Polytechnical University since 2008. She has authored or co-authored more than 50 papers in journals and conferences. Her current research interests include computer vision, image process, radar imagery and recognition.

Zhaoqiang Xia received the B.E. degree and Ph.D. degree from the Northwestern Polytechnical University, Xi’an, China, in 2008 and 2014, respectively. He has been a visiting scholar in the University of North Carolina at Charlotte from 2011 to 2013. He is currently an associate professor in School of Electronics and Information, Northwestern Polytechnical University. He has authored or co-authored more than 40 papers in journals and conferences, and has served as a reviewer for international journals and conferences. His current research interests include image processing, visual search and recognition, and statistical machine learning.