Automatic text detection and removal in video sequences

Pattern Recognition Letters 24 (2003) 2607–2623 www.elsevier.com/locate/patrec Automatic text detection and removal in video sequences Chang Woo Lee ...

Download PDF

910KB Sizes 3 Downloads 87 Views

Report

PDF Reader
Full Text

Pattern Recognition Letters 24 (2003) 2607–2623 www.elsevier.com/locate/patrec

Automatic text detection and removal in video sequences Chang Woo Lee a, Keechul Jung b, Hang Joon Kim

a,*

a

b

Department of Computer Engineering, Kyungpook National University, 1370 Sangyuk-dong, Puk-gu, Daegu, 702-701 Republic of Korea School of Media, College of Information Science, Soongsil University, Seoul, Republic of Korea Received 14 February 2003; received in revised form 30 April 2003

Abstract This paper proposes an approach for automatic text detection and removal in video sequences based on support vector machines (SVMs) and spatiotemporal restoration. Given two consecutive frames, ﬁrst, text regions in the current frame are detected by an SVM-based texture classiﬁer. Second, two stages are performed for the restoration of the regions occluded by the detected text regions: temporal restoration in consecutive frames and spatial restoration in the current frame. Utilizing text motion and background diﬀerence, an input video sequence is classiﬁed and a diﬀerent temporal restoration scheme is applied to the sequence. Such a combination of temporal restoration and spatial restoration shows great potential for automatic detection and removal of objects of interest in various kinds of video sequences, and is applicable to many applications such as translation of captions and replacement of indirect advertisements in videos. 2003 Elsevier B.V. All rights reserved. Keywords: Text detection; Text removal; Motion estimation; Spatiotemporal restoration; Support vector machines

1. Introduction These days, indirect advertisements are prevalent and easily found in TV programs and movie scenes. Examples include logos or banners on clothes, electrical home appliances, furniture, etc. In cases in which indirect advertisement is not permitted, these are conventionally manually erased after taking a picture, or taped by sticky bands before taking a picture. Unlike such meth-

*

Corresponding author. Tel.: +82-53-940-8694; fax: +82-53957-4846. E-mail address: [email protected] (H.J. Kim).

ods, recently, several approaches to the automatic detection and removal of objects of interest have been proposed in the areas of image processing, computer vision, and computer graphics. Objects of interest in these ﬁelds include blotches, scratches, ﬂaws, noises, or text characters in images or videos. Detection and removal of blotches and line scratches in an image sequence (Kokaram et al., 1995) is closely related to text detection and removal in video sequences. However, the text included in images is usually bigger than blotches and scratches, and therefore it is inappropriate to use the blotch/scratch-oriented techniques for text removal. Bertalmio and Sapiro (2000) presented a

0167-8655/$ - see front matter 2003 Elsevier B.V. All rights reserved. doi:10.1016/S0167-8655(03)00105-3

2608

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623

method that used a partial diﬀerential equation for the inpainting problem, and applied it to remove pre-speciﬁed regions and artiﬁcially inserted text in images. Chan and Shen (2001) proposed a similar approach based on curvature driven diﬀusions. As the application part of the ﬂaw removal in photographs, ﬁlms, and images, Wei and Levoy (2000) used tree-structured vector quantization for texture synthesis. These approaches may hardly produce good results in cases in which the surrounding areas of the region to be restored are poorly textured. Irani and Peleg (1993) used image motion information extracted from an image sequence for image enhancement and reconstruction of occlusion. Chun and Bae (2001) presented the motioncompensated recovery of an occluded region for caption replacement. These investigations showed interesting results for the occluded background recovery. However, these techniques used multiple frames for occlusion recovery, and therefore took a long time to estimate motion information. Additionally, when caption and background regions span many frames together without moving, these methods are inapplicable to the restoration of background regions occluded by captions. In this paper, we describe an automatic text detection and removal technique in video sequences. We need two techniques for this purpose; one for detecting text regions in an image, and the other for removing the text regions and simultaneously restoring the background occluded by the text in the video sequence. For detecting the text in an image we use an support vector machine (SVM)-based text detection method. For restoring the background regions, we perform temporal restoration between consecutive frames and then spatial restoration in the rest. 2D motion information extracted from a block matching algorithm (BMA) is used for temporal restoration, and then residual pixels remaining after temporal restoration are restored by using an inpainting algorithm (Bertalmio and Sapiro, 2000). We call it spatiotemporal restoration. The reasons for using a combined approach of temporal restoration and spatial restoration can be stated as follows: (1) The text regions in the compressed domain such as MPEG are usually

extremely degraded, and therefore it is diﬃcult to get reasonable results using only spatial information in a frame for the occluded regionÕs restoration; (2) although it may give better results to use several consecutive frames for the restoration, the computational burden of extracting motion information continuously through several frames is relatively high; (3) for scene text and stationary text on a stationary background, temporal information is not available. For these reasons, we use a combined method of temporal restoration and spatial restoration. Fig. 1 shows how to perform the automatic text detection and removal using the combined technique. In the sequence classiﬁcation step, an input video sequence is classiﬁed as one of four types utilizing the text motion and background frame diﬀerence so that background motions are selectively estimated. It makes the restoration more accurate. The remainder of this paper is organized as follows. Section 2 describes our text detection method with a texture classiﬁer and several heuristics. The restoration of the occluded backgrounds using spatiotemporal restoration is then described in Section 3. Section 4 discusses the experimental results, and Section 5 draws conclusions and outlines future works.

2. Text detection There are two primary methods for text detection in images: the connected component-based method (Lienhart and Stuber, 1996; Jain and Yu, 1998; Kim et al., 2000) and the texture-based method (Richard and Lippmann, 1991; Zhong et al., 1995; Jain and Karu, 1996; Jeong et al., 1999; Li et al., 2000; Zhong et al., 2000; Hasan and Karam, 2000; Jung, 2001; Jung and Han, 2001; Jung et al., 2002). Connected component-based methods group small components into successively larger components until all blocks are identiﬁed in the image, and then analyze the geometrical arrangement of the components that belong to characters. However, connected component-based methods are inappropriate for low-resolution and noisy video documents because they depend on the eﬀectiveness of the segmentation method, which

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623

2609

Fig. 1. Flow of the combined technique for spatiotemporal restoration.

should guarantee that a character is segmented into a few connected components (Zhong et al., 2000). On the other hand, texture-based methods regards text regions as textured objects and apply Gabor ﬁlters, Wavelet, FFT, and spatial variance to exploit the textural properties. These utilizations of textural information for text detection are sensitive to various font styles, sizes, and colors. Therefore, it is diﬃcult to generate a texture ﬁlter set which is appropriate for each application. In this paper, we use an SVM-based texture classiﬁer to detect text in video images. We present a brief description of the method in this chapter. The reader may refer to (Hastie et al., 2001; Kim et al., 2002; Sch€ okopf and Smola, 2002) for more details.

2.1. Support vector machines for texture classiﬁcation SVMs have been recently introduced as a method for pattern classiﬁcation and non-linear regression. Given a set of labeled training examples ðxi ; yi Þ 2 RN f1g, i ¼ 1; . . . ; l, where xi is the input pattern for the ith example and yi is the corresponding desired response, an SVM ﬁnds the optimal hyperplane using the training sample, subject to the constraint yi ðwT xi þ bÞ P 1, for i ¼ 1; 2; . . . ; N , where w is a weight vector and b is a bias. That is, an SVM constructs a linear classiﬁer in a feature space F that is non-linearly related to the input space via a map U : RN ! F . Originally, SVM is a linear classiﬁer. The basic idea of

2610

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623

applying SVM to a non-linear problem is to map the data into F non-linearly, where the (linear) SVM is constructed. To ensure the linear separability of data in F , the dimensionality of F should be large (Cover, 1965). The practical computations of this high-dimensional mapping and SVM in F are indirectly and economically performed with help of kernel functions. For more details, readers are referred to (Sch€ okopf and Smola, 2002). The classiﬁer is then identiﬁed as a canonical hyperplane (Vapnik, 1998) in F that correctly separates the largest fraction of data points, while maximizing the margin (distance between convex hulls of either classes), and is represented in the input space as f ðxÞ ¼

l X

yi ai Uðxi Þ UðxÞ þ b;

ð1Þ

i¼1

(Vapnik, 1998). One method of feature extraction is achieved by taking the p-order correlations between the entries xi of an input vector x. If x represents an image pattern of just pixel values, this amounts to mapping the input space into the space of pth order products (monomials) of the input pixels. It should be noted that these features cannot be extracted by simply computing all the correlations, since the required computation is prohibitive when p is not small (p > 2): for N dimensional input patterns, the dimensionality of the feature space F is ðN þ p 1Þ!=p!ðN 1Þ! However, the introduction of a polynomial kernel facilitates work in this feature space, as the polynomial kernel with degree p ðkðx; yÞ ¼ ðx yÞp Þ corresponds to the dot product of the feature vectors extracted by the monomial feature extractor Up (Vapnik, 1998):

l

where fxi gi¼1 are the support vectors (SVs) that are the nearest examples xi from the optimal hyperplane and correspond to a non-zero ai . The coeﬃcients ai and b are determined by solving a large-scale quadratic programming problem (Vapnik, 1998). The appeal of SVMs lies in their strong connection to the underlying statistical learning theory. According to the structural risk minimization principle (Vapnik, 1998), a function that can classify training data accurately and which belongs to a set of functions with the lowest capacity particularly in the VC-dimension that is a measure of the capacity or expressive power of the family of classiﬁcation functions realized by the learning machine, will generalize best, regardless of the dimensionality of the input space. In the case of a canonical hyperplane, minimizing the VCdimension corresponds to maximizing the margin. As a result, for many applications, SVMs have been shown to provide a better generalization performance than traditional techniques, including neural networks and radial basis function networks (Sch€ olkopf et al., 1997; Haykin, 1999). For computational eﬃciency, the mapping U is often implicitly performed using a kernel function k deﬁned as: kðx; yÞ ¼ UðxÞ UðyÞ. Then, by selecting the proper kernels, k, various mappings or feature extractions, U, can be indirectly induced

ðUp ðxÞ Up ðyÞÞ ¼

N X

xi1 xip yi1 yip

i1 ;...;ip ¼1

¼

N X

xi yi

!p ¼ ðx yÞp :

ð2Þ

i¼1

Within the purpose of texture processing, an SVM can also be viewed as an implementation of a well-known multi-channel ﬁltering scheme (Kim et al., 2002). The operation performed by a polynomial kernel in an SVM is essentially the same as that of a channel: the kernel ﬁrst computes the inner product between its input x and a SV, x and then performs a non-linear map ðhð Þ ¼ ð Þp Þ. In this case, the SVs play the role of a ﬁlter bank. Even though they are not designed as ﬁlters for capturing speciﬁc frequency bands or orientations, they can still extract critical information for classiﬁcation (Kim et al., 2002). Furthermore, they are automatically and optimally selected based on the criterion of generalization error for a given texture classiﬁcation problem. Then, by selecting the SVs as special-purpose ﬁlters, the SVM can simultaneously identify an optimal feature extractor and the corresponding texture classiﬁer based on a uniﬁed criterion of minimizing the bound on test error. Based on these properties of an SVM, it can be inferred that SVMs do not necessarily require the

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623

explicit use of an external feature extractor, as an SVM can generalize well in high-dimensional spaces, thereby eliminating the need to reduce the dimensionality of the input to a moderate size. Furthermore, SVMs incorporate feature extractors within their own architecture and can use their nonlinear mapped input patterns ðUðxÞ or hðx x ÞÞ as features for classiﬁcation. The architecture of an SVM texture classiﬁer involves three layers with entirely diﬀerent roles as shown in Fig. 2. The input layer is made up of source nodes that connect the SVM to its environment. Its activation x comes directly from an M M window in the input image (gray values). However, instead of using all the pixels in the window, a conﬁguration for autoregressive features which are shaded pixels in the lower-left part of Fig. 2 is used. (The same strategy is used in the training phase. Accordingly, the SVs in the ﬁgure also include only shaded pixels.) This reduces the size of the feature vector from M 2 to 4M 3 and results in an improved generalization performance and classiﬁcation speed. The hidden layer calcu-

2611

lates inner products between the input and SVs and applies a non-linear function. The size of the hidden layer m is determined as the number of SVs identiﬁed during the training phase. Taking every pixels in the image as the center of an M M window for the classiﬁcation, the texture classiﬁer exhaustively searches the entire image to classify each pixel as text or non-text. The output is then obtained by performing innerproduct of input pattern x and each SV xi and computing the weighted sum of the results of innerproduct with yi ai as shown in Fig. 2. The sign of the output y, obtained by weighting the activation of the hidden layer, then represents the class of the central pixel in the input window. For training, +1 was assigned as the text class and )1 as the non-text class. Accordingly, if the SVM output for an input pattern is positive, it is classiﬁed as text. The parameters to be controlled in training SVM include the polynomial degree p and the input window size M. It should be reminded that the strength of an SVM is based on its automatic capacity control in terms of the VC-dimension. This

Fig. 2. Architecture of color texture classiﬁer.

2612

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623

capacity control takes place within a class of functions ffa : a 2 Kg speciﬁed a priori by the choice of the F , or equivalently p and M in the proposed method. When the training errors are negligible as the case of texture classiﬁcation with 2000 randomly selected text and non-text patterns, the upper bound of the VC-dimension of fa can be estimated by a term that is in proportion to the squares of radius r of the smallest ball containing all the data points and norm of the vector w in F (Vapnik, 1998). Since both r and w depend on the shape of F , the VC-dimension depends on p and 2 M. The r2 and kwk can be computed by the use of kernels without working directly in F (Sch€ olkopf et al., 1995). The estimated upper bound of the VC-dimension can then constitute the cost functional that should be minimized to obtain the best parameters. About 40 diﬀerent parameter conﬁgurations were tested where p and M varied along f1; 2; 3; 4; 5g and f5; 7; 9; 11; 13; 15; 19; 21g, respectively and the best parameter set is found to be (p ¼ 3, M ¼ 13). 2.2. Text region simpliﬁcation As in previous works (Jeong et al., 1999), we also simplify the output of the SVM. The result of the text region simpliﬁcation is rectangular bounding boxes representing the text regions. First, we produce the potential text regions using four-connected component analysis. The boundaries of each potential text region are adjusted using the histogram distribution of the text pixels included in each bounding box. For example, in the left boundary, the new boundary is set to the left-most position when the vertical projection value of text pixels exceeds 1/10 of the width of that bounding box. The same strategies are applied in each left, right, top, and bottom boundaries of all the bounding boxes. Then, among the potential text regions, non-text regions that hardly include text are removed using several heuristics as follows: (1) In a text region, the number of pixels classiﬁed as a text pixel should be larger than 1/3 of the area of the bounding box, (2) The height of a text region should be larger than 1/30 of the height of the image, and (3) The width of a text region should exceed half of the height of the text

region. Regions violating these three rules are not text regions anymore. The results of the bounding box ﬁltering step are rectangular bounding boxes representing text regions. The intermediate results of text region detection are shown in Fig. 3. Fig. 3(a) shows an input image, (b) is the output of the SVM in which those classiﬁed as text pixels are white, (c) is the simpliﬁcation of the SVM output using four-connected component analysis in which one rectangular box represents one connected component, and each bounding box represents the potential text regions, and (d) is the ﬁnal result of text detection in which each connected component violating the heuristics is ﬁltered out, after the boundaries of bounding boxes are adjusted. 2.3. Occluded region speciﬁcation The target region of the spatiotemporal restoration is not the whole of a text region, but only the text pixels in the region not including the background. In order to extract only text pixels from a text region we use a simple thresholding scheme on the assumption that the text color in a text region is brighter than that of the background. If the text is similar in color to that of background of an image, we cannot read the text properly. In general, to discriminate text from the background of an image, the text has highlighting pixels with diﬀerent colors from that of the text and background. For example, as shown in Fig. 4(a), the artiﬁcially inserted text has black boundary pixels to emphasize itself and distinguish itself from the similarly colored background. The surrounding pixels of the text are aﬀected by these highlighting pixels during compression so that the results in the spatial restoration are hardly reliable. This is because the spatial restoration algorithm uses only the surrounding pixel information of the regions to be restored. Therefore, to include the highlighting pixels, remove small holes, and use more accurate surrounding information of the region to be restored in spatial restoration, we expand the extracted text pixels by using a morphological operation (dilation). For example, three or more pixels are inserted to highlight text pixels as shown in Fig. 4(a).

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623

2613

Fig. 3. Intermediate results for detecting text regions: (a) input image sized 720 · 400, (b) output of the SVM ﬁlter, (c) result of fourconnected component analysis, and (d) result of text detection after bounding box ﬁltering.

Fig. 4. Results of occluded region speciﬁcation: (a) example of highlighting pixels, (b) extracted text pixels, and (c) result of expanding.

Fig. 4(b), in which white pixels are text pixels, shows the extracted text pixels, not including the highlighting pixels. Fig. 4(c) is an example of ex-

panded text using the size of the structuring element in the dilation as 3 · 3 pixels. We call these expanded text regions the occluded regions.

2614

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623

3. Text removal using spatiotemporal restoration Since temporal information is not always available, we ﬁrst determine whether it is available for temporal restoration. According to the availability of temporal information, an input sequence is classiﬁed into one of four types: stationary text on stationary background, stationary text on dynamically varying background, moving text on stationary background and moving text on dynamically varying background. After an input sequence is classiﬁed, diﬀerent spatiotemporal restoration schemes are applied to each classiﬁed sequence. We perform occluded region restoration using a combined method of temporal restoration in consecutive frames and spatial restoration in the residual regions. The spatiotemporal restoration is completed as a weighted sum of the spatial restoration and temporal restoration as following Eq. (3). bI ði; j; tÞ ¼ abI T ði; j; tÞ þ ð1 aÞbI S ði; j; tÞ;

ð3Þ

where a is a weighting coeﬃcient, bI T ði; j; tÞ and bI S ði; j; tÞ are the results of temporal restoration and spatial restoration at time t for a pixel value in position (i; j) respectively. 3.1. Sequence classiﬁcation Most of the text in a video sequence has several characteristics. For example, sometimes it is ﬁxed in a position, sometimes it is not, and sometimes it is scrolled horizontally or vertically among consecutive frames. Therefore, if we make use of these characteristics for spatiotemporal restoration, the computational time can be reduced and more accurate motion estimation can be achieved. For example, if it is known in advance that the background is stationary, the time for estimating background motion vectors becomes unnecessary. Furthermore, when a sequence has the frame differences in background regions resulting from noise not from real movements of a camera or objects, incorrect temporal restoration produced by misestimating of motions can be prevented. Thus the occluded regions are restored using accurate motion information in the temporal resto-

ration step. In this step, an input sequence is classiﬁed using simple classiﬁcation rules in which text motion and background frame diﬀerence are utilized. Table 1 shows simple rules for classifying an input image sequence. Given an input video sequence, we ﬁrst estimate the text motions in each detected text region for the sequence classiﬁcation. If the estimated text motion is great than and equal to T1 that is set to 1 in the proposed method, it means the text moves more than a pixel in any direction between two frames so that we consider that text motion exists and classify the sequence as a moving text sequence. After determining whether the text moves or not, background subtraction between two consecutive frames is performed to see if the background moves or not. If the frame diﬀerence of the background is less than T2, the sequence is classiﬁed as a stationary background sequence. Then the background motion is selectively estimated using BMA only if the background frame diﬀerence exists. In order to determine the threshold value, T2, we use an adaptive thresholding technique, which is a histogram-based approach for motion detection (Habili et al., 2000). Due to noise and changes of illumination in addition to true motion of object sources or a camera, not all the non-zero diﬀerences are induced by real motion in a frame difference image. To overcome this problem and estimate accurate motions occurring because of true motion, the threshold value is determined using the adaptive thresholding technique.

Table 1 Four-type classiﬁcation rules of a sequence for spatiotemporal restoration Text motion

Frame diﬀerence in background region

Types

Stationary text on stationary background Stationary text on varying background

PT2 PT1

Moving text on stationary background Moving text on varying background

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623

Fig. 5 shows an example sequence of moving text on a stationary background when the characteristics of the text and background are distinctly exposed. Figs. 5(a) and (b) are a current frame and a previous frame, respectively. Fig. 5(c) is the binary diﬀerence image between (a) and (b) in which gray level 255 is assigned to the non-zero diﬀerences and 0 to the zero diﬀerences, and (d) is the frame diﬀerence image in which the threshold value, 12, generated by the adaptive thresholding technique is applied to the non-zero diﬀerences. That is, white pixels possess a diﬀerence of greater than 12 or less than )12. 3.2. Motion estimation Since no temporal information for text removal is available in cases of scene text or stationary text on a stationary background, we have to restore the occluded regions using only a spatial restoration algorithm. However, in other cases we can estimate text motions and background motions, and approximate the motions of the occluded regions. Therefore, we utilize these motions for temporal restoration. In our proposed method, the text

2615

motion is estimated using a region-based model and the background motion using a block-based model. For estimating motion vectors we assume the background and the text regions have only simple linear motion so we can use a simple translational model, minimum absolute diﬀerence (MAD) (Bovik, 2000) as in Eq. (4): X 1 MADðd1 ; d2 Þ ¼ jIðn1 ; n2 ; tÞ N1 N2 ðn ;n Þ2B 1

2

Iðn1 þ d1 ; n2 þ d2 ; t 1Þj;

ð4Þ

where B denotes an N1 N2 block for a set of candidate motion vectors (d1 ; d2 ), t denotes a current frame, and t 1 denotes a corresponding previous frame. Especially, in text motion estimation, B stands for the text region and N1 N2 is the number of pixels of a text region. Then the displacement estimate is given by arg min MADðd1 ; d2 Þ; ðd1 ;d2 Þ

ð5Þ

where M1 6 d1 6 M1 and M2 6 d2 6 M2 . M1 and M2 are the height and width of a search window, respectively. The resulting motion ﬁeld is found by

Fig. 5. Results of adaptive thresholding on the moving text on stationary background sequence: (a) current frame, (b) previous frame, (c) frame diﬀerence image, and (d) frame diﬀerence image after applying an adaptive thresholding technique.

2616

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623

an exhaustive search within a small rectangular window (±16 vertically and horizontally). In the background motion estimation step, even though a block includes only one text pixel, we do not estimate the background motion of the block. Instead, using Eq. (6), the motions of the occluded regions, MADo , are approximated to the average of eight-neighboring background motions in which the text blocks are excluded. k is the index of neighboring blocks, TB stands for the text block, and n is the number of neighboring blocks that are not the text block. If there is no background motion around a text block, we do not approximate the motion of the occluded region. 1 X MADo ¼ MADk ðd1b ; d2b Þ: ð6Þ n k62TB Fig. 6 shows the result of motion estimation. Fig. 6(a) shows a current frame in a video sequence and in (b), two kinds of motions exist; one is the background motion and the other is the motion of the occluded regions shown in the blocks that are included inside the bold line. In Fig. 6(b), the gray regions represent the occluded regions when their motions are to be approximated using the motions of surrounding background blocks. There is no text motion in this sequence. 3.3. Spatiotemporal restoration 3.3.1. Temporal restoration To restore the occluded regions in a video sequence, we use both spatial and temporal information associated with the text regions. For the temporal restoration, we use the text motions estimated in the sequence classiﬁcation, and the

background motions selectively estimated using Eqs. (4) and (5). The temporal restoration at time t using motion estimation results is performed as following Eq. (7). bI T ði; j; tÞ ¼ Iði þ d b ; j þ d b ; t 1Þ; 1 2

ð7Þ

where ði þ d1b ; j þ d2b Þ 62 TextAreaði; j; t 1Þ. Text Area stands for the occluded regions in each frame and TextAreaði; j; t 1Þ ¼ TextArea ði þ d1t ; jþ d2t ; tÞ. Iði; j; t 1Þ is the pixel value of the position (i; j) at time t 1. d t and d b are the text motion vector and the background motion vector of the block including (i; j), respectively. After forward mapping from the previous frame to the current frame using estimated motion vectors, the text pixels in the occluded regions in a current frame are restored temporally by copying the corresponding background pixels in the previous frame. If the pixels to be copied in a previous frame are the text pixels, we do not copy them. 3.3.2. Spatial restoration To remove the occluded pixels remaining after temporal restoration, we use the Ôimage inpaintingÕ algorithm as the spatial restoration algorithm. The image inpainting algorithm devised by Bertalmio and Sapiro (2000), is a method for restoring damaged parts of an image and ﬁlling in automatically the selected regions with surrounding information. Since the topology of text regions is hardly known in advance and the surrounding texture is degraded in the video sequences dealt with in the proposed method, we use the inpainting algorithm that is insensitive to the surrounding texture and the topology of the region to be ﬁlledin.

Fig. 6. Example of motion estimation: (a) current frame and (b) result of motion estimation.

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623

As detailed by Bertalmio and Sapiro (2000), the basic idea of the inpainting algorithm is that the information surrounding the region to be inpainted is propagated inside along the isophote lines arriving at the region boundaries. In the nth inpainting time, the improved version of the image pixel I n ði; jÞ is as follows. I nþ1 ði; jÞ ¼ I n ði; jÞ þ Dt Itn ði; jÞ;

ð8Þ

where I n ði; jÞ is each of the image pixels inside the region to be inpainted and Dt Itn ði; jÞ stands for the update of the image pixel I n ði; jÞ with the rate of improvement, Dt. This update term contains not only the information to propagate, but also the isophote direction using a 2-D Laplacian operator. Inpainting Algorithm 1. Preprocessing (anisotropic diﬀusion) 2. For each pixel in the region to be inpainted 2.1. compute the vertical/horizontal smoothness variations using 2-D Laplacian 2.2. compute the isophote direction 2.3. propagate 2D-smoothness variations along to the isophote line 2.4. compute the magnitude of the pixel to be inpainted 2.5. update the value of the pixel through the combination of 2.1–2.4 3. Iterate steps 2.1–2.5 During the spatial restoration, anisotropic diffusion (Perona and Malik, 1990; Bertalmio and Sapiro, 2000) was interleaved with one per tenth inpainting loop to ensure noise insensitivity and preserve the sharpness of edges. o Iðx; y; tÞ ¼ gðrIÞjðx; y; tÞjrIðx; y; tÞj 8ðx; yÞ 2 X; ot ð9Þ 2

where gðrIÞ ¼ eð ðkrIk=KÞ Þ , jðx; y; tÞ is the Euclidean curvature of the isophote line and rI, X, K is the image gradient, the regions to be inpainted, and a normalizing constant, respectively. We performed the spatial restoration algorithm until the pixel values in the regions to be restored did not change.

2617

4. Experimental results We have tested the proposed method using video clips collected from movies and animations. The sizes of the video frames used during the experiments ranged from 352 · 240 to 720 · 400 and were extracted from video clips with a frame rate of 15 frames/s. In each frame, the sizes of the text characters ranged from 6 · 6 to 35 · 36 and the font widths of each text character ranged from 2 to 5. After text region detection, we extracted the text pixels from the rest of the detected text bounding boxes. The result of text extraction is a binary image with text pixels alone visible. Then, we applied the expanding routine to the binary image. The expanded areas covered not only the highlighting pixels, but also the small holes. The SVM was trained using 150 text-included video scenes with various sizes manually captured from diﬀerent sources such as movies, TV programs and animations, in which parts of text images, M Ms, are used as the training samples of the text class and others are used as the samples of the non-text class. In other words, we manually separate these scenes into text parts and non-text parts and then both text and non-text parts are used for training of an SVM as positive and negative examples, respectively. In this investigation, we tested the proposed method using four types of video sequences. We estimated text motion using a region-based model and the background motions using a block-based model. In spatiotemporal restoration, if a pixel at (i; j) is temporally restored, a in Eq. (3) is set to 1, otherwise 0. In other words, during temporal restoration, the pixels of the occluded regions in a current frame are replaced with the corresponding pixels in a previous frame with no additional postprocessing. Then the rest is the target region of the spatial restoration. All the experiments were performed on the Pentium 4, 1.7 GHz using Visual C++. In the experiment on a 640 · 272 image sequence having only background motion, the processing time on text detection was 2.5 s, on temporal restoration using only text motion was 0.1 s, and on spatial restoration was 75 s. And another experiment on a 384 · 288 image sequence having text and

2618

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623

background motion, the processing time on text detection was 1.5 s, on temporal restoration using only background motion was 0.1 s, and on spatial restoration was 93 s. In the experiments on the sequences of the stationary text on stationary background, the spatiotemporal restoration scheme is trivial so that the occluded regions are restored using only the spatial restoration algorithm. In this experiment, the occluded regions are restored in the same way as in (Bertalmio and SapiroÕs, 2000) method except that the regions to be removed were automatically speciﬁed. If the occluded region is large and the surrounding areas of the occluded regions are poorly textured, the occluded regions are restored using only spatial information but the process has disadvantages in computational time and visual qualities of restored scenes. Typically compressed videos lose much information in proportion to compression rate. For this reason, spatial information of the surrounding area of the occluded regions is degraded so that it is diﬃcult to get reliable results in the spatial restoration. However, for the lack of temporal information, we have to restore the occluded regions using only spatial information. In the experiments on the sequences of the stationary text on dynamically varying background, since the background dynamically moves, the temporal information is available in the occluded region restoration. Spatiotemporal restoration may give better results than spatial restoration in that the occluded regions can be partially restored in temporal restoration so that the target regions get smaller and spatial information becomes more accurate around the pixels to be spatially restored. The text motion conﬁrms that the text did not move. However, the background frame diﬀerence indicates that the backgrounds moved. Thus, it is easily known that motions existed in the background regions. This video sequence was classiﬁed as a case of stationary text on a dynamically varying background. In the current frame, the uncovered background regions induced by the movements of the camera or objects were involved in the temporal restoration. Motion compensated temporal restoration takes the form of Eq. (7).

Assuming that the occluded regions have the same motion as those of the neighboring blocks, we approximated the motions of the occluded regions to an average of eight-neighboring background motions using Eq. (6), in which the text blocks were excluded. Then we copied the pixel values from the corresponding position in the previous frame into the right position in the current frame after checking that the pixels to be copied were not the text pixels. The dimension of the blocks used in the motion estimation was 8 by 8 pixels. Fig. 7 shows the experimental results of the sequence of the stationary text on a dynamically moving background. Figs. 8(a) and (b) are a current and a previous frame sized 640 · 272, respectively. Fig. 7(c) is the diﬀerence image between (a) and (b), in which the white pixels are the non-zero diﬀerences, and (d) is the diﬀerence image when the threshold value, 21 generated by the adaptive thresholding technique is applied. That is, white pixels possess a diﬀerence of greater than 21 or less than )21. Fig. 7(e) shows the occluded regions speciﬁed after expanding the extracted text pixels and the small isolated regions are those incorrectly detected by SVM-based text classiﬁer. Fig. 7(f) shows the target regions for temporal restoration in which dark gray regions represent the regions in which there is no motion or can not be approximated according to Eq. (6), and light gray regions represent the logically restorable regions composed of the blocks that have the approximated motions and are the target region of the temporal restoration. Figs. 7(g) and (h) show the temporal restoration result and the ﬁnal result of spatiotemporal restoration, respectively. Logically restorable regions are not fully restored in temporal restoration because the position of the pixels to be temporally restored according to Eq. (7) is still the element of text areas in the previous frame. Therefore, not all the logically restorable regions are actually restored. In the experiments on the sequences of the moving text on stationary background, newly covered regions occur when the text moves and the covered regions are the target regions of temporal restoration. Since the positions of pixels to be copied according to text motions are still one of

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623

2619

Fig. 7. Spatiotemporal restoration results: (a) current frame, (b) previous frame, (e) occluded regions, (f) the temporally restorable regions, (g) result of temporal restoration, and (h) ﬁnal result of spatiotemporal restoration.

the occluded regions in the previous frame, the positions of the target regions are not copied. In the experiments on the sequences of the moving text on dynamically varying background, the target regions are restored using the text and background motions. In this experiment, we can restore not only the newly covered regions caused by the textÕs movement but also the uncovered regions caused by the background movement. In other words, the restorable regions in temporal restoration can be bigger than any other cases. Fig. 8 shows the spatiotemporal restoration results. Figs. 8(a) and (b) are a current frame and a previous frame sized 384 · 288, (c) is the result of

text expanding in which the characters in the upper left corner were ﬁltered out because the region ﬁltering rules were not satisﬁed, and (d) shows that the restorable regions in the white regions are the regions to be restored using the text motion and the background motion together, and the light gray regions are the regions to be restored using the text motion only. As shown in Fig. 8(e), if we directly restore the occluded region without performing temporal restoration it is impossible to restore the lips in the current frame using only spatial restoration. Fig. 9 shows the comparison results between the spatial restoration and the spatiotemporal

2620

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623

Fig. 8. Results of the spatiotemporal restoration: (a) current frame, (b) previous frame, (c) result of text expanding, (d) temporally restorable areas, (e) result of temporal restoration, and (f) ﬁnal result of spatiotemporal restoration.

restoration of the occluded regions in an input sequence. The input video sequence used in experiment 3 is dealt with for subjective video quality comparison. The magniﬁed regions in Figs. 9(c) and (d) are used to compare the visual qualities of each restoration result. The results of the spatiotemporal restoration are better than those of the spatial restoration. This is because

spatial information surrounding the occluded regions is insuﬃcient to restore the regions correctly. Especially, in the compressed domain, the pixel-level degradation around the text considerably aﬀects the spatially restored regions because the spatial restoration algorithm is dependent on the surrounding information of the occluded regions.

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623

2621

Fig. 9. Comparison results between the spatial restoration only and the spatiotemporal restoration: (a) result of spatial restoration, (b) result of spatiotemporal restoration, (c) magniﬁed frame details after spatial restoration, and (d) magniﬁed frame details after spatiotemporal restoration.

Fig. 10. Example of caption replacement: (a) input sequence, (b) caption replacement without background restoration, and (c) results of caption replacement after spatiotemporal restoration.

2622

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623

Fig. 10 shows that the proposed method is useful in an application of caption replacement. The captions are manually translated in this case. Additionally, it is also possible to apply this approach to indirect advertisement removal. 5. Conclusions In this paper, an approach for automatic text detection and removal technique using an SVM and spatiotemporal restoration is proposed. The proposed method detected text regions automatically using the SVM-based texture classiﬁer and restored the background regions occluded by the detected text regions using a combined approach of spatial restoration with temporal restoration. To increase the eﬃciency of the restoration, an input video sequence was classiﬁed according to text motions and background frame diﬀerence. Experimental results suggest potential uses of the automatic text detection and removal in digital videos. The proposed method can be adapted to text removal in still images and stationary/moving text on the stationary/dynamically varying background as well. If it is needed to erase text, captions, or brand marks in video frames, the proposed method makes the whole process automatic. Our further research will investigate a diﬀerent restoration scheme for reducing the computational time and improving the visual qualities of restored sequences for the practical usage. Acknowledgement This work was supported by the Soongsil University Research Fund. References Bertalmio, M., Sapiro, G., Vicent, C., Coloma, B., 2000. Image inpainting. In: Siggraph 2000 Conference Proceedings. pp. 417–424. Bovik, A., 2000. Hand Book of Image and Video Processing. Academic Press. Chan, T., Shen, J., 2001. Inpainting, zooming, and edge coding. In: Special Session on Inverse Problems and Image Analysis at the AMS Annual Conference, January.

Chun, B.T., Bae, Y.L., 2001. A method for recovering original image for video caption area and replacing caption text. In: International Workshop of Content-based Multimedia IndexingÕ2001. Cover, T.M., 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. 14, 326–334. Habili, N., Moini, A., Burgess, N., 2000. Automatic thresholding for change detection in digital video. In: Proceedings of the Visual Components and Image Processing 2000, Australia, 4067, pp. 133–142. Hasan, Y.M.Y., Karam, L.J., 2000. Morphological text extraction from images. IEEE Trans. Image Process. 9 (11), 1978– 1983. Hastie, T., Tibshirani, R., Friedman, J., 2001. Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag, New York. Haykin, S., 1999. Neural Network––A Comprehensive Foundation, second ed Prentice Hall, Englewood Cliﬀs, NJ. Irani, M., Peleg, S., 1993. Motion analysis for image enhancement: resolution, occlusion, and transparency. J. Visual Commun. Image Represent. 4 (4), 324–335. Jain, A.K., Karu, K., 1996. Learning texture discrimination masks. IEEE Trans. Pattern Anal. Machine Intell. 18 (2), 195–205. Jain, A.K., Yu, B., 1998. Automatic text location in images and video frames. Pattern Recogn. 31 (12), 2055–2076. Jeong, K.Y., Jung, K., Kim, E.Y., Kim, H.J., 1999. Neural network-based text location for news video indexing. In: Proceedings of International Conference of Image Processing. Jung, K., 2001. Neural network-based text location in color images. Pattern Recogn. Lett. 22, 1503–1515. Jung, K., Han, J.H., 2001. Mean shift-based text location using texture property. In: International Workshop of Contentbased Multimedia Indexing. Jung, K., Kim, K.I., Han, J.H., 2002. Text extraction in real scene images on planar planes. In: ICPR2002, vol. 3. pp. 469–472. Kim, E.Y., Jung, K., Jeong, K.Y., Kim, H.J., 2000. Automatic text region extraction using cluster-based templates. In: International Conference on Application and Pattern Recognition and Digital Techniques. pp. 418–421. Kim, K.I., Jung, K., Park, S.H., Kim, H.J., 2002. Support vector machines for texture classiﬁcation. IEEE Trans. Pattern Anal. Machine Intell. 24 (11), 1542–1550. Kokaram, A.C., Morris, R.D., Fitzgerald, W.J., Rayner, P.J.W., 1995. Interpolation of missing data in image sequences. IEEE Trans. Image Process. 4 (11), 1509–1519. Lienhart, R., Stuber, F., 1996. Automatic text recognition in digital videos. In: SPIE––The International Society for Optical Engineering. pp. 180–188. Li, H., Doerman, D., Kia, O., 2000. Automatic text detection and tracking in digital video. IEEE Trans. Image Process. 9 (1), 147–156. Perona, P., Malik, J., 1990. Scale-space and edge detection using anisotropic diﬀusion. IEEE Trans. Pattern Anal. Machine Intell. 12 (7), 629–639.

C.W. Lee et al. / Pattern Recognition Letters 24 (2003) 2607–2623 Richard, M.D., Lippmann, R.P., 1991. Neural network classiﬁers estimates bayesian a posteriori probabilities. Neural Comput. (3), 461–483. Sch€ olkopf, S., Burges, C.J.C., Vapnik, V., 1995. Extracting support data for a given task. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining. pp. 252–257. Sch€ olkopf, B., Sung, K.K., Burges, C.J.C., Girosi, F., Niyogi, P., Poggio, T., Vapnik, V., 1997. Comparing support vector machines with gaussian kernels to radial basis function classiﬁers. IEEE Trans. Signal Process. 45, 2758– 2765.

2623

Sch€ okopf, B., Smola, A.J., 2002. Learning with Kernels. The MIT Press. Vapnik, V., 1998. Statistical Learning Theory. John Wiley & Sons, New York. Wei, L.Y., Levoy, M., 2000. Fast texture systhesis using treestructured vector quantization. In: Siggraph 2000 Conference Proceedings. pp. 479–488. Zhong, Y., Karu, K., Jain, A.K., 1995. Locating text in complex color images. Pattern Recogn. 28 (10), 1523–1535. Zhong, Y., Zhang, H., Jain, A.K., 2000. Automatic caption localization in compressed video. IEEE Trans. Pattern Anal. Machine Intell. 22 (4), 385–392.

Automatic text detection and removal in video sequences

Automatic text detection and removal in video sequences

Recommend Documents