Scene Text Deblurring in Non-stationary Video Sequences

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 96 (2016) 744 – 753 20th International Conference on Knowledge Bas...

Download PDF

896KB Sizes 8 Downloads 75 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 96 (2016) 744 – 753

20th International Conference on Knowledge Based and Intelligent Information and Engineering Systems, KES2016, 5-7 September 2016, York, United Kingdom

Scene text deblurring in non-stationary video sequences Margarita Favorskaya*, Vladimir Buryachenko Siberian State Aerospace University, 31 Krasnoyarsky Rabochy av., Krasnoyarsk, 660037 Russian Federation

Abstract Text detection in natural scenes burdened by imperfect shooting conditions and blurring artifacts is the subject of the present paper. The text as a linguistic component provides a significant amount of information for scene understanding, scene categorization, image retrieval, and many other challenging problems. Usually real video sequences suffer from the superposition of the complicated impacts that are often analyzed separately. The main attention focuses on the text detection with geometric and blurring distortions under blurring and camera shooting artifacts. The original methodology based on the analysis of the gradient sharp profiles includes the automatic text detection in fully or partially blurred frames of a non-stationary video sequence. Also, the blind technique of a blurred text restoration is discussed. Additionally some results of the text detection are mentioned. The detection results for corrupted text fragments from test dataset ICDAR 2015 achieve 76–83% and prevail the detection results of the non-processed by deblurring procedure text fragments upon 40–52%. 2016The TheAuthors. Authors. Published Elsevier © 2016 © Published by by Elsevier B.V.B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of KES International. Peer-review under responsibility of KES International Keywords: text deblurring; scene text detection; non-blind kernel; blind kernel; non-stationary video sequence

1. Introduction Scene text detection in a still image or a sequence of images plays a significant role in image annotation, image indexing, traffic sign recognition, license plate recognition, signboards and information board recognition as the assistance to old people, among others. The scene text is a part of an image or frame unlike the imposed artificial text, such as subtitles, logotypes, information about sport competitions, and so on. The task of scene text detection and recognition meets with many challenging impacts, including text attributes (font sizes, alignment, colors), luminance in a scene (shadow, brightening, contrast), geometric location (orientation, perspective), scene complexity

* Corresponding author. Tel.: +7-391-291-9240; fax: +7-391-91-9147. E-mail address: [email protected]

1877-0509 © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of KES International doi:10.1016/j.procs.2016.08.259

Margarita Favorskaya and Vladimir Buryachenko / Procedia Computer Science 96 (2016) 744 – 753

(cluttered background, moving objects), and visual artifacts (noise, distortions due to a blur). The distortion due to a motion blur or defocusing is a major issue among other impacts because the blurring changes the shape significantly. Moreover, the most of methods, which are useful to detect and recognize the inherent and imposed text, fail to give satisfactory results under blur influence, e. g., conventional feature-based, texture-based, and edge-based methods. It is well-known that some intrinsic or extrinsic factors may lead to a quality degradation of images, among which blurring is the main one. The situation is further complicated by superposition of the shooting conditions with the camera shakes and jitters during, for example, a hand-held shooting. Five reasons of a blurring appearance can be mentioned (the first four items are the extrinsic reasons, while the last one is the intrinsic reason): x A blurring is caused by fast object motion like a moving car. This type is called a motion blur. x The blurred images may appear from the required long exposure time in dark lighting conditions, when a handheld camera cannot be stabilized well. This type is called a camera shake blur. x A defocus blur emerges in the resultant images obtained from visual equipment having a single focus plane. x Some atmospheric turbulence impacts, such as fog or smoke, produce a blurred image. x The cause of an intrinsic physical blur lies in a lens system, if the lenses have different refractive indices for different wavelengths of light. Even though an image blurring in photography may be motivated as an aesthetic representation, the blurred images in computer vision are often considered as the corrupted ones. Also hereinafter, it is reasonable to consider the camera shakes and jitters as a wider extrinsic reason, provoking 3D geometric distortions and, in particular case, a camera shake blur. In the current research, various cases, including types of blurring, degree of blurring, and degree of blur coverage, are studied under the non-stationary conditions of camera shooting. In the remainder of this paper, Section 2 presents a literature survey of the related works. Text properties and limitations are analyzed in Section 3. The proposed methodology of the scene text detection under blurring and shooting artifacts is described in Section 4, while Section 5 provides a deblurring technique of the extracted letters. The comparative experimental results are drawn in Section 6. The conclusions in Section 7 complete the paper. 2. Related Work The study of the image blurring caused by defocusing and/or diffraction began since 1990s. Some mathematical models were built based on the Gaussian kernel1 in order to improve a human depth-from-blur perception. At the same time, the image deblurring as the difficult and useful for practice task attracts many researchers. A good analytical survey of the deblurring methods was presented by Wang and Tao2, according to which the most deblurring methods are grouped into five categories, such as the Bayesian inference framework, variational methods, sparse representation-based methods, homography-based modelling, region-based methods, and other methods: x A probability hypothesis in the Bayesian framework can be adapted to estimate the imposed uncertainty attributes on either the unknown sharp image or the unknown blur kernel, or both. The commonly-used estimators are the Maximum A Posteriori (MAP), minimum mean square error3, and variational Bayesian methods4. Thus, in the MAP approach, a classic non-blind algorithm called as Richardson-Lucy (RL) deconvolution5,6 can be mentioned. Levin et al.4 show how to make the MAP estimation successfully recover the true blur kernel. x Variational methods are typically used as approximation methods, incorporating the regularization techniques into a constraint space. In contrast to classical regularizers based on the first-order derivatives, nowadays the second-order regularization techniques are also developing in deblurring framework7. x Space representation was declared due to sparse properties of the natural scenes. It can be applied in many computer vision applications, such as denoising, inpainting, superresolution, and deblurring8. x Homography-based modelling was generally proposed to simulate the blur effect induced by the camera shake as a spatially variant deblurring using a set of the multiple kernels or homographies9. Usually this approach is amplified by the “temporal” homographies, if it is possible. x Region-based methods create the blur model based on the local consistent kernels in each region, even at each point. The efficient filter flow model was proposed by Hirsch et al.10. In the case of the object motion blur, the

745

746

Margarita Favorskaya and Vladimir Buryachenko / Procedia Computer Science 96 (2016) 744 – 753

blurred image is segmented into several regions, each of which has the own motion properties and a blur type. The localized frequency representation of each local region can be independently transformed into the frequency domain in order to define a local kernel11. The close approach to deblur an image, containing multiple moving objects, was developed by Kim et al.12, when a blur segmentation, kernel estimation, and image restoration were alternatively processed in a unified variational framework. x To other methods, one may concern projection-based method, kernel regression, stochastic deconvolution, and spectral analysis. Five types of blur in video sequences mentioned above in Section 1 can be reduced roughly to two main causes – an object motion blur and a defocus blur. The aim of a motion deblurring is to recover a sharp image of a scene from the captured blurred image. One of the conventional ways is to use the Point Spread Function (PSF), which shows how a single point is spread on the receiver. The PSF can be estimated either from a single image or, more accurately, from multiple images. The PSF is in the basis of the blind and non-blind motion deblurring methods. In blind motion deblurring methods, the motion PSF is unknown, while in non-blind methods the motion PSF is given and used to recover the latent image from a blurry image. Often the PSF is not only a camera path but a combination of defocus, movement, and intersections. Objectively, the blind deblurring methods have the ill-posed mathematical solution. Therefore, some approximations were developed to solve this complicated task. Fergus et al. were the first, who proposed the kernel estimations13. The most of feature-based, texture-based, and edge-based methods are sensitive to blurring and may fail to produce the robust features, texture properties, and perfect edges for the blurred regions. In the case of video sequences, many methods utilize the temporal information for text detection in order to enhance low contrast text components. Huang14 detected a motion in 30 consecutive frames to synthesize a motion image. Then the synthesized motion image was used to filter out the moving candidate text regions. Mi et al.15 proposed an approach for text extraction based on the edge features using multiple frames. However, edge-based methods degrade, when blurring exists in frames. The number of the deblurring models related to the text in natural scenes is very restricted, among which one can mark the following ones. Pan et al.16 employed an effective L0-regularized prior based on intensity and gradient for text image deblurring. However, L0-norms are an NP-hard problem, which makes it expensive in time complexity and problematic for the frames processing. Cho et al.17 solved the task of the blind deconvolution using the specific properties of the handwritten and printed text in documents. This method was an extension of the commonly used optimization framework for image deblurring by intuitive text properties, which were incorporated in the optimization process. Wang et al.18 proposed an alternating minimization algorithm for recovering images from blurred and noised observations with the total variation L2 regularization. This algorithm can be applied to the anisotropic and also isotropic forms of total variation discretizations with good edge preservation. However, the experiments with text embedded in a scene were not implemented. Cao et al.19 suggested to use of the text-specific multi-scale dictionaries for scene text deblurring in order to improve the visual quality of the blurred images. A series of the text-specific multi-scale dictionaries and a natural scene dictionary was created and learned for separately modelling the priors on the text and non-text fields. Early works were devoted to removing of spatially invariant blurs. However, these methods often fail since the captured real blur kernels are often spatially varying because of a depth variation, a camera shake, etc. The conventional way to model the spatial-varying blur is to treat the image representations as some piecewise uniform blur regions20,21,22. Their performance depends on accurate segmentation results on all regions in the image. Different approaches to model a non-uniform blur as a linear combination of different blurry intermediate images captured by the camera along the motion trajectory have been proposed by Gupta et al.23 and Whyte et al.9. These methods are concentrated on handling 3D camera shakes at the cost of assuming a constant scene depth. Fergus et al.13 introduced a technique for removing the effects of unknown camera shakes from a single image. These authors exploited the research in natural image statistics, which shows that photographs of the natural scenes typically obey the specific distributions of the image gradients. The Bayesian approach was adopted allowing to find the blur kernel implied by a distribution of probable images. Then an image is restored using a standard blind deconvolution algorithm. The further processing deals with symbol extraction, which can be realized by two ways: through or without binarization. The first way is very simple and efficient, if suitable threshold values are determined. Some classical methods, such as Otsu’s method24, Sauvola’s and Pietikainen’s method25, or Wolf et al.26, can be recommended.

Margarita Favorskaya and Vladimir Buryachenko / Procedia Computer Science 96 (2016) 744 – 753

747

They are based on local information, global information, or their combination. The second way includes the methods that extract a large number of features based on maximally stable extremal regions, scale invariant features, or histogram oriented features. They use their own classifiers with a large number of training samples27,28. This short literature review demonstrates the active interest to blurring/deblurring problems. However, a complexity is a restricting factor, which does not permit to find a uniform solution in non-blind and blind techniques. 3. Propositions, Limitations, and Metrics The analysis of text properties in natural scenes and impact factors leads to the following propositions: x x x x x x

Usually the text has a high contrast against nearby background regions and large enough to be noticed. Each sign has a near-uniform and bright color. The background region may be very cluttered and non-predicted. The text line may be a straight line or a circular arc. Geometric distortions of the text lines depend on a viewpoint and may be affine, perspective, or fitted randomly. The text may appear in any part of the image or frame. In the case of video sequence, a set of sequential frames containing text can include entirely all or partially blurred frames. A set of sequential frames can be affected by camera shakes and jitters (Fig. 1). x It is assumed that the blur function influences on the low-frequency component of an image. x Detailed (or focused) information is concentrated in the high-frequency component of an image. x The additive random noise is considered as the high-frequency component of an image due to its high change rate in the intensity function.

Fig. 1. (a) the frames from the heavily blurred video Sam_1.avi29; (b) the full-frame gradients of the heavily blurred video; (c) the frames from the weakly blurred video Auto1.avi30; (d) the full-frame gradients of the weakly blurred video.

748

Margarita Favorskaya and Vladimir Buryachenko / Procedia Computer Science 96 (2016) 744 – 753

In order to determine the presence of a text in a video sequence distorted by blurring and camera shakes, the algorithm based on the quality assessment metrics was designed. Suppose that a set of frames between two consecutive keyframes is analyzed. The algorithm ought to estimate a presence of a blurring degree in each frame, a natural scene text in blurred regions of each frame, and camera shakes and jitters relative to the sequential frames. A frame can be blurred, fully or partially, with different defocusing degree. It means that the difficulties of text detection may appear even at this stage, where False Rejection Ratio (FRR) values ought to have the minimum value. Notice that a location of the text appearance is relatively constant in a set of frames. The deviations are instigated by the camera shakes and jitters. For a classification of blurred and non-blurred frames some quality assessment metrics were proposed, among which are the following ones: x Blind/ Referenceless Image Spatial Quality Evaluator (BRISQUE) extracts the point-wise statistics of local normalized luminance signals31. The BRISQUE is trained on the features obtained from both natural and distorted images. It is based on the human judgments on the quality of these images. x Blind/No-Reference (NR) Image Quality Assessment (IQA) operates in the spatial domain and analyzes the regular statistical properties in natural images32. x Global Phase Coherence (GPC) compares the likelihood of the image with the likelihood of all possible images sharing the same Fourier power spectrum33. The likelihood is measured by the total variation, and the numerical estimation is realized by a Monte-Carlo simulation. x Sharpness Index (SI) is closely related to the GPC, but it uses the Gaussian random fields instead of random phase images34. On the one hand, all values of the mentioned above metrics may be normalized and weighted in a common assessment expression. On the other hand, not all metrics are useful for the blurred text detection. The text is characterized by the high density of edges mostly in the vertical and horizontal directions. It means that some structured, albeit blurred information can help detect the distorted text fragments. 4. Blurred Text Detection The blurred images cannot have the sharp edges, thus the gradient magnitude distribution should have the greater relative mass at small or zero values. The proposed method of blurred text detection includes the following steps: x Detect the presence of a text in the blurred and non-blurred frames at the coarse level of the Gaussian pyramid using the Gradient Sharp Profiles (GSPs). x If one or several non-blurred frames with the text are detected, then consider them as the latent frames from the point of blurring, otherwise it is desirable to extract all “good” blurred frames (with the maximum value of sharpness scores) and choice the best one as a reference frame. x Extract additional frames that are followed by the reference frame. A number of additional frames is a tuning parameter with the recommended values 3–7. x Align the text fragments in the selected frames and apply an iterative procedure using pseudo-latent images. x Improve the text fragments at the coarse level of the Gaussian pyramid using a stroke filter. Since there is no a priori information about the text fragments and the text can appear in any place of the frame, it is desirable to detect the connected text fragments using the “good” frames, such as non-blurred or blurred with a small value of the sharpness score. For this purpose, all analyzed frames are transformed to the coarse level of the Gaussian pyramid in order to decrease a blur degree. Notice that the RGB frames ought to be represented in the gray scale space in order to receive the gradient information suitable to detect, a frame is blurred or not. In this research, 2D Laplacian filter was used for gradient frames obtaining. Due to the text fragments have mainly the horizontal or vertical directions, let us build the horizontal GSPs along rows of a frame. The gradient profiles can be used to measure the sharpness of the text edges, if the edges remain visible in the blurred image. In common case, the gradient profiles describe the distribution of the gradient magnitudes along a gradient direction. Without loss of a

749

Margarita Favorskaya and Vladimir Buryachenko / Procedia Computer Science 96 (2016) 744 – 753

generality, suppose that the estimators of the text fragments in the horizontal or vertical directions prevail relative to random directions. The sharpness in each pixel of an image row i can be measured by the square root of the unbiased sample variance of the gradient magnitudes Vj(GM(xi)) in 1D surrounding of a pixel p(xi, yj) in the OX direction providing by Eq. 1, where GM(xi) is a gradient value in a pixel p(xi, yj) under yj = const, i [1, N], j [1, M], N and M are the sizes of an image, k is an internal index, k [–n, +n] in 1D surrounding of a pixel p(xi, yj) in the OX direction. 2 n · § ¨ GM 2 x §¨ GM x n ·¸ ¸ k i k i ¦ ¦ ¨ ¸ ¸ ¨ k n © k n ¹ ¹ © N 1 n

V j GM xi

(1)

The sharper the gradient profile, the lesser the value of Vj(GM(xi)). A comparative example is depicted in Fig. 2. The gradients in Fig. 2 c are vastly different in the width values. However, a possibility to detect the blurred text remains.

Fig. 2. (a) the sharp and blurred frames from Video_5_3_2.avi35 with the imposed horizontal lines; (b) the sharp and blurred gradient frames; (c) 1D-gradient profiles of the sharp and blurred frames.

The similar (to Eq. 1) expression may be represented to the OY direction. Also, Eq. 1 can be reinforced by the logarithmic score in difficult cases: 2 n · § ¨ GM 2 x §¨ GM x n ·¸ ¸ k i k i ¦ ¦ ¨ ¸ ¸ ¨ k n © k n ¹ ¹ © log . N 1 n

V j log GM xi

(2)

The received GSPs are analyzed on the presence of the regular vertical and horizontal outlines in the gradient profiles with following localization of the compact areas with high density of the symbol edges. In order to receive more reliable estimators of a text restoration, two frames with the closest disposition and scaling with the minimum blurring impact ought to be chosen from the given set of the analyzed frames. This task can be realized for a nonstationary video sequence, if the camera movement is absent or very slow. The non-lucky and lucky cases are

750

Margarita Favorskaya and Vladimir Buryachenko / Procedia Computer Science 96 (2016) 744 – 753

depicted in Fig. 3. The search of the regions of interest can be accomplished by a procedure similar to block matching algorithm, especially its fast modification, for example, block based gradient descent search algorithm36.

Fig. 3. (a) the text fragments from video Sam_1.avi29; (b) the text fragments from video Video_25_5_2.avi35.

At the final step, the modified stroke filter for the Latin symbols37 is applied for text localization. The stroke filter is a discrete analogue of the second derivative of continuous function. It extracts the stroke features according to the width, orientation, and scale responses. As a result, a stroke width map and a stroke orientation map can be built in each text fragment. 5. Scene Text Deblurring The conventional model of a blurred image B(x, y) is a convolution of a reference (latent) image R(x, y) and the kernel (or the PSF) K(x, y) used to model the motion blur with additional noise parameter Z(x, y) in a view of Eq. 3.

Bx, y Rx, y K x, y Z x, y

(3)

The problem of the blind deblurring is heavily ill-posed due to infinite solutions of Eq. 3. This challenge attracts much attention, and various blind deblurring methods have been proposed for the natural landscape images especially. At the same time, the methods recovering the sharp and clean text from the blurry image or video sequence are developed slowly, while the methods considering several sources of distortions are practically absent. According to the non-uniform motion blur model that was proposed by Cho et al.38, a conventional motion blur model can be expressed by Eq. 4, where b, r, and z are the m u 1 vector representation of B, R and Z, respectively, m is a number of pixels in an image B, Ti is the m u m transformation matrix that produces 2D translation of image r, {wi} is a set of the weighing values with a total sum equal 1.

b

¦w Tr z i

i

(4)

i

Eqs. 3 and 4 are the different representation of the same model, if the camera shakes contain translations only. If the camera shakes involve a rotation or scaling, Eq. 3 cannot represent the non-uniform spatially varying motion blurs. The different PSFs per a pixel or per a block are required. Tai et al.39 proposed to replace Ti by a general homography Pi in order to obtain the non-uniform motion blur model under the camera shakes:

b

¦w P r z i

i

(5)

i

There are eight parameters for estimation in homography matrix Pi (Eq. 5) in comparison with two translation parameters in the translation matrix Ti. Notice that the ill-posed conventional blind motion deblurring problem can be transformed to the well-posed image regularization problem40,41. When a set of blurred images of the same scene is available, one can avoid some problems of a single image deblurring. Because of amount or distributions of blurs may differ, the benefit to use the mutually complementary information of a scene is evident. Several frames may help suppress effectively the ringing artifacts and the image noise, as well as improve a restoration process applying a pseudo-latent image. If two input images containing the blurred text b1 and b2 are chosen, then the PSF can be estimated based on the pseudo-latent images b1 and b2 alternately in each iteration. In other words, when the image b2 is considered as a pseudo-latent image, the PSF (wj(1,i), Pj(1,i)) of the first image b1 can be estimated. Then in the same manner, the PSF (wj(2,i), Pj(2,i)) of the second

751

Margarita Favorskaya and Vladimir Buryachenko / Procedia Computer Science 96 (2016) 744 – 753

image b2 can be obtained. When both PSFs of images b1 and b2 are received, the latent image Rit can be restored using Eq. 6, where k is an index of the input frames, j is a number of an iteration, ORit is a weight,
arg min ¦ b k ¦ w Rit

j k ,i

k

2

r O Rit
j k ,i

P

(6)

i

Then the intermediate latent image Rit is used to estimate the PSFs in the next iteration. A regularization term is determined according to the known approach based on the iterative reweighted least square method42. 6. Experimental Results For experiments, some test video sequences were used. The quality of text localization depends strongly from a blur degree. The steps of the text localization using a stroke filter are depicted in Fig. 4. Depending on a blur degree, the algorithm provides the different results, viz. 30–80% of true text localization. Notice that Fig. 4 illustrates the results without a deblurring procedure.

Fig. 4. (a, f) the original weakly and heavily blurred frames from video Bus.avi43, respectively; (b, g) the candidate text fragments detected by the Maximally Stable Extremal Region (MSER) feature detector; (c, h) the non-text fragments are removed based on the threshold geometric sizes; (d, i) the stroke filter application; (e, j) the detected text and non-text fragments.

The illustrations for the iterative process of a deblurring based on two high-blurred frames 50 and 55 from Video_42_2_3.avi35 are depicted in Fig. 5. The iterative deblurring improves gradually the sharpness of the text fragment but causes the ringing (Fig. 5 c) and the graininess (Fig. 5 e). The final frame (Fig. 5 f) at last iteration contains the sharp text without artifacts.

Fig. 5. (a, b) two blurred frames from Video_42_2_3.avi35; (c-f) an iterative process of deblurring.

The SI permits to obtain the objective estimators44. The SI is defined by Eq. 7, where μ and σ are the expectation and standard deviation of a periodic total variation TV(I(x, y)), I(x, y) is an intensity value in point (x, y), I(x, y) is the tail of the Gaussian distribution defined by Eq. 8.

§ P TV I x, y · SII x, y log10 I¨ ¸ V ¹ ©

(7)

752

Margarita Favorskaya and Vladimir Buryachenko / Procedia Computer Science 96 (2016) 744 – 753

Ix, y

1

³ 2S

f

x

e t

2

/2

dt

(8)

The dataset ICDAR 2015 contains more than 1,500 images of high resolution including the text fragments in frames with different background. Also, the dataset ICDAR 2015 involves video sequences obtaining by a camera movement, which causes a motion blur and a defocus blur. The results of the text detection are indicated in Table 1. Table 1. Estimators of the text detection in video sequences with different blurring degrees. Video sequence

Frame resolution

Sharpness Index

Text detection by stroke filter, %

Text detection by stroke filter and deblurring, %

Bus.avi43

1817 × 853

0.44

56

77

Video_42_2_3.avi35

1280 × 720

0.16

25

63

800 × 480

0.87

65

89

Video_5_3_2.avi35

720 × 480

0.16

43

61

35

720 × 480

0.69

67

71

Video_25_5_2.avi

Video_2_1_2.avi

35

The experimental results from Table 1 show that the proposed deblurring procedure improves the text detection in comparison with other known algorithms, e. g., a stroke filter. 7. Conclusions Among five types of blur, two types, such as an object motion blur and a defocus blur, appear often in video sequences. The subject of investigation is the scene text detection in the non-stationary and blurred video sequences. The proposed original procedure based on gradient sharp profiles detects the text fragments directly without a preliminary definition, a frame is blurred or not. Such way is more suitable for the practical applications in comparison with a consistent approach of all possible artifacts elicitation and compensation. However, nowadays the detection of the text fragments that are blurred and transformed due to the camera shakes and jitters, as well as the text deblurring cannot be concerned to the real-time tasks. The obtained detection results for the corrupted text fragments show 76–83% in average for video sequences from the test dataset ICDAR 2015. Acknowledgments This work was supported by the Russian Fund for Basic Researches, grant number 16-07-00121 A. References 1. Nguyen TC, Huang TS. Image blurring effects due to depth discontinuities: blurring that creates emergent image details. Image and Vision Computing 1992;10(10):689–698. 2. Wang, R., Tao, D. Recent Progress in Image Deblurring. arXiv preprint arXiv:1409.6838, 2014. 3. Schmidt, U., Schelten, K., Roth, S., 2011. “Bayesian deblurring with integrated noise estimation,” IEEE Conference on Computer Vision and Pattern Recognition, 2625-2632. 4. Levin A, Weiss Y, Durand F, Freeman WT. Understanding blind deconvolution algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 2011;33(12):2354-2367. 5. Richardson WH. Bayesian-based iterative method of image restoration. Journal of the Optical Society of America (1972);62(1):55-59 6. Lucy LB. An iterative technique for the rectification of observed distributions. The astronomical journal 1974;79:745-754 7. Papafitsoros K, Schfonlieb CB. A combined first and second order variational approach for image reconstruction. Journal of mathematical imaging and vision 2014;48(2):308-338. 8. Cai, J.F., Ji, H., Liu, C., Shen, Z., 2009. “Blind motion deblurring from a single image using sparse approximation,” IEEE Conference on Computer Vision and Pattern Recognition, 104-111. 9. Whyte O, Sivic J, Zisserman A, Ponce J. Non-uniform deblurring for shaken images. International journal of computer vision 2012;98(2):168186.

Margarita Favorskaya and Vladimir Buryachenko / Procedia Computer Science 96 (2016) 744 – 753

10. Hirsch, M., Sra, S., Scholkopf, B., Harmeling, S., 2010. “Efficient filter flow for space-variant multiframe blind deconvolution,” IEEE Conference on Computer Vision and Pattern Recognition, 607-614. 11. Chakrabarti, A., Zickler, T., Freeman, W.T., 2010. “Analyzing spatially-varying blur,” IEEE Conference on Computer Vision and Pattern Recognition, 2512-2519. 12. Kim, T.H., Ahn, B., Lee, K.M., 2013. “Dynamic scene deblurring,” IEEE International Conference on Computer Vision, 3160-3167. 13. Fergus R, Singh B, Hertzmann A, Roweis ST, Freeman WT. Removing Camera Shake from a Single Photograph. ACM Transactions on Graphics 2006;25(3):787-794. 14. Huang, X., 2011. “A novel approach to detecting scene text in video,” 4th International Congress on Image and Signal Processing, 1, 469473. 15. Mi, C., Xu, Y., Lu, H., Xue, X., 2005. “A novel video text extraction approach based on multiple frames,” 5th International Conference on Information, Communications and Signal Processing, 678-682. 16. Pan, J., Hu, Z., Su, Z., Yang, M.H., 2014. “Deblurring Text Images via L0-Regularized Intensity and Gradient Prior,” IEEE Conference on Computer Vision and Pattern Recognition, 2901-2908. 17. Cho, H., Wang, J., Lee, S., 2012. “Text image deblurring using text-specific properties.” European Conference on Computer Vision, 524-537. 18. Wang Y, Yang J, Yin W, Zhang Y. A New Alternating Minimization Algorithm for Total Variation Image Reconstruction. SIAM Journal on Imaging Sciences 2008;248-272. 19. Cao X, Ren W, Zuo W, Guo X, Foroosh H. Scene Text Deblurring Using Text-Specific Multiscale dictionaries. IEEE Transactions on Image Processing 2015;24(4):1302-1314. 20. Ji, H., Wang, K., 2012. “A two-stage approach to blind spatially-varying motion deblurring,” IEEE Conference on Computer Vision and Pattern Recognition, 73-80. 21. Cho, S., Matsushita, Y., Lee, S., 2007. “Removing non-uniform motion blur from images,” IEEE 11th International Conference on Computer Vision, 1-8. 22. Levin, A. Blind motion deblurring using image statistics. In: Schölkopf B, Platt J, Hoffman T, editors. Advances in Neural Information Processing Systems, 19. Cambridge, MA, USA: MIT Press; 2006. p. 841-848. 23. Gupta, A., Joshi, N., Zitnick, C.L., Cohen, M., Curless, B., 2010. “Single image deblurring using motion density functions,” 11th European conference on Computer vision I:171-184. 24. Otsu N. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics 1979;9(1):62-66. 25. Sauvol J, Pietikainen M. Adaptive document image binarization. Pattern recognition 2000;33:225-236. 26. Wolf, C., Jolion, J.M., Chassaing, F., 2002. “Text Localization, Enhancement and Binarization in Multimedia Documents,” 16th International Conference on Pattern Recognition, 2:1037-1040. 27. Mishra, A., Alahari, K., Jawahar, C.V., 2012. “Top-down and bottom-up cues for scene text recognition,” IEEE Conference on Computer Vision and Pattern Recognition, 2687-2694. 28. Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L., 2013. “Recognizing text with perspective distortion in natural scene images.,” IEEE International Conference on Computer Vision, 569-576. 29. Video Sam_1.avi. [Online]. Available: https://youtu.be/L-vTVj_qfE8. 30. Video Auto_1.avi. [Online]. Available: https://youtu.be/4s9F1kQ8zFY. 31. Mittal, A., Moorthy, A.K., Bovik, A.C., 2011. “Blind/Referenceless Image Spatial Quality Evaluator,” Asilomar Conference on signal, Systems and Computers, 723-727. 32. Mittal A, Soundararajan R, Bovik AC. Making a ‘Completely Blind’ Image Quality Analyzer. IEEE Signal Processing Letters 2013;20(3):209-212. 33. Blanchet, G., Moisan, L., Rouge, B., 2008. “Measuring the global phase coherence of an image,” 15th IEEE International Conference on Image Processing, 1176-1179. 34. Blanchet, G., Moisan, L., 2012. “An explicit sharpness index related to global phase coherence,” IEEE International Conference on Acoustics, Speech and Signal Processing, 1065-1068. 35. ICDAR 2015. Challenge 3: "Text in Videos". [Online]. Available: http://rrc.cvc.uab.es/?ch=3&com=downloads. 36. Favorskaya M. Motion estimation for objects analysis and detection in videos. In: Kountchev R, Nakamatsu K, editors. Advances in Reasoning-Based Image Processing Intelligent Systems, ISRL, 29. Springer-Verlag Berlin Heidelberg: Springer; 2012. p. 211-253. 37. Favorskaya, M., Zotin, A., Damov, M., 2010. “Intelligent inpainting system for texture reconstruction in videos with text removal,” International Congress on Ultra Modern Telecommunications and Control Systems, 867-874. 38. Cho S, Cho H, Tai Y-W, Lee A. Non-uniform Motion Deblurring for Camera Shakes. Computer Graphic Forum 2012;31(7): 2183-2192. 39. Tai Y-W, Tan P, Gao L, Brown MS. Richardson-Lucy Deblurring for Scenes under Projective Motion Path. IEEE Transactions on Pattern Analysis and Machine Intelligence 2009;33(8):1603-1618. 40. Pruessner A, O’Leary DP. Blind Deconvolution Using a Regularized Structured Total Least Norm Algorithm. SIAM J. Matrix Anal. Appl. 2003;24(4):1018-1037. 41. Fu H, Barlow J. A regularized structured total least squares algorithm for high-resolution image reconstruction. Linear Algebra and its Applications 2004;391:75-98. 42. Levin A, Fergus R, Durand F, Freeman WT. Image and depth from a conventional camera with a coded aperture. ACM Transactions on Graphics 2007;26(3):article no. 70. 43. Bus.avi. [Online]. Available: http://www.youtube.com/watch?v=c3h5p4R0-rw. 44. Moreno, P., Calderero, F., 2013. “Evaluation of sharpness measures and proposal of a stop criterion for reverse diffusion in the context of image deblurring,” 8th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 69-77.

753

Scene Text Deblurring in Non-stationary Video Sequences

Scene Text Deblurring in Non-stationary Video Sequences

Recommend Documents