Procedia Computer Science Volume 53, 2015, Pages 216–223 2015 INNS Conference on Big Data
MapReduce Based Text Detection in Big Data Natural Scene Videos Abdelkarim Ben Ayed, Mohamed Ben Halima and Adel M. Alimi REGIM-Lab: REsearch Groups in Intelligent Machines, University of Sfax, ENIS, BP 1173, Sfax, 3038, Tunisia {abdelkarim.benayed.tn, mohamed.benhlima, adel.alimi}@ieee.org
Abstract Text detection is a basic step in many computer vision applications including video Optical Character Recognition (OCR), video indexation, understanding video content, etc. Actually, several sources such as mobile devices, monitoring cameras and social networks are generating every day billions of videos with different formats and uncertainty. Such videos require new methods to apply text detection. In this paper, we are introducing a novel text detection technique using many blocks coming from frame decomposition. Each block is analyzed and classified which allowed the extraction of text coordinates using MapReduce programming model. To validate our approach, we test it on YouTube Video Text (YVT) dataset and we found that the running speed of this approach can be more than 2 times as fast as classic approach. Keywords: video text detection, MapReduce, big data
1 Introduction Social networks, smart phones, tablets, monitoring cameras and many other sources are generating every second a large quantity of unstructured videos. In most video processing applications, text detection is an important step that can be used to extract and recognize embedded text, such information is very important mainly for OCR and indexing systems. Besides, many videos nowadays have very large dimensions like full HD and ultra HD videos, which require very high computation resources to analyze these videos. Text in videos can be classified into two types: artificial text and natural scene text. Artificial text is added manually on the video such as subtitles for better understanding of the video content. Natural scene text is part of the video such as road signs and magazine labels. Natural scene texts are usually much more complex to be processed than artificial texts. In the remaining of this paper we are going to focus only on natural scene text detection.
216 Selection and peer-review under responsibility of the Scientific Programme Committee of INNS-BigData2015 c The Authors. Published by Elsevier B.V.
doi:10.1016/j.procs.2015.07.297
MapReduce Based Text Detection in Big Data NSV
A. Ben Ayed, M. Ben Halima and A. M. Alimi
Many challenges are encountered for natural scene text detection in videos (Risnumawan, Shivakumara and Chan) and (Yi and Tian) such as the low contrast between texts and their backgrounds, complex backgrounds, complex font (small size, different types, and different orientations) and the presence of non-text objects having text similar properties. Besides, other challenges are related to video acquisition such as the low quality of frames in compressed videos and the blurring camera effect in videos. In addition, high computation time for most robust text detection algorithms is another big challenge that we are going to resolve in this paper. Videos contains treasures of text information that could be very useful. However, traditional data analysis tools and relational databases are not suitable to manage such complex data, called big data. Big data can be characterized by four aspects called the 4Vs (Volume, Velocity, Variety and Veracity). Volume represents the large size of data that starts from one terabyte or more, Velocity the high speed of data acquisition, Variety the high diversity of data types and formats that require to be stored and analyzed together and Veracity the uncertainty of data. Few works have already used MapReduce for text detection and text processing. In (Chen, Zhang and Ma), the authors have presented a distributed text classification system based on MapReduce for text feature extraction. The proposed system have improved the text classification processing speed for massive data text. Other big data analytics projects are presented in (Ben Ayed, Ben Halima and Alimi). The remaining sections of this paper are organized as follows. In section 2, we give a brief presentation of parallel computing techniques including Big Data analytics. In section 3, we present examples of previous works on text detection in natural scene images and videos. In section 4, we propose text detection system for videos based on texture analysis and MapReduce programming model. Finally, conclusion and perspectives are presented in section 5.
2 Big Data and Parallel Computing Old computation systems used scale up and scale out solution to handle the increasing demand of computations, but this solutions have some disadvantages. Scale up solutions are very expensive and limited by a technical barrier, whereas scale out solutions require many engineering efforts. Big data analytics solutions such as Apache Hadoop based on parallel architecture however, allows overcoming all these problems via a framework that abstract and handling most of the engineering effort caused by parallel architectures.
2.1 Hadoop Framework Components Hadoop is used with very large data sets computer clusters of commodity hardware. Java gives an open source code framework called Apache Hadoop. Hadoop is composed of multiple layers including mainly Hadoop Distributed File System (HDFS) for storage and MapReduce programming model and other high-level programming languages like Hive (data warehouse language using SQL-92 queries), Pig (data flows oriented language using Pig Latin programming language) and Hbase (A sparse database for storing large quantities of data). The passage from Hadoop 1 to Hadoop 2 include new layers in Hadoop architecture. Doug Cutting developed Hadoop in 2005 based on Google File System (GFS) and Google MapReduce (Ghemawat, Gobioff and Leung) and (Dean and Ghemawat). Today, many big data solutions are available, but the most popular is the open source Apache Hadoop (Apache, Apache Hadoop). A Hadoop system is composed usually of hundreds or thousands of standard machines running Linux where a master machine is managing the storage tasks “Data Nodes” and processing tasks “Task Tracker”. “Name Node” and “Job Tracker” are the components that manage respectively the storage tasks and the processing tasks.
217
MapReduce Based Text Detection in Big Data NSV
A. Ben Ayed, M. Ben Halima and A. M. Alimi
2.2 Hadoop Distributed File System (HDFS) and MapReduce programming model Apache Hadoop framework uses the redundant and distributed storage system HDFS that stores and replicates files segments in multiple machines. A master node manage data splitting and replication in the other workers nodes used for both data storage and processing, see Figure 1. Hadoop is based on MapReduce as a programming model, MapReduce is composed of two functions to be written by the user: “Map” function that divide problems to smaller ones and “Reduce” function that combine the results so that all the details of distributed and parallel computations are abstracted. The role of the main server is to manage the communications with user application and the other workers nodes, which free the master node from the overload of computation and improve the system performance and mainly the bandwidth, see Figure 2.
Figure 1: Hadoop Distributed File System (HDFS) Architecture (Apache, HDFS Architecture Guide)
Figure 2: MapReduce architecture (Pluta)
218
MapReduce Based Text Detection in Big Data NSV
A. Ben Ayed, M. Ben Halima and A. M. Alimi
3 State of the Art on Video Text Detection To detect text in video frames, certain text properties are exploited such as text high contrast compared to the background, text discontinuity, text regularity, etc. Current techniques of text detection can be classified mainly in two categories: region and texture based text detection methods. Usually these methods are anticipated by preprocessing steps and followed by refinement steps to improve the quality of text detection and handle the noise and outliers objects.
3.1 Region-based text detection methods Region-based methods ; including connected components and edge methods ; starts by segmenting the input images or video frames into candidate regions, then region features like size, width, height, contrast, spatial frequency and alignment are analyzed to extract text regions. LeBourgois (LeBourgeois) have used max-entropy thresholding binarization method then the segmentation step finally; he adds another step to split the connected characters. He consider background as the dominant part in the image and assumes all characters have a certain fixed size. Antani et al. (Antani, Gargi and Crandall) have proposed connected based method that starts by enhancing the low text contrast during binarization, then extract the connected components which will be filtered based on a set of characteristics like height, width, aspect ratio, etc. Finally, a collinearity measure of components is used to determine the correct components polarity. Kim (Kim) have applied LCQ (Local Color Quantization) for each color before segmentation, then candidate text lines are extracted using connected components method, after that similar feature text regions are merged. This method have high processing time that is why input images are converted to 256-color images. Lienhart and Effelsberg (Lienhart and Effelsberg) have used directly RGB color images. First, monochromacity and contrast features are used for grouping pixels in connected components, then similar color regions are merged together, finally other characteristics such as compactness, width/height ranges and ratios are used to filter-out non-text regions. Gllavata et al. (Gllavata, Ewerth and Freisleben) approach include four steps. First, the input image is converted to YUV color space, then, an edge detection technique is used to segment the image, after that, text regions are detected using horizontal projection then segmented text regions are enhanced based on geometric properties. Ben Halima et al. (Ben Halima, Karray and Alimi, A comprehensive method for Arabic video text detection, localization, extraction and recognition) and (Ben Halima, Karray and Alimi, Arabic text recognition in video sequences) have used edge detection technique. First, vertical edges are connected using dilatation operator for text bloc localization. Then, text lines are extracted using horizontal projection, next, standard deviation and entropy features of eight pixel neighbors are used to classify the pixels in text and non-text regions with fuzzy C-Means clustering algorithm. Region-based text detection methods are fast and accurate, but require similar text gray-level.
3.2 Texture-based text detection methods Texture-based methods treat the text as a special type of texture and uses texture analysis to locate text regions. These methods analyze the blocks of pixel texture using statistical properties or filters like Haar wavelet. Then these blocks are classified either to text or background texture regions using machine learning classifiers like SVM or neural network. Wu et al. (Wu, Manmatha and Riseman) have proposed texture-based segmentation system using derivatives of Gaussian filters and non-linear transformations to create a feature vector for each pixel, in a second step, feature vectors are used to classify pixels into text or non-text pixels, finally, a text cleaning process is made using a fixed threshold. Li and Doerman (Li and Doermann) have presented
219
MapReduce Based Text Detection in Big Data NSV
A. Ben Ayed, M. Ben Halima and A. M. Alimi
another texture approach using a window of 16x16 pixels to analyze the texture of these blocks then classify them into text and non-text blocks using three-layer neural network. Ye et al. (Ye, Huang and Gao) have proposed a texture based method using image gradient and wavelet transformation to characterize the pixels texture then uses SVM to classify these characteristics and identify the text regions. Sayahi et Ben Halima (Sayahi and Ben Halima) have proposed an approach that starts with resizing images to 240x320 and making binarization with dilatation and erosion to reduce execution time. Then, decompose the image in 20x20 pixel blocks and apply Haar wavelet transformation to generate 8-features vector for each block. Then, a neural network of nine hidden layers and two output layers is used to classify the blocks. Finally, some refinements are applied. Region-based text detection methods have the advantages to handle images containing texts with different gray levels, but they require high computation time and they have low detection accuracy.
4 Proposed text detection system A typical video text detection system starts by offline video acquisition from a certain database or online from monitoring cameras, uploaded videos on internet websites, social networks, etc. Most text detection algorithms are designed to extract texts from gray level and binary images that is why we need to decompose the videos to several frames then transform the frame images to gray level and binary to reduce additional details and noise and to simplify the content of the image and accelerate the frame processing tasks. The choice of the binarization algorithm is critical for the quality of segmentation and the speed of system processing, usually we use an adaptive, based threshold binarization. We decided to use a texture-based text detection method since it can handle various languages and fonts with different colors, size and orientations. Texture-based approach analyses images using a sliding window, usually the images are divided into blocks of 20x20 pixels, then, we use Haar Wavelet transformation on each block to extract a list of characteristics including the mean, the energy, the entropy, the correlation, the homogeneity factor, the second and third central moment, etc. We choose wavelet transformation for feature extraction due to its good location performance (Ye, Huang and Gao). The extracted feature vectors of each region are then classified into text regions or non-text region using classification algorithms such as fuzzy C-Means, SVM, Neural Networks, etc. We choose Neural Network for feature vector classification since it gives very high classification results compared to other algorithms (Benenson). Finally, close text regions are merged together to extract the final text regions coordinates, see Figure 3. To deal with texture-based approach high computation time, we can reduce the frame size (Sayahi and Ben Halima), reduce the number of features (Ye, Huang and Gao) or to use parallel computation architecture. We decided to use Apache Hadoop since it allows easily having parallel computation architecture with distributed storage and computing model and because videos analysis is suffering more and more from the four Vs of big data (Volume, Velocity, Variety and Veracity). The steps of blocks analysis to extract features and the step of feature vector classification require high computation resources, besides; these tasks are independents, which allows the parallel execution of these two steps with high improvement in system computation time. The same steps are used for the first steps (acquisition, frame decomposition and binarization) using sequential algorithms. After that, the master node distribute the blocks to the workers machines using a distributed and redundant file system (HDFS). This file system improve the performance and the reliability of our system. Next, analysis and classification of blocks is executed on n blocks using MapReduce programming model as Map jobs on workers machines then results are assembled via a Reduce task to obtain final text region coordinates, see Figure 4.
220
MapReduce Based Text Detection in Big Data NSV
A. Ben Ayed, M. Ben Halima and A. M. Alimi
Segmentation of frame blocks
Start
Wavelet block analysis
Video acquisition
Feature vector classification
Video frame decomposition
Extraction of text coordinates
RGB to gray level Transformation
End
Binarization Figure 3: Flow Chart of Classic Text Detection Process
Segmentation of frame blocks
Start Video acquisition
…………………… Analysis of block (1)
Analysis of block (n) Map
Video frame decomposition RGB to gray level Transformation
Classification of block (1)
Classification of block (n)
…………………… Extraction of text coordinates
Reduce
Binarization End Figure 4: Flow Chart of Map Reduce Based Text Detection Process
5 Experimental results For the evaluation of the proposed algorithm, we used a PC equipped with Intel Core-i7 CPU and 8GB RAM running Microsoft Windows 8.1 for testing the classic algorithm and two Amazon Elastic MapReduce m3.xlarge instances (Amazon Web Services) for testing MapReduce algorithm.
221
MapReduce Based Text Detection in Big Data NSV
A. Ben Ayed, M. Ben Halima and A. M. Alimi
We have tested our approach on YouTube Video Text (YVT) dataset (Nguyen, Wang and Belongie) and (University of California). The dataset is composed of 30 HD videos that contain both scene texts and artificial texts. Each video is composed of 15 seconds with a frame rate of 30 frames per second, which is the equivalent of more than 12,000 frames. Results and comparisons show that computation time is twice reduced using map reduce model, see Table 1. Original image
Text detection
Classic model time
Map Reduce time
19.17 sec
7.68 sec
14.55 sec
7.10 sec
25.06 sec
10.61 sec
Table 1: Text detection results and computation time comparisons
6 Conclusion We presented a novel approach for texture-based text detection in videos using the Apache Hadoop big data analytics. The proposed system decompose video frames in multiple fixed-size blocks, analyze the blocs using Har wavelet transformation to generate feature vectors which are classified with a neural network classifier into text blocks and non-text blocks. The proposed system allows detecting texts of various languages and fonts with different colors, size and orientations with less than a half computation time due to the MapReduce parallel programming model. Further processing is required on the extracted regions based on these regions geometrical characteristics to eliminate noisy regions and exclude text-like regions.
Acknowledgements The authors would like to acknowledge the financial support of this work by grants from General Direction of Scientific Research (DGRST), Tunisia, under the ARUB program.
References Amazon Web Services, Inc. Amazon EMR . Mai 2015. .
222
MapReduce Based Text Detection in Big Data NSV
A. Ben Ayed, M. Ben Halima and A. M. Alimi
Antani, S., et al. "Extraction of text in video." Technical Report CSE-99-016. 1999. Apache. Apache Hadoop. 2015. . —. HDFS Architecture Guide. April 2013. . Ben Ayed, A., M. Ben Halima and A. M. Alimi. "Big Data Analytics for Logistics and Transportation." Accepted in The 4th IEEE International Conference on Advanced Logistics and Transport (ICALT'2015). Ed. IEEE. Valenciennes, France: IEEE, 2015. Ben Halima, M., H. Karray and A. M. Alimi. "A comprehensive method for Arabic video text detection, localization, extraction and recognition." Advances in Multimedia Information Processing PCM 2010 (2010): 648-659. —. "Arabic text recognition in video sequences." CoRR (2013): 603-608. datasets results. 30 Sptember 2014. Benenson, Rodrigo. Classification . Chen, L., et al. "Applied-Information Technology with Distributed Text Feature Extraction Method Based on MapReduce." Advanced Materials Research 1046 (2014): 444-448. Dean, J. and S. Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM January 2008: 107-113. Ghemawat, S., H. Gobioff and S.T. Leung. "The Google file system." Proceedings of the nineteenth ACM symposium on Operating systems principles. New York, USA: ACM, 2003. 29-43. Gllavata, J., R. Ewerth and B. Freisleben. "A robust algorithm for text detection in images." Proceedings of the 3rd International Symposium on Image and Signal Processing and Analysis (IEEE ISPA’2003). IEEE, 2003. 611-616. Inc, Wikimedia. Apache Hadoop. 2015. . Kim, P.-K. "Automatic Text Location in Complex Color Images Using Local Color Quantization." TENCON 99. Cheju Island: IEEE, 1999. 629-632. LeBourgeois, F. "Robust Multifont OCR System from Gray Level Images." International Conference on Document Analysis and Recognition. Washington, DC, USA: ACM, 1997. 1–5. Li, H. and D. Doermann. "Automatic Text Tracking In Digital Videos." IEEE Second Workshop on Multimedia Signal Processing, 1998 . Redondo Beach, California, USA: IEEE, 1998. 21-26. Lienhart, R. and W. Effelsberg. "Automatic Text Segmentation and Text Recognition for Video Indexing." Multimedia Systems (2000): 69-81. Nguyen, P. X., K. Wang and S. Belongie. "Video text detection and recognition: Dataset and benchmark." IEEE Winter Conference on Applications of Computer Vision (WACV'2014). Steamboat Springs, CO: IEEE, 2014. 776-783. Pluta, Mike. Hadoop V1 Architecture Overview. 2014. . Risnumawan, A., et al. "A robust arbitrary text detection system for natural scene images." Expert Systems with Applications 41.18 (2014): 8027-8048. Sayahi, S. and M. Ben Halima. "An intelligent and robust multi-oriented image scene text detection." The 6th International Conference of Soft Computing and Pattern Recognition (IEEE SoCPaR’2014). Ed. IEEE. Tunis, Tunisia: IEEE, 2014. 418-422. University of California, San Diego. Datasets. 2014. . Wu, V., R. Manmatha and E. M. Riseman. "Finding Text in Images." DL '97 Proceedings of the second ACM international conference on Digital libraries. New York, USA: ACM, 1997. 23-26. Ye, Q., et al. "Fast and robust text detection in images and video frames." Image and Vision Computing (2005): 565-576. Yi, C. and Y. Tian. "Text detection in natural scene images by stroke gabor words." International Conference on Document Analysis and Recognition. Beijing, China: IEEE, 2011. 177-181.
223